ebook img

Match-Tensor: a Deep Relevance Model for Search PDF

0.44 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Match-Tensor: a Deep Relevance Model for Search

Match-Tensor: a Deep Relevance Model for Search Aaron Jaech Hetunandan Kamisetty [email protected] [email protected] University of Washington Facebook Inc 185 E Stevens Way NE 1101 Dexter Ave N Seattle, WA 98195, USA Seattle, WA 98109, USA Eric Ringger Charlie Clarke [email protected] [email protected] Facebook Inc Facebook Inc 1101 Dexter Ave N One Hacker Way 7 1 Seattle, WA 98109, USA Menlo Park, CA 94025, USA 0 2 ABSTRACT thefeaturesetfailstocapturesomeaspectofrelevance,that n The application of Deep Neural Networks for ranking in aspect is lost to the ranker. a searchenginesmayobviatetheneedfortheextensivefeature RecentlyDeepNeuralNetworks(DNNs)havereplacedthis J engineering common to current learning-to-rank methods. traditional machine learning paradigm in many applications 6 However,weshowthatcombiningsimplerelevancematching bydirectlylearningamodelanditsfeaturesfromtheinputs, 2 features like BM25 with existing Deep Neural Net models obviatingtheneedformanuallyspecifiedfeatures. Motivated often substantially improves the accuracy of these models, by success in machine translation and other related NLP ] indicating that they do not capture essential local relevance tasks, DNNs are now being applied to ranking in search R matchingsignals. WedescribeanoveldeepRecurrentNeural [17, 19, 31, 33, 40, 41, 44] . These efforts share a goal of I Net-based model that we call Match-Tensor. The architec- reducing the need for traditional feature engineering, and . s ture of the Match-Tensor model simultaneously accounts for evenifDNNscannotcompletelyeliminatefeatureengineering, c both local relevance matching and global topicality signals they may capture aspects of relevance beyond its scope. [ allowing for a rich interplay between them when computing Unfortunately, while previous efforts have demonstrated 1 the relevance of a document to a query. On a large held-out that these approaches can capture novel aspects of rele- v testsetconsistingofsocialmediadocuments,wedemonstrate vance, they can also fail to capture the most salient aspect 5 not only that Match-Tensor outperforms BM25 and other of the search task, namely local relevance matching [14]. In- 9 classes of DNNs but also that it largely subsumes signals deed, some recent efforts to develop completely new DNN 7 present in these models. architectures tailored to search still show lower accuracy on 7 TRECdatasetsthantheclassicinformationretrievalmodels, 0 which are used as features in traditional learning-to-rank . 1 approaches [35, 36]. While this failure may possibly be due 0 1 INTRODUCTION to the lack of sufficient training data, which is critical to 7 Machine learning models have traditionally relied on a set of the success of DNNs, the failure may also reflect a more 1 manually defined features whose weights are trained in order fundamental gap in these architectures. : v tooptimizeanobjectivefunctiononasetoflabeledexamples. Wedevelopanewdeepneuralnetarchitecturetailoredfor i The learning-to-rank models in widespread use for search thesearchtaskinvolvingsocialmediausingamethodwerefer X may require hundreds of such features to be generated for toasMatch-Tensor. Themodelarchitectureissimultaneously r each query/document pair [27]. In particular, the approach simple and expressive and is schematically illustrated in a requires features that capture the presence of query terms in Fig. 1: a pair of sequence models computes representations a given document, the proximity and location of those terms of a query and a given document accounting for both local in the document, document quality, document type, and and distributed context; these representations are used in manyotherimportantaspectsofthequery,ofthedocument, matching each pair of query and document words along and of the relationship between them. In many rankers, several dimensions and stored as a 3-D tensor (hence, the the feature set includes classic information retrieval models, termMatch-Tensor). Aconvolutionalnetworkthentranslates such as BM25 [37], and term dependence models [28]. These this tensor to a relevance score. Using a large volume of models in turn combine more basic query and document social media search data, we train a model of this type featuresinanattempttocapturesalientaspectsofrelevance. and analyze the performance, demonstrating that it is also Generationofthesefeaturesrequiresasubstantialsoftware substantiallymoreaccuratethanothercurrentmodel-classes. engineering effort, and the addition of a single new feature We also demonstrate that this approach largely subsumes may require the creation and testing of new and specialized other models such as BM25[37] and SSM[17, 41]: the Match- code to compute it. More importantly, the scope of the Tensor model alone performs nearly as well as meta-models feature set is limited by the imagination of its engineers. If trainedwithbothMatch-Tensorandthesefeaturesasinput. 3-D Match-Tensor Relevance score Exact Query BiLSTM Match ⊗ Channel Fully-connected Layers ⊗ ccvcvv C(o3n-LDvoa flyiuletteirorsn)al Max-pooling ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ + Additional Query convolutional Word-embeddings layers Document BiLSTM Document Word-embeddings Figure 1: Match-Tensor architecture. Word embeddings are mapped to bi-LSTM state separately on the query and document. The transformed bi-LSTM state is combined by way of point-wise product for each query term and each document term to produce multiple match channels. The match channels are joined by an exact-match channel to produce the 3-D Match-Tensor. 2-D convolution (with 3-D filters) maps the Match-Tensor to a probability of relevance for the query and the document. 2 BACKGROUND application of convolutional neural networks, in lieu of feed- Likemostlearning-to-rankmethods,wewishtolearnafunc- forward-networks,byShenetal.[41]marksthenextnotable tion Φ(q,d) that computes a score reflecting the relevance advancement using the same siamese architecture. The local of document d for query q. In the traditional feature engi- connectivity of convolutional networks can allow for more neering approach, we would start with a set of hand-crafted accurate models, especially when it mirrors the structure of features F that capture various aspects of relevance match- the task at hand. ing, combine them in a single model M – such as logistic In parallel to these developments, there have been several regression or boosted decision trees – and train the model advances in Deep Neural Networks, especially for model- using a learning-to-rank approach [3, 4] to predict the labels ing text. While earlier approaches to DNNs for text used on training data: convolutional networks, more recent approaches have used Recurrent Neural Networks (RNNs), especially those based Φ(q,d)=M(F(q,d)) on Long Short-term Memory (LSTM) units [16]. Unlike con- The features employed in this approach may be as simple as volutionalnetworks,theunitsinrecurrentnetworksmaintain binary query term presence in the document or as complex an internal state that is updated from word to word as a as separately trained classification or ranking sub-models. given text is processed, allowing for the network to capture Furthermore, it is standard to include classic information sequential relations across a query or document. A popu- retrieval models in this feature set, particularly BM25 [37]. lar architecture for machine translation uses the so-called Liu[25]providesathoroughoverviewoftraditionallearning- sequence-to-sequence paradigm in which the input text in to-rank methods for search. Macdonald et al. [27] cover the source language is “encoded” using an encoder network many of the engineering issues associated with deploying toproduceafixed-lengthrepresentation(theRNNstate)[42]. learning-to-rank in a search engine. A “decoder” then begins with this representation and emits The advent of Deep Neural Networks has led to the de- an output in the target language. While the use of a fixed- velopment of an exciting alternative: In this approach, a length representation is similar to the architecture of Huang single learning procedure is used to learn both features and etal.[17]andShenetal.[41],theuseofRNNssuchasthose a model simultaneously. Huang et al. [17] introduced the based on LSTMs is critical to their performance. Attention- first Deep Neural Network architectures for Web search that based schemes build on this architecture by dynamically operated on (query, title) pairs, using a so-called siamese re-weighting (i.e., focusing attention) on various elements architecture [23], in which two feed-forward networks NN of the source representation during the decoding process, Q and NN map the query q and the title of a given web andtheyhavedemonstratedconsiderableimprovementsover D document d, respectively, into fixed-length representations: their non-attention counterparts [2]. The“representation-based“natureofsiamesearchitectures Φ(q,d)=cos(NN (q),NN (d)), Q D has also been identified as a limitation in search [14] and The final documents are then ranked by their similarity to has led to the development of alternate “interaction-based” thequeryinthisspacecomputedusingcosinesimilarity. The architectures, in which the relationships between query and 2 documentareconsideredearlier. InanapproachcalledMatch- 1 Pyramid, Pang et al. [36] construct an interaction matrix Sigmoid between query and document terms, where each entry in the matrix denotes the strength of similarity between the Linear Projection corresponding terms. A hierarchical convolutional model then operates on this single interaction matrix to compute 20 the final score . A recent paper by Mitra et al. [32], that Max Pool appeared while this manuscript was being prepared, uses a “duet” architecture in which two separate networks (one n x m x 20 “representation”-basedandtheother“interaction”-based)are combinedtosimultaneouslyaccountforlocalanddistributed ReLU measures of similarity. The key idea in this method is to use Conv 1x1 an exact-match matrix followed by convolutional layers on the“interaction”halfofthenetworkinadditiontoasiamese n x m x 18 architecture. A crucial limitation of such an approach to ReLU modelinginteractionsisthatalltokensinthequeryaregiven equal importance: the interaction model can therefore not Conv 3x3, 3x4, 3x5 distinguishbetweenquerytermsthatareimportantandthose n x m x 51 that are not [39]. Toaddresssomeoftheseshortcomings,wedevelopedanew Append Exact architecture that we refer to as Match-Tensor. Our thesis is Match Channel that no single interaction matrix can capture all aspects of n x m x 50 the matching problem. Unlike previous interaction-based ap- proaches,whichuseasinglerepresentationoftheinteraction 2d Product matrix and cannot distinguish between different words that m x 50 n x 50 have the same pattern of matching, the Match-Tensor model architecture simultaneously considers similarity along sev- Linear Projection Linear Projection eral dimensions during the matching process. This approach allows for a rich interplay between different dimensions in m x 30 n x 140 subsequent stages of the architecture in order to determine the relevance of a document to the query. bi-LSTM bi-LSTM m x 40 n x 40 3 MATCH-TENSOR ARCHITECTURE Linear Projection ThecentralthesisoftheMatch-Tensorarchitectureisthatit m x 256 n x 256 isvitaltoincorporatemultiplenotionsofsimilarity,capturing both immediate and larger contexts in a given document Embedding Lookup Embedding Lookup when computing the relevance of that document to a query. Query Document This objective is achieved by a three-dimensional tensor, in which one dimension corresponds to terms in the query, Figure 2: Match-Tensor Architecture in Detail. In- one dimension for the terms in the document, and a third dividual elements of the architecture are shown in dimension for various match channels. Each match channel rounded boxes, while tensor state is shown in the contains a distinct estimate of the match similarity between rectangular boxes. Query and document representa- the query and document, hence the name “Match-Tensor”. tions are combined in the match-tensor, from which Thetensoriscomputedusingtheoutputofaneuralnetwork a probability of relevance is finally computed. operating on word-embeddings and is supplemented with an exact-match channel that operates directly on the tokens; a downstream neural network is then employed to determine the relevance of the document to the query using the tensor. 3.1 Input to the Match-Tensor Layer Theentirenetworkistrainedend-to-endwithadiscriminative objective. Thus, the manner in which these multiple notions Tobegin,aword-embeddinglookuplayerconvertsqueryand of similarity are combined to produce the final relevance documenttermsintoseparatesequencesofword-embeddings. score is deferred until after all channels are computed. Fig.2 The word embedding table is itself computed offline from a illustrates the final model in detail. In what follows, we large corpus of social media documents using the word2vec motivate and introduce each element of the architecture. package [30] in an unsupervised manner and are held fixed during the training of the Match-Tensor network. In our 3 implementation, word-embeddings are 256-dimensional vec- setsoffilters,eachhavingawidthofthreequerywordsanda tors of floating point numbers. The word embeddings are height of three, four, or five document words. As illustrated then passed through a linear projection layer to a reduced in Fig. 1, these 3-D convolution filters enable the model to l-dimensional space (here l = 40); the same linear projec- learn interactions among the representations in ways that tion matrix is applied to both the query and the document would be very difficult to anticipate as a feature engineer, word vectors. This linear projection allows the size of the lending expressive strength to the model architecture. The embeddings to be varied and tuned as a hyperparameter model applies a ReLU (rectified linear unit) function to the without relearning the embeddings from scratch each time. output of these convolutions and then convolves that output Two Recurrent Neural Networks, specifically bi-directional with a set of 1×1 filters. The ReLU activation function LSTMs (i.e, bi-LSTMs) [11, 16], then encode the query (re- was chosen because it has been shown to be effective for con- spectively document) word-embedding sequence into a se- volutional neural networks in computer vision [18]. Finally, quenceofLSTMstates. Thebi-LSTMstatescaptureseparate the model applies 2-D max-pooling to coalesce the peaks representationsinvectorformofthequeryandthedocument, from the ReLU into a single fixed size vector. This is fed respectively, that reflects their sequential structure, looking into a fully-connected layer (labeled as “linear projection” beyond thegranularityof aword tophrases of arbitrary size. in the diagram) and through a sigmoid to produce a single Duringhyperparametertuning,weallowedthemodelstouse probability of relevance on the output of the model. a linear projection layer inside the bi-LSTM recurrent con- nection,asdefinedinSaketal.[38],butthebestmodelsdid 4 ADDITIONAL RELATED WORK notmakeuseofit. Thereisaseparatelinearprojectionafter The work presented in this paper follows a rich and rapidly the bi-LSTM to establish the same number k of dimensions growing body of work that use Deep Neural Networks in in the representation of query and document (in our case, Search[7,17,19,31,33,40,41,44]. Ourworkisclosesttothe k=50). Thus, at the end, each token in the query and the so-calledMatchPyramidmodelsofPangetal.[35,36]: Match document is represented as a k-dimensional vector. Pyramid models construct a single match matrix and then useaconvolutionalnetworkontopofit(hence“pyramid”)to 3.2 Match-Tensor Layer compute a relevance score. Unlike them, the Match-Tensor For m words in the query and n words in the document, the architecture developed in this work simultaneously considers actualmatch-tensor–fromwhichthearchitectureinheritsits multiplechannelsduringthematchingprocessallowingfora name–isanm×n×k+1tensor,wherek+1isthenumberof rich interplay between the different channels in determining channels in the match-tensor (specifically, 51 in the detailed the relevance of a document to a query. Match Pyramid figure). Eachofthek+1channelsiscomputedfromadistinct models are unable to distinguish between different words representationofthequeryanddocument: allbutoneofthe having the same match pattern. Guo et al. [14] developed channelsarecomputedusingtheelement-wiseproductofthe a neural-network based model (DRMM) that uses matching corresponding bi-LSTM states of the query and document histograms and term-gating. They report that this model is (after applying the subsequent projection). Including each moreaccuratethanBM25andotheralternativesonstandard dimension as a separate layer instead of collapsing them TREC test collections (Robust-04 and ClueWeb-09-Cat-B) into a single layer allows the model to include state-specific howeverMitraet. al[32]reportthatamodelincorporatingan (approximately: term-specific)signalsinthematchingprocess exact-matchchannelwitharepresentationbased“distributed” and to weigh matching different terms according to their model outperforms DRMM on a larger collection of web- importance. While this approach captures most of the key search queries. signals, one omission of the first k layers is their inability to Severalothermodelsthatuseword-embeddingshavebeen properly represent out-of-vocabulary tokens in the query or proposed for various aspects of Search including Diaz et al. document, beginning in the initial word embedding lookup. [8] who use them in query expansions, Ganguly et al. [10] Tocompensateforthisproblem,theinitialembeddinglookup who use them in smoothing language models, Nalisnick et al. includesanout-of-vocabularyvector,andthemodelappends [33]whoproposedualembeddingsandGrbovicetal.[12,13] an extra exact-match channel in the match-tensor (hence, whousetheminsponsoredsearch. Cohenetal.[6]havealso k+1 total channels) such that the entry at position i,j of studied the utility of DNNs for several IR tasks. this channel is set to α if word i in the query is an exact 5 METHODOLOGY match to word j in the document and zero otherwise. This exact-match channel is critical for capturing local textual 5.1 Data match. The value of α is learned via back-propagation along Webaseourexperimentsonacollectionofapproximately1.6 with the rest of the model. million (query,document,label) triplets, collected on a major socialmediasitebetweenbetween2016-01-01and2016-06-01. 3.3 From Match-Tensor to Score Eachdocumentisapubliclyviewablesocialmediapost,which The secondary neural network begins with the match-tensor might include videos, photos, and links, as well as text, but and applies a convolutional layer: the match-tensor is con- for the experiments reported in this paper, we consider only volved cross the full depth (k+1) of the tensor with three the text. Labels indicate the relevance level of the document 4 Table 1: Details of Dataset. Our corpus consists of for our setting. The details are discussed in the following approximately1.6M(query,document,label)triplets sections. that we partitioned into train,validation and test. Each query string and the corresponding results ap- pear in exactly one of the partitions. The test parti- 5.3 Semantic Similarity Model(SSM) tion was used only during the final evaluation. Weconstructedanmodelusingthesiamesenetworkarchitec- turebasedonthesemanticsimilaritymodels(SSM)appearing Unique Average inotherwork[17,34,41]. InthisSSMmodelshownindetail Results Queries Results/Query in Figure 5, we construct a query embedding by concatenat- Train 59,457 1,032,325 17.36 ing the last output from each of the forward and backward Validation 3,975 69,005 17.36 directionsofthequerybi-LSTMandadocumentembedding Test 35,773 615,242 17.20 by max-pooling over the output bi-LSTM states across the entire document. Max-pooling is used for the document be- causethedocumentscanbemuchlongerthanthequeryand it is harder for the bi-LSTM to propagate the relevant infor- with respect to the query. For these experiments, we use mation all the way to the end of the document [22]. These three levels of relevance: “VITAL”, “RELEVANT”, and fixed-lengthdocumentandqueryembeddingsarethenpassed “NONRELEVANT”. We split this dataset (by query) into 3 throw linear projections before computing a dot-product be- parts: train, validation, and test, so that each query string tween them which is then used to compute the final score. appearsinexactlyoneofthepartitions. Weprovidedetailsof Themodelparametersandhyper-parameterswereoptimized the partitioning in table 1. The training and validation sets on the same dataset as the Match-Tensor model. were used to train the models and perform hyper-parameter sweeps. Thetestsetwasusedonlyforevaluation,attheend of this process. 5.4 Match-Tensor(Exact-only)+SSM Inrecentworkthatappearedwhilethismanuscriptwasbeing 5.2 Implementation Details prepared, Mitra et al [32] show that a combination of local We implemented our Match-Tensor neural net model using and distributed matching can outperform other models for TensorFlow [1]. We used pre-trained 256-dimensional phrase web-search. Since several details of their model are specific embeddings using the word2vec package [29] on a large cor- to the structure of web documents, we constructed an model pus of documents with a vocabulary size of around 2 million thathassimilarcharacteristicsforoursettingsbycombining tokens containing unigrams and selected bigrams. Out-of- a single channel exact-match only Match-Tensor component vocabularywordsaremappedtoaspecialtoken. Queriesare with an SSM component into a single model. This is done truncated to a maximum of eight words in length, whereas by concatenating the output from the last layer of Match- documents are truncated to a maximum length of 200 words. Tensor filters with the hidden layer of the SSM comparison Both the query and the documents are then preprocessed by network as shown in Figure 6. The Match-Tensor and SSM lowercasing and applying a simple tokenizer to split words components share parameters for the word embedding and and remove punctuation. Since social media documents are LSTM portion of the model. structured into separate fields (e.g. the title, author, and body), we added special tokens for demarcating the bound- 5.5 Match-Tensor+SSM aries between fields and for the start and end of a document. The embeddings for these special boundary tokens are ran- We also compare the effect of using all the channels in the domly initialized and kept fixed during training. We used Match-TensorarchitectureinconjunctionwiththeSSMarchi- dropout as a regularizer on the non-recurrent connections of tecture. ThismodelisshowninFigure6. Theonlydifference allbi-LSTMs. WeemployedtheAdamoptimizerforgradient in architecture between this model and the previous (exact- descent[20]withalearningrateof0.001andmini-batchsize match only channel) model is the number of channels in the of 200. Hyperparameter settings are shown in 2. tensor layer: the former has one channel while this model To investigate the importance of each of the major com- has k+1 like the Match-Tensor model. ponents in the model we also experimented with alternative choices for these components and alternate architectures, 5.6 bi-LSTMs vs CNNs tailoring them to socialmedia documents. While somearchi- tectures have been suggested for short text common in some Wecomparedallthreemodelarchitectureslistedagainstsim- social media[26, 36], studies indicate that they do not beat ilar ones that uses convolutional layers in-place of bi-LSTMs. baseline models such as BM25[35]. In contrast, both early We used a mix of width 1 and width 3 convolutional filters. models [17, 41] and recent developments by Mitra et.al that Compared to the bi-LSTMs, that can incorporate informa- has shown strong performance [31] have been designed for tion over a wide token span, the representations produced web-search and are not directly usable in our setting. We by the convolutional filters only look at trigrams (when the therefore adapted these model architectures from web search width is 3) but are computationally cheaper. 5 5.7 Attention Pooling 100 To improve on the query-agnostic pooling schemes of SSM, we also implemented an attention pooling mechanism for y 99 the document embeddings as an alternative to max pooling. urac The hypothesis here is that information from the query is Acc 98 important in deciding in how to summarize the document. n Set 97 Tfohrme aatttioenntfiroonmpotholeinqguemryodeemlbleeadrdnisnagRanedLUeaachctiovuattpeudttfrraonms- alidatio 96 V the document bi-LSTM. Attention weights are determined ve 95 by taking the dot product between these vectors and nor- ati SSM malized using the Softmax function. The attention-pooled Rel 94 Match-Tensor Match-Tensor + SSM document embedding is the weighted combination of the bi- 93 LSTMoutputs. Wenotethatouruseofattentionisdifferent 10 20 30 40 50 60 70 80 90 100 Percent of Data Used fromthatofPalangietal.[45]whereattention-basedpooling was used in a query-agnostic manner. In our preliminary Figure 3: Test-accuracy (AUC of ROC curve) as a experiments, using attention-based pooling did not result in function of size of training set for each model, rela- improved results compared to a max pooling baseline so we tive to test-accuracy of model trained with all data. did not pursue it further. In each case, we fixed the hyperparameter settings to the values selected in Table 2. 5.8 Ensemble models 6.3 Performance of Neural Models Comparingdifferentmodelarchitecturesusingabsolutemet- ricscanyieldinsightintotherelativeimportanceofthetypes Figure 4 summarizes the performance of the various neural of signals for the task at hand. However, one model might model architectures relative to a BM25 baseline. The figure outperform another model without capturing the signal in reports NDCG at various levels as well as Expected Recip- the latter model. Consequently, to test if one model sub- rocal Rank (ERR) [5], with all measures computed using sumes another,wetrainadditionalensemblemodelsthatuse thethreerelevancegrades. Overall, theMatch-Tensormodel the scores of both models. We measure the accuracy of the (withbi-LSTMs)isthemostaccurateindividualmodel,with ensemble models in addition to the individual models. an 11% improvement in AUC of the ROC curve (right panel ofthefigure)overaBM25baselineandsmallerbutconsistent improvements on NDCG and ERR. We note that while the 6 RESULTS relativeorderingofmodelsappearstoberobusttovariations in the test set, the values of these relative improvements 6.1 Model selection appear to be sensitive to the composition of the test set: Weoptimizedhyperparametersbasedonarandomgridsearch relative improvements when restricted to a subset of the on the validation set for each model architecture studied, se- test-set that are “hard” (at most half the available results lecting the one model with the best score out of 200 runs. are relevant) are much larger. Foreachmodelarchitecturewenextevaluatedthesinglebest The SSM architecture had lower NDCG than the BM25 model on the test set. Table 2 reports these hyperparamters baselines. This result is consistent with [14] and others who for the three main model architectures we studied. Criti- have highlighted the limitation of models that only match cally, we note that the final Match-Tensor model has fewer semanticrepresentationsinrelevance-matchingtasks. Match- parameters than the final SSM model. Tensorisnotonlymoreaccurateinaggregate,itisalsomore accurateateverycutoff: thePrecisionofthemodelishigher than the others at all values of Recall. 6.2 Sensitivity to Training Size 6.4 bi-LSTMs vs CNNs for text To evaluate the sensitivity of the model performance to the representation amount of training data, for each of the NN architectures we sub-sampled the training set, retrained models (keeping We considered the use of convolutional neural networks to the hyper-parameters fixed), and computed the test-loss. compute the text representations in the first stage of each Figure3showsthetestlossofeachmodelasafunctionofits modelarchitectureinplaceofbi-LSTMs. Table3showsthat finalaccuracy. Notsurprisingly,eacharchitectureweconsider acrossthefourmodelarchitecturesunderconsideration,using here benefits from the availability of large training sets, and bi-LSTMs results in more accurate models than their CNN theaccuracyimprovessubstantiallyasweincreasethesizeof counterparts in terms of AUC, NDCG, and ERR. For AUC the training set. However, the relative comparisons between in particular, the relative gain in AUC from using bi-LSTMs the model architectures appear to be reasonably robust to is between two to three percent. The fact that this increase training data size. holds for both SSM and Match-Tensor architecture variants 6 Table 2: Hyperparameter settings for each model architecture. These were selected by performing a random grid search on the validation set and selecting the best setting of 200 runs. SSM Match-Tensor Match-Tensor+SSM Word Embedding Projection 50 40 50 Doc. bi-LSTM Dim. 120 70 95 Query bi-LSTM Dim. 32 15 15 Comparison Net Hidden Layer 50 50 55 Match-Tensor Size – 40 35 Match Filters 1st Layer – 18 18 Match Filters 2nd Layer – 20 30 Training Epochs 4.25 4.5 3.25 Total Parameters 216K 104K 160K SSM MatchTensor+SSM MatchTensor MatchTensor(Exact Match only)+SSM Figure 4: Accuracy of different model architectures relative to BM25 (in percentage). Table 3: Relative improvement for each model architecture when using bi-LSTMs over using CNNs Model AUC NDCG@1 NDCG@3 NDCG@10 ERR SSM 2.02% 1.10% 0.83% 0.67% 0.74% Match-Tensor 2.87% 1.18% 1.04% 0.66% 0.58% Match-Tensor+SSM 2.07% 0.59% 0.57% 0.33% 0.43% Match-Tensor (Exact Only)+SSM 1.57% 0.12% 0.00% 0.11% 0.15% suggeststhattheimprovementsareduetobi-LSTMs–across relativeimprovementwhengoingfromSSMtoMatch-Tensor the board – providing more accurate representations at each is substantial. This improvement holds even when using position. Thisoutcomeisnotsurprisinggivensimilarresults CNNs to represent the state at each query and document in other language modeling tasks and is consistent with token: AUC goes up by 4% when using bi-LSTMs and 3% gains in NDCG observed in [34] in going from convolutional when using CNNs, suggesting that the improvement is a to bi-LSTM-based semantic similarity models. consequence of the underlying difference in architectures. The superiority of Match-Tensor is not surprising given that the Match-Tensor architecture has a substantially more ex- 6.5 2-D Matching vs. SSM pressive matching function. Furthermore, combining the As we have seen, the Match-Tensor architecture outperforms Match-Tensor and SSM architectures gives no substantial the SSM architecture, often significantly. Although both gaininperformance: asseen,smallimprovementsinAUCare architectures are most accurate when using bi-LSTMs, the 7 offsetbysmallreductioninNDCGandERR.Theabsenceof Under its bag of words model, BM25 often scores results a difference for this hybrid architecture would suggest that thathavecompletelywrongphrasesbuttherightsetoftokens the bi-LSTM representation at each position is already cap- above a relevant result. This is best illustrated by the query turing global context sufficiently well to make the additional “low fat high carb” where the model prefers a result about explicitper-documentrepresentationnon-informativeforthis “lowcarbhighfat”overarelevantresult. Traditionallearning- problem. to-rank methods address this problem with specifically en- gineering proximity and ordering features. Match-Tensor, 6.6 Influence of the exact-match channel on the other hand, correctly ranks these results, learning the necessary proximity, ordering, grammatical, and other While we introduced the exact-match channel to account for relationships directly from training examples. out-of-vocabulary tokens where the bi-LSTM states might The Match-Tensor(Exact only)+SSM model uses only ex- notbeaccurate,itiscomputedforallcases. Notsurprisingly, act matches between query and document terms and relies it is an important contributor to the accuracy of the final on a single representation of query and document to cap- model. However the interplay between all channels improves ture semantic signals. This results in subtler failures, often the accuracy of the model further: the relative NDCG at 1 due to an over-reliance on the exact-match channel: for a goes up by 2% with the bi-LSTM channels on compared to query inquiring about scholarships for graduate programs, theexact-matchalonemodelwheretherelativeimprovement ”scholarship to master degree”, the exact-only model prefers is roughly 1% i.e., roughly half. This approximate doubling adocumentthathastheexactphrasebutissemanticallynot in relative accuracy when moving from the single channel to usefultothesearcher. ThefullMatch-Tensormodelcorrectly the full Match-Tensor model is seen across all positions in prefers another result that matches the intent of the query NDCG and in ERR. even though the document doesn’t contain an exact match. 6.7 Ensemble models To determine if a deep relevance model is indeed capturing 7 CONCLUDING DISCUSSION all essential relevance matching signals, we trained ensemble Deep Neural Networks are a compelling development in ma- models: boosted trees [9] that combine as inputs the neural chine learning that have substantially advanced the state- model’s output as a feature and a BM25 feature using 5- of-the-art in several disciplines[24]. While initial develop- fold cross-validation on the existing validation set. Neural ments in several domains were focused on the absolute accu- Models that capture essential relevance matching signals racy [21, 42] of these models when compared to alternatives, bettershouldshowrelativelysmallimprovementswhenBM25 thefocushasmorerecentlygravitatedtowardsthecomplete- isaddedtothemix,comparedtothosethatdon’tsinceagood nessofthesemodels;indeedinseveraldomainssuchasspeech model should already capture most of the signal in BM25. recognition, computer vision and machine translation, en- AsseeninTable4,Match-Tensorshowsthesmallestrelative tire production systems have been completely replaced with increase when BM25 is added to the mix compared to all neural networks that are trained end-to-end [15, 43]. otheralternatives. Anexact-matchonlyMatch-Tensor+SSM Early neural network models for search focused on seman- modelalsodoesbetterinthisregardthanSSMalonealthough tic matching signals which supplemented existing relevance again the full Match-Tensor model is substantially better by matching features. By computing similarity between seman- allowingforinterplayamongchannelsevenwithouthavingan tic representations of the query and document, this class of explicitSSM-likecomponent. Wenotethatdespitethesmall models naturally captured signals that were hard to deter- relative increase, the Match-Tensor&BM25 model is more mineusingtraditionalmodels. However,thisgeneralclassof accurate than all other ensemble variants and is nearly 1% models appears to miss critical relevance-matching signals moreaccuratethanMatch-Tensor(Exactonly)+SSM&BM25. [14]. In this work, we explored Match-Tensor, a new Deep Thus,theMatch-Tensormodelisnotonlythemostaccurate Relevance model architecture for Search. By simultaneously model in this list, it also largely subsumes the semantic accountingforseveralnotionsofsimilaritywithanexpressive matching signals in SSM and the relevance matching signals 3-D tensor layer and by deferring the combination of these in BM25 as indicated by the relative small improvement in signals into a relevance score to later layers, Match-Tensor accuracy when BM25 is added to it. is able to achieve higher accuracies than other architectures. More interestingly, this architecture appears to largely sub- 6.8 Model Introspection sume the signals in previous models: adding a SSM-like We illustrate the strengths of the Match-Tensor model over component to the model does not affect the accuracy of the other approaches with a few examples in Table 5. SSM is final model, while the improvement when adding BM25 is focused towards matching representations. As a result, it small and far less than the corresponding improvements in often misses out on relevance matching signals by finding a othermodelarchitectures. Whilewehavetailoredthedetails result about the same broad topic but different in several of the model Match-Tensor architecture and the alternatives crucial details: as an example, for a query about a celebrity westudiedforthesearchtaskwithinaspecificsocialnetwork, tv show, the model ranks a document about a different we expect that these learnings might also be adaptable to celebrity’s TV show above a relevant result. search within other domains – just as we have adapted the 8 Table 4: Relative improvement in model accuracy by combining BM25 with original model. The smallest change in each column is shown in bold to signify which architecture already best incorporates the signal available in BM25. Model AUC NDCG@1 NDCG@3 NDCG@10 ERR SSM & BM25 3.29% 2.53% 2.11% 1.33% 1.47% Match-Tensor & BM25 1.64% 0.94% 0.79% 0.54% 0.57% Match-Tensor+SSM & BM25 1.52% 1.05% 0.882% 0.65% 0.57% Match-Tensor(Exact only)+SSM & BM25 2.32% 1.12% 0.877% 0.77% 0.72% Table 5: Illustrative examples highlighting pairs of results that were incorrectly ordered by a method but were correctly ordered by the Match-Tensor model Query Irrelevant Result Relevant Result Method with incor- rect ranking ariana tv show Leah Michele’s tv show.. Ariana on the tv.. SSM corn shucking song Blackhawks playing the blues.. The corn shucking song.. SSM cats acted like hu- ..humans acted like cats.. ..cats trying to act like humans.. BM25 mans low fat high carb Low carb high fat diet.. ..popular low fat high carb .. BM25 cleveland wins nba Golden State beats Cleveland in Cleveland wins basketball champi- Match-Tensor(Exact championship NBA championship.. onship.. only) + SSM scholarship to mas- My score is low for scholarship to ..Master’s application and offers Match-Tensor(Exact ter degree master degree .. scholarship.. only) + SSM SSM style models originally developed for web-search to our [8] FernandoDiaz,BhaskarMitra,andNickCraswell.2016. Query setting. expansionwithlocally-trainedwordembeddings. arXivpreprint arXiv:1605.07891 (2016). The ability to select diverse ways of computing similarity [9] JeromeHFriedman.2001. Greedyfunctionapproximation: A between query and document in the form of channels in the gradient boosting machine. Annals of statistics (2001), 1189– match-tensormodellayerisageneralandpowerfulprimitive. 1232. [10] DebasisGanguly,DwaipayanRoy,MandarMitra,andGarethJF Indeed, we have only explored a few design choices within Jones.2015. Wordembeddingbasedgeneralizedlanguagemodel thisgeneraldesignspace(comparingbi-LSTMstoCNNs). It forinformationretrieval.In38thInternationalACMSIGIRCon- ferenceonResearchandDevelopmentinInformationRetrieval. ispossiblethatincreasingthediversityofthesesources,using ACM,795–798. a mix of RNNs, CNNs and other notions of exact matching [11] AlexGravesandJu¨rgenSchmidhuber.2005. Framewisephoneme (by incorporating named entity linking, for example) might classificationwithbidirectionalLSTMandotherneuralnetwork architectures. NeuralNetworks 18,5(2005),602–610. furtherstrengthentheaccuracyandperformanceofthemodel. [12] Mihajlo Grbovic, Nemanja Djuric, Vladan Radosavljevic, and These, and other explorations are deferred to future work. NarayanBhamidipati.2015. Searchretargetingusingdirected query embeddings. In 24th International Conference on the WorldWideWeb.ACM,37–38. [13] MihajloGrbovic,NemanjaDjuric,VladanRadosavljevic,Fabrizio REFERENCES Silvestri,andNarayanBhamidipati.2015. Context-andcontent- [1] Mart´ın Abadi et al. 2015. TensorFlow: Large-Scale Machine aware embeddings for query rewriting in sponsored search. In LearningonHeterogeneousSystems. (2015). http://tensorflow. 38th International ACM SIGIR Conference on Research and org/ Softwareavailablefromtensorflow.org. DevelopmentinInformationRetrieval.ACM,383–392. [2] DzmitryBahdanau,KyunghyunCho,andYoshuaBengio.2014. [14] JiafengGuo,YixingFan,QingyaoAi,andWBruceCroft.2016.A Neuralmachinetranslationbyjointlylearningtoalignandtrans- deeprelevancematchingmodelforad-hocretrieval.In25thACM late. arXivpreprintarXiv:1409.0473 (2014). International on Conference on Information and Knowledge [3] ChristopherJCBurges.2010. FromRankNettoLambdaRank Management.ACM,55–64. toLambdaMART:Anoverview. Learning 11(2010),23–581. [15] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel- [4] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang rahmanMohamed,NavdeepJaitly,AndrewSenior,VincentVan- Li.2007. Learningtorank: Frompairwiseapproachtolistwise houcke,PatrickNguyen,TaraNSainath,andothers.2012. Deep approach.In24thInternationalConferenceonMachinelearning. neuralnetworksforacousticmodelinginspeechrecognition: The ACM,129–136. sharedviewsoffourresearchgroups. IEEE Signal Processing [5] OlivierChapelle,DonaldMetlzer,YaZhang,andPierreGrinspan. Magazine 29,6(2012),82–97. 2009. Expected reciprocal rank for graded relevance. In 18th [16] SeppHochreiterandJu¨rgenSchmidhuber.1997. Longshort-term ACMConferenceonInformationandKnowledgeManagement. memory. Neuralcomputation 9,8(1997),1735–1780. 621–630. [17] Po-SenHuang,XiaodongHe,JianfengGao,LiDeng,AlexAcero, [6] DanielCohen,QingyaoAi,andWBruceCroft.2016.Adaptability andLarryHeck.2013. Learningdeepstructuredsemanticmodels ofneuralnetworksonvaryinggranularityIRtasks.arXivpreprint forwebsearchusingclickthroughdata.In22ndACMInterna- arXiv:1606.07565 (2016). tionalConferenceonConferenceonInformation&Knowledge [7] NickCraswell,WBruceCroft,JiafengGuo,BhaskarMitra,and Management.ACM,2333–2338. MaartendeRijke.2016. Neu-IR:TheSIGIR2016workshopon [18] KevinJarrett,KorayKavukcuoglu,YannLeCun,andothers.2009. neuralinformationretrieval. (2016). Whatisthebestmulti-stagearchitectureforobjectrecognition?. 9 InComputerVision,2009IEEE12thInternationalConference [41] YelongShen,XiaodongHe,JianfengGao,LiDeng,andGr´egoire on.IEEE,2146–2153. Mesnil. 2014. Learning semantic representations using convo- [19] TomKenterandMaartendeRijke.2015. Shorttextsimilarity lutionalneuralnetworksforwebsearch.In23rd International with word embeddings. In 24th ACM International on Con- ConferenceontheWorldWideWeb.ACM,373–374. ference on Information and Knowledge Management. ACM, [42] IlyaSutskever,OriolVinyals,andQuocVLe.2014. Sequenceto 1411–1420. sequencelearningwithneuralnetworks.InAdvancesinneural [20] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for informationprocessingsystems.3104–3112. stochasticoptimization.arXivpreprintarXiv:1412.6980 (2014). [43] YonghuiWu,MikeSchuster,ZhifengChen,QuocVLe,Moham- [21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. mad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Imagenetclassificationwithdeepconvolutionalneuralnetworks. QinGao,KlausMacherey,andothers.2016. Google’sNeuralMa- InAdvances in neural information processing systems.1097– chineTranslationSystem: BridgingtheGapbetweenHumanand 1105. MachineTranslation. arXivpreprintarXiv:1609.08144 (2016). [22] SiweiLai,LihengXu,KangLiu,andJunZhao.2015. Recurrent [44] LiuYang,QingyaoAi,JiafengGuo,andWBruceCroft.2016. ConvolutionalNeuralNetworksforTextClassification..In29th aNMM:Rankingshortanswertextswithattention-basedneural AAAIConferenceonArtificialIntelligence.2267–2273. matchingmodel.In25thACMInternationalonConferenceon [23] YannLeCunandYoshuaBengio.1995. Convolutionalnetworks InformationandKnowledgeManagement.ACM,287–296. forimages,speech,andtimeseries.Thehandbookofbraintheory [45] Shuangfei Zhai, Keng-hao Chang, Ruofei Zhang, and andneuralnetworks 3361,10(1995),1995. ZhongfeiMarkZhang.2016. Deepintent: Learningattentionsfor [24] YannLeCun,YoshuaBengio,andGeoffreyHinton.2015. Deep online advertising with recurrent neural networks. In Proceed- learning. Nature 521,7553(2015),436–444. ingsofthe22ndACMSIGKDDInternationalConferenceon [25] Tie-YanLiu.2011. Learningtorankforinformationretrieval. KnowledgeDiscoveryandDataMining.ACM,1295–1304. Springer,Berlin. [26] ZhengdongLuandHangLi.2013. Adeeparchitectureformatch- ingshorttexts.InAdvancesinNeuralInformationProcessing Systems.1367–1375. [27] CraigMacdonald,RodrygoL.Santos,andIadhOunis.2013. The whensandhowsoflearningtorankforwebsearch. Information Retrieval 16,5(October2013),584–628. [28] DonaldMetzlerandW.BruceCroft.2005. Amarkovrandom fieldmodelfortermdependencies.In28thAnnualInternational ACMSIGIRConferenceonResearchandDevelopmentinIn- formationRetrieval.472–479. [29] TomasMikolov,KaiChen,GregCorrado,andJeffreyDean.2013. Efficientestimationofwordrepresentationsinvectorspace.arXiv preprintarXiv:1301.3781 (2013). [30] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and JeffreyDean.2013. DistributedRepresentationsofWordsand PhrasesandTheirCompositionality.In26thInternationalCon- ferenceonNeuralInformationProcessingSystems.3111–3119. [31] BhaskarMitra.2015. Exploringsessioncontextusingdistributed representationsofqueriesandreformulations.In38thInterna- tionalACMSIGIRConferenceonResearchandDevelopment inInformationRetrieval.ACM,3–12. [32] BhaskarMitra,FernandoDiaz,andNickCraswell.2016.Learning tomatchusinglocalanddistributedrepresentationsoftextfor websearch. arXivpreprintarXiv:1610.08136 (2016). [33] EricNalisnick,BhaskarMitra,NickCraswell,andRichCaruana. 2016. Improvingdocumentrankingwithdualwordembeddings. In 25th International Conference Companion on the World WideWeb.InternationalWorldWideWebConferencesSteering Committee,83–84. [34] HamidPalangi,LiDeng,YelongShen,JianfengGao,Xiaodong He,JianshuChen,XinyingSong,andRababK.Ward.2014. Se- manticModellingwithLong-Short-TermMemoryforInformation Retrieval. arXivpreprintarXiv:1412.6629 (2014). [35] LiangPang,YanyanLan,JiafengGuo,JunXu,andXueqiCheng. 2016.Astudyofmatchpyramidmodelsonad-hocretrieval.arXiv preprintarXiv:1606.04648 (2016). [36] LiangPang,YanyanLan,JiafengGuo,JunXu,ShengxianWan, andXueqiCheng.2016. TextMatchingasImageRecognition. arXivpreprintarXiv:1602.06359 (2016). [37] StephenERobertsonandSteveWalker.1994. Somesimpleef- fectiveapproximationstothe2-poissonmodelforprobabilistic weightedretrieval.In17thAnnualInternationalACMSIGIR ConferenceonResearchanddevelopmentinInformationRe- trieval.232–241. [38] Hasim Sak, Andrew W Senior, and Fran¸coise Beaufays. 2014. Longshort-termmemoryrecurrentneuralnetworkarchitectures forlargescaleacousticmodeling..InInterspeech.338–342. [39] GerardSaltonandChristopherBuckley.1988. Term-weighting approachesinautomatictextretrieval. Informationprocessing &management 24,5(1988),513–523. [40] YelongShen,XiaodongHe,JianfengGao,LiDeng,andGr´egoire Mesnil.2014.Alatentsemanticmodelwithconvolutional-pooling structureforinformationretrieval.In23rdACMInternational ConferenceonConferenceonInformationandKnowledgeMan- agement.ACM,101–110. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.