Table Of Content

1 Scalable Nearest Neighbor Search based on kNN Graph Wan-Lei Zhao*, Jie Yang, Cheng-Hao Deng Abstract—Nearestneighborsearchisknownasachallengingissuethathasbeenstudiedforseveraldecades.Recently,thisissue becomesmoreandmoreimminentinviewingthatthebigdataproblemarisesfromvariousfields.Inthispaper,ascalablesolutionbased onhill-climbingstrategywiththesupportofk-nearestneighborgraph(kNN)ispresented.Twomajorissueshavebeenconsideredinthe paper.Firstly,anefficientkNNgraphconstructionmethodbasedontwomeanstreeispresented.Forthenearestneighborsearch,an enhancedhill-climbingprocedureisproposed,whichseesconsiderableperformanceboostoveroriginalprocedure.Furthermore,with thesupportofinvertedindexingderivedfromresiduevectorquantization,ourmethodachievescloseto100%recallwithhighspeed 7 efficiency in two state-of-the-art evaluation benchmarks. In addition, a comparative study on both the compressional and traditional 1 nearestneighborsearchmethodsispresented.Weshowthatourmethodachievesthebesttrade-offbetweensearchquality,efficiency 0 andmemorycomplexity. 2 b e IndexTerms—nearestneighborsearch,k-nngraph,hill-climbing. F (cid:70) 3 ] V 1 INTRODUCTION perform the NN search given kNN graph is supplied. For C thefirstissue,ithasbeenaddressedbythescheme“divide- . Nearest neighbor search (NNS) generally arises from a s and-conquer” [10], [11] and nearest neighbor descent [12]. wide range of subjects, such as database, machine learn- c The combination of these two schemes is also seen from [ ing, computer vision, and information retrieval. Due to the recentwork[6].Thankstotheaboveschemes,thecomplex- fundamental role it plays, it has been studied in many 2 ityofconstructingkNNgraphhasbeenreducedtoaround computer fields for several decades. The nearest neighbor v O(D·n·log(n)). With the support of kNN graph, nearest search problem can be simply defined as follows. Given a 5 7 queryvector(q ∈ RD),andncandidatesthatareunderthe neighbor search is conducted by hill-climbing strategy [4]. However, this strategy could be easily trapped in local op- 4 same dimensionality. It is required to return sample(s) for 8 thequerythatarespatiallyclosesttoitaccordingtocertain timasincetheseedsamplesarerandomlyselected.Recently, 0 metric(usuallyl1-distanceorl2-distance). thisissuehasbeenaddressedbysearchperturbation[5],in 1. Traditionally, this issue has been addressed by various which an intervention scheme is introduced as the search 0 space partitioning strategies. However, these methods are procedure is trapped in a local optima. As indicated in 7 hardly scalable to high dimensional (e.g., D > 20), large- recent work [6], this solution is sub-optimal. Alternatively, 1 scale and dense vector space. In such case, most of the with the support of k-d trees to index the whole reference v: traditional methods such as k-d tree [1], R-tree [2] and set, the hill climbing search is directed to start from the i locality sensitive hashing (LSH) [3] are unable to return neighborhood of potential nearest neighbors [6]. However, X decentresults. inordertoachievehighperformance,multiplek-dtreesare ar Recently, there are two major trends in the literature. required,whichcausesconsiderablememoryoverhead. In one direction, the nearest neighbor search is conducted Considering the big memory cost with the traditional basedonk-nearestneighborgraph(kNNgraph)[4],[5],[6], methods, NNS is addressed based on vector quantization. inwhichthekNNgraphisconstructedoffline.Alternatively, Therepresentativemethodsareproductquantizer(PQ)[7], NNSisaddressedbasedonvectorquantization[7],[8],[9]. additive quantizer (AQ) [8], composite quantizer (CQ) [9] The primary goal of this way is to compress the reference and stacked quantizer (SQ) [13]. Essentially in all these set by vector quantization. Such that it is possible to load methods, vectors in the reference set are compressed via thewholereferenceset(aftercompression)intothememory vector quantization. The advantages that the quantization inthecasethatthereferencesetisextremelylarge. methods bring are two folds. On one hand, the candidate TherearetwokeyissuestobeaddressedinkNNgraph vectors have been compressed (typically the memory con- based methods. The primary issue is about the efficient sumption is one order of magnitude lower than the size construction of the kNN graph, given the original problem of reference set), which makes it possible to scale-up the is at the complexity of O(D·n2). Another issue is how to nearest neighbor search to very large range. On the other hand, the distance computation between query and all the candidates becomes very efficient when it is approximated • Fujian Key Laboratory of Sensing and Computing for Smart City, and theSchoolofInformationScienceandEngineering,XiamenUniversity, by the distances between query and vocabulary words. Xiamen,361005,P.R.China. However, due to the approximation in both vector repre- Wan-LeiZhaoisthecorrespodingauthor.E-mail:[email protected]. sentationanddistancecalculation,highrecallisundesirable 2 withthecompressionalmethods. and multiple k-d trees. Although efficient, sub-optimal re- In our paper, similar as [4] [5] [6], NNS is basically sultsareachieved. addressed via hill-climbing strategy based on kNN graph. For all the aforementioned tree partitioning methods, Two major issues are considered. Firstly, an efficient and another major disadvantage lies in their heavy demand in scalable kNN graph construction method is presented. In memory.Ononehand,inordertosupportfastcomparison, addition,differentfromexistingstrategies,thehill-climbing allthecandidatevectorsareloadedintothememory.Onthe searchisdirectedbymulti-layerresiduevectorquantization other hand, the tree nodes that are used for indexing also (RVQ) [14]. The residue quantization forms a coarse-to- takeupconsiderableamountofextramemory.Overall,the fine partition over the whole vector space. For this reason, memory consumption is usually several times bigger than it is able to direct the hill-climbing search to start most thesizeofcandidateset. likely from within the neighborhood of the true nearest Apart from tree partitioning method, several attempts neighbor.Sincetheinvertedindexingderivedfromresidue havebeenmadetoapplylocalitysensitivehashing(LSH)[3] quantization only keeps the IDs of the reference vectors, on NNS. In general, there are two steps involved in the in comparison to method in [6], ignorable extra memory search stage. Namely, step 1. collects the candidates that is required. As a result, the major memory cost of our share the same or similar hash keys as the query; step 2. method is to hold the reference set. Furthermore, with the performsexhaustivecomparisonbetweenthequeryandall support of the multi-layer quantization, sparse indexing these selected candidates to find out the nearest neighbor. over the big reference set is achievable with a series of SimilarasFLANN,LSHisgoodfortheapplicationthatonly smallvocabularies,whichmakesthesearchprocedurevery requires approximate nearest neighbor. Moreover, in order efficientandeasilyscale-uptobillionleveldataset. tosupportthefastcomparisoninstep2,thewholereference The remaining of this paper is organized as follows. setarerequiredtobeloadedinthememory,whichleadsto In Section 2, a brief review about the literature of NNS alotofmemoryconsumption. is presented. Section 3 presents an efficient algorithm for In recent years, the problem of NNS is addressed by kNN graph construction. Based on the constructed kNN vector quantization [18]. The proposal of applying prod- graph, an enhanced hill-climbing procedure for nearest uct quantizer [7] on NNS opens up new way to solve neighborsearchisthereforepresented.Section4showshow this decades old problem. For all the quantization based the inverted indexing structure is built based on residue methods [19], [7], [20], [9], [13], they share two things in vector quantization, which will boost the performance of common. Firstly, the candidate vectors are all compressed hill-climbing procedure. The experimental study over the via vector quantization. This makes it easier than previous enhancedhill-climbingnearestneighborsearchispresented methodstoholdthewholedatasetinthememory.Secondly, inSection5. NNS is conducted between the query and the compressed candidate vectors. The distance between query and candidates is approximated by the distance between query and 2 RELATED WORKS vocabularywordsthatareusedforquantization.Duetothe heavy compression on the reference vectors, high recall on The early study about the issue of NNS could be traced thetopishardlydesirablewiththesemethods. back to 1970s, when the need of NNS search over file Recently, the hill-climbing strategy has been also ex- system arises. In those days, the data to be processed are plored[4],[5],[6].Basically,nosophisticatedspacepartition- in very low dimension, typically 1D. This problem is well- ingisrequiredintheoriginalidea[4].TheNNSstartsfrom addressedbyB-Tree[15]anditsvariantsB+-tree[15],based a group of random seeds (random location in the vector onwhichtheNNScomplexitycouldbeaslowasO(log(n)). space). The iterative procedure tranverses over the kNN However,B-treeisnotnaturallyextensibletomorethan1D graph by depth-first search. Guided by kNN graph, the case.Moresophisticatedindexingstructuresweredesigned searchprocedureascentsclosertothetruenearestneighbor to handle NNS in multi-dimensional data. Representative in each round. It takes only few number of rounds to structuresarek-d-tree[1],R-tree[2]andR*-tree[16].Fork- converge.Sincetheprocedurestartsfromrandomposition, dtree,pivotvectorisselectedeachtimetosplitthedataset it is likely to be trapped in local optima. To alleviate this evenly into two. By applying this bisecting repeatedly, the issue,aninterventionschemeisintroducedin[5].However, hyper-space is partitioned into embedded hierarchical sub- thismethodturnsouttobesub-optimal.Recently,thisissue spaces. The NNS is performed by tranversing over one is addressed by indexing the reference set by multiple k- or several branches to probe the nearest neighbor. The d trees, which is similar as FLANN [17]. When one query space partitioning aims to restrict the search taking place comes, the search tranverses over these k-d trees. Potential within minimum number of sub-spaces. However, unlike candidatesarecollectedastheseedsforhill-climbingsearch. B-tree in 1D case, the partition scheme does not exclude Suchthatthesearchprocesscouldbefasterandthepossibil- thepossibilitythatnearestneighborresidesoutsideofthese itythatistrappedinalocaloptimaislower.Unfortunately, candidatesub-spaces.Therefore,extensiveprobingoverthe such kind of indexing causes nearly one times of memory large number of branches in k-d tree becomes inevitable. overheadtoloadthek-dtrees,whichmakesitinscalableto For this reason, NNS with k-d tree could be very slow. large-scalesearchtask. Similar comments apply to R-tree and its variants despite Hill-climbing strategy is adopted in our search scheme thatamoresophisticateddesignonthepartitionstrategyis duetoitssimplicityandefficiency.Differentfrom[5],[6],the proposedinthesetreestructures.Recentindexingstructure searchprocedureisdirectedbyresiduevectorquantization FLANN[17]partitionsthespacewithhierarchicalk-means (RVQ) [19]. Firstly, the candidate vectors are quantized by 3 RVQ. However, different from compressional methods, the graph G is refined as new and closer neighbors join in. quantized codes are used only to build inverted indexing. Notice that the result of two means clustering is different Candidate vectors sharing the same RVQ codes (indexing from one round to another. We find it is sufficient to set I0 key) are organized into one inverted list. In comparison to10formillionleveldataset.Theoverallcomplexityofthis to [6], only 4 bytes are required to keep the index for one procedureisO(I0·n·log(n)·D),wheren·log(n)·Disactually vector.Whenaquerycomes,thedistancebetweenthequery thecomplexityoftwomeansclustering. and the RVQ codes are calculated, the candidates reside in the lists of top-ranked codes are taken as the seeds for 3.2 EnhancedHill-climbingProcedureforNNS hill-climbing procedure. The inverted indexing built from Oncethek-NNgraphisready,itisquitenaturaltoadoptthe RVQ codes plays the similar role as k-d trees in [6], while hill-climbingscheme[4]toperformtheNNS.However,we requiringverysmallamountofmemory. find that the procedure proposed in [4] is easily trapped in local optima and involves redundant visits to unlikely NN 3 NNS BASED ON K-NN GRAPH CONSTRUCTION candidates.Forthisreason,theprocedureisrevisedslightly 3.1 k-NNGraphConstructionwithTwoMeansCluster- inthepaper,whichisgiveninAlg.2. ing Algorithm2.EhancedHill-ClimbingNNS In this section, a novel kNN graph construction method 1: Input:q:query,G:k-NNGraph, based on two means tree [21] is presented. Basically, we 2: S:seeds,R←φ:NNrank-list follow the strategy of divide-and-conquer [10], [11], [6]. 3: Output:R While different from [10], [11], [6], the division over the 4: R←R∪{<s,d(q,s)>},s∈S;t←0; vector space is undertaken by two means clustering in our 5: whilet<t0 do paper. The motivation behind this algorithm is based on 6: T←φ; the observation that samples in one cluster are most likely 7: foreachri ∈top-kofRdo neighbors. The general procedure that constructs the kNN 8: foreachnj ∈G[ri]do graph is in two steps. Firstly, the reference set is divided 9: if nj isNOTvisitedthen into small clusters, where the size of each cluster is no 10: T←T∪{<nj,d(q,nj)>}; bigger than a threshold (i.e., 50). This could be efficiently 11: endif fulfilled by bisecting the reference set repeatedly with two 12: endfor meansclustering.Inthesecondstep,anexhaustivedistance 13: endfor comparison is conducted within each cluster. The kNN list 14: R←R∪T; for each cluster member is therefore built or updated. This 15: t←t+1; general procedure could be repeated for several times to 16: endwhile refinethekNNlistofeachsample. end Algorithm1.kNNGconstruction In Alg. 2, seeds could be randomly selected as [4] does or 1: Input:Xn×d:referenceset,k:scaleofk-NNGraph suppliedbyinvertedindexing,whichwillbeintroducedin 2: Output:kNNGraphGn×k Section4.Aswillbeseenintheexperiments,seedssupplied 3: InitializeGn×k;t←0; byinvertedindexingleadtosignificantperformanceboost. 4: fort<I0 do As shown in Alg. 2, the major modification on the 5: ApplytwomeansclusteringonXn×d originalalgorithmisonthewayofkNNexpansion.Asseen 6: CollectclusterssetS from Line 8-14, the NN expansion is undertaken for each 7: foreachSm ∈S do top-k sample in one iteration. In contrast, algorithm given 8: foreach<i,j >do//(i,j ∈Sm &i(cid:54)=j) in [4] expands kNN only for the top-ranked sample in one 9: if <i,j >isNOTvisitedthen iteration, which does not allow all the seeds to hold equal 10: UpdateG[i]andG[j]withd(xi,xj); chancetoclimbthehill.Wefindthattheoriginalprocedure 11: endif ismorelikelytrappedinlocaloptima.Aswillbeseeninthe 12: endfor experiment,morenumberofiterationsisneededtoreachto 13: endfor thesamelevelofsearchqualityasourenhancedversion. 14: t←t+1 15: endfor end 4 FAST NEAREST NEIGHBOR SEARCH 4.1 ReviewonResidueQuantization AsshowninAlg.1,thekNNgraphconstructionmainly relies on two means clustering, which forms a partition on The idea of approximating vectors by the composition of the input reference set in each round. Comparing with k-d several vectors could be traced back to the design of “two- tree,twomeansproducesballshapeclustersinsteadofcubic stage residue vector quantization” [23]. In this scheme, an shapepartition.Inordertoachievehighquality,thecluster- input vector is encoded by a quantizer, and its residue ing is driven by objective function of boost k-means [22], (quantization error). The residue vector is sequentially en- which always leads to higher clustering quality. With the codedbyanotherquantizer.Thisschemehasbeenextended clustering result of each round, the brute-force comparison to several orders (stages) [14], in which the residue vector is conducted within each cluster. Due to the small scale of that is left from the prior quantization stage is quantized each cluster, this process could be very efficient. The kNN recursively.Givenvectorv ∈ RD,andaseriesvocabularies 4 {V1,··· ,Vk}, vector v is approximated by a composition v ≈w(1)+w(2) id c c 11 7 of words from these vocabularies. In particular, one word 1 2 is selected from one stage vocabulary. Given this recursive 6 12 quantization is repeated to k-th stage, vector v is approxi- matedasfollows. v ≈w(1)+···+w(m)+···+w(k) (1) x⏟cx1xxcx2 c1 c2 id i1 im ik indexkey Irn=Eqvn.−1,(cid:80)wi(j(mmm=1−) 1∈)wVmi(jj)incosltlaegcetemd farroemtramin−ed1onstathgee.reFsiindaulleys, 4 3 vector v after RVQ is encoded as c1···ci···cm, where ci is the word ID from vocabulary Vi. When only one stage is Fig. 1: An illustration of inverted indexing derived from employed,residuequantizationisnothingmorethanvector residuevectorquantization.Onlytwo-layercaseisdemon- quantization, which has been well-known as Bag-of-visual strated, while it is extensible to more than two-layer cases. wordmodel[24]. RVQ is shown on the left, the inverted indexing with RVQ Asdiscussedin[19],multiple-stageRVQformsacoarse- codes is shown on the right. Vector vid is encoded by wc(11) troo-lfienaesphaierrtiatricohnicoavlekr-mtheeavnesc(tHorKsMpa)cien, wFLhAicNhNpl[a1y7s].siWmhilialer andwc(22),whicharethewordsfromthefirstandthesecond layervocabulariesrespectively. different from HKM, the space complexity of RVQ is very small given there is no need to keep the big amount of high-dimensional tree nodes. In addition, RVQ is able to produceabigamountofdistinctivecodeswithsmallvocab- In order to speed-up the search, it is more favorable to ulariesasmultiplestagesareemployed.Forinstance,given calculate the distance between query and the first order m = 4,|Vi=1···4| = 256, the number of distinctive codes is codes first. Thereafter, the early pruning is undertaken by asmuchas232. ignoringkeyswhosefirstordercodesarefarawayfromthe query.Toachievethis,Eqn.2isrewrittenas 4.2 InvertedIndexingwithRVQandCascadedPruning d(q,I)=q·qt−2·w(1)·qt−w(1)2 In RVQ encoding, the energy of a vector is mainly kept in c1 c1 (cid:124) (cid:123)(cid:122) (cid:125) thecodesoflowerorder.Similaras[25],[19],inourdesign, term1 the codes from multiple stage RVQ are combined as the −2·w(2)·qt−w1 ·w(2)t−w(2)2. (3) indexingkey,whichindicatestheroughlocationofavector (cid:124) c2 (cid:123)c(cid:122)1 c2 c2 (cid:125) in the space. Such that all the vectors that reside close to term2 this location share the same indexing key. The more orders As shown in Eqn. 3, the search process will calculate “term of codes are incorporated, the more precise this location 1”first.Onlykeysthatarerankedattoparejoinedintothe andthelargeroftheinvertedindexingspace.Consequently, distance calculation in the 2nd term. Based on d(q,I), the theencodedvectorsaresparselydistributedintheinverted indexing keys are sorted again. Only the inverted lists that lists. are pointed by the top-ranked keys are visited. Note that When a query comes, the search process calculates the this cascade pruning scheme is extensible to the case that distance between query and indexing keys. On the first morethantwoordersofcodesareusedastheindexingkey. hand, the indexing keys are decomposed back into several Ineachinvertedlist,onlytheIDsofvectorswhichsharethe RVQcodes.Theasymmetricdistancebetweenthequeryand sameRVQcodesarekept. thesecodesarethereforecalculatedandsorted.Givenquery qandtheindexingkeyI formedbythefirsttwoorderRVQ In the above analysis, we only show the case that in- codes (i.e., I = c1c2), the distance between q and key I is vertedindexingkeysproducedfromthefirsttwoordersof givenas RVQ codes. However, it is extensible to the case of more ordersofRVQcodesareemployed.Itisactuallypossibleto d(q,I)=(q−w(1)−w(2))2 generatekeyspaceoftrillionlevelwithonlyfourordersof c1 c2 =q·qt−2·(w(1)+w(2))·qt+(w(1)+w(2))2, (2) RVQcodes.Forinstance,ifvocabularysizesofthefirstfour c1 c2 c1 c2 ordersaresetto213,28,28and28,thevolumeofkeyspaceis where w1 and w2 are respectively words from the first asbigas237,whichissufficienttobuildinvertedindicesfor c1 c2 orderandthesecondordervocabularies. trillion level reference set. In practice, the number of layers Attheinitialstageofsearch,theinner-productsbetween tobeemployedcloselyrelatedtothesizeofreferenceset. queryandallthevocabularywordsarecalculatedandkept Intuitively, the larger of the reference set, the more in a table. In addition, for the third term in Eqn. 2, it numberoflayersshouldbeincorporated.Accordingtoour involves the calculation of inner product between w1 and observation,formillionleveldataset,two-layerRVQmakes w2 .Inordertofacilitatefastcalculation,theinnerproc1ducts agoodtrade-offbetweenefficiencyandsearchquality.Dur- c2 between words from the first order vocabulary and the ingtheonlinesearch,allthecandidatevectorsthatresidein wordsfromthesecondordervocabularyarepre-calculated the top-ranked inverted lists are collected as the candidate andkeptforlookingup.Asaresult,fastcalculationofEqn.2 seeds.Theseedsareorganizedasapriorityqueueaccording isachievablebyperformingseverallook-upoperations. tod(q,I)andarefedtoAlg.2. 5 5 EXPERIMENT 100 In this section, the experiments on SIFT1M and GIST1M 95 are presented. In SIFT1M and GIST1M, there are 10,000 %) 90 and1,000queriesrespectively.Inthefirsttwoexperiments, all ( following the practice in [6], the curve of average recall e rec 85 againsttotalsearchtime(ofallqueries)ispresentedtoshow ag thetrade-offthatonemethodcouldachievebetweensearch Aver 80 qualityandsearchefficiency.Thecurveisplottedbyvarying 75 thenumberofsamplestobeexpandedineachroundoftop- GNNS E-GNNS rankedlistandthenumberofiterations. 70 0 20 40 60 80 100 120 140 160 180 We mainly study the performance of the enhanced Search time (s) GNNS (E-GNNS) and the performance of E-GNNS when (a) SIFT1M it is supported by inverted indexing (IVF+E-GNNS). Their 90 performance is presented in comparison to state-of-the-art methodssuchasoriginalGNNS[4],EFANNA[6],whichis 85 the most effective method in recent studies, FLANN [17], E2LSH [3] and ANN [26]. For E-GNNS and IVF+E-GNNS, all (%) 80 wexepeursiemtehnetss,anmaemseclayle30ofaknNdN50grfoaprhSsIFaTs1EMFAaNndNAGIiSnT1thMe ge rec 75 a respectively.ForIVF-E-GNNS,twolayersofRVQquantizer Aver 70 are adopted and their vocabulary sizes are set to 256 for 65 eachonbothSIFT1MandGIST1M.Alltheexperimentshave GNNS beenpulledoutbyasinglethreadonaPCwith3.6GHzCPU E-GNNS 60 and16Gmemorysetup. 0 20 40 60 80 100 120 Search time (s) (b) GIST1M 5.1 E-GNNSversusGNNS Fig. 2: NNS quality against search time for GNNS and E- In our first experiment, we try to show the performance GNNS. The scale of kNN graph is fixed to 30 and 50 for improvementachievedbyE-GNNSoverGNNS.Asshown SIFT1MandGIST1Mrespectively. in Figure 2, with the revised procedure (Alg. 2), significant performanceboostisobservedwithE-GNNS.Ononehand, E-GNNStakeslessnumberofiterationstoconverge.Onthe TABLE 1: Parameter settings for non-compressional meth- other hand, it becomes less likely to be trapped in a local ods optimaasitisabletoreachtoconsiderablyhigherrecall. SIFT1M GIST1M ANN[26] ε=0 ε=5 5.2 E-GNNSsupportedbyInvertedIndex E2LSH[3] r=0.45 r=1 p=0.9, p=0.8, FLANN[17] In the second experiment, we study the performance of E- ωm=0.01,ωt=0 ωm=0.01,ωt=0 EFANNA[6] nTree=4,kNN=30 nTree=4,kNN=50 GNNS when it is supported by inverted indexing derived from RVQ. The performance of EFANNA is also presented 5.3 ComparisontoNon-compressionalMethods for comparison, which similarly follows the scheme of GNNS. Comparing to EFANNA, we adopt inverted in- Inthisexperiment,theperformanceofIVF+E-GNNSiscom- dexing instead of multiple k-d trees to direct the search. pared to both compressional methods such as IVFPQ [7], Moreover,theenhancedGNNSprocedureisadoptedinour IVFRVQ[19]andIMI[27]andnon-compressionalmethods case.AsshowninFigure3,E-GNNSsupportedbyinverted such as E2LSH [3], ANN [26] and FLANN [17]. The speed indexing shows considerable performance improvement. efficiency, search quality and memory cost of all these With similar search time, its average recall is boosted more methodsonSIFT1MandGIST1Mareconsideredandshown than 10% across two datasets. Compared to EFANNA, in Table 2. In particular, the memory cost is shown as the IVF+E-GNNS converges to significantly higher recall in ratiobetweenthememoryconsumptionofonemethodand shorter time. While for EFANNA, it takes shorter time the size of reference set. All results reported for the non- than IVF+E-GNNS only when the requirement of search compressional methods are obtained by running the codes quality is low. In order to reach to high recall, the search providedbytheauthors.Theconfigurationononemethod procedure could be unexpectedly elongated. We find that issettowherebalanceismadebetweenspeedefficiencyand its performance could be even degenerated as one chooses search quality (see detailed configurations in Table 1). The larger expansion factor (comparable to top-k in IVF+E- performancefromexhaustiveNNSissuppliedforreference. GNNS). This mainly indicates GNNS directed by multiple As seen from Table 2, ANN shows stable search quality k-dtreesiseasiertobetrappedinlocaloptimathanthatof and best trade-off on two datasets in terms of its recall, IVF+E-GNNS. Notice that IVF+E-GNNS also outperforms however its speed is only several times faster than brute- the method proposed in [5] by a large margin as it is force search. Among all methods considered in this experi- reportedin[6]. ment,IVF+E-GNNSshowsthebesttrade-offbetweenmem- 6 TABLE2:Comparisononaveragetimecostperquery(ms),memoryconsumption(relativeratio)andquality(recall@top- k) of all representative NNS Methods in the literature. In terms of memory consumption, the ratio is taken between real memoryconsumptionofonemethodagainsttheoriginalsizeofthecandidatedataset SIFT1M GIST1M T(ms) Memory R@1 R@100 T(ms) Memory R@1 R@100 IVFPQ-8[7] 8.6 0.086× 0.288 0.943 36.2 0.0029× 0.082 0.510 IVFRVQ-8[19] 4.7 0.070× 0.270 0.931 12.5 0.0023× 0.107 0.616 IMI-8[27] 95.1 0.086× 0.228 0.924 90.0 0.0029× 0.053 0.369 FLANN[17] 0.9 6.25× 0.852 0.852 11.9 2.12× 0.799 0.799 ANN[26] 250.5 8.45× 0.993 1.000 232.0 1.48× 0.938 1.000 E2LSH[3] 30.0 29.10× 0.764 0.764 6907.7 1.74× 0.342 0.990 E-GNNS 2.3 1.225× 0.882 0.882 6.7 1.049× 0.724 0.724 IVF+E-GNNS 1.8 1.234× 0.983 0.983 8.0 1.053× 0.943 0.943 EFANNA[6] 1.8 1.621× 0.955 0.955 9.5 1.049× 0.891 0.891 ExhaustiveNNS 1,107.0 1.0× 1.00 1.00 2,258.0 1.0× 1.00 1.00 100 scalable to very large scale dataset. With the support of constructed kNN graph, an enhanced hill-climbing proce- 95 dure has been presented. Furthermore, with the support of %) 90 invertedindexing,itshowssuperiorperformanceovermost all ( oftheNNSmethodsintheliterature.Wealsoshowthatthis e rec 85 method is easily scalable to billion level dataset with the g Avera 80 sreuspidpuoretvoefcttohreqiunavnertitzeadtiionnd.exing derived from multi-layer 75 IVF+E-GNNS EFANNA E-GNNS 70 REFERENCES 10 20 30 40 50 60 Search time (s) [1] J. L. Bentley, “Multidimensional binary search trees used for (a) SIFT1M associativesearching,”CommunicationsofACM,vol.18,pp.509– 517,Sep.1975. 100 [2] A. Guttman, “R-trees: A dynamic index structure for spatial searching,”inProceedingsofthe1984ACMSIGMODinternational 95 conference on Management of data, vol. 14, (New York, NY, USA), 90 pp.47–57,ACM,Jun.1984. e(s) 85 [3] sMen.sDitaivtaer,hNa.shIminmgosrclhiceam, Pe.bInadseydk,oanndpV-s.taSb.lMe idrriostkrnibi,u“tiLooncsa,”lityin- m h ti 80 Proceedings of the Twentieth Annual Symposium on Computational Searc 75 [4] GKe.oHmaetjeryb,i,(NY.ewAbYboarski,-YNaYd,kUorS,AH),.pSph.a2h5b3a–z2i6,2a,nAdCHM.,Z20h0a4n.g, “Fast 70 approximate nearest-neighbor search with k-nearest neighbor graph,” in International Joint Conference on Artificial Intelligence, 65 IVF+E-GNNS EFANNA pp.1312–1317,2011. 60 E-GNNS [5] J. Wang and S. Li, “Query-driven iterated neighborhood graph 0 5 10 15 20 25 30 35 40 45 search for large scale indexing,” in Proceedings of the 20th ACM correct neighbors(%) International Conference on Multimedia, (New York, NY, USA), pp.179–188,ACM,2012. (b) GIST1M [6] C. Fu and D. Cai, “EFANNA : An extremely fast approximate nearestneighborsearchalgorithmbasedonknngraph,”arXiv.org, Fig.3:NNSqualityagainstsearchtimeforE-GNNS,IVF+E- 2016. arXiv:1609.07228. GNNS and EFANNA. The scale of kNN graphs in SIFT1M [7] H. Je´gou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” Trans. PAMI, vol. 33, pp. 117–128, Jan. and GIST1M is fixed to 30 and 50 respectively for all the 2011. methods. [8] A.BabenkoandV.Lempitsky,“Additivequantizationforextreme vectorcompression,”inCVPR,pp.931–938,2014. [9] T. Zhang, C. Du, and J. Wang, “Composite quantization for ap- proximatenearestneighborsearch,”inICML,pp.838–846,2014. ory cost, speed efficiency and search quality. Additionally, [10] J. L. Bentley, “Multidimensional divide-and-conquer,” Commun. comparing with all other non-compressional methods, its ACM,vol.23,pp.214–229,Apr.1980. [11] J.Wang,J.Wang,G.Zeng,Z.Tu,R.Gan,andS.Li,“Scalablek-nn memoryconsumptionisonlyslightlylargerthanthesizeof graph construction for visual descriptors,” in CVPR, pp. 1106– referenceset. 1113,2012. [12] W.Dong,C.Moses,andK.Li,“Efficientk-nearestneighborgraph constructionforgenericsimilaritymeasures,”inProceedingsofthe 6 CONCLUSION 20thInternationalConferenceonWorldWideWeb,WWW’11,(New York,NY,USA),pp.577–586,ACM,2011. We have presented our solution for efficient nearest neigh- [13] J. Martinez, H. H. Hoos, and J. J. Little, “Stacked bor search, which is based on hill-climbing strategy. Two quantizers for compositional vector compression,” Arxiv.org. https://arxiv.org/abs/1411.2173. major issues have been discussed in the paper. Firstly, a [14] C. F. Barnes, S. A. Rizvi, and N. M. Nasrabadi, “Advances in novel kNN graph construction method is proposed. Its residualvectorquantization:areview.,”IEEETransactionsonImage time complexity is only at O(n·log(n)·D), which makes it Processing,vol.5,no.2,pp.226–262,1996. 7 [15] D.Comer,“Ubiquitousb-tree,”ACMComputingSurveys,vol.11, pp.121–137,Jun.1979. [16] N.Beckmann,H.-P.Kriegel,R.Schneider,andB.Seeger,“TheR*- tree:anefficientandrobustaccessmethodforpointsandrectan- gles,”inInternationalConferenceonManagementofData,pp.322– 331,1990. [17] M.MujaandD.G.Lowe,“Scalablenearestneighboralgorithms for high dimensional data,” Trans. PAMI, vol. 36, pp. 2227–2240, 2014. [18] R.M.GrayandD.L.Neuhoff,“Quantization,”IEEETransactions onInformationTheory,vol.44,pp.2325–2383,Sep.2006. [19] Y.Chen,T.Guan,andC.Wang,“Approximatenearestneighbor searchbyresidualvectorquantization,”Sensors,vol.10,pp.11259– 11273,2010. [20] A.BabenkoandV.Lempitsky,“Efficientindexingofbillion-scale datasetsofdeepdescriptors,”inCVPR,pp.2055–2063,2016. [21] N. Verma, S. Kpotufe, and S. Dasgupta, “Which spatial parti- tiontreesare adaptivetointrinsic dimension?,”inProceedingsof the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp.565–574,2009. [22] W.-L. Zhao, C.-H. Deng, and C.-W. Ngo, “Boost k-means,” Arxiv.org, vol. abs/1610.02483, 2016. https://arxiv.org/abs/1610.02483. [23] R. M. Gray, “Vector quantization,” IEEE Acoust, Speech, Signal ProcessingMagzine,pp.4–29,Apr. [24] J.SivicandA.Zisserman,“VideoGoogle:Atextretrievalapproach toobjectmatchinginvideos,”inICCV,Oct.2003. [25] A. Babenko and V. Lemptisky, “The inverted multi-index,” in CVPR,pp.3069–3076,2012. [26] S. Arya and D. Mount, “ANN Searching Toolkit.” ftp://ftp.cs.umd.edu/pub/faculty/mount/ANN/. [27] A.BabenkoandV.Lemptisky,“Theinvertedmulti-index,”Trans. PAMI,vol.37,no.6,pp.1247–1260,2015.