Quantile-Based KNN Over Multi-Valued Objects Wenjie Zhang, Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia {zhangw, lxue, macheema, yingz, weiw}@cse.unsw.edu.au Abstract—K Nearest Neighbor search has many applications results insensitive to relative distributions of instances of includingdatamining,multi-media,imageprocessing,andmon- objects. itoring moving objects. In this paper, we study the problem of KNN over multi-valued objects. We aim to provide effective Query Player A Player B and efficient techniques to identify KNN sensitive to relative distributionsofobjects.Weproposetousequantilestosummarize n/2 n n/2 relative-distribution-sensitiveKnearestneighbors.Givenaquery Q and a quantile φ ∈ (0,1], we firstly study the problem of Score 10.09.99 9.0 efficiently computing K nearest objects based on a φ-quantile distance(e.g.mediandistance)fromeachobjecttoQ.Thesecond (a) problemistoretrievetheK nearestobjectstoQbasedonoverall distances in the “bestpopulation” (with a given size specified by φ-quantile)foreachobject.Whilethefirstproblemcanbesolved inpolynomialtime,weshowthatthe2ndproblemisNP-hard.A l l set of efficient, novel algorithms have been proposed to give an Score 10.09.99 9.0 exactsolutionforthefirstproblemandanapproximatesolution forthesecondproblemwiththeapproximationratio2.Extensive (b) experiment demonstrates that our techniques are very efficient and effective. n/2 n n/2 Score I. INTRODUCTION 10.09.99 9.0 Given a set D of objects (points) in a d-dimensional metric (c) spaceandad-dimensionalqueryobject(point)q,theK near- Fig.1. MotivatingExample est neighbor search retrieves the K closest objects to q from D.TheconventionalKNNsearchhasbeenextensivelystudied Motivating Example. Let k = 1. In gymnastics, suppose that we want to select the “best” balance-beam player among [15],[23]withawidespectrumofapplicationsincludingdata all candidates to participate a world championship. The mining, contents-based image retrieval, and location based scores of two players A and B, based on the most recent services. In this paper, we study the problem of K nearest n games/attempts, are depicted in Figure 1(a), respectively. neighbor search over objects each of which has a collection Assume that the 2n scores of A and B (n for A and n for B) of values (instances) without temporal constraints specified; are distributed from 9.99 to 9.0 as depicted in Figure 1(a). that is, we do not deal with sequence databases [1], [18]. Assume that we approximately treat each player as an The existing model, probabilistic KNN, is to apply the uncertain object and the score of an attempt as an instance uncertainsemanticstoeachobjectbytreatingthecollectionof with the equal occurrence probability. It can be immediately instances of each object mutually exclusive. It aims to catch verified that based on the existing probabilistic NN models, relative distributions among objects with multi-instances. The 1 player A and player B have the same probability, , to be two semantics of ranking top-k uncertain tuples are employed 2 the nearest neighbor of the query point q (i.e. the score 10) in a probabilistic KNN model: 1) retrieving k tuples that if |10−score| is used as the distance metric. We permutate can co-exist in a possible world (e.g. U-topk) [24], and 2) the distribution in Figure 1(a) by swapping the two pairs of retrieving tuples according to the probability that a tuple is instancesofAandBasdepictedinFigure1(b).Itisimmediate top-k or at a specific rank in all possible worlds (e.g. U- 1 that A and B still have the same probability, , to be the kRanks and PT-k) [24], [17] 1. While various probabilistic 2 nearest neighbor regarding the score distribution after these NNmodelsareproposedin[3],[5],[20],aprobabilisticKNN two permutations. Choosing l = n, the score distribution in modeloveruncertaindatahasbeenproposedfollowingU-topk 2 Figure 1(a) is eventually modified to the score distribution ranking semantics [6]. In these probabilistic KNN models, in 1(c) after n such pairs of permutations; consequently, the the probability for an object to be KNN to a query object is 2 nearestneighborprobabilitiesofAandB,respectively,remain calculated to define the result of a KNN. Nevertheless, below 1 unchanged, , regarding the distribution in Figure 1(c). we show that the probabilistic KNN models may provide 2 Quantile-Based KNN. The examples in Figures 1 (a)-(c) 1Whenk=1,thesetwomodelsarethesame. demonstrate that the existing probabilistic KNN models may be insensitive to relative distributions of object instances. objectsandagivenqueryobjectthatisalsomulti-valued. Very recently, in [10] a novel model based on the expected • We show that the problem of KNN against the quantile rank for ranking top-k uncertain objects has been proposed. group-base distance is NP-hard. Novel and efficient al- Regarding the distributions and the permuted intermediate gorithms are proposed with the approximation ratio 2. distributions as depicted above in Figures 1 (a)-(c), player A As a byproduct, our techniques to compute a φ-quantile and B always have the same expected rank. Moreover, in the distance is O(n) if single-valued object is involved while the above application we do not need to enforce the uncertain technique in [26] is O(nlogn) where n is the number of semantics among multi-instances of each player by treating instances.Besidesthetheoreticalanalysis,anextensiveperfor- them mutually exclusive. Motivated by these, we treat each mance evaluation demonstrates that the proposed techniques player as a multi-valued object. are both efficient and effective. Quantiles [26] may provide a succinct summary of data The rest of the paper is organized as follows. In Section distributions. In this paper, we investigate the KNN problem II, we formally define the problems and provide some neces- over multi-valued objects based on a φ-quantile distance (φ∈ sary background information. In Section III, we present the (0,1]) from a multi-valued object to a query Q; for example, framework of our algorithms to conduct KNN against these the median is the 0.5-quantile. We extend our investigation to 2 quantile-based KNN problems. Section IV and Section V the KNN problem over multi-valued objects based on overall present query processing techniques for these two KNN prob- distances in the “best population” (with a given size specified lems, respectively. In Section VI, we report our experiment by φ-quantile) regarding each object; such overall distances results. Related work is summarized in Section VII. This is are called a φ-quantile group-base distance. followed by conclusions. Regarding the above example, our KNN problem based on 0.5-quantiledistancesistorankplayersbasedontheirmedian II. BACKGROUNDINFORMATION performances,respectively.TheKNNproblembasedona0.5- We present problem definition and necessary preliminaries. quantile group-base distance is to rank players based on their For reference, notations frequently used in the paper are overall performances of the top-50% of scores, respectively. summarized in Table I. The above example contains multi-valued objects in a 1- dimensionalspaceandthequeryisasingle-valuedpoint.Nev- Notation Definition ertheless, our investigation covers the applications where data U set of of objects U (Q) multi-valued (query) object objects consist of multiple instances in a d-dimensional space E entry of R-tree and a query object may also consist of multiple instances in a u (q) instance of U (Q) - a point in d-dimensional space d-dimensionalspace.Forinstance,inNBAtheperformanceof w(u) (w(S)) (total) weight of u (the set S) a player per game may be measured by his statistics (scores, d(q,u) Euclidean distance between q and u assists, rebounds, steals, blocks) and may be treated as an dlo(E,E(cid:2)) distance lower-bound between E and E(cid:2) instance of the player; consequently, each player has a set of dup(E,E(cid:2)) distance upper-bound between E and E(cid:2) instances. Suppose that a team wants to sign a contract with dφ(Q,U) φ-quantile distance of Q and U player A and wants to find his market value. The team may gbdφ(Q,U) φ-quantile group-base distance of Q and U Q×U Cartesian product of instances from Q to U wanttofindoutthetop-k“similar”NBAplayers,withexisting contracts, to A against their recent game statistics. Then, the TABLEI team can use the salaries information of these k-players to THESUMMARYOFNOTATIONS. project the salary level of A. Contributions. To the best of our knowledge, this is the first paper to study KNN problems regarding quantiles over multi- valued objects. Yiu et al [26] develop efficient techniques to A. Problem Definition compute quantile-distances among data points; nevertheless Given a collection S of m elements, eac(cid:2)h element si has the techniques are not applicable to our problem due to the a weight w(s ) where 0 < w(s ) ≤ 1 and m w(s ) = 1. i i i=1 i following reasons. Firstly, the query object in our problem Let S be sorted increasingly on a search key f - a function; settingmayhavemultipleinstancesandwecountallpaircom- that is, f(s ) ≤ f(s ) if i < j. Without loss of generality, in i j binationsbetweenanobjectandagivenqueryobject,whilethe this paper a sorted collection S of data elements, thereafter, computation of multi-source-based quantile-distances in [26] always means sorting S increasingly on a given search key is to compute the distance of an instance to its nearest given unless otherwise specified. source. Secondly, the quantile group-base distance problem Definition 1 (φ-quantile of S): Given a φ (0<φ≤1), the studied in this paper is NP-hard. Our contribution may be φ-quantile Sφ of S is the(cid:2)first element si in the sorted S on summarized as follows. the search key such that i w(s )≥φ. j=1 j • We make the first attempt to identify KNN sensitive to Inourproblemdefinition,aninstanceofanobjectU (orQ) the relative distributions among multi-valued objects. isweighted-weightgivestherepresentativenessofaninstance • Efficient, novel techniques are proposed for computing inU.Forinstance,intheexamplesinSectionIagamestatistic quantiledistancebasedKNNagainstasetofmulti-valued mayappear multipletimes;consequently anormalizedweight (the occurrence of an instance over the total occurrences of is the minimum total weighted distance among φ-quantile all instances) may be used to indicate the representativeness (cid:2)populations of Q × U; that is, the minimum value of of an instance. Note that the total of such weights in U (or w(q)w(u)d(q,u) with the constraint that S(cid:3) is a (q,u)∈S(cid:2) Q) is 1. φ-quantile population of Q×U. A multi-valued object U is represented as {(u ,w(u ))|1≤ The φ-quantile group-base distance between Q and U is i i i ≤ m} where ui is a point(cid:2)in a d-dimensional space, 0 < denotedbygbdφ(Q,U).Notethatforaφ∈(0,1],theexample w(u )≤1(1≤i≤m),and m w(u )=1.Aqueryobject below shows that gbd (Q,U) is not always defined on the set i i=1 i φ Q is also a multi-valued object. We use U to denote a set of of the “consecutive” smallest distances. In fact, we will show multi-valued objects. in Section 5 that the computation of gbd (Q,U) is NP-hard. φ ForagivenQandeachU ∈U,therearetotally(|Q|×|U|) Example 2: Regarding Example 1, let φ = 0.5. pairs of instances in Q×U where each pair (qi,uj) (qi ∈Q gbd0.5(Q,U) = 18d(q2,u1) + 18d(q3,u1) + 18d(q3,u2) + and uj ∈(cid:2)U)has theweight w(qi)×w(uj),namely w(qi,uj). 18d(q2,u2) instead of 18d(q2,u1)+ 18d(q3,u1)+ 18d(q3,u2)+ dCilsetaarnlcy,e d(qqii,∈uQj,)u2j∈bUetwwe(qein)q×i awnd(uujj)is=cal1le.dTthheedEisutacnlicdeeaonf (14(dq(1q,1u,1u)1,)14.)}Hiesren,ot{(e(vqe2n,ua10).,518-q),ua(n(tqi3le,up1o)p,u18l)a,tio(n(qo3f,uQ2)×,18U),. (q ,u ). Let Q×U ={((q ,u ),w(q ,u )) | q ∈Q & u ∈ In fact, there are several 0.5-quantile populations of i j i j i j i j U}. Q×U, including {((q3,u1),18), ((q2,u2),18), ((q1,u1),14)}, Definition 2 (φ-quantile distance of Q and U): Given a {((q2,u1),18), ((q2,u2),18), ((q1,u1),14)}, etc. (cid:2) φ∈(0,1], let Q×U be sorted increasingly on the search key Definition 5 (φ-Quantile KNN): Given a φ ∈ (0,1], a set - the distance d(q ,u ) of each element (q ,u ). Then, the U of multi-valued objects in a d-dimensional space, and a i j i j distance of the φ-quantile of Q×U is called the φ-quantile multi-valued query object Q, the φ-quantile KNN problem is distance of Q×U, denoted by d (Q,U). to retrieve the set Φ of K objects from U such that for each φ K Definition 2 states that if (q,u) is the φ-quantile of Q×U U ∈Φ and each U(cid:3) ∈U −Φ , d (Q,U)≤d (Q,U(cid:3)). K K φ φ (i.e., (Q×U) =(q,u)) then d(q,u) is d (Q,U). Definition 6 (φ-Quantile Group-base KNN): Given a φ ∈ φ φ (0,1],asetU ofmulti-valuedobjectsinad-dimensionalspace, Q and a multi-valued query object Q, the φ-quantile group-base q KNN problem is to retrieve the set Φ of K objects from q 3 K 1 U such that for each U ∈ Φ and each U(cid:3) ∈ U − Φ , K K gbd (Q,U)≤gbd (Q,U(cid:3)). u φ φ 2 q U 2 Problem Statement. In this paper, for a given φ∈(0,1], we study the problems of efficiently computing the φ-Quantile u 1 KNN and the φ-Quantile Group-base KNN. Fig.2. Distancesbetween2Multi-ValuedObjects B. Preliminaries Example 1: Regarding the example in Figure 2, |Q| = 3 Given a collection S of m elements, eac(cid:2)h element si has and |U| = 2. Assume that w(q1) = 12, w(q2) = w(q3) = 14; a weight w(si) where 0 < w(si) ≤ 1 and mi=1w(si) ≤ 1. w(u1) = w(u2) = 12. Consequently, Q×U consists of the A naive way to compute the φ-quantile is to firstly sort S following six pairs sorted on their distances increasingly: regarding a given search key f, and then scan the sorted list Q × U = {((q2,u1),18), ((q3,u1),18), ((q3,u2),18), toobtaintheφ-quantileofS.Clearly,thenaivealgorithmruns ((q1,u1),14), ((q2,u2),18), ((q1,u2),14)}. in O(mlogm). Note that the 2.5-quantile and the 3-quantile of Q×U are In [8], an efficient and effective partitioning technique 8 8 PARTITIONING (S) is proposed to find an element s∈S to the same (q3,u2). The 0.2-quantile distance d0.2(Q,U) of Q and U is d(q3,u1), d0.5(Q,U) is d(q1,u1), d0.6(Q,U) is also divideS intotwosub-collectionsS1 andS2 withthefollowing d(q1,u1). (cid:2) properties: Below, we specify a measure based on aggregates to define 1) for each s(cid:3) ∈ S1, f(s(cid:3)) ≤ f(s); and for each s(cid:3) ∈ S2, the top/best quantile-population of S. f(s(cid:3))≥f(s). Definition 3 (φ-quantile population of S): Given a S and 2) |S1|≥ 130m−6 and |S2|≥ 130m−6. a φ ∈ (0,1], a φ-quantile population S of S is a sub- Using the partitioning technique, in Algorithm 1 we present φ,P collection S(cid:3) of S such that the total weights of the elements an iteration-based algorithm to compute a φ-quantile when S in S(cid:3) is not smaller than φ and removing any element from is not sorted. S(cid:3) makes the total weights in the remaining sub-collection In Algorithm 1, w(S1) denotes the total weights of the smaller than φ. elements in S1. When S has only one element, S1 =S2 =∅. Definition 4 (φ-quantile group-base distance): Given a It is shown in [8] that the time complexity of PARTI- φ ∈ (0,1], the φ-quantile group-base distance of Q and U TIONING (S) is linear - O(|S|). Consequently, each iteration runs in linear time regarding the current sub-collection size. 2Note that our techniques developed in this paper is based on Euclidean Recall the property 2 above in PARTITIONING (S). It is distance; nevertheless they can be immediately extended to cover other distancemetrics. immediate that the sizes of sub-collections involved in the Algorithm 1: QUANTILE (S, φ) Q U (cid:2) Input :fS::sapceocilfleycatiosneaorcfhmkeeyl;ements;φ:0<φ≤ mi=1w(si); w(E)E w(E)E Output :φ-quantileofS 1 (s,S1,S2)←−PARTITIONING(S); 2 ifφ≤w(S1)then 3 callQUANTILE(S1,φ); 4 else 5 ifφ>w(S1)+w(s)then Fig.3. Local aR-trees for Multi-Valued Objects 6 callQUANTILE(S2,φ−w(S1)−w(s)); 7 else 8 returns; indexed by the entry. Namely, for every intermediate entry E in the local aR-tree, we record the weight of E as the sum of weights (total weights) of instances having E as an ancestor. Local aR-trees will facilitate the efficient computation of φ- iterationsinAlgorithm1areexponentiallyreduced-attheith quantile(orφ-quantilegroup-base)KNNandsupporteffective iteration bounded by (( 7 )i−1m+c) where c is a constant; 10 pruning techniques. Figure 3 shows the local aR-trees for Q consequently, the time complexity of Algorithm 1 is linear - and U where each intermediate entry records w(E). O(m). The correctness of Algorithm 1 immediately follows Besides, we maintain an R-tree on the MBBs of multiple from the property 1 of PARTITIONING (S). instances of objects. That is, for each object we first obtain the MBB of its multiple instances. Then we build an R-tree III. FRAMEWORKOVERVIEW on these MBBs. This R-tree is called the global R-tree. Note Ourtechniquesforsolvingtheφ-quantileKNNandandthe in the global R-tree, each leaf is an MBB of an object. φ-quantile group-base KNN for a given φ ∈ (0,1] follow a standard seeding-refinement paradigm outlined in Algorithm Distances between MBBs. Given two MBBs E and E(cid:3), we 2. compute the minimal and maximal distances dlo(E,E(cid:3)) and dup(E,E(cid:3)) between them as follows. For each dimension i Algorithm 2: Framework (1 ≤ i ≤ d), let IE,i and IE(cid:2),i denote the intervals on which E and E(cid:3) are projected, respectively. The minimum • Phase 1 - Seeding: Compute the φ-quantile (or φ-quantile group-base) distance from each of the K distancemindisti(E,E(cid:3))betweenIE,i andIE(cid:2),i isdefinedas chosen objects to Q. follows. If IE,i overlaps with IE(cid:2),i then mindisti(E,E(cid:3)) = 0 otherwise mindist (E,E(cid:3)) is the minimal value among • Phase 2 - Refinement: Determine the final solution for i φ-quantile KNN (or φ-quantile group-base KNN). the distances of 4 pairs of the ends of IE,i and IE(cid:2),i. maxdist (E,E(cid:3)) is the the maximal value among the dis- i tances of 4 pairs of the ends of IE,i and IE(cid:2),i. In the seeding phase, we choose K objects and compute (cid:3) (cid:4) theirφ-quantiledistances(orφ-quantilegroup-basedistances); (cid:4)(cid:6)d dlo(E,E(cid:2))=(cid:5) (mindist (E,E(cid:2)))2 assume that γ (λ ) is the maximal of these K φ-quantile i k K i=1 distances (or the φ-quantile group-base distances). In the (cid:3) (cid:4) (cid:4)(cid:6)d refinement phase, we use γK (λK) and φ to effectively prune dup(E,E(cid:2))=(cid:5) (maxdist (E,E(cid:2)))2 i objects and iteratively update γ (λ ) (if necessary). K K i=1 Selecting K Objects in the Seeding Phase. Any K multi- Figure 4 below shows representative examples. Note that valued objects from U could be used for the seeding phase. we can immediately verify that for each pair of instances Clearly, the smaller γK is, the more powerful γK may be (u,u(cid:3)) where u is contained by E and u(cid:3) is contained by used in the refinement for pruning. In our algorithm, we use E(cid:3), dlo(E,E(cid:3)) ≤ d(u,u(cid:3)) ≤ dup(E,E(cid:3)). Thus, dlo(E,E(cid:3)) the mean aU of the multiple instances for each multi-valued and dup(E,E(cid:3)) can be used as a lower- and upper- bound of objectU,andweusethemeanaQ ofthemultipleinstancesof the distance of any pair in E ×E(cid:3). Immediately, computing Q.ThenweapplytheKNNalgorithmin[16]toobtaintheK minimal/maximal distance between the two d-dimensional nearestneighborstoa from{a |U ∈U}.Subsequently,we Q U MBBs requires O(d) time. use the K objects corresponding to these K nearest means to aQ as the K objects in the seeding phase for both φ-quantile IV. φ-QUANTILEKNN KNN and φ-quantile group-base KNN, respectively. We present our techniques for conducting φ-quantile KNN Data Structures. Below are the data structures used in the foragivenφ∈(0,1].Wefirstpresentanefficientalgorithmto seeding and refinement phases in our techniques. For each compute a φ-quantile distance between Q and U instead of a multi-valued object U ∈ U, a local aR-tree [21] is built to brute-force computation; this will be used in the two phases. organizeitsmultipleinstances.Theaggregateinformationkept Then, we present a set of novel pruning techniques in the on each intermediate entry is the sum of weights of instances refinement phase, as well as the refinement algorithm. E1 E1 AlgorithmDescription.WeoutlineouralgorithminAlgorithm mindist(E1,E2) 3 below in a recursive fashion. Note that the input of the maxdist(E1,E2) E2 algorithmisacollectionofpairsofentries-initially,RQ×RU whereR (R )isthesetoftheentriesintherootofthelocal E Q U 2 aR-tree of Q (U). mindist(E1,E2) maxdist(E1,E2) (a) (b) Algorithm 3: QUANTILE-DISTANCE (RQ×RU, φ) Input :RQ×RU;φ:0<φ≤1; Fig.4. Minimal/Maximal Distance between 2 MBBs Output :dφ(Q,U) 1 θ:=0;T1:=∅;T2:=∅; 2 ifRQ×RU onlycontainsleafentriesthen 3 callAlgorithm1onRQ×RU andφ A. Efficiently Computing φ-Quantile Distances. 4 else Given Q, U, and a φ ∈ (0,1], we present an efficient 5 Calculatingdlφo andduφp regardingRQ×RU; NalgaoivreithLminteoarcoAmlgpourteithdmφ(.QF,iUrst)lyi,nfothriseascehct(ioqni,.uj) in Q×U, 786 foreiafcdhuT(pE1(E,:=E,E(cid:2){)(cid:2)()∈E≥,REdQ(cid:2)lφo)×}a∪RndUTd1dloo(E,E(cid:2))≤duφp then we calculate its distance d(qi,uj) and its weight w(qi,uj) (= 9 elseifdup(E,E(cid:2))<dlφo then w(qi)w(uj)).ThencallAlgorithm1toproducetheφ-quantile 10 θ:=θ+w(E)×w(E(cid:2)) (q,u) of Q×U regarding the search key value (distance) of each(qi,uj)andtheweightw(qi,uj)ofeach(qi,uj).Clearly, 1121 foreTa2ch:=(ET,2E∪(cid:2))E∈NTU1MdEoRATING(E,E(cid:2)) d (Q,U)=d(q,u)andthenaivealgorithmrunsinlineartime reφgarding |Q×U|; that is, O(|Q×U|). 13 QUANTILE-DISTANCE(T2,φ−θ); Pruning-based Linear Algorithm. While it is costly to enumerate all pairs of instances in Q×U, intuitively most In Algorithm 3, lines 7 and 8 remove the pairs, with their pairsinQ×U arepossibletoberemovedwithoutenumerating maximum distances smaller than dlφo or minimum distances them. This may be done by using the local aR-trees of Q larger than duφp, from T1 (i.e. no further exploring). Lines 9 and 10 cumulatively record the total weights θ of removed and U, respectively. Our algorithm is based on level-by-level pairs of entries with the maximum distance smaller than dlo. synchronous traversal on the local aR-trres of Q and U. The φ Lines11and12enumeratealltheremainingpairsofentriesin example below gives the basic idea of the algorithm. the next level for the next iteration; this will be shown in the examplebelow.Toensurethecorrectness,inthenextiteration, wecomputethe(φ−θ)-quantiledistancefromremainingpairs w =0.2 w =0.3 w =0.2 of instances. (E1, E1') (E1, E2') (E2, E2') Example 3: Continue the example (Figure 5) in the part of w =0.3 Basic Idea. Using Algorithm 3, T1 ={(E1,E2(cid:3)),(E2,E1(cid:3))} in (E , E ') the 1st iteration and θ =0.2. 2 1 Assumethatthechildnode,NODE(E1),ofRQ correspond- Fig.5. PruneEntriesattheCurrentLevel ing to the entry E1 has two entries {E1,1,E1,2}, NODE(E2) contains {E2,1,E2,2}, NODE(E1(cid:3)) contains {E1(cid:3),1,E1(cid:3),2}, and NODE(E(cid:3)) contains {E(cid:3) ,E(cid:3) }. In Algorithm 3, BasicIdea.SupposethattherootR oflocalaR-treeofQhas 2 2,1 2,2 2enetnritersie(sE(E(cid:3),1E,E(cid:3))2.)T,oatnadllyth,ethreoo4tpRaUirQsooffloecnatraieRs-tarreeedoefpUictheadsi2n {E(NEU1,M1,EER2(cid:3)A,1T),IN(EG1(E,1,1E,E2(cid:3),2(cid:3)2)),g(eEne1r,2a,teEs2(cid:3),t1h)e, (4E1p,a2i,rEs2(cid:3)o,2f)}e.ntSriimes-, Figure 5 w1her2e the two ends of each interval, corresponding ilarly, ENUMERATING(E2,E1(cid:3)) generates the 4 pairs of en- to(E ,E(cid:3)),aredlo(E ,E(cid:3))anddup(E ,E(cid:3)),respectively,and tries, {(E2,1,E1(cid:3),1), (E2,1,E1(cid:3),2), (E2,2,E1(cid:3),1), (E2,2,E1(cid:3),2)}. w(Ei,E(cid:3)j)isalsoshowi najsw.Assumeithajtalower-bounddlo These 8 pairs (in T2) together with (φ−0.2) are sent to the andainujpper-bounddup ofd (Q,U)areaswhataredepicteφd next iteration - QUANTILE-DISTANCE (T2, φ−0.2). (cid:2) φ φ Remark 1: If {E1,E2} are the leafs (i.e. points), then in Figure 5, respectively. they have no corresponding children nodes. In this case, rodenlmoS(2oiEnvpc2eae,diEr.sd2(cid:3)Tuo)hpf(eiseEnpl1taar,iriErgeseo1(cid:3),r)f(tE(hiEsa1n2,s,EmdE2(cid:3)uφa2)(cid:3)pl)l.aenrcCadotnhn(aEsanel2sqo,udEelφbon1(cid:3)e,t)l,ry(eE,inmw1ot,ehvEeeo1(cid:3)dnn)elbyxcetacfnloaeucvbuseeesl Egt(hEeNen2nUe,rtEMaht1e(cid:3)eE,1fR4)oA,llpT(oEawIiNr2isn,GgEo(E1f4(cid:3),21ep),na}Eti.rr2(cid:3)isSe)soimfineailnnattdroilrytea,slE:iafNr{e{(UEEgM1e(cid:3)1n,,EEeERr2(cid:3)2a(cid:3)A},t1eT)daI,rNfe(oGErt(h1tEeh,Ee2l,e2n(cid:3)Ea,e2f1x(cid:3)s))t,, iistegrautaiorann.tAeesdtthoebdeisstmanaclelerofthaannydpa(iQr ,oUf i)nasntadnctheestionta(lEw1,eEig1(cid:3)h)t iteration: {(E1,1,E2(cid:3)), (E1,2,E2(cid:3)), (E2,1,E1(cid:3)), (E2,2,E1(cid:3))}. (cid:2) φ of(E1,E1(cid:3))is0.2,fromthenextlevelweonlyneedtofindthe Calculating dlφo and duφp. In Algorithm 3, at each iteration we (φ−0.2)-quantiledistancesintheremainingpairsofinstances. need to calculate dlo and dup (line 5) except that all entries φ φ This is the basic idea of our algorithm. are at the leaf level. Assume that at the jth iteration, there are l pairs of entries left; that is, T2 = {ti | 1 ≤ i ≤ l} - each t with the form (E,E(cid:3)) where E is an entry from the localiaR-tree of Q and E(cid:3) is an entry from the local ar- child entry pairsof(E1, E2') w =0.05 w =0.05 tree of U. Recall that the maximum distance dup(ti) and the (E1,1, E2,1') (E1,1, E2,2') minimum distance dlo(t ) are defined on a pair t of entries w =0.1 w =0.1 i i in Section III. Suppose that in the current iteration, we want (E1,2, E2,1') (E1,2, E2,2') to compute the φ(cid:3)-quantile distance dφ(cid:2)(T) in T where T child entry pairsof(E2, E1') denotes the collection of pairs of instances each of which has an element in T2 as the ancestor; that is, for each pair of w =0.1 w =0.05 w =0.05 instances (q,u) ∈ T, ∃(E,E(cid:3)) ∈ T2 such that E contains q (E2,1, E1,1') (E2,2, E1,2') (E2,1, E1,2') w =0.1 and E(cid:3) contains u. Let the φ(cid:3)-quantile of T2, regarding the search key dlo(ti), (E2,2, E1,1') be (EL,EL(cid:3)), and the φ(cid:3)-quantile of T2, regarding the search key dup(t ), be denoted by (EU,EU(cid:3)). Fig.6. FilteringwhileEnumerating i Theorem 1: dlo(EL,EL(cid:3))≤dφ(cid:2)(T)≤dup(EU,EU(cid:3)). Proof: According to the(cid:2)definition of the φ(cid:3)-quantile B. Refinement Algorithm distance, it is immediate that w(t) ≥ φ(cid:3); consequently, dlo(EL,EL(cid:3))≤dφd(cid:2)lo(T(t))≤.dφ(cid:2)(T),t∈T2 In the seeding phase, after dφ(Q,U) is calculated for each Simi(cid:2)larly, according to the definition of φ(cid:3)-quantile dis- U of chosen K objects. A max heap maintains the K objects tdaunpc(eE,U,EduUp((cid:3)t)).<dφ(cid:2)(T),t∈T2w(t) < φ(cid:3). Therefore, dφ(cid:2)(T) ≤ tbhaesseedKonφt-hqeuiarnφti-lqeudainsttialencdeisstaanndcessi;tsγoKn itshethteopm. aOxuimr ruemfinoe-f Theorem 1 implies that dlo(EL,EL(cid:3)) and dup(EU,EU(cid:3)) ment algorithm for generating the final result for φ-quantile KNN is outlined below in Algorithm 4. In the algorithm, are a lower-bound and an upper-bound, respectively, of we effectively use γ and φ to prune as many entries, in ddφlo(cid:2)((ETL).,ETLhu(cid:3))s,anidn dluinpe(E5U,EofU(cid:3)A).lgColreitahrmly, t3h,iswceancbaelcduolantee the global R-tree on KMBBs of objects and local aR-trees, as possible. Since “closer” objects have a better chance to be in by Algorithm 1 in linear time. thefinalresultofKNN(thusabetterchancetoreduceγ ),we K Time Complexity. In each iteration, our algorithm is linear traverse the global R-tree based on the priority that an entry regarding |T2|; that is, O(|T2|). Since the total entries in the E with the smallest minimum distance to the MBB of Q will local aR-trees of Q and U are (|Q|) and O(|U|), respectively, be visited first. This can be done by maintaining a heap H on Algorithm 3 runes in linear time regarding |Q×U|; that is, the currently extended entries. O(|Q×U|). Correctness. The following theorem can be immediately ver- Algorithm 4: Refinement ified based on the definition of φ-quantile distance. 1 while H (cid:7)=∅ do Theorem 2: Let θ denote the total weights of the pairs of 2 E := deheap(H); entriessofarprunedbydlφoateachiteration.Then,dφ−θ(T)= 3 if notPRUNED1(Q,E) then dφ(Q,U)whereT consistsofallremainingpairsofinstances 4 if E isanintermediateentry then after the current iteration. 5 add MBBs of the child entries to H Theorem1andTheorem2implythatAlgorithm3iscorrect; 6 else /* E corresponds to an object U */ that is, it can produce d (Q,U). φ 7 if notPRUNED2(Q,E) then Filtering while Enumerating. Algorithm 3 can be im- 8 call Algorithm 3 to compute dφ(E,Q); proved when enumerating children pairs in line 12 - 9 if dφ(Q,E)<γK then ENUMERATING(E,E(cid:3)). For every enumerated pair t of 10 update current KNN children entries, before adding to T2 we check if it can be pruned by the current distance lower and upper bounds. Then 11 T2 keeps only the remaining pairs of children entries for the nextiteration.Notethatinθ,wealsoincludethetotalweights of children pairs pruned by the current lower bound dlo. In Algorithm 4, we initially load to H the entries MBBs in φ Example 4: Continue Example 3. The enumerated 8 pairs the root node of the global R-tree. Then we iteratively apply aredepictedinFigure6.Thepair(E2,1,E1(cid:3),1)withtheweight PRUNED1(E,Q)totheheaptopE byusingthepruningrules 0.1 is pruned by the current dlo. Consequently, the remaining below in Section IV-C for the global R-tree. If E cannot be φ 7 pairs (put in T2) and (φ − 0.2 − 0.1) are used to call pruned (i.e. PRUNED1(Q,E) returns FALSE), then we add QUANTILE-DISTANCE () for the next iteration. to H the entries of the child node with E as the MBB when Note that the time complexity of Algorithm 3, by adding E is an intermediate entry. When E cannot be pruned and E thetechnique“FilteringwhileEnumerating”,remainsthesame is the MBB of an object U, we apply the pruning rules below O(|Q|×|U|). in Section IV-C on an individual object, PRUNED2(Q,E), to pruneU (PRUNED2(Q,E)returnsTRUEifpruned)or“trim” and remove/trim the child entries E(cid:3) of the entries in TDi−1, the entries in the local aR-tree of U. if dlo(E(cid:3),E) > γ . For each remaining child entry E(cid:3) (not K trimmed), E(cid:3) with dup(E(cid:3),E) ≤ γ will not be extended K C. Pruning Rules at the next level since all its decedent entries always have Pruning Rules 1 and 2 attempt to prune an entry in the their minimum (and maximum) distances not greater that γK global R-tree, while Pruning Rule 3 is to further examine the - we cumulate w(E(cid:3)) in Δ; E(cid:3) with dup(E(cid:3),E) > γK will “details” of a remaining object using its local aR-tree. be extended in the next level for further trimming (thus, it is put into TD ). E is pruned and we terminate the execution of Pruning an Entry E of Global R-tree. Pruning Rules 1 and i Pruning Rule 2 if the value of Δ plus the total weights of the 2 below use Q to prune an entry E of the global R-tree. The entries in TD is not greater than φ. Note that if E cannot be correctness of Pruning Rule 1 is immediate. i pruned, then the execution terminates if either the the current Pruning Rule 1. (Distance based:) If dlo(E ,E)≥γ , then TD is empty or at the leaf level. Moreover, at the root level Q K i E can be pruned where EQ is the MBB of Q and E is an (i.e. i=1), we assume that TD0 consists of the MBB EQ of entry of the global R-tree. (E is pruned means that all objects Q. indexed by E can be pruned). Clearly, the execution of Pruning Rule 2 terminates if E is Note that the minimum distance between two MBBs is pruned or the minimum value of γ -cover is obtained (Δ + K defined in Section III and can be calculated in constant time. thetotalweightsincurrentTD).IfE istheMBBofanobject Next, we present Pruning Rule 2. U (i.e. corresponds to U) and U cannot be removed, then we Regarding the local aR-tree aRQ of Q, a set Γ of entries record the obtained total weights of the minimum γK-cover in aRQ is a γK-cover of aRQ if 1) there are no 2 entries in ΔQ and record its trimmed aR-local tree by aRQ,trim. ΔQ in Γ with the descendent relationship, 2) for each Ei ∈ Γ, will be used in the next pruning rule, and aRQ,trim will be dlo(Ei,E) ≤ γK, and 3) for each entry E(cid:3) which is not an used in line 8 to call Algorithm 3 if U cannot be pruned by ancestornoradescendentofanyentryinΓ,dlo(E(cid:3),E)>γK. the next pruning rule. The following theorem is immediate from the definition of φ- Example 6: Continue Example 5 regarding Figure 7. Sup- quantile distance. (cid:2)Theorwem(E3(cid:3):)≤Leφt, thΓen fboer eacah UγK∈-Eco,vder(Qo,fU)R≥Qγ. .If p{oEsQe}th.aAtttthheerroooottolefvaeRl,QwecoonbttaaiinnsTeDnti1re=sE{E11a,nEd2E}2r.eTgaDrd0in=g CEl(cid:2)e∈aΓrly, if we can find a γ -cover satisfyφing the condKition the depicted γk. At the next level, E1,1, E1,2, E2,2, E2,1 are in Theorem 3, then E can beKpruned. trimmed; consequently, TD2 = {E1,3,E2,3} and Δ = 0 if dup(E1,3,E)>γK and dup(E2,3,E)>γK. If w(TD2)<φ, (cid:2)PruningRule2.(Weightsbased:)IfthereisaγK-coverΓwith then E will be pruned; otherwise we go to the next level for w(E(cid:3))≤φ, then E can be pruned. further exploring. E(cid:2)∈Γ Note that there could be many γK-covers as shown in In case that dup(E1,3,E) ≤ γK and dup(E2,3,E) ≤ γK, Example 5. TD2 remains∅andΔ=w(E1,3)+w(E2,3).IfΔ>φthenE cannot be pruned. Since TD2 = ∅, the execution of Pruning E1,1 E1 E2 E2,1 Rule 2 terminates. If E is an object, then we record ΔQ and E1,2 E2,2 aRQ,trim. Here, ΔQ =w(E1,3)+w(E2,3), and in aRQ,trim, E1,1, E1,2, E2,1, E2,3 are pruned/trimmed. (cid:2) E1,3 E E2,3 Remark 2: When we trim/remove entries of R-trees, we do a “logic” removal by commenting them out. Fig.7. γK-Cover PRUNED1(Q,E).Foreachentry,wefirstcheckPruningRule 1 - PRUNED1(Q,E) returns TRUE if E is pruned. If E Example 5: As depicted in Figure 7, the γ -covers can be cannotbeprunedbyPruningRule1,thenweinvoketheabove K {E1,E2}, {E1,E2,3}, {E1,3,E2}, {E1,3,E2,3}. If E1,3 and execution of Pruning Rule 2; PRUNED1(Q,E) returns TRUE E2,3havechildentries,morealternativescouldbeenumerated. if E is pruned. They possibly have different to(cid:2)tal weights. Aγ -coverΓisminimumif w(E(cid:3))isminimized.In Trimming the Local aR-Tree of U. Before conducting the K E(cid:2)∈Γ example 5, {E1,3,E2,3} has the smallest weight among those computation of φ-quantile distance by Algorithm 3, we first trim the entries of the local aR-tree by γ . We conduct this 4 covers. Cle(cid:2)arly, the minimum γK has the maximal pruning K power since w(E(cid:3)) is minimized. in a level-by-level fashion from the root of the local aR-tree E(cid:2)∈Γ in the same way as the execution of Pruning Rule 2 except Executing Pruning Rule 2. Although a minimum γK-cover thatweswaptheroleQwithU;thatis,QbecomesE,andU can be computed by traversing aRQ level-by-level from the becomes Q in the execution of Pruning Rule 2. At each level root, we will not always try to get a minimum γK-cover if E i of aR-tree of U, we check the flowing pruning rule. can be pruned earlier. We visit aR level-by-level from the Q root.Ateach level i,wegenerate atodolistTD (initially∅), Pruning Rule 3. (Using Local aR-tree:) If (Δ+w(TD ))× i i (cid:2) Δ ≤φ, then U can be pruned. 3 w(u) = 1, and the location of each u can be chosen Q u∈U Proof: From the definition of φ-quantile distance, it is so that w(Q,u)d(Q,u) equals any integer. immediatethatif(Δ+w(TD ))×Δ ≤φ,thend (Q,U)≥ If we normalize the above Knapsack Problem by normaliz- i Q φ γ . ing each v(s) by (cid:2) v(s) . Then the normalized version of K s(cid:2)∈Sv(s(cid:2)) Knapsack is also NP-complete. The decision problem of the PRUNED2(Q,E). As described above, the execution of Prun- above spacial case of computing gbd (Q,U) is the same as ing Rule 3 is the same as the execution of Pruning Rule 2 φ thenormalizedKnapsackProblem.Consequently,theproblem except that we swap the roles of Q and U and check Pruning of computing gbd (Q,U) is NP-hard. Rule 3 instead of Pruning 2 at each level. PRUNED2(Q,E) φ Theorem 4: The problem of computing gbd (Q,U) is NP- terminates and returns TRUE if E is pruned; otherwise, φ PRUNED2(Q,E) terminates at the leaf-level (or TD = ∅) hard. i and returns FALSE. When PRUNED1(Q,E) returns FALSE, Appro(cid:2)ximately Computing gbdφ(Q,U). If we want to max- R ×R is used as the input of Algorithm 3 for imize v(s) with respect to a given X in the Knapsack Q,trim U,trim s∈S(cid:2) computingd (Q,U)insteadofusingR ×R .Here,R Problem, then there is PTAS; that is, a polynomial-time φ Q U Q,trim (R ) consists of the untrimmed entries at the root of approximationschemegivinganapproximatefactorarbitrarily U,trim aRQ,trim (aRU,trim). Note that in Algorithm 3, level-by- closer to 1(cid:2). Nevertheless, there is no PTAS to approximately level we use only untrimmed entries from both aR and minimize c(s) regarding a given Y. Q,trim s∈S(cid:2) aR .Wecanfurtherspeed-upthecomputationbyvisiting We adopt the approximate algorithm in [13] for Knapsack U,trim only the intermediate nodes, in aR and aR , Problem. It runs in time O(mlogm), where m is the num- Q,trim Q,trim respectively, with more than one child. ber of elem(cid:2)ents in S, with the approximation factor 2 for The correctness of Algorithm 4 immediately follows from minimizing c(s) for a given Y. The algorithm can be s∈S(cid:2) the theorems and pruning rules. Note that when Algorithm 3 immediately used to approximately compute gbd (Q,U) if φ is invoked, at each iteration we use the minimum value of γ we treat Q × U as S; and for each (q,u) ∈ Q × U, treat K and the obtained upper-bound dup as an upper-bound. w(q,u) as a v value and treat w(q,u)d(q,u) as a c value φ in the Knapsack Problem. Let aproxgbd (Q,U) denote the φ V. φ-QUANTILEGROUP-BASEKNN group distance output by the approximation algorithm. The Our algorithm for solving the φ-quantile group-base KNN following theorem is shown [13]. (φ∈(0,1])(definedinSectionII-A)alsofollowstheseeding- Theorem 5: 1≤ aproxgbdφ(Q,U) ≤2. refinement framework, Algorithm 2. For the seeding-phase, We briefly present tghbedφb(aQs,iUc)idea of the algorithm in [13] firstly we show that computing a φ-quantile group-base dis- while applying it to computing gbd . It iteratively conducts φ tance gbdφ(Q,U) between Q and U is NP-hard, and then an 2 phases: Completion and Growing Seed-Set ST - initially existing algorithm is employed with the approximation factor ∅ (w(ST) is always smaller than φ). We firstly sort Q×U 2 to approximately compute gbdφ(Q,U). In the refinement increasingly based on d(q,u). In the Completion phase, for phase, 2 novel, effective pruning techniques are developed. each element (q,u) in the remaining Q×U with w(q,u)+ w(ST)≥φ, 1) replace the current feasible solution S(cid:3) if the A. Computing φ-Quantile Group-base Distances total weighted distance in ST ∪{(q,u)} is smaller than that We first show that the Knapsack Problem can be converted in S(cid:3), and 2) remove (q,u) from Q×U. In Growing Seed- to a special case of our problem. Set ST, move the 1st element from the remaining Q×U to Knapsack Problem. It is NP-complete and can be formally ST. In each iteration, we first conduct Completion and then described below [11]. Growing Seed-Set; the algorithm terminates and outputs the total weighted distance in S(cid:3) if there is no element left in the INSTANCE: Finite set S, for each element s∈S, an integer remaining Q×U. size c(s), and an integer value v(s), and positive integers X Example 7: Suppose that φ = 0.5 and Q × U contains and Y. 4 elements. To simplify the presentation, we present these 4 (cid:2) XQUaEnSdT(cid:2)ION:Isvt(hse)r≥eaYsu.bsetS(cid:3) ofS suchthat s∈S(cid:2)c(s)≤ e(3le,m0.e4n8t)s,o(n4l,y0.b1y2)it}s.(Idnisotuarnacleg,owritehimgh,tw):e{fi(1rs,t0s.2o8rt),th(e2,l0is.t12in)-, s∈S(cid:2) creasinglybasedonthevalueof weight×distance = distance. NP-hardness. As defined in Section II-A, the problem of weight In the 1st iteration, nothing is chosen in the Completion computing gbdφ(Q,U) can be(cid:2)stated below. Find a subset phase since all elements with weight less than 0.5; ST S(cid:2)(cid:3) from Q × U such that (q,u)∈S(cid:2)w(q,u) ≥ φ and becomes {(1,0.28)} and (1,0.28) is removed from Q × U (q,u)∈S(cid:2)w(q,u)d(q,u) is minimized. in the Growing Seed-Set phase. In the 2nd iteration, S(cid:3) = A special case of the problem of computing gbdφ(Q,U) {(1,0.28), (3,0.48)} is chosen as a feasible solution and is that Q is a point with weight 1. In this case, each w(u) (3,0.48) is removed Q × U in the Completion phase; ST (= w(Q,u)) may be arbitrarily assigned with the constraint grows to {(1,0.28),(2,0.12)} and (2,0.12) is removed from Q×U since(2,0.12)wasthe1stelement.Inthe3rditeration, loc3aHlearRe-,trΔeeQofisQo.bΔtaianneddTasDtihearweereigchotrdoefdtwhehemnienximecuumtetγhKe-PcrouvneirngofRtuhlee regardingCompletionphase,(4,0.12)isremovedfromQ×U 2atleveliandswaptherolesofQandU asdescribedabove. as w(ST) + 0.13 = 0.52 > 0.5 and {(1,0.28), (2,0.12), (4,0.12)}becomesS(cid:3)asitstotalweighteddistance(1)smaller Proof: First,itcanbeimmediatelyverifiedthattheobject than that (1.72) in S(cid:3) = {(1,0.28), (3,0.48)}. Consequently, U pruned(i.e.,theentryE containingU ispruned)byPruning 1isoutputasapproxgbd0.5(Q,U);inthisexampleithappens Rule 4 or Pruning 5 has the property that gbdφ(Q,U)≥λK. approxgbd0.5(Q,U)=gbd(Q,U). (cid:2) FromTheorem6,itfollowsthatfor1≤i≤k,gbdφ(Q,Ui)≤ Notethatthisapproximatealgorithmdoesnotaccommodate aproxgbd (Q,U(cid:3))≤2gbd (Q,U ). φ i φ i a pruning-based level-by-level computation of gbd (Q,U) Theorem 6 states that every ith group-base distance (i ∈ φ because it requires to access all elements. [1,K]) output by our algorithm is between gbd (Q,U ) and φ i 2gbd (Q,U ). Our experiment, nevertheless, indicates the er- B. Refinement φ i ror could be much smaller in practice. In the seeding phase, we use the above approximate algo- rithm to approximately compute gbd (Q,U) between Q and VI. EXPERIMENTALSTUDY φ eachofthechosenK objects.Thelargestobtainedaproxgbd φ We report a thorough performance evaluation on the effi- value is denoted as λ . The refinement algorithm follows the K ciency and effectiveness of our algorithms. In particular, we similar framework outlined in Algorithm 4 in Section IV-B implement and evaluate the following techniques. except that: Q-KNN: Techniques presented in Section IV to compute • InPRUNDE1(Q,E)wewillusethepruningrulesbelow. KNN based on a φ-quantile distance (φ∈(0,1]). • remove line 7. Naive Q-KNN: Remove the pruning rules from Q-KNN. • call the above algorithm to (approximately) compute G-KNN: Techniques in Section V to compute KNN based gbd (Q,U) instead of Algorithm 3. φ on φ-quantile group-base distances. • use aproxgbdφ generated by the above approximate Naive G-KNN: Remove the pruning rules from G-KNN. algorithm and λ to replace d and γ , respectively. K φ K All algorithms are implemented in C++ and compiled by In the group with its total weighted distance gbd (Q,U), φ GNU GCC. Experiments are conducted on PCs with Intel instances may be from many different entries of the local aR- Xeon2.4GHzdualCPUand4GmemoryunderDebianLinux. treeofU.Consequently,itisnotalwayspossibletotrimmany Our experiments are conducted on both real and synthetic entries (subtrees) from the local aR-tree as what we do for datasets. computingφ-quantileKNN.Thus,inourrefinementalgorithm we only develop pruning rules to prune entries in the global Real dataset is extracted from NBA players’ game-by-game R-tree. statistics(http://www.nba.com),containing339,721recordsof 1,313 players. Each player is treated as a multi-valued object Pruning Rule 4. Suppose that E is the MBB of Q. If φ× Q where the statistics (score, assistance, rebound) of a player dL(E ,E)≥λ , then E is pruned from the global R-tree. Q K per game is treated as an instance with the equal weight The next pruning rule is used at each level. Suppose that (normalized). L = {E | 1 ≤ i ≤ l} consists of all the entries at the k i level k of the local aR-tree of Q. Without loss of generality, Synthetic datasets are generated using the methodologies we assume that L is sorted in the increasing order based on in [4] regarding the following parameters. Dimensionality d k dL(Ei,E); that is, dL(Ei1,E) ≤ dL(Ei2,E) if i1 < i2. Let varies from 2 to 5 with default value 3. Data domain in each E denote the φ-quantile of L according to the search key dimension is [0,1]. Number n of objects varies from 10,000 j k dL(E ,E) and the weight w(E ) of each element E ∈L . to 50,000 with default value 10,000. Number m of instances i i i k per object follows a uniform distribution in [1, M] where M Pruning Rule 5. E is pruned if: variesfrom400to2,000withthedefaultvalue400.Thevalue (cid:6)j−1 (cid:6)j−1 K variesamong5,10,20,30and40withdefaultvalue10.The (φ− w(Ei))dL(Ej,E)+ (w(Ei)×dL(Ei,E))≥λK. average length of object MBBs follows a uniform or normal i=1 i=1 distribution. In normal distribution, the length of MBB lies in the range [0,h] with the expectation value h/2 and standard Executing PRUNDE1(Q,E). For an E in the global R-tree, deviation 0.025; in uniform distribution, the length of MBBs we first check Pruning Rule 4; this is done by constant time. uniformly spreads over [0, h] where h varies from 0.05 to If E cannot be pruned, then we traverse the local aR-tree of 0.25withdefaultvalue0.05(i.e.,5%oftheedgelengthofthe Q level-by-level from the root to test Pruning Rule 5. To test whole data space). With the default setting, the total number Pruning Rule 5 at each level k, we first need to sort L . The k of instances is about 2 millions. total time complexity for traversing the local aR-tree of Q to Centers of objects (objects’ MBBs) follow either uniform, test Pruning Rule 5 is thus O(|Q|). normal or anti-correlated distribution. Locations of instances Accuracy Guarantee. Our algorithm for solving φ-quantile in an object follow uniform or normal distribution. Weights group-base KNN has the following accuracy guarantee. assigned to each instance follow uniform or normal dis- Theorem 6: Suppose that for 1 ≤ i ≤ k, U is ranked the tribution. Table II summarizes the parameters used in our i top-ith in the exact φ-quantile group-base KNN, and U(cid:3) is experimentwherethedefaultvaluesareinboldfont.Foreach i ranked the top-ith by our algorithms. Then for 1 ≤ i ≤ k, experiment, we randomly choose 100 objects from datasets as gbd (Q,U )≤aproxgbd (Q,U(cid:3))≤2gbd (Q,U ). query objects and record the average performance. Note that φ i φ i φ i defaultvalueswillbeusedinourexperimentunlessotherwise 1200 Q-KNN 1132.0 Q-KNN specified. e (s) Naive GQ--KKNNNN e (s) 600 Naive GQ--KKNNNN 566.8 m Naive G-KNN m Naive G-KNN Ti 800 Ti ng 588.4 ng 400 320.6 nudmimbeernosifoonbaljietcytsdN 21,0k3,,240,k5, 30k, 40k, 50k ocessi 400 ocessi 200 203.6 edge length h 0.05, 0.1, 0.15, 0.2, 0.25 Pr 1.1 2.8 Pr 32.7 0 0 number of instances m 400, 600, 800, 1k, 2k Synthetic data NBA data K 5, 10, 15, 20, 30 (a) (b) φ 0.1, 0.3, 0.5, 0.7, 0.9 Fig.9. OverallPerformance object location uniform, normal, anti-correlated instance location uniform, normal weight distribution uniform, normal We further evaluate the pruning powers in the refinement h distribution uniform, normal phase by conducting the following experiment. Regarding the TABLEII φ-quantile KNN, we examine the running time of Naive Q- PARAMETERVALUES. KNN, Naive Q-KNN with the Pruning Rule 1 (P1), Naive Q-KNNwiththePruningRules1and2(P1-2),andtheNaive Q-KNN with the Pruning Rules 1, 2, and 3 (P1-3, that is, Q- KNN). Similarly, for φ-quantile group-base KNN, Naive G- me (s) 0.4 AlgoriNthamiv e3 me (s) 0.06 AlgoriNthamiv e3 KGN-KNN,NNawivitehGth-eKPNruNniwnigthRtuhleesP4ruannidng5R(Pu4le-54,t(hPa4t)i,sa,nGd-KNNaiNve) g Ti g Ti 0.04 are examined. The evaluation results are depicted in Figure essin 0.2 essin 0.02 10. It shows that all these pruning rules are very effective and oc oc Pr Pr efficient. These 2 experiments indicate that Q-KNN and G- 0 0 200 400 600 800 1000 2 3 4 5 KNN are much more efficient than Naive Q-KNN and Naive (a) VaryingM (b) Varyingd G-KNN, respectively. Thus, in the rest of experiments we will no longer evaluate Naive Q-KNN and Naive G-KNN. Fig.8. TimeforComputingdφ Experimentsarealsoconductedagainstdifferentsettings(e.g., different data distributions, dimensionality, data size, etc) and A. Computing φ-Quantile Distance allexperimentsdemonstratetheeffectivenessofeachproposed pruningrulewithsimilarbehavior.Resultsarenotincludedin Figure 8 evaluates the efficiency of our technique, Algo- the paper due to space limits. rithm3,forcomputingaφ-quantiledistance,againstthenaive algorithm described in Section IV-A. In our experiment, we trhanesdeom2lyalsgeolreicthtm10s0a0npdairrespoofrtobthjeectasvefrroamgethtiemdeatbaysetssectoontdesst. me (s)103 588.4 1132.0 Figure 8(a) shows that our technique has more advantages g Ti102 29.9 86.0 n wthhatenthteheadnvuamnbtaegreofofinusstainngcesAlignocrrietahsmes.3Figgeutsrelo8w(be)rswhohwens cessi101 P1 3.6 P4 2.8 dimensionality increases. This is because that the pruning Pro100 P1-2 1P.11-3 P4-5 costs in Algorithm 3 are proportional to the dimensionality. Q-KNN G-KNN When dimensionality increases, more pruning overheads are Fig.10. Pruning Powers involved.Nevertheless,Figure8indicatesAlgorithm3signifi- cantly outperforms the naive algorithm. Therefore, we always use Algorithm 3 in the remaining experiments. Note that we C. Accuracy did not evaluate the techniques in [26] since they are not To evaluate the accuracy of G-KNN, we use two error generally applicable to our problem. measures. The first is the average distance error ratio. For 1 ≤ i ≤ K, approx(i) denotes the group-based distance of B. Overall Performance thetop-ithobjectoutputbyG-KNN,andexact(i)denotesthe Figure 9 reports the results of the evaluation on processing group-baseddistanceofthetop-ithobjectintheexactsolution. (cid:2) time of Q-KNN, Naive Q-KNN, G-KNN, Naive G-KNN over K |approx(i)−exact(i)| real and synthetic datasets. As shown, Q-KNN and G-KNN err ratio= i=1 exact(i) are much more efficient than their naive versions (i.e. without K using pruning techniques in the refinement phase) - upto 2 The second measure records the “misplaced” ratio. For 1 ≤ orders of magnitude. The improvement is less significant over i≤K,iftheithobjectintheexactsolutionisnotthesameas NBA data. This is because in NBA dataset, objects’ MBB theithobjectinthesolutionoutputbyG-KNN,thenmp(i)= sizesareverylargerelativetothewholedataspace;thisgives 1. (cid:2) veryhighoverlappingdegreeamongobjects’MBBs.Thusless K mp(i) objects can be pruned during query processing. mp ratio= i=1 K
Description: