ebook img

VEXT: A Virtual Observatory Exploration Toolkit PDF

14 Pages·2002·0.59 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview VEXT: A Virtual Observatory Exploration Toolkit

Final Summary of Research for NASA Grant NAG5-10739: VEXT: A Virtual Observatory Exploration Toolkit Jeff Schneider I and Andy Connolly: 1 Carnegie Mellon Univ., 5000 Forbes Ave., Pittsburgh, PA-15217 2 Dept. of Physics and Astronomy, Univ. of Pittsburgh, Pittsburgh, PA-15260 Abstract. This final report consists of two main parts. The first is taken from a paper by the PiCA Group which describes our ongoing work in fast computation of n-point correlation functions, (see Moore et al. 2001). We present here anew algorithm for the fast computation of N-point correlation functions in large astronomical data sets. The algorithm is based on kd-trees which axe decorated with cached sufficient statistics thus allowing for orders of magnitude speed-ups over the naive non-tree-based implementation of correlation functions. We further discuss the use of controlled approximations within the computation which allows for further acceleration. In summary, our algorithm now makes it pos- sible to compute exact, all-pairs, measurements of the 2, 3 and 4-point correlation functions for cosmological data sets like the Sloan Digital Sky Survey (SDSS; York et al. 2000) and the next generation of Cosmic Microwave Background experiments (see Szapudi et al. 2000). The second part summarizes the progress made by the PiCA Group in this area through the AISR grant. 1 Introduction Correlation functions are some of the most widely used statistics within as- trophysics (see Peebles 1980 for a extensive review). They are often used to quantify the clustering of objects in the universe (e.g. galaxies, quasars etc.) compared to a pure Poission process. More recently, they have also been used to measure fluctuations in the Cosmic Microwave Background (see Szapudi et al. 2000). On large scales, the higher-order correlation functions (3-point and above) can be used to test several fundamental assumptions about the universe; for example, our hierarchical scenario for structure formation, the Gaussianity of the initial conditions as well as testing various models for the biasing between the luminous and dark matter. The reader is referred to Sza- pudi (2000), Szapudi et al. (1999a,b) and Scoccimarro (2000; and references therein) for an overview of the usefulness of N-point correlation functions in constraining cosmological models. Over the coming decade, several new, massive cosmological surveys will become available to the astronomical community. In this new era, the qual- ity and quantity of data will warrant a more sophisticated analysis of the 2 Jeff Schneider and Andy Connolly higher-order correlation functions of galaxies (and other objects) over the largest range of scales possible. Our ability to perform such studies will be severely limited by the computational time needed to compute such functions and no-longer by the amount of data available. In this paper, we address this computational "bottle-neck" by outlining a new algorithm that uses innova- tive computer science to accelerate the computation of N-point correlation functions far beyond the naive O(R N) scaling law (where R is the number of objects in the dataset and N is the power of correlation function desired). The algorithm presented here was developed as part of the "Computa- tional AstroStatistics" collaboration (see Nichol et al. 2000) and is a member of a family of algorithms for a very general class of statistical computations, including nearest-neighbor methods, kernel density estimation, and cluster- ing. The work presented here was initially presented by Gray & Moore (2001) and will soon be discussed in a more substantial paper by Connolly et al. (2001). In this conference proceeding, we provide a brief review of hi-trees (Section 2), a discussion of the use of hi-trees in range searches (Section 3), an overview of the development of a fast N-point correlation function code (Section 4) as well as presenting the concept of controlled approxima- tions in the calculation of the correlation function (Section 5). In Section 6, we provide preliminary results on the computation speed-up achieved with this algorithm and discuss future prospects for further advances in this field through the use of other tree structures. 2 Review of kd-trees Our fast N-point correlation function algorithm is built upon the kd-tree data structure which was introduced by Friedman et al. (1977). A kd-tree is a way of organizing a set of datapoints in k-dimensional space in such a way that once built, whenever a query arrives requesting a list all points in a neighborhood, the query can be answered quickly without needing to scan every single point. The root node of the hi-tree owns all the data points. Each non-leaf-node has two children, defined by a splitting dimension n.SPLrrDIM and a splitting value n.SmzVALuZ. The two children divide their parent's data points between them, with the left child owning those data points that are strictly less than the splitting value in the splitting dimension, and the right child owning the remainder of the parent's data points: Xi E n.LzFT _ Xi[n.SPLITDIM] < n.SPLITVALUE and xi E n (1) Xi E n.RmaT ¢:_ xi[n.SPL,TD,u] _ n.SPL,TV^LuS and xi E n (2) As an example, some of the nodes of a hi-tree are illustrated in Figures 1. hi-trees are usually constructed top-down, beginning with the full set of points and then splitting in the center of the widest dimension. This produces VEXT: A Virtual Observatory Exploration Toolkit 3 •"', .: 2: • i •".:. ": :: it• °, •.-.: _...:. . "_._...:.. , -i .(...-... . I ii ! _%t•° • •, ., • • •. Figure la: The top node of a kdtrce is sim- Figure lb: The second level contains two ply a hyper-rectangle surrounding the data- nodes. points. •".:. ": :: • _•so •°..- Figure lc: The third level contains four nodes. Note how a parent node creates its Figure ld: The set of nodes in the sixth level two children by splitting in the centers of its of the tree. widest dimension two child nodes, each with a distinct set of points. This procedure is then repeated recursively on each of the two child nodes. A node is declared to be a leaf, and is left unsplit, if the widest dimension of its bounding box is <_some threshold, MINBOXWIDTIt. A node is also left unsplit if it denotes fewer than some threshold number of points, train. A leaf node has no children, but instead contains a list of k-dimensional vectors: the actual datapoints contained in that leaf. The values MINBOXWIDTH : 0 and rmin : 1 would cause the largest kd-tree structure because all leaf nodes would denote singleton or coincident points. In practice, we set MINBOXWIDTtt 4 Jeff Schneider and Andy Connoily to 1% of the range of the data point components and rmin to around 10. The tree size and construction thus cost considerably less than these bounds be- cause in dense regions, tiny leaf nodes are able to summarize dozens of data points. The operations needed in tree-building are computationally trivial and therefore, the overhead in constructing the tree is negligible. Also, once a tree is built it can be re-used for many different analysis operations. Since the introduction of kd-trees, many variations of them have been proposed and used with great success in areas such as databases and com- putational geometry (Preparata & Shamos 1985). R-trees (Guttman 1984) are designed for disk resident data sets and efficient incremental addition of data. Metric trees (see Uhlmann 1991) place hyperspheres around tree nodes, instead of axis-aligned splitting planes. In all cases, the algorithms we discuss in this paper could be applied equally effectively with these other structures. For example, Moore (2000) shows the use of metric trees for accelerating several clustering and pairwise comparision algorithms. 3 Range Searching Before proceeding to fast N-point calculations, we will begin with a very standard kd-tree search algorithm that could be used as a building block for fast 2-point computations. For simplicity of exposition we wilt assume the every node of the kd-tree contains one extra piece of information: the bounding box of all the points it contains. Call this box n.BousDBox. The implication of this is that every node must contain two new k dimensional vectors to represent the lower and upper limits of each dimension of the bounding box. The range search operation takes two inputs. The first is a k-dimensional vector q called the query point. The second is a separation distance sin. The operation returns the complete set of points in the kd-tree that lie within distance shi of q. * RangeSearch(n, q, Shi) Returns a set of points S such that x E S ¢_ x E n and Ix- q[ < Shi (3) * Let MINDIST ::: the closest distance from q to n.BowDBox. • If MINDIST _>Shi then it is impossible that any point in n can be within range of the query. So simply return the empty set of points without doing any further work. * Else, if n is a leaf node, we must iterate through all the datapoints in its leaf list. For each point, find if it is within distance sm of q. ff so, add it to the set of results. . Else, n is not a leaf node. Then: - Let RangeSearch(n.L_vT, q, sin) S|e[t :---- - Let Sright := RangeSearch(n.mQHT, q, sin) VEXT: A Virtual Observatory Exploration Toolkit 5 I! Figure 2a: The shaded rectangles denote Figure 2b: When the range is larger there are nodes that were pruned during a search for fewer opportunities for pruning. the set of points that lie inside the circle. - Return _left IJ Sright. Figure 2a shows the result of running this algorithm in two dimensions. Many large nodes are pruned from the search. 117 distance calculations were needed for performing this range search, compared with 499 that would have been needed by a naive method. Note that it is not essential that kd-tree nodes have bounding boxes ex- plicitly stored. Instead a hyper-rectangle can be passed to each recursive call of the above function and dynamically updated as the tree is searched. Range searching with a hi-tree can be much faster than without if the range is small, containing only a small fraction of the total number of data- points. But what if the range is large? Figure 2b shows an example in which hi-trees provide little computational saving because almost all the points match the query and thus need to be visited. In general this problem is un- avoidable. But in one special case it can be avoided--if we merely want. to count the number of datapoints in a range instead of explicitly find them all. 3.1 Range Counting and Cached Sufficient Statistics We will add the following field to a hi-tree node. Let n.NuMPOINTS be the number of points contained in node n. This is the first and simplest of a set of hi-tree decorations we refer to as cached sufficient statistics (see Moore & Lee 1998). In general, we frequently stored the centroid of all points in a node and their covariance matrix. 6 JeffSchneidaenrdAndyConnolly Oncewehave n.NuMPOINTSit is trivial to write an operation that counts the number of datapoints within some range without explicitly visiting them. • RangeCount(n, q, shi) Returns an integer: the number of points that are both inside the n and also within distance shi of q. a Let MINDIST :----the closest distance from q to n.BouNvBox. • If MINDIST _ Shi then it is impossible that any point in n can be within range of the query. So simply return 0. • Let MAXDIST :_- the furthest distance from q to n.BousvBox. • If MAXDIST _( shi then every point in n must be within range of the query. So simply return n.NuMPOINTS. • Else, if n is a leaf node, we must iterate through all the datapoints in its leaf list. Start a counter at zero. For each point, find if it is within distance sai of q. If so, increment the counter by one. Return the count once the full list has been scanned. • Else, n is not a leaf node. Then: - Let Cleft :---- I:_LngeCount(n.LEFT, query, sai) - Let Cright := RangeCount(n.mcM% query, sai) - Return Cleft + C¢ight. The same query that gave the poor range search performance in Figure 2b gives good performance in Figure 3. The difference is that a second type of pruning of the search is possible: if the hyperrectangle surrounding the n is either entirely outside or inside the range then we prune. 4 Fast N-point Correlation Functions 4.1 The Single Tree Approach to Two-Point Computation It is easy to see that the 2-point correlation function is simply a repeated set of range counts. For example, given a minimum and maximum separation Sto and shi we run the following algorithm: • SingleTree2Point(X, n, slo, shi) Input X is a dataset, represented as a matrix in which the kth row corre- sponds to the kth datapoint. X has R rows and k columns. Input n is the root of a kdtree built from the data in X. Output integer: the number of pairs of points (xi, xj) such that Sto <_[xi - xj [< Shi. •C:=0 • For i between 1 and R do: - C := C + RangeCount(n, xi, Shi) -- RangeCount(n, x_, Slo) Note that in practice we do not use two range counts at each iteration, but one slightly more complex rangecount operation RangeCountBetweenSeparations(n, q, sto, sai) that directly counts the number of points whose distance from q is between sto and Shi. VEXT: A Virtual Observatory Exploration Toolkit J Fig. 3. When doing a range count, nodes entirely within range can also be pruned and added to the total count. This ad- ditional pruning adds significant speed-ups to the slower range count discussed in Figure 2b. Now one just spends time studying nodes on the boundary of the range count. 4.2 The Dual Tree Approach to Two-Point Computation The previous algorithm iterates over all datapoints, issuing a range count operation for each. We can save further time by turning that outer iteration into an additional kd-tree search. The new search will be a recursive procedure that takes two nodes, na and rib, as arguments. The goal will be to compute the number of pairs of points (x,y) such that x E ha, y E rib, and Slo <_ [x- Yl< shi. • DualTreeCount(na, rib, Sto, shi) Returns an integer: the number of pairs of points (x, y) such that x E ha, y E rib, and Sto <_ Ix - y[ < Shi. • Let MINDIST :----the closest distance between na.BOuNDBox and _b.BOUNDBOX- • If MIND1ST __>Shi then it is impossible that any pair of points can match. So simply return 0. • Let MAXDIST :: the furthest distance between na.BOu_DBOx and nb.BOOr_DBoX. • If MAXDIST <: Slo then it is again impossible that any pair of points can match. So simply return 0. • If S_o <_MINDIST <__MAXDIST < shi then all pairs of points must match. Use na.NuMPOINTS and nb.NuMPOISTS to compute the number of resulting pairs na.NUMPO1NTS X nb.NUMPOINTS_ and return that value. • Else, if n_ and nb are both leaf nodes, we nmst iterate through all pairs of datapoints in their leaf lists. Return the resulting (slowly computed) count. 8 Jeff Schneider and Andy Connolly • Else at least one of the two nodes is a non-leaf. Pick the non-leaf with the largest number of points (breaking ties arbitrarily), and call it n*. Call the other node n-. Then: - Let Cleft :---- DualTreeCount(n*.L_,r% n-, Slo,sai) - Let Cright := DualTreeCount(n*.maMT, n-, slo, shi) - Return Cleft + Cright- Computing a 2-point function on a dataset X then simply consists of com- puting the value C = DualTreeCount(nroot,nroot, Sto,shi), where nroot is a kd-tree built from X, for a range of bins with minimum and maximum boundaries of Sto and hisep. We note here that the 2-point correlation func- tion, the quanity of interest is not simply C, but C/2 (the number of unique pairs of objects). A further speed-up can be obtained by simultaneously computing the DualTreeCount(nroot, nroot, Slo, Shi) over a series of bins. We will discuss this in further detail in Connolly et al. (2001). 4.3 Redundancy Elimination So far, we have discussed two operations - exclusion and subsumption - which remove the need to traverse the whole tree thus speeding-up the computation of the correlation function. Another form of pruning is to eliminate node-node comparisons which have been performed already in the reverse order. This can be done simply by (virtually) ranking the datapoints according to their position in a depth-first traversal of the tree, then recording for each node the minimum and maximum ranks of the points it owns, and pruning whenever na'S maximum rank is less than nb's minimum rank. This is useful for all- pairs problems, but will later be seen to be essential for all-k-tuples problems. This kind of pruning is not practical for Single-tree search. 4.4 Multiple Trees Approach to N-Point Computation The advantages of Dual-Tree over Single-Tree are so far two fold. First, Dual- tree can be faster, and second it can exploit redundancy elimination. But two more advantages remain. First, we can extend the "2-tree for 2-point" method up to "N-trees for N-point". Second (discussed in Section 5.1), we can perform effective approximation with Dual-trees (or n-trees). We now discuss the first of these advantages. The N-point computation is parameterized by two n × n symmetric ma- trices: L and H. We wish to compute R R R E E "'" E I(L,H,i,,i2,...i,) (4) Q:I i2:1 i_:1 VEXT:AVirtual Observatory Exploration Toolkit 9 where I(L, H, il, i2,...in) is zero unless the following conditions hold (in which case it takes the value 1): (5) Vl < i < j < n, L(i,j ) < Ixi, - xijl < H(ij) We will achieve this by calling a recursive function FastNPoint on an n-tuple of kdtree nodes (nl, n2... nn). This recursive function much return Z Z "'" Z I(L,H, il,i2,...i,) (6) ilEn 1 i2Cn 2 i_Enn • FastNPoint(na,nb, Sto,shi) • Let ALLSUBSUMED:=TRUE • For i= 1ton do - Forj =i+1 tondo * Let MINDIST :----the closest distance between ni.BovsDBox and nj .BOONDBOX. * If MINDIST > H(i,j) then it is impossible that any n-tuple of points can match because the distance between the ith and jth points in any such n-tuple must be out of range. So simply return 0. * Let MAXDIST :-- the furthest distance between ni.BOUNDBOXand ?/j.BOUNDBOX. * IfMAXDIST < L(ij)then similarlyreturn0. * IfL(i,j)<_MINDIST _<MAXDIST < H(i,j)then every n-tuple has the propertythe the ithmember and jth member match. We areinterestedinwhether thisistrueforall(i,j)pairsand so the firsttime we aredisappointed(bydiscoveringthe above expressiondoesnothold)thenwe willupdate theALLSUBSUMED flag.Thus theactualcomputation atthisstepis: IfL(ij)> MINDIsT or MAXDIST _>H(i,j)then ALLSUBSUMED :=FALSE. • IfALLSUBSUMED has remained truethroughoutthe above double loop, we canbe surethateveryn-tuplederivedfrom thenodesintherecursive callmust match, and sowe can simply return _Ini .NuMPOINTS (7) i=1 • Else, if all of nl,n2,...nn are leaf nodes we must iterate through all n-tuples of datapoints in their leaf lists. Return the resulting (slowly computed) count. • Else at least one of the nodes is a non leaf. Pick the non-leaf with the largest number of points (breaking ties arbitrarily), and assume it has index i -- i*. Then: 10 JeffSchneidaenrdAndyConnolly - Let Cjeft := FastNPoint(nl,... ,ni..L_pz,... ,nn) -- Let Cright ;---- FastNPoint(nl,..., ni'-a'QHT,.--, nn) -- Return Cleft + Cright- The full N-point computation is achieved by calling FastNPoint with arguments consisting of an n-tuple of copies of the root node. We should note once again it is possible to save considerable amounts of computation by eliminating redundancy. For example, in the 4-point statis- tic, the above implementation will recount each matching 4-tuple of points (z, y, z, w) in 24 different ways: once for each of the 4! permutations of (x, y, z, w). Again, this excess cost can be avoided by ordering the datapoints via a depth-first tree indexing scheme and then pruning any n-tuple of nodes violating that order. But the reader should be aware of an extremely messy problem regarding how much to award to the count in the case that a sub- sume type of pruning can take place, ff all nodes own independent sets of points the answer is simple: the product of the node counts. If all nodes are the same then the answer is again simple: (_), where u is the number of points in the node. Somewhat more subtle combinatorics are needed in the case where some nodes in the n-tuple are identical and others are not. And fearsome computation is needed in the various cases in which some nodes are descendants of some other nodes. 5 Controlled Approximation In general, when the final answer comes back from FastNPoint, the ma- jority of the quantity in the count will be the sum of components arising from large subsume prunes. But the majority of the computational effort will have been spent on accounting for the vast number of small but unprunable combinations of nodes. We can improve the running time of the algorithm by demanding that it also prunes it search in cases in which only a tiny count of n-tuples is at stake. This is achieved by adding a parameter, T, to the FastNPoint algorithm, and adding the following lines at the start: • Let Cmax :=l-Iin=l ni.NuMPO1NTS * If Cmax < T then quit this recursive call. This will clearly cause an inaccurate result, but fortunately it is not hard to maintain tight lower and upper bounds on what the true answer would have been ifthe approximation had not been made. Thus FastNPoint(nl, n2,..., nn, T) now returns a pair of counts (Clo, Chi) where we can guarantee that the true count C lies in the range Clo _ C _<Chi. 5.1 Iterative Deepening for Controlled Approximation Suppose the true value of the N-point function is C but that we are prepared to accept a fractional error of e:we will be happy with any value Capprox such

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.