Representation of big data by dimension reduction A.G.Ramm, C. Van Department of Mathematics, Kansas State University, Manhattan, KS 66506, USA 7 1 [email protected]; [email protected] 0 2 n a Abstract J 1 Suppose the data consist of a set S of points xj,1 ≤ j ≤ J, distributed in a 3 bounded domain D ⊂ RN, where N and J are large numbers. In this paper an algorithmisproposedforcheckingwhetherthereexistsamanifoldMoflowdimension ] T nearwhichmanyofthepointsofS lieandfindingsuchMifitexists. Therearemany I . dimension reduction algorithms, both linear and non-linear. Our algorithm is simple s c toimplementandhassomeadvantagescomparedwiththeknownalgorithms. Ifthere [ is a manifold of low dimension near which most of the data points lie, the proposed 1 algorithmwillfindit. Somenumericalresults arepresentedillustratingthealgorithm v and analyzing its performance compared to the classical PCA (principal component 7 analysis) and Isomap. 2 0 0 0 1 Introduction . 2 0 There is a large literature on dimension reduction. Without trying to describe the 7 1 published results, we refer the reader to the reviews [3] and [7] and references therein. : Our algorithm is simple, it is easy to implement, and it has some advantages over the v i known algorithms. We compare its performance with PCA and Isomap algorithms. X The main point in this short paper is the new algorithm for computing a manifold M r a of low dimension in a neighborhood of which most of the data points lie, or finding out that there is no such manifold. In section 2, we describe the details of the proposed algorithm. In section 3, we analyze the performance of the algorithm and compare its performance with PCA and Isomap algorithms’ performances. In section 4, we show some numerical results. 2 Description of the algorithm Let S be the set of the data points x ,1 ≤ j ≤ J, x ∈ D ⊂ RN, where D is a unit j j cube and J and N are very large. Divide the unit cube D into a grid with step size 0 < r ≤ 1. Let a = 1 be the number of intervals on each side of the unit cube D. r Let V be the upper limit of the total volume of the small cubes C , L be the lower m limit of the number of the data points near the manifold M, and p be the lower limit of the number of the data points in a small cube C with the side r. By a manifold m in this paper, a union of piecewise-smooth manifolds is meant. A smooth manifold is defined to be a smooth lower-dimensional domain in RN, for example, a line in R3, or a plane in R3. The one-dimensional manifold that we construct is a union of piecewise-linear curves, a linear curve is a line. Let us describe the algorithm, which is a development of an idea in [8]. The steps of this algorithm are: 1. Take a = a , and a = 2. 1 1 2. Take M = aN, where M is the number of the small cubes with the side r. Denote these small cubes by C ,1 ≤ m ≤ M. Scan the data set S by moving m the small cube with the step size r and calculating the number of points in this cube for each position of this cube in D. Each of the points of S lie in some cube C . Let µ be the number of the points of S in C . Neglect the small m m m cubes with µ < p, where p is the chosen lower limit of the number of the data m points in a small cube C . One chooses p (cid:28) |S|, for example, p = 0.005|S|, m where |S| is the number of the data points in S. Let V be the total volume of t the cubes that we keep. 3. If V = 0, we conclude that there is no manifold found. t (cid:112) 4. If V > V, we set a = a = 2a . If a > N |S|, we conclude that there is no t 2 1 2 manifold found. Otherwise, repeat steps 2 and 3. 5. If V < V, then denote by C ,1 ≤ k ≤ K (cid:28) M, the small cubes that we keep t k (µ ≥ p). Denote the centers of C by c and let C := ∪K C . k k k k=1 k 6. If the total number of the points in C is less than L, where L is the chosen lower limit of the number of the data points near the manifold M, then we conclude that there is no manifold found. 7. Otherwise, we have a set of small cubes C ,1 ≤ k ≤ K (cid:28) M with the side r, k which has total volume less than V and the number of the data points ≥ L. Given the set of small cubes C ,1 ≤ k ≤ K, one can build the sets L of dimension k s s (cid:28) N, in a neighborhood of which maximal amount of points x ∈ S lie. For ex- j ample, to build L , one can start with the point y := c where C is the small cube 1 1 1 1 closest to the origin, and join y with y = c where C is the small cube closest to 1 2 2 2 C . Continuing joining K small cubes, one gets a one-dimensional piecewise-linear 1 manifold L . To build L , one considers the triangles T with vertices y ,y ,y , 1 2 k k k+1 k+2 1 ≤ k ≤ K−2. The union of T forms a two-dimensional manifold L . Similarly, one k 2 can build s-dimensional manifold L from the set of C . s k While building manifolds L , one can construct several such manifolds because the s closest to C cube C may be non-unique. However, each of the constructed man- i−1 i ifolds contains the small cubes C , which have totally at least L data points. For k example, for L = 0.9|S| each of these manifolds contains at least 0.9|S| data points. Of course, there could be a manifold containing 0.99|S| and another one containing 0.9|S|. But if the goal is to include as many data points as possible, the experimenter can increase L, for example, to 0.95|S|. 2 Choice of V,L, and p The idea of the algorithm is to find a region containing most of the data points ( ≥ L ) that has small volume ( ≤ V ) by neglecting the small cubes C that has < p data k points. Depending on how sparse the data set is, one can choose an appropriate L. If the data set has a small number of outliers ( an outlier is an observation point that is distant from other observations), one can choose L = 0.9|S|, as in the first numerical experiment in Section 3. If the data set is sparse, one can choose L = 0.8|S|, as in the second numerical experiment in Section 3. In general, when the experimenter does not know how sparse the data are, one can try L to be 0.95|S|, 0.9|S| or 0.8|S|. One should not choose V too big, for example ≥ 0.5, because it does not make sense to have a manifold with big volume. One should not choose V too small because then either one cannot find a manifold or the number of small cubes C is the same as the k number of data points. So, the authors suggest V to be between 0.3 and 0.4 of the volume of the original unit cube where the data points are. The value p is used to decide whether one should neglect the small cubes C . So, it k is recommended to choose p << |S|. Numerical experiments show that p = 0.005|S| works well. The experimenter can also try smaller values of p but then more compu- tation time is needed. 3 Performance analysis (cid:112) In the algorithm one doubles a each time and 2 ≤ a ≤ N |S|. So the main step runs at most log N(cid:112)|S| or 1 log|S| times. For each a one calculates the number N of points of the data set S in each of M small cubes. The total computational a time is |S|M 1 log|S|. In N-dimensional space, calculating the distance between a N two points takes N operations. Thus, the asymptotic speed of the algorithm is O(cid:0)N |S|M 1 log|S|(cid:1) or O(|S|M log|S|). Since M ≤ |S|, in the worst case, the a N a a asymptotic speed is O(|S|2 log|S|). If one compares this algorithm with the principal component analysis (PCA) algo- rithm,whichhasasymptoticspeedO(|S|3),oneseesthatouralgorithmismuchfaster than the PCA algorithm. PCA theory is presented in [5]. In paper [1] the fastest algorithm for finding a manifold of lower dimension contain- ing many points of S, called Isomap, has the asymptotic speed of O(sl|S|log|S|), where s is the assumed dimension of the manifold and l is the number of landmarks. However, in the Isomap algorithm one has to use a priori assumed dimension of the manifold, which is not known a priori, while our algorithm finds the manifold and its dimension. Also, in the Isomap algorithm one has to specify the landmarks, which can be arbitrarily located in the domain D. Our algorithm is simpler to implement than the Isomap algorithm. For large |S|, one can make our algorithm faster by putting the upper limit for a. For (cid:112) (cid:112) example, instead of 1 ≤ a ≤ N |S|, one can require 1 ≤ a ≤ 2N |S|. 3 4 Numerical results In the following numerical experiments, we use different data sets from paper [2] and [6]. We apply our proposed algorithm above to each data set and get the results as shown in the pictures. The first picture of each data set shows the original data points (as blue points) and the found small cubes (as red points). The second picture of each data set shows the found manifold. 1. Consider the data set from paper [2]. This is a 2-dimensional set containing |S| = 308 points in a 2-dimensional Cartesian coordinate system. Below we take the integer value for p nearest to p and larger than p. /2/2 Figure 1: a data set from paper [2] with |S| = 308 and p = 0.005|S| = 1.56 4 /2/2 Figure 2: a data set from paper [2] with |S| = 308 and p = 0.005|S| = 1.56 Run the algorithm with V = 0.5 (i.e. the final set of C has to have total k volume less than 0.5), L = 0.9×|S| (the manifold has to contains at least 90% of the initial data points), and p = 0.005|S| = 1.56 (we take p = 2 and neglect the small cubes that have ≤ 1 point). One finds the manifold L when a = 16, 1 r = 1/16, the number of small cubes C is M = 89, which has total volume of k 0.35 and contains 0.92|S| data points. 2. Consider another data set from paper [2]. This set has |S| = 296 data points. 5 /2/2 Figure 3: another data set from paper [2] with |S| = 296 and p = 0.005|S| = 1.5 6 /2/2 Figure 4: another data set from paper [2] with |S| = 296 and p = 0.005|S| = 1.5 Run the algorithm with V = 0.5 (i.e. the final set of C has to have total k volume less than 0.5), L = 0.8×|S| (since some part of the data is uniformly distributed), and p = 0.005|S| = 1.5 (we take p = 2 and neglect the small cubes that have ≤ 1 point). One finds the manifold L when a = 16, r = 1/16, the 1 numberofsmallcubesC isM = 76, whichhastotalvolumeof0.3andcontains k 0.83|S| data points. 3. Usethedatasetfrompaper[2]asinthesecondexperiment. Inthisexperiment, p = 4 was used. This is to show that one can try different values of p (besides the suggested one p = 0.005|S|) and may find a manifold with a smaller number of small cubes. 7 /2/2 Figure 5: another data set from paper [2] with |S| = 296 and p = 4 8 /2/2 Figure 6: another data set from paper [2] with |S| = 296 and p = 4 Choosep = 4,V = 0.5andL = 0.8×|S|. OnefindsthemanifoldL whena = 8, 1 r = 1/8, the number of small cubes C is M = 30 ( less than 76 in experiment k 2), which has total volume of 0.47 and the total number of points 0.89|S|. The conclusionis: onecantrydifferentvaluesforpandcanhavedifferentmanifolds. 4. Consider a data set from paper [4] (Figure 7). 9 Figure 7: a data set from paper [4] with |S| = 235 Run the algorithm with V = 0.5, L = 0.8×|S| and let p run from 1 to 10. For any p, one cannot find a set of C which has total volume ≤ V and contains k at least 0.8|S| data points. This data set does not have a manifold of lower dimension in a neighborhood of which most of the data points lie. 5. Consider a data set from paper [6]. This set has |S| = 373 data points. 10