ebook img

Data Mining via Support Vector Machines PDF

25 Pages·0.496 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Mining via Support Vector Machines

Data Mining via Support Vector Machines O. L. Mangasarian Computer Sciences Department University of Wisconsin 1210 West Dayton Street Madison, WI 53706 [email protected] Abstract Support vector machines (SVMs) have played a key role in broad classesofproblemsarisinginvariousfields. Muchmorerecently,SVMs have become the tool of choice for problems arising in data classifi- cation and mining. This paper emphasizes some recent developments that the author and his colleagues have contributed to such as: gen- eralized SVMs (a very general mathematical programming framework forSVMs),smoothSVMs(asmoothnonlinearequationrepresentation of SVMs solvable by a fast Newton method), Lagrangian SVMs (an unconstrained Lagrangian representation of SVMs leading to an ex- tremely simple iterative scheme capable of solving classification prob- lems with millions of points) and reduced SVMs (a rectangular kernel classifier that utilizes as little as 1% of the data). 1 Introduction This paper describes four recent developments, one theoretical, three algo- rithmic, all centered on support vector machines (SVMs). SVMs have be- come the toolof choice for the fundamental classification problem of machine learning and data mining. We briefly outline these four developments now. In Section 2 new formulations for SVMs are given as convex mathemati- cal programs which are often quadratic or linear programs. By setting apart 1 the two functions of a support vector machine: separation of points by a nonlinear surface in the original space of patterns, and maximizing the dis- tance between separating planes ina higher dimensional space, we areable to define indefinite, possibly discontinuous, kernels, not necessarily inner prod- uct ones, that generate highly nonlinear separating surfaces. Maximizing the distancebetween theseparatingplanesinthehigherdimensionalspaceissur- rogated by support vector suppression, which is achieved by minimizing any desired norm of support vector multipliers. The norm may be one induced by the separation kernel if it happens to be positive definite, or a Euclidean or a polyhedral norm. The latter norm leads to a linear program whereas the former norms lead to convex quadratic programs, all with an arbitrary separation kernel. A standard support vector machine can be recovered by using the same kernel for separation and support vector suppression. In Section 3 we apply smoothing methods, extensively used for solving important mathematical programming problems and applications, to gen- erate and solve an unconstrained smooth reformulation of the support vec- tor machine for pattern classification using a completely arbitrary kernel. We term such reformulation a smooth support vector machine (SSVM). A fast Newton-Armijo algorithm for solving the SSVM converges globally and quadratically. Numerical results and comparisons demonstrate the effective- ness and speed of the algorithm. For example, on six publicly available datasets, tenfold cross validation correctness of SSVM was the highest com- pared with four other methods as well as the fastest. In Section 4 an implicit Lagrangian for the dual of a simple reformula- tion of the standard quadratic program of a linear support vector machine is proposed. This leads to the minimization of an unconstrained differentiable convex function in a space of dimensionality equal to the number of classified points. This problem is solvable by an extremely simple linearly convergent Lagrangian support vector machine (LSVM) algorithm. LSVM requires the inversion at the outset of a single matrix of the order of the much smaller di- mensionality of the original input space plus one. The full algorithm is given in this paper in 11 lines of MATLAB code without any special optimization tools such as linear or quadratic programming solvers. This LSVM code can be used “as is” to solve classification problems with millions of points. In Section 5 an algorithm is proposed which generates a nonlinear kernel- based separating surface that requires as little as 1% of a large dataset for its explicit evaluation. To generate this nonlinear surface, the entire dataset is used as a constraint in an optimization problem with very few variables 2 corresponding to the 1% of the data kept. The remainder of the data can be thrown away after solving the optimization problem. This is achieved by making use of a rectangular m×m¯ kernel K(A,A¯(cid:1)) that greatly reduces the size of the quadratic program to be solved and simplifies the characterization of the nonlinear separating surface. Here, the m rows of A represent the original m data points while the m¯ rows of A¯ represent a greatly reduced m¯ data points. Computational results indicate that test set correctness for the reduced support vector machine (RSVM), with a nonlinear separating surface that depends on a small randomly selected portion of the dataset, is better than that of a conventional support vector machine (SVM) with a nonlinear surface that explicitly depends on the entire dataset, and much better than a conventional SVM using a small random sample of the data. Computational times, as well as memory usage, are much smaller for RSVM than that of a conventional SVM using the entire dataset. A word about our notation. All vectors will be column vectors unless transposed to a row vector by a prime superscript (cid:1). For a vector x in the n-dimensional real space Rn, the plus function x is defined as (x ) = + + i max {0,x }, i = 1,...,n, while x denotes the step function defined as i ∗ (x ) = 1 if x > 0 and (x ) = 0 if x ≤ 0, i = 1,...,n. The scalar (inner) ∗ i i ∗ i i product of two vectors x and y in the n-dimensional real space Rn will be denoted by x(cid:1)y and the p-norm of x will be denoted by (cid:2)x(cid:2) . If x(cid:1)y = 0, p we than write x ⊥ y. For a matrix A ∈ Rm×n,A is the ith row of A which i is a row vector in Rn. A column vector of ones of arbitrary dimension will be denoted by e. For A ∈ Rm×n and B ∈ Rn×l, the kernel K(A,B) maps Rm×n ×Rn×l into Rm×l. In particular, if x and y are column vectors in Rn then, K(x(cid:1),y) is a real number, K(x(cid:1),A(cid:1)) is a row vector in Rm and K(A,A(cid:1)) isanm×mmatrix. Iff isarealvaluedfunctiondefinedonthen-dimensional real space Rn, the gradient of f at x is denoted by ∇f(x) which is a row vector in Rn and the n×n Hessian matrix of second partial derivatives of f atx is denoted by ∇2f(x). The base of the naturallogarithm will be denoted by ε. 3 2 The Generalized Support Vector Machine (GSVM) [25] We consider the problem of classifying m points in the n-dimensional real space Rn, represented by the m×n matrix A, according to membership of each point A in the classes +1 or -1 as specified by a given m×m diagonal i matrix D with ones or minus ones along its diagonal. For this problem the standard support vector machine with a linear kernel AA(cid:1) [38, 11] is given by the following for some ν > 0: min νe(cid:1)y + 1w(cid:1)w 2 (w,γ,y)∈Rn+1+m s.t. D(Aw −eγ)+y ≥ e (1) y ≥ 0. Here w is the normal to the bounding planes: x(cid:1)w − γ = +1 (2) x(cid:1)w − γ = −1, and γ determines their location relative to the origin. The first plane above bounds the class +1 points and the second plane bounds the class -1 points when the two classes are strictly linearly separable, that is when the slack variable y = 0. The linear separating surface is the plane x(cid:1)w = γ, (3) midway between the bounding planes (2). See Figure 1. If the classes are linearly inseparable then the two planes bound the two classes with a “soft margin” determined by a nonnegative slack variable y, that is: x(cid:1)w − γ + y ≥ +1, for x(cid:1) = A and D = +1, i i ii (4) x(cid:1)w − γ − y ≤ −1, for x(cid:1) = A and D = −1. i i ii The 1-norm of the slack variable y is minimized with weight ν in (1). The quadratictermin(1),whichistwicethereciprocalofthesquareofthe2-norm distance 2 between the two bounding planes of (2) in the n-dimensional (cid:4)w(cid:4) space of w 2∈ Rn for a fixed γ, maximizes that distance, often called the “margin”. Figure 1 depicts the points represented by A, the bounding planes (2) with margin 2 , and the separating plane (3) which separates A+, (cid:4)w(cid:4) 2 4 x(cid:7)w γ = + 1 x x x x A+ x x x x x x x x x x x x x A- x x x x x Margin= 2 x (cid:2)w(cid:2) 2 x x x x(cid:7)w γ − w = 1 x(cid:7)w γ Separating Plane: = Figure 1: The bounding planes (2) with margin 2 , and the plane (3) sep- (cid:1)w(cid:1)2 arating A+, the points represented by rows of A with Dii = +1, from A−, the points represented by rows of A with Dii =−1. the points represented by rows of A with D = +1, from A−, the points ii represented by rows of A with D = −1. ii In the GSVM formulation we attempt to discriminate between the classes +1 and -1 by a nonlinear separating surface which subsumes the linear sep- arating surface (3), and is induced by some kernel K(A,A(cid:1)), as follows: K(x(cid:1),A(cid:1))Du = γ, (5) where K(x(cid:1),A(cid:1)) ∈ Rm, e.g. K(x(cid:1),A(cid:1)) = x(cid:1)A for the linear separating surface (3) and w = A(cid:1)Du. The parameters u ∈ Rm and γ ∈ R are determined by solving a mathematical program, typically quadratic or linear. In special cases, such as the standard SVM (13) below, u can be interpreted as a dual variable. A point x ∈ Rn is classified in class +1 or -1 according to whether the decision function (K(x(cid:1),A(cid:1))Du−γ) , (6) ∗ yields 1 or 0 respectively. Here (·) denotes the step function defined in the ∗ Introduction. The kernel function K(x(cid:1),A(cid:1)) implicitly defines a nonlinear 5 map from x ∈ Rn to some other space z ∈ Rk where k may be much larger thann. InparticularifthekernelK isaninnerproductkernelunderMercer’s condition [13, pp 138-140],[38, 11, 5] (an assumption that we will not make in this paper) then for x and y in Rn: K(x,y) = h(x)(cid:1)h(y), (7) and the separating surface (5) becomes: h(x)(cid:1)h(A(cid:1))Du = γ, (8) where h is a function, not easily computable, from Rn to Rk, and h(A(cid:1)) ∈ Rk×m results from applying h to the m columns of A(cid:1). The difficulty in com- puting h and the possible high dimensionality of Rk have been important factors in using a kernel K as a generator of an implicit nonlinear separating surface in the original feature space Rn but which is linear in the high di- mensional space Rk. Our separating surface (5) written in terms of a kernel function retains this advantage and is linear in its parameters, u,γ. We now state a mathematical program that generates such a surface for a general kernel K as follows: min νe(cid:1)y +f(u) u,γ,y s.t. D(K(A,A(cid:1))Du−eγ)+y ≥ e (9) y ≥ 0. Here f is some convex function on Rm, typically some norm or seminorm, and ν is some positive parameter that weights the separation error e(cid:1)y versus suppression of the separating surface parameter u. Suppression of u can be interpreted in one of two ways. We interpret it here as minimizing the number of support vectors, i.e. constraints of (9) with positive multipliers. A more conventional interpretation is that of maximizing some measure of the distance or margin between the bounding parallel planes in Rk, under appropriate assumptions, such as f being a quadratic function induced by a positive definite kernel K as in (13) below. As is well known, this leads to improved generalization by minimizing an upper boundon the VC dimension [38, 35]. We term a solution of the mathematical program (9) and the resulting decision function (6) a generalized support vector machine, GSVM. In what follows derive a number of special cases, including the standard support vector machine. 6 We consider first support vector machines that include the standard ones [38, 11, 5] andwhich areobtained by setting f of (9) to bea convex quadratic function f(u) = 1u(cid:1)Hu, where H ∈ Rm×m is some symmetric positive def- 2 inite matrix. The mathematical program (9) becomes the following convex quadratic program: min νe(cid:1)y + 1u(cid:1)Hu 2 u,γ,y s.t. D(K(A,A(cid:1))Du−eγ)+y ≥ e (10) y ≥ 0. The Wolfe dual [39, 22] of this convex quadratic program is: min 1r(cid:1)DK(A,A(cid:1))DH−1DK(A,A(cid:1))(cid:1)Dr −e(cid:1)r r∈Rm 2 s.t. e(cid:1)Dr = 0 (11) 0 ≤ r ≤ νe. Furthermore, the primal variable u is related to the dual variable r by: u = H−1DK(A,A(cid:1))(cid:1)Dr. (12) If we assume that the kernel K(A,A(cid:1)) is symmetric positive definite and let H = DK(A,A(cid:1))D, then our dual problem (11) degenerates to the dual problem of the standard support vector machine [38, 11, 5] with u = r: min 1u(cid:1)DK(A,A(cid:1))Du−e(cid:1)u 2 u∈Rm s.t. e(cid:1)Du = 0 (13) 0 ≤ u ≤ νe. The positive definiteness assumption on K(A,A(cid:1)) in (13) can be relaxed to positive semidefiniteness while maintaining the convex quadratic program (10), with H = DK(A,A(cid:1))D, as the direct dual of (13) without utilizing (11) and (12). The symmetry and positive semidefiniteness of the kernel K(A,A(cid:1)) for this version of a support vector machine is consistent with the support vector machine literature. The fact that r = u in the dual formu- lation (13), shows that the variable u appearing in the original formulation (10) is also the dual multiplier vector for the first set of constraints of (10). Hence the quadratic term in the objective function of (10) can be thought of as suppressing as many multipliers of support vectors as possible and thus 7 minimizing the number of such support vectors. This is another (nonstan- dard) interpretation of the standard support vector machine that is usually interpreted asmaximizing themarginordistance between parallelseparating planes. This leads to the idea of using other values for the matrix H other than DK(A,A(cid:1))D that will also suppress u. One particular choice is interesting because it puts no restrictions on K: no symmetry, no positive definiteness or semidefiniteness and not even continuity. This is the choice H = I in (10) which leads to a dual problem (11) with H = I and u = DK(A,A(cid:1))(cid:1)Dr as follows: min 1r(cid:1)DK(A,A(cid:1))K(A,A(cid:1))(cid:1)Dr −e(cid:1)r 2 r∈Rm s.t. e(cid:1)Dr = 0 (14) 0 ≤ r ≤ νe. We note immediately that K(A,A(cid:1))K(A,A(cid:1))(cid:1) is positive semidefinite with no assumptions on K(A,A(cid:1)), and hence the above problem is an always solvable convex quadratic program for any kernel K(A,A(cid:1)). In fact by the Frank-Wolfe existence theorem [15], the quadratic program (10) is solvable for any symmetric positive definite matrix H because its objective function is bounded below by zero. Hence by quadratic programming duality its dual problem (11) is also solvable. Any solution of (10) can be used to generate a nonlinear decision function (6). Thus we are free to choose any symmetric positive definite matrix H to generate a support vector machine. Experimentation will be needed to determine what are the most appropriate choices for H. By using the 1-norm instead of the 2-norm a linear programming for- mulation for the GSVM can be obtained. We refer the interested reader to [25]. We turn our attention now to an efficient method for generating SVMs based on smoothing ideas that have already been effectively used to solve various mathematical programs [7, 8, 6, 9, 10, 16, 37, 12]. 3 SSVM: Smooth Support Vector Machines [21] In our smooth approach, the square of 2-norm of the slack variable y is minimized with weight ν instead of the 1-norm of y as in (1). In addition 2 8 the distance between the planes (2) is measured in the (n+1)-dimensional space of (w,γ) ∈ Rn+1, that is 2 . Measuring the margin in this (n+1)- (cid:4)(w,γ)(cid:4) dimensional space instead of Rn indu2ces strong convexity and has little or no effect on the problem as was shown in [27, 29, 21, 20]. Thus using twice the reciprocal squared of this margin instead, yields our modified SVM problem as follows: min νy(cid:1)y + 1(w(cid:1)w+γ2) 2 2 w,γ,y s.t. D(Aw −eγ)+y ≥ e (15) y ≥ 0. At the solution of problem (15), y is given by y = (e−D(Aw −eγ)) , (16) + where, as defined in the Introduction, (·) replaces negative components of + a vector by zeros. Thus, we can replace y in (15) by (e − D(Aw − eγ)) + and convert the SVM problem (15) into an equivalent SVM which is an unconstrained optimization problem as follows: mw,iγn ν2(cid:2)(e−D(Aw−eγ))+(cid:2)22 + 12(w(cid:1)w+γ2). (17) This problem is a strongly convex minimization problem without any con- straints. It is easy to show that it has a unique solution. However, the objective function in (17) is not twice differentiable which precludes the use of a fast Newton method. We thus apply the smoothing techniques of [7, 8] and replace x by a very accurate smooth approximation [21, Lemma 2.1] + that is given by p(x,α), the integral of the sigmoid function 1 of neural 1+ε−αx networks [23], that is 1 p(x,α) = x+ log(1+ε−αx), α > 0. (18) α This p function with a smoothing parameter α is used here to replace the plus function of (17) to obtain a smooth support vector machine (SSVM): ν 1 min Φ (w,γ) := min (cid:2)p(e−D(Aw−eγ),α)(cid:2)2 + (w(cid:1)w +γ2). α 2 (w,γ)∈Rn+1 (w,γ)∈Rn+12 2 (19) It can be shown [21, Theorem 2.2] that the solution of problem (15) is ob- tained by solving problem (19) with α approaching infinity. Advantage can 9 be taken of the twice differentiable property of the objective function of (19) toutilizeaquadraticallyconvergent algorithmforsolvingthesmoothsupport vector machine (19) as follows. Algorithm 3.1 Newton-Armijo Algorithm for SSVM (19) Start with any (w0,γ0) ∈ Rn+1. Having (wi,γi), stop if the gradient of the objective function of (19) is zero, that is ∇Φ (wi,γi) = 0. Else compute α (wi+1,γi+1) as follows: (i) Newton Direction: Determine direction di ∈ Rn+1 by setting equal to zero the linearization of ∇Φ (w,γ) around (wi,γi) which gives n+1 α linear equations in n+1 variables: ∇2Φ (wi,γi)di = −∇Φ (wi,γi)(cid:1). (20) α α (ii) Armijo Stepsize [1]: Choose a stepsize λ ∈ R such that: i (wi+1,γi+1) = (wi,γi)+λ di (21) i where λ = max{1, 1, 1,...} such that : i 2 4 Φ (wi,γi)−Φ ((wi,γi)+λ di) ≥ −δλ ∇Φ (wi,γi)di (22) α α i i α where δ ∈ (0, 1). 2 Note that a key difference between our smoothing approach and that of the classical SVM [38, 11] is that we are solving here a linear system of equations (20) instead of solving a quadratic program as is the case with the classical SVM. Furthermore, it can be shown [21, Theorem 3.2] that the smoothing algorithm above converges quadratically from any starting point. To obtain a nonlinear SSVM we consider the GSVM formulation (9) with a 2-norm squared error term on y instead of the 1-norm, and inste(cid:1)ad(cid:2) of the convex term f(u) that suppresses u we use a 2-norm squared of u to γ suppress both u and γ. We obtain then: min νy(cid:1)y + 1(u(cid:1)u+γ2) u,γ,y 2 2 s.t. D(K(A,A(cid:1))Du−eγ)+y ≥ e (23) y ≥ 0. We repeat the same arguments as above, in going from (15) to (19), to obtain the SSVM with a nonlinear kernel K(A,A(cid:1)): mu,iγn ν2(cid:2)p(e−D(K(A,A(cid:1))Du−eγ),α)(cid:2)22+ 12(u(cid:1)u+γ2), (24) 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.