View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Repositori Institucional de la Universitat Jaume I ✩ Functional archetype and archetypoid analysis ∗ Irene Epifanio Dept. Matem`atiques and Institut de Matem`atiques i Aplicacions de Castello´. Campus del 6 Riu Sec. Universitat Jaume I, 12071 Castello´, Spain 1 0 2 n u J Abstract 5 2 Archetypeandarchetypoidanalysiscanbeextendedtofunctionaldata. Each ] function is approximated by a convex combination of actual observations E (functional archetypoids) or functional archetypes, which are a convex com- M bination of observations in the data set. Well-known Canadian temperature . t data are used to illustrate the analysis developed. Computational methods a t are proposed for performing these analyses, based on the coefficients of a ba- s [ sis. Unlike a previous attempt to compute functional archetypes, which was 2 only valid for an orthogonal basis, the proposed methodology can be used for v any basis. It is computationally less demanding than the simple approach of 1 1 discretizing thefunctions. Multivariatefunctional archetype andarchetypoid 9 analysis are also introduced and applied in an interesting problem about the 6 0 study of human development around the world over the last 50 years. These 1. tools can contribute to the understanding of a functional data set, as in the 0 classical multivariate case. 6 1 Keywords: Archetype analysis, Functional data analysis, Unsupervised : v learning, Extreme point, Global human development i X r a 1. Introduction Archetype analysis (AA) is a statistical technique that seeks to approxi- matedatabyaconvexcombinationofpureorextremaltypescalledarchetypes. Archetypes are built as a convex combination of the observations. AA was ✩ The code and data for reproducing the examples are available at http://www3.uji.es/∼epifanio/RESEARCH/faa.rar. ∗ Tel.: +34 964728390;fax: +34 964728429. Email address: [email protected](Irene Epifanio) Preprint submitted to Computational Statistics & Data Analysis June 28, 2016 first introduced by Cutler and Breiman (1994). More recently, archetypoid analysis (ADA) was introduced by Vinu´e, Epifanio, and Alemany (2015a). Unlike AA, the pure types in ADA are not a mixture (convex combination) of observations, but real observations. It has been shown that human under- standing and interpretation of data is made easier when they are represented by their extreme constituents (Davis and Love, 2010) by the principle of op- posites (Thurau et al., 2012). In other words, extremes are better than cen- tral points for human interpretation. Their applications have been growing in recent years, especially after the AA algorithm was implemented in the R package archetypes (Eugster and Leisch (2009)). ADA is available in the R packageAnthropometry(Vinu´e, Epifanio, Simo´, Ib´an˜ez, Domingo, and Ayala (2015b)). The fields of application include, for instance, market research (Li et al. (2003), Porzio et al. (2008), Midgley and Venaik (2013)), biology (D’Esposito et al.(2012)),genetics(Thøgersen et al.(2013)),sports(Eugster (2012)), industrial engineering (Epifanio et al. (2013); Vinu´e et al. (2015a)), theevaluationofscientists(Seiler and Wohlrabe(2013)),astrophysics(Chan et al. (2003), Richards et al. (2012)), e-learning (Theodosiou et al. (2013)), multi- document summarization (Canhasi and Kononenko (2013, 2014)) and differ- ent machine learning problems (Mørup and Hansen (2012), Stone (2002)). In the seminal paper by Cutler and Breiman (1994), one of the illustra- tive examples worked with functional observations, i.e, data consisting of a set of functions, although they converted them into a matrix by consider- ing a set of values of each curve (after being smoothed) at certain points. Functional data analysis (FDA) comprises statistical procedures for func- tional observations (a whole function is a datum). The objectives of FDA are essentially the same as those of any other branch of statistics. Al- though FDA is relatively new field, some classic references in this field in- clude: Ramsay and Silverman (2005), who provide an excellent overview, Ferraty and Vieu (2006), with new methodologies for studying functional data nonparametrically, and Ramsay and Silverman (2002), who offer inter- esting applications in different fields, and Ramsay et al. (2009) regarding software in this field. More recent advances and interesting applications of FDA comprise a variety of fields such as aviation safety (Gregorutti et al., 2015), chromatography (Rakˆet and Markussen, 2014), the quality of cook- ies and the relationship with the flour kneading process (Jacques and Preda, 2014), the relationship between the geometry of the internal carotid artery andthepresenceorabsenceofananeurysm(Usset et al.,2016;Sangalli et al., 2009) and the analysis of hippocampal differences in Alzheimer’s disease 2 (Epifanio and Ventura-Campos, 2011, 2014). AfirstattempttoextendAAtofunctionaldatawasmadebyCostantini et al. (2012). Functions were expressed in a functional basis, and standard multi- variate AA was applied to the coefficients in this basis. This method is only valid when the basis is orthonormal. The same thing is true when comput- ing standard principal component analysis (PCA) of the basis coefficients in order to carry out functional principal component analysis (FPCA). In this paper, a methodology is developed for obtaining functional archetypes and archetypoids, whatever the basis used for approximating the functions. Interest in describing and displaying the important features of a set of curves is not recent. Jones and Rice (1992) considered curves with ex- treme scores of principal components. This could be viewed as searching the archetypoid functions. However, unlike PCA, the goal of AA is to obtain ex- treme individuals, andcurves with extreme PCA scores do not necessarily re- turnarchetypalobservations. ThisisexplainedinCutler and Breiman(1994) and shown in Epifanio et al. (2013) through a problem where archetypes could not be recovered with PCA even if all the components had been con- sidered. In this paper, AA and ADA are extended to univariate and multivariate functional data (more than one function is available per individual). Section 2 presents a review of archetype and archetypoid analysis in the classical multivariate case, and their respective extensions to FDA are introduced and illustrated through a well-studied data set in the field of FDA. The location of functional archetypes and archetypoids is also analyzed. Human develop- ment is a topic of considerable political and public interest, suffice it to say that all United Nations member states committed to help achieve the Mil- lennium Development Goals established in 2000, by 2015 (United Nations, 2015). In Section 3, the proposal is applied to understanding statistics about human development for all countries, in order to obtain the big picture of global development. This can make it easier to interpret the large amount of data about sustainable development even for non-experts. The code in R (R Development Core Team (2015)) and data for reproducing the results are available at http://www3.uji.es/∼epifanio/RESEARCH/faa.rar. Con- clusions and future work are discussed in Section 4. 3 2. Definition of functional AA and ADA 2.1. AA and ADA for (standard) multivariate data LetXbeann×mmatrixthatcontainsausualmultivariatedatasetwithn observationsandmvariables. TheobjectiveofAAistofindak×mmatrixZ, whose rows arethek archetypes inthose data, insuch away thatdata canbe approximated by mixtures of the archetypes. To obtain the archetypes, AA computes two matrices α and β which minimize the residual sum of squares (RSS) that arises from combining the equation where x is approximated by i a mixture of z ’s (archetypes) ( n kx − k α z k2) and the equation j Pi=1 i Pj=1 ij j where z ’s is expressed as a mixture of the data (z = n β x ): j j Pl=1 jl l n k n k n RSS = Xkxi −Xαijzjk2 = Xkxi −XαijXβjlxlk2, (1) i=1 j=1 i=1 j=1 l=1 under the constraints k 1) Xαij = 1 with αij ≥ 0 for i = 1,...,n and j=1 n 2) Xβjl = 1 with βjl ≥ 0 for j = 1,...,k. l=1 Constraint 1) means that the approximations of x are a convex combina- i k tion of archetypes, xˆi = Xαijzj. Each αij is the weight of the archetype j j=1 for the observation i; that is to say, the α coefficients indicate how much each archetype contributes to the approximation of each observation. Constraint n 2) means that archetypes zj are a mixture of the observations, zj = Xβjlxl. l=1 Note that archetypes are not necessarily actual observations. This would happen if only one β is equal to 1 in constraint 2) for each j. This implies jl that β can only take on the values 0 or 1, since β ≥ 0 and the sum of jl jl constraint 2) is 1. In ADA, the continuous optimization problem of AA transforms into the following mixed-integer optimization problem: n k n k n RSS = Xkxi −Xαijzjk2 = Xkxi −XαijXβjlxlk2, (2) i=1 j=1 i=1 j=1 l=1 4 under the constraints k 1) Xαij = 1 with αij ≥ 0 and i = 1,...,n and j=1 n 2) Xβjl = 1 with βjl ∈ {0,1} and j = 1,...,k. l=1 Note that 2) implies that β = 1 for one and only one l and β = 0 jl jl otherwise. Archetypes and archetypoids are extremal (pure types) representatives of the data. Archetypes belong to the boundary of the convex hull of data if k > 1 (see Cutler and Breiman (1994)), whereas archetypoids do not neces- sarily (see Vinu´e et al. (2015a)). For k = 1, the archetype is the mean, and the archetypoid is the medoid (with one cluster) (Kaufman and Rousseeuw (1990)). In order to solve AA, Cutler and Breiman (1994) proposed an alternating minimizing algorithm, which was implemented in R by Eugster and Leisch (2009). To solve the convex least squares problems, they used a penalized version of the non-negative least squares algorithm by Lawson and Hanson (1974). As explained by Eugster and Leisch (2009), the problems to solve are of the ku − Twk kind, where u and w are vectors and T is a matrix, all of appropriate dimensions, and with the non-negativity and equality re- strictions. In the penalized version an extra element H (for “huge”) is added to u and to each observation of T, minimizing under non-negativity con- straints ku− Twk + Hk1 − wk. The larger H is, the more dominant the second term is, which means that the equality restrictions are fulfilled. The penalized non-negative least squares method is quite slow, but it is attractive since it can be used if the number of variables is larger than the number of cases (according to Cutler and Breiman (1994)). Note that in the R library archetypes, data are standardized by default. To solve ADA, Vinu´e et al. (2015a) proposed an algorithm, which was implemented in R by Vinu´e et al. (2015b), based on the Partitioning Around Medoids(PAM)clusteringalgorithm(Kaufman and Rousseeuw(1990)). That algorithm consists of two phases, a BUILD step and a SWAP step. An ini- tial set of archetypoids is computed in the BUILD phase. The SWAP step seeks to improve the set of archetypoids by exchanging chosen observations for unselected cases and by checking if these replacements reduce the RSS. 5 In the R implementation, the initial candidates are determined in three dif- ferent ways. The first are the nearest observations in Euclidean distance to the k archetypes, the so-called cand set. The second candidates, referred ns to as the cand set, are the observations with the maximum α value for α each archetype, i.e., the observations with the largest relative share for the respective archetype. The third initial candidates, the cand set, consist of β the cases with the maximum β value for each archetype, i.e., those whose contribution to the generation of archetypes is largest. Neither archetypes nor archetypoids are necessarily nested. In order to select thevaluek theuser candecide howmany representatives aretobecon- sidered or the elbow criterion can be used as in Cutler and Breiman (1994); Eugster and Leisch (2009); Vinu´e et al. (2015a) (the value k is chosen as the point where the elbow on the RSS representation for a series of different k values is found). 2.2. AA and ADA for functional data The first consideration is that in the functional context, the values of the m variables in the standard multivariate context are replaced by function values with a continuous index t. Similarly, summations are replaced by integration to define the inner product. Let {x (t), ..., x (t)} be a set of 1 n univariate observed functions with argument t defined in the interval [a,b]. It is always assumed that these functions satisfy reasonable smoothness con- ditions and are square-integrable functions on that interval, a Hilbert space. The objective of functional archetype analysis (FAA) is to find k archetype functions, insuchawaythatourfunctionaldatasamplecanbeapproximated by mixtures of those archetypes. Analogously, FAA computes two matrices α and β which minimize the residual sum of squares (RSS) as in equation 1, taking into account that now k.k stands for a functional norm instead of a vector norm and that the vectors x and z correspond to the functions x i j i and z . The meaning of α and β in the functional case is identical to the j standard multivariate case. In the same manner, functional archetypoid analysis (FADA) seeks k functions of the sample (archetypoids), in such a way that our functional data sample can be approximated by mixtures of those archetypoids. As before, the vector norms are replaced by functional norms in equation 2, and the interpretation is identical, changing only vectors for functions. The same proofs developed by Cutler and Breiman (1994) can be used to demonstrate that the functional archetype for k = 1 is the functional mean, 6 whereasfork >1,functionalarchetypesbelongtotheboundaryoftheconvex hull generated by the set S = {x (t), ..., x (t)}, conv(S). The medoid is the 1 n object in the cluster for which the average dissimilarity to all the objects in the cluster is minimal (Kaufman and Rousseeuw, 1990). In the archetypoid case, for k = 1, as in the classical multivariate case, the archetypoid would be the functional medoid (with one cluster) of {x (t), ..., x (t)}. Note that 1 n the minimization of RSS coincides with the definition of functional medoid, if the vector norms are replaced by functional norms in its definition. A vertex of the convex hull of S is a case x of S for which x does not belong i i to conv(S\{x }). For k > 1, archetypoids are not necessarily vertices in i the classical multivariate case or in the functional case. For example, let us consider the following functions defined in [0,1], x (t) = 0, x (t) = 1 if t in 1 2 [0,0.5] and 0.8 in (0.5,1], x (t) = 0.8 if t in [0,0.5] and 1 in (0.5,1] and x (t) 3 4 = 0.9. For k = 2, the functional archetypoids are x and x , and x is not 1 4 4 a vertex, since x = 0.5x + 0.5x . The same examples used by Vinu´e et al. 4 2 3 (2015a) in the standard multivariate case can also be used to show that functional archetypoids are not necessarily on the boundary of the convex hull. The example functions could be a linear combination of (orthonormal) Fourier basis functions, whose coefficients would be the values given in the examples in Vinu´e et al. (2015a). 2.2.1. Computational details After defining the problem, computational methods for FAA and FADA have to be determined. The L2-norm (kfk2 =< f,f >= bf(t)2dt) is con- Ra sidered for RSS computation. Note that, in practice, the functions are not observed continuously, but rather in a finite set of points. Suppose that preliminary steps such as smoothing have been carried out. A naive ap- proach is to discretize the observed functions to a fine grid of m equally spaced values from a to b. This gives an n×m X matrix that can be used with standard multivariate AA and ADA. This was the strategy followed by Cutler and Breiman (1994) in their seminal paper for working with func- tional observations. Depending on the features of the functions, the number m may have to be large to capture them, which increases the computa- tional time (in the computational complexity study by Eugster and Leisch (2009), increasing the number of variables implied a polynomial increase of the computation time per iteration in AA). It should also be noted that this approach uses a fairly crude approximation of the integral bf(t)2dt as Ra the simple sum of discrete values. It would be similar to the trapezoidal 7 rule; in fact, both methods are the same if periodic boundary conditions are considered. Other, more sophisticated numerical integration techniques could be used to obtain a higher accuracy for approximating the integrals, at the expense of increasing the computational complexity. Instead a ba- sis function expansion of the functions will be used. In this way, a high accuracy will be kept without increasing the computational cost. In fact, as the number of basis functions used is usually smaller than the number of sampling points, the computational time will be decreased with respect to the strategy of discretizing the functions. This is really important if we work with large scale data with many observations and variables. Note that many relevant problems, where AA has been applied, involve functions with manysamplingpoints, forexample, inneurology(Mørup and Hansen(2012), Tsanousa et al. (2015)), chemistry (Mørup and Hansen, 2012), remote sens- ing (Zhao et al., 2015), navigation scenarios (Feld et al., 2015), meteorology (Steinschneider and Lall, 2015), sustainability (Thurau et al., 2012), image analysis (Thurau and Bauckhage, 2009), etc. Representing functions by basis functions has the advantage that it can be applied to data where the functions have not all been measured at the same time points, data observations do not have to be equally spaced and the number of sampling points can vary across cases. The critical point is to select an appropriate basis and the number of basis functions (less compu- tation is required for smaller number of basis functions). This is a common question in all FDA problems. Ideal basis functions should have features that match those known to belong to the functions being approximated (see Ramsay and Silverman (2005) for a good explanation about smoothing func- tional data). In the classic FDA situation considered here, functions are densely observed and the basis coefficients are estimated separately to each function. With sparsely observed functions, the information from all func- tions should be utilized to estimate the coefficients for each function (James, 2010). In the basis approach, each function x is expressed as a linear combina- i tion of known basis functions B with h = 1, ..., m: x (t) = m bhB (t) h i Ph=1 i h = b′B, where ′ stands for transpose and b indicates the vector of length m i i of the coefficients and B the functional vector whose elements are the basis functions. Along with this, RSS can be expressed (with the corresponding constraints for FAA and FADA) as: 8 n k n k n RSS = Xkxi −Xαijzjk2 = Xkxi −XαijXβjlxlk2 = i=1 j=1 i=1 j=1 l=1 n k n n k n Xkb′iB−XαijXβjlb′lBk2 = Xk(b′i −Xαij Xβjlb′l)Bk2 = (3) i=1 j=1 l=1 i=1 j=1 l=1 n n n Xka′iBk2 = X < a′iB,a′iB >= Xa′iWai, i=1 i=1 i=1 where a′ = b′ − k α n β b′ and W is the order m symmetric i i Pj=1 ijPl=1 jl l matrix with elements w = B B , i.e. the matrix containing the m1,m2 R m1 m2 inner products of the pairs of basis functions. In the case of an orthonormal basis such as Fourier, W is the order m identity matrix, and FAA (FADA, respectively) is reduced to AA (ADA, respectively) of the basis coefficients. But,inothercases, wemayhavetoresorttonumericalintegrationtoevaluate W, but once W is computed, no more numerical integrations are necessary. As we can see, the first attempt to define FAA made by Costantini et al. (2012) is only valid if the basis is orthonormal, but not otherwise. To illustrate the concepts, a database heavily analyzed in FDA that ap- peared in Ramsay and Silverman (2005) to illustrate FPCA is used, so that readers who are interested can appreciate the differences in the analysis. The mean monthly temperatures for 35 Canadian weather stations averaged over 1960 to 1994 are considered. As in Ramsay and Silverman (2005), the data are approximated with a 12-term Fourier series. As there are four cli- mate zones, four archetypes and archetypoids are chosen. Fig. 1 shows the solution of FAA and FADA (the same archetypoids are obtained from the three initial candidates). The FADA solution corresponds to the following weather station (each of them belongs to a different climate zone): Montreal (Atlantic), Resolute (Arctic), Dawson (Continental) and Victoria (Pacific), in black, red, green and blue, respectively. If we bring to mind a map of Canada, these stations are located near the borders (extreme zones) of the country. 9 0 2 0 ) 1 C . g e d 0 ( e r u 0 at 1 r − e p m 0 e 2 T − 0 3 − 0 2 4 6 8 10 12 Month Figure 1: Four functional archetypes (solid lines) and archetypoids (dashed lines) for Canadian weather stations. 2.3. Multivariate FAA and FADA We often wish to study more than one function at the same time. For the shake of clarity, the extension of FAA and FADA to deal with bivariate functional data is discussed. The key is to define an inner product between bivariate functions, which is computed simply as the sum of the inner prod- ucts of the two components. Therefore, the squared norm of a bivariate function is simply the sum of the squared norms of the two component func- tions. What all this amounts to is that FAA or FADA for M multivariate functions is equivalent to M independent FAA or FADA, respectively, with shared parameters α and β. In practice, a composite function is formed by stringing the M functions together. Let f (t) = (x (t),y (t)) be a bivariate function. Its squared norm is i i i kf k2 = bx (t)2dt+ by (t)2dt. To compute FAA and FADA, let us consider i Ra i Ra i 10
Description: