ebook img

Three-way component analysis using the R package ThreeWay PDF

20 Pages·2011·0.17 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Three-way component analysis using the R package ThreeWay

Journal of Statistical Software JSS MMMMMM YYYY, Volume VV, Issue II. http://www.jstatsoft.org/ R Three-way component analysis using the package ThreeWay Paolo Giordani Henk A.L. Kiers Maria Antonietta Del Ferraro Sapienza University of Rome University of Groningen Sapienza University of Rome Abstract The R package ThreeWay is presented and its main features are illustrated. The aim of ThreeWay is to offer a suit of functions for handling three-way arrays. In particular, the most relevant available functions are T3 and CP, which implement, respectively, the Tucker3andCandecomp/Parafacmethods. Theyarethetwomostpopulartoolsforsum- marizingthree-wayarraysintermsofcomponents. Afterbrieflyrecallingbothtechniques fromatheoreticalpointofview,thefunctionsT3andCParedescribedbyconsideringtwo real life examples. Keywords: multiway analysis, Tucker3, Candecomp/Parafac, R, ThreeWay. 1. Introduction Instatistics,datagenerallyrefertotheobservationsofsomevariablesonasetofunitsandare storedina(two-way)matrix,sayXoforder(I×J),whereI andJ denotethenumbersofunits and variables, respectively. The generic element of X is xij and, therefore, data are indexed by i ∈ {1,...,I} and j ∈ {1,...,J} concerning the unit and variable modes, respectively. Here, we use the term ’mode’ to refer to a set of entities. However, in several situations, the available data can be indexed by i ∈ {1,...,I}, j ∈ {1,...,J} and k ∈ {1,...,K} where K denotes the number of occasions. In this case, the available information consists of some variables collected on a set of units on different occasions and is usually stored in a (three- way) array, say X of order (I ×J ×K), with generic element xijk. The array can then be seen as a box in which the ways (or indices) correspond to the vertical, horizontal and depth axis. For the sake of generality, in the following we decided not to refer to unit, variable and occasions modes. Rather, we refer to, respectively, the A-mode (with I entities), B-mode (with J entities) and C-mode (with K entities). Multiway data analysis concerns the cases in which the number of indices is higher than two 2 The R package ThreeWay (three-waydataanalysiswhenthenumberofindicesisthree). Inthispaperweshalllimitour attention to the three-way case. In the literature there exist several proposals for performing component models on three-way data. The two probably most popular techniques are the Tucker3 (T3) model (Tucker 1966) and the Candecomp/Parafac (CP) model (independently proposed by Carroll and Chang 1970; Harshman 1970). The aim of this work is to illustrate the R package ThreeWay for performing such methods. For this purpose, in the next section, we provide some preliminary notions on three-way data and we introduce the T3 and CP models. Then, Sections3and4aredevotedtothemainfeaturesofThreeWaywithparticular referencetotheimplementationsofCPandT3consideringtwobenchmarkdatasets. Finally, in Section 5 some concluding remarks are given. 2. Methodological background The data array X can be seen as a collection of K matrices of order (I × J). It can be convenient to explicitly take into account this by expressing the available information in terms of a new matrix, which we may call supermatrix, containing such a collection. This can be done for instance as XA = [X··1...X··k...X··K], where X··k stands for the matrix of order (I × J) concerning the k-th entity of the C-mode. X··k is usually called the k-th frontal slab or slice of X. In other words, XA is the matrix with I rows (corresponding to the A-mode entities) and JK columns (corresponding to all the combinations of the B- and C-mode entities) obtained juxtaposing next to each other the frontal slabs of X (A-mode matricization of X). The process of transforming a three-way array into a two-way matrix is usually denoted as matricization or, especially in chemometrics, unfolding. It must be noted that there exist several ways to matricize an array (see, for more details, Kiers 2000). However, in this context, it is sufficient to consider XA for describing the T3 and CP models. In principle standard Principal Component Analysis (PCA) applied to XA or to the matrix containing the mean values of the X··k’s can be used for summarizing the data in X. The former, sometimes called PCA on supermatrices (PCA-SUP) and the latter are available in the functions pcasup3 and pcamean of ThreeWay. Their outputs give a preliminary insight in the data. However, these tools offer information which is incomplete at best because the three-way interactions among the data are arbitrarily ignored. Ad.hoc methods, such as T3 and CP, should be thus considered. In the remaining part of this section, they will be briefly recalled. Refer to Bro (1998), Kroonenberg (1983, 2008), Smilde et al. (2004) for extensive monographs on T3 and CP and, in general, on multiway analysis with applications in several domains. 2.1. The Tucker3 model The Tucker3 (T3) model (Tucker 1966) is a multi-linear model summarizing X by extracting different components for every mode. The Tucker3 model with P components for the A-mode mode, Q for the B-mode and R for the C-mode, can be formalized as XA = AGA(C(cid:2)⊗B(cid:2))+EA, (1) where A, B and C are the component matrices for the A-, B- and C-modes, respectively, the genericelementsofwhichareaip,bjq andckr expressingthecomponentscoresofthei-thentity on the p-th component for the A-mode, of the j-th entity on the q-th component for the B- modeandofthek-thentityonther-thcomponentfortheC-mode,respectively. Furthermore, Journal of Statistical Software 3 GA is the matricized version of the so-called core array G of order (P ×Q×R). Its generic element gpqr gives the interaction among the components of the three modes. Finally EA denotes the matricized array of the errors. The symbol ⊗ is the Kronecker product between ⎡ ⎤ u11V ··· u1JV two matrices (given two matrices U and V, it is U⊗V = ⎢⎣ ... ... ... ⎥⎦). A scalar uI1V ··· uIJV formulation of the T3 model equivalent to (1) can also be provided: (cid:8)P (cid:8)Q (cid:8)R xijk = aipbjqckrgpqr +eijk. (2) p=1q=1r=1 IntheT3model,limitednumbersofcomponentsforall thethreemodesaresought. However, it can be useful to reduce only two modes or just one mode. In these cases, the Tucker2 (T2) or Tucker1 (T1) models can be introduced, respectively. Reducing without loss of generality the A- and B- modes, the T2 model (T2-AB) can be written as (cid:8)P (cid:8)Q xijk = aipbjqgpqk +eijk. (3) p=1q=1 If, without loss of generality, our interest relies in reducing the A-mode, we obtain the T1 model (T1-A): (cid:8)P xijk = aipgpjk +eijk. (4) p=1 The optimal parameter matrices of the T3, T2 and T1 models are found by minimizing (cid:9)I (cid:9)J (cid:9)K (cid:4)EA(cid:4)2 = e2ijk. For the T3 and T2 models, this can be done implementing an Alter- i=1j=1k=1 nating Least Squares (ALS) algorithm, which alternatingly updates every parameter matrix keeping fixed the remaining ones upon convergence. It is assumed that an algorithm con- verges when the values of the loss function in two consecutive iterations differ less than a pre-specified threshold. It can be shown that these algorithms converge to at least a local minimum in a limited number of iterations. To limit the risk of hitting local optima, more than one start is recommended. Since T1 is equivalent to a PCA on XA, the T1 solution can be obtainedbythe SVDofXA. Differentvariants of T1 can be obtainedbychoosing different matricizations. It can be shown that the obtained solution in terms of A, B, C and G is not unique. In fact, equally well-fitting solutions can be obtained considering A˜ = AS, B˜ = BT, C˜ = CU and G˜A = S−1GA(U(cid:2)−1 ⊗T(cid:2)−1), where S, T and U are rotation matrices of appropriate order. Such a rotational freedom holds provided that the rotations of the component matrices are compensated in the core. Although it might represent a limit of the analysis, the indeter- minacy of the solution can be used in order to obtain simple structure solutions. See, for instance, Kiers (1998) in which the procedure called orthomax for jointly rotating to simple structure the component matrices and the core is proposed. The estimation of the T3 model parameters can be carried out using the function T3 of the R package ThreeWay. Its main features shall be illustrated in Section 3. 4 The R package ThreeWay 2.2. The Candecomp/Parafac model The Candecomp/Parafac (CP) model (Carroll and Chang 1970; Harshman 1970) aims at reducing X by extracting the same number of components, say S, for every mode. In scalar notation the CP model can be written as (cid:8)S xijk = aisbjscks+eijk, (5) s=1 with obvious notation. Although the CP model is rather different from T3 and satisfies different properties, it can be seen as a constrained version of T3 with P = Q = R = S and gpqr = 1, if p = q = r, and gpqr = 0, otherwise. The matrix formulation of CP helps to highlight such a relationship. In fact, it is XA = AIA(C(cid:2)⊗B(cid:2))+EA, (6) where IA is the matricized version of the three-way identity array I (ipqr = 1 if p = q = r and 0 otherwise). By comparing (1) and (6) we can conclude that the CP model can be seen as a T3 model with GA = IA. The most relevant difference between CP and T3 concerns the so-called intrinsic axis property. By this, under mild conditions, the CP solution is unique up to rescaling and permutationofthecolumnsofthecomponentmatrices. Itiseasytoseethatuniquenessholds because rotations of the component matrices cannot be compensated into the core which is constrained to be equal to I. A nice property of the CP model is that its solution with S components is the best S-rank approximation of X, where the rank of an array is defined according to Kruskal (1977). It follows that the rank of X is equal to the smallest number of components for which a perfect fit CP solution is obtained. A suitable ALS algorithm can be adopted for obtaining the optimal CP solution. Again the optimal parameter matrices are found such that the residual sum of squares is minimized. InordertoapplytheCPmodel,thefunctionCPoftheRpackageThreeWaycanbeconsidered and its main features will be described in Section 4. T3 R ThreeWay 3. The Function of the package We now apply the T3 model to the ‘Learning to read’ data (Bus, 1982). The data set refers to the process of learning to read of seven pupils (I = 7). Five tests (J = 5) are used to evaluate the learning process: each test measures different reading aspects. The tests are ‘Letter knowledge’ (L), the ability to read ‘Regular orthographic short words’ (P), ‘Regular orthographic long words’ (Q), ‘Regular orthographic long and short words within context’ (S), ‘Irregular orthographic long and short words’ (R). The pupils are tested weekly from week 3 to week 47 except for ten holidays weeks, hence K = 37. The aim of the study is to investigate the learning process and whether the performances of the pupils are equal over time. Of course, the first step of the analysis consists of loading the data. R> library(ThreeWay) R> data(Bus) Journal of Statistical Software 5 The function T3 requires the data set to be analyzed and, if available, the labels laba, labb and labc for the entities of the three modes (if not available, T3 first asks to the users for adding them by keyboard and then generates them automatically when the user decides not to add them). or generate them automatically). The data set can be an object of class array, data.frameormatrix. Inthelattertwocases, theA-modematricizationoftheoriginalarray must be given as input and the numbers of entities of the A-, B- and C-modes must be given interactively by the user. Bus is an object of class matrix with the names of the rows corre- sponding to the labels of the A-mode (pupils) and the names of the columns corresponding to a combination of the labels of the B- and C-modes (tests and time occasions, respectively). R> laba=rownames(Bus) R> labb=substr(colnames(Bus)[1:5],1,1) R> labc=substr(colnames(Bus)[seq(1,ncol(Bus),5)],3,8) A relevant point to be addressed concerns preprocessing. In fact, prior to fitting a model to the data, it is fundamental to decide how to preprocess the data. Preprocessing can be done by centering within a mode or a combination of modes and normalizing across a mode. Differently from a two-way analysis in which data are usually centered and/or normalized across the rows, in a three-way context, it is not straightforward how to preprocess the data since different options are available. We can say that the main aim of the preprocessing step is to eliminate artificial differences in levels and scales. Centering is helpful in order to get ratio-scale (rather than interval-scale) data, i.e. the observed values must be proportional and a zero value denotes a lack of the property being measured. Note that the CP and T3 (and its variants) models require ratio-scale data. The normalization step does not make the data consistent with the model but allows us to avoid that the results are affected by differences in range among entities of one or more modes. In the package ThreeWay centering and normalization are done according to Kiers (2000) and can be performed using functions cent3 and norm3, respectively, specifying the mode(s) within or across we want to center or normalize the data. These functions are automatically implemented when launching T3. For a deeper insight into preprocessing in three-way analysis see Harshman and Lundy (1984) and Bro and Smilde (2003). In norm3, data are normalized to sum of squares equal to product of size of other modes. Alternatively, one can scale the data so that they range from 0 to 1. Since the five tests have different score ranges, we decide to rescaling them as was done by Kroonenberg (1983) and Timmerman (2001), who already analyzed the data using T3. R> max.scale=c(47,10,10,15,15) R> maxBus=rep(max.scale,37) R> BusN=t(t(Bus)/maxBus) Here, max.scale contains the maximum value for every test (the minimum value is 0). We have not centered the data because they have a meaningful zero point in 0. In fact, when the score of the pupil i on the test j at the occasion k is 0, we can conclude that the reading ability of such a pupil is absent. The preprocessed data set is available in BusN and the function T3 can be run. 6 The R package ThreeWay R> t3bus=T3(BusN,laba,labb,labc) In the following, we are going to summarize the most relevant steps for carrying out a three- way analysis process. Thus, some steps are omitted for the sake of brevity. Prior to choosing the numbers of components, T3 gives a suggestion based on the generalized scree test (Timmerman and Kiers 2000; Kiers and der Kinderen 2003) according to the Convex Hull procedure (Ceulemans and Kiers 2006). The corresponding functions called in T3 are T3runsApproxFit and DimSelector R> You are now about to do a Tucker3 analysis. R> R> You have to specify the dimensionalities to use. R> To search these, you are advised to first run PCASUP analyses. R> (PCASUP analyses are PCA’s of supermatrices with slices of the 3way array next to each other, R> thus 3 supermatrices are analyed by PCA, and for each we get a component matrix, and eigenvalues. R> The eigenvalues may give an indication as to the required number of components for each mode.) R> The results can next be used to find useful dimensionalities by the generalized scree test R> If you want to do the PCASUPs for choosing your dimensionality, specify ’1’: R> 1: 0 R> Read 1 item In our analysis we decide to operate as in Timmerman (2001) summarizing the data using twocomponentsfortheA-mode(P = 2), onefortheB-mode(Q = 1)andtwofortheC-mode (R = 2). R> How many A-mode components do you want to use? R> 1: 2 R> Read 1 item R> How many B-mode components do you want to use? R> 1: 1 R> Read 1 item R> How many C-mode components do you want to use? R> 1: 2 R> Read 1 item In this way the fit of the model is very high (96.26%) as one can see inspecting the following output obtained considering two additional random starts. Note that the T3 solution is ob- tained in T3 calling the function T3func. R> By default, only a rationally started analysis run will be carried out. R> To decrease the chance of missing the optimal solution, you may use Journal of Statistical Software 7 additional, randomly started runs. R> If you want additional runs, specify how many (e.g., 4): R> 1: 2 R> Read 1 item R> Run no. 1 R> Rational ORTHONORMALIZED start R> Tucker3 function value at start is 26.5805045184081 R> Tucker3 function value is 25.7735931853048 after 4 iterations R> Fit percentage is 96.26469729604 % R> Procedure used 0.03 seconds R> Run no. 2 R> Random ORTHONORMALIZED starts R> Tucker3 function value at start is 688.092368487312 R> Tucker3 function value is 25.7735931724334 after 6 iterations R> Fit percentage is 96.2646972979055 % R> Procedure used 0.03 seconds R> Run no. 3 R> Random ORTHONORMALIZED starts R> Tucker3 function value at start is 689.615935107658 R> Tucker3 function value is 25.7735931361569 after 5 iterations R> Fit percentage is 96.264697303163 % R> Procedure used 0.03 seconds R> R> Fit (%) values from all runs: R> Start n.1 Start n.2 Start n.3 R> 96.26 96.26 96.26 R> R> Tucker3 analysis with 2 x 1 x 2 components, gave a fit of 96.26 % ThenextstepofT3isaboutsimplestructurerotationbyOrthomax(Kiers 1998)implemented in the function varimcoco. However, in the current analysis this rotation is not carried out because when P = QR we can rotate to simple structure by simply transforming GA of order (2×2) into the identity matrix of order two using the rotational freedom and compensating the transformation in the component matrices. Therefore, we ignore the Orthomax rotation by choosing relative weights equal to 0 (this does not rotate the current solution). R> Find useful simple structure rotation of core and components R> You can now carry out SIMPLE STRUCTURE rotations with varying weights. R> If desired, you can specify a range of different weights for each mode. R> You can also specify a single weight (e.g., just 1). R> Analyses will be carried out with all combinations of relative weights. R> Specify (range of) relative weight(s) for A (default=0): R> 1: R> Read 0 items R> R> Warning: as the number of B-mode components is 1, no simple structure for B 8 The R package ThreeWay will be sought (relative weight=0) R> R> Specify (range of) relative weight(s) for C (default=0): R> 1: R> Read 0 items Prior to transforming GA into the identity matrix, we permute and reflect the solution in such a way that the results are coincident with those in Timmerman (2001). R> You can now manually PERMUTE and REFLECT columns/rows of solution R> If you want to reflect/permute columns/rows, specify ’1’: R> 1: 1 R> Read 1 item R> Give a vector for reflection of columns of A (e.g., 1 -1 -1 1 ..) R> 1: -1 1 R> Read 2 items R> Give a vector with new order of columns of A (e.g., 3 1 4 2 ..) R> 1: 2 1 R> Read 2 items R> Give a vector for reflection of columns of B (e.g., 1 -1 -1 1 ..) R> 1: R> Read 0 items R> R> Warning: the columns of B will not be reflected R> R> Give a vector for reflection of columns of C (e.g., 1 -1 -1 1 ..) R> 1: R> Read 0 items R> R> Warning: the columns of C will not be reflected R> R> Give a vector with new order of columns of C (e.g., 3 1 4 2 ..) R> 1: 2 1 R> Read 2 items We then exit from T3. The resulting output is an object of class list called t3bus. The transformation of GA (denoted by t3bus$core) is compensated in A (t3bus$A): R> t3bus$A=t3bus$A%*%t3bus$core R> t3bus$core=solve(t3bus$core)%*%t3bus$core Justas inTimmerman (2001) we rescalethe componentmatrices so thatthe maximum value of the second A-mode component, of the B-mode component and of the first C-mode compo- nent is equal to 1. The last three rescalings are compensated in the first A-mode component and in the second C-mode component. Journal of Statistical Software 9 R> t3bus$A=t3bus$A%*%t3bus$core R> colnames(t3bus$A)=c("A1","A2") R> t3bus$core=solve(t3bus$core)%*%t3bus$core R> maxA2=max(t3bus$A[,2]) R> t3bus$A[,2]=t3bus$A[,2]/maxA2 R> maxB=max(t3bus$B) R> t3bus$B=t3bus$B/maxB R> maxC1=max(t3bus$C[,1]) R> t3bus$C[,1]=t3bus$C[,1]/maxC1 R> t3bus$A[,1]=t3bus$A[,1]*maxB*maxC1 R> t3bus$C[,2]=t3bus$C[,2]*maxA2*maxB The so-obtained solution is very easy to interpret. To analyze the dynamics of the occasion component scores, we represent them in Figure 1. R> plot(t3bus$C[,1],type="n",xlab="Time occasion", ylab="Component score",ylim=c(min(t3bus$C)-.1,max(t3bus$C)+.1)) R> points(t3bus$C[,1],pch=16) R> points(t3bus$C[,2],pch=17) R> legend(3,1,legend=c("C1","C2"), pch=c(16,17)) The first C-mode component can be interpreted as the ‘General performance level’ because the scores are close to 0 in the beginning and close to 1 at the end of the testing time. The second C-mode component is more complex due to the negative scores in the second half of the occasions. It is interpreted approximately as the ‘Learning rate’ but the negative values do not indicate that the learning rate decreases in the end: it is due only to the rescaling pro- cedure. The B-mode component (see printed values below) is connected with the ‘Difficulties of the items’. Since P (score 1.00) and S (score 0.99) have the highest scores, the judgments of these tests became positive faster than the other tests. The most difficult test is R with a component score equal to 0.58. R> print(round(t3bus$B,2)) R> B1 R> L 0.91 R> P 1.00 R> Q 0.87 R> S 0.99 R> R 0.58 Taking into account that the core is an identity matrix, we can deduce that the first and second C-mode components are related, respectively, to the first and second A-mode compo- nents. The above connection helps us in the interpretation of the components A1 and A2 (see printed values below). Therefore, the pupils whose component scores are high are those who have a performance level (with respect to A1) and a learning rate (with respect to A2) above 10 The R package ThreeWay Figure 1: Component scores for the C-mode from T3 applied to the Bus data average. We can conclude that Pupil 4 is the best one: his (her) scores are the highest (scores 1.28 and 1.00). Pupil 5 follows. In fact, his (her) scores are the highest after the previous pupil. After them, Pupils 6 and 1 have almost the same high scores on the first component, but Pupil 1 has a very low score (the last but one) on the learning rate component. At last, in order, Pupils 3, 2 and 7 appear. Pupil 7 is the worst having the lowest scores (0.89 and -0.42). R> print(round(t3bus$A,2)) R> A1 A2 R> n.1 1.06 -0.42 R> n.2 0.96 -0.30 R> n.3 0.99 -0.38 R> n.4 1.28 1.00 R> n.5 1.16 0.19 R> n.6 1.09 -0.01 R> n.7 0.89 -0.42 Finally, weareinterestedinassessingthestatisticalvalidityoftheobtainedcomponentmatri- ces. Infact,T3allowsustocarryoutabootstrapprocedureforcomputingconfidenceintervals

Description:
stored in a (two-way) matrix, say X of order (I×J), where I and J denote the In principle standard Principal Component Analysis (PCA) applied to XA or to the matrix mode and of the k-th entity on the r-th component for the C-mode,
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.