Improved graph-based SFA: Information preservation complements the slowness principle Alberto N. Escalante-B.∗, Laurenz Wiskott Institut fu¨r Neuroinformatik Ruhr-University Bochum, Germany 6 1 0 2 n Abstract a J Slowfeatureanalysis(SFA)isanunsupervised-learningalgorithmthatextracts 5 slowly varying features from a multi-dimensional time series. A supervised 1 extension to SFA for classification and regression is graph-based SFA (GSFA). ] GSFAisbasedonthepreservationofsimilarities,whicharespecifiedbyagraph V structure derived from the labels. It has been shown that hierarchical GSFA C (HGSFA) allows learning from images and other high-dimensional data. The . feature space spanned by HGSFA is complex due to the composition of the s c nonlinearities of the nodes in the network. However, we show that the network [ discardsusefulinformationprematurelybeforeitreacheshighernodes,resulting 1 in suboptimal global slowness and an under-exploited feature space. v To counteract these problems, we propose an extension called hierarchical 5 information-preserving GSFA (HiGSFA), where information preservation com- 4 plementstheslowness-maximizationgoal. Webuilda10-layerHiGSFAnetwork 9 to estimate human age from facial photographs of the MORPH-II database, 3 0 achieving a mean absolute error of 3.50 years, improving the state-of-the-art . performance. HiGSFA and HGSFA support multiple-labels and offer a rich 1 feature space, feed-forward training, and linear complexity in the number of 0 6 samples and dimensions. Furthermore, HiGSFA outperforms HGSFA in terms 1 offeatureslowness,estimationaccuracyandinputreconstruction,givingriseto : a promising hierarchical supervised-learning approach. v i X Keywords: Supervised dimensionality reduction, Similarity-based learning, r Information preservation, Deep networks, Age estimation a 1. Introduction Supervised dimensionality reduction (supervised DR) has promising appli- cations in computer vision, pattern recognition and machine learning, where ∗Correspondingauthor. Email addresses: [email protected](AlbertoN.Escalante-B.), [email protected](LaurenzWiskott) Preprint submitted to Pattern Recognition January 18, 2016 it is frequently used as a pre-processing step to solve classification and regres- sion problems with high-dimensional input. The goal of supervised DR is to extractalow-dimensionalrepresentationoftheinputsamplesthatcontainsthe predictiveinformationaboutthelabels(e.g.,[1]). Oneadvantageisthatdimen- sions irrelevant to the supervised problem can be discarded, resulting in a more compact representation and more accurate estimations. The slowness principle requires the extraction of the most slowly changing features, see [2], and can be used for DR by itself or before other algorithms. It has been shown that this principle might explain in part how the neurons in the brain self-organize to compute invariant representations. Slow features can be computed using a few methods, such as online learning rules (e.g., [3, 4]), slow feature analysis (SFA) [5, 6], which is a closed-form algorithm specific for this task that has biologically feasible variants [7], and an incremental-learning version (inc-SFA) [8]. TheoptimizationobjectiveofSFAistheminimizationofthesquaredoutput differences between (temporally) consecutive pairs of samples. Although SFA is unsupervised, the order of the samples can be regarded as a weak form of supervised information, allowing the use of SFA for supervised learning tasks. Graph-based SFA (GSFA) [9, 10] is an extension of SFA explicitly designed for the solution of classification and regression problems and yields more ac- curate label estimations than SFA. GSFA has been used to estimate age and gender from synthetic face images [11], and more recently for traffic sign clas- sification [12] and face detection [13]. GSFA is trained with a training graph, where the vertices are the samples and the edges represent similarities of the corresponding labels. The optimization objective of GSFA is the minimization of weighted squared output differences between samples connected by an edge, where the edges are carefully chosen to reflect label similarities. Figure1: ThecombinationofthreeextensionsofSFA(graph-based,hierarchical,information- preserving) gives rise to 8 different versions of SFA. In this article we propose the versions withinformationpreservation,beingHiGSFAthemostsignificantone. A common problem in machine learning is the high computational cost of algorithmswhenthedataarehigh-dimensional. ThisisalsothecasewhenGSFA 2 is applied to the data directly (direct GSFA), because it has cubic complexity w.r.t. the number of dimensions. However, processing high-dimensional data is stillpracticalifoneresortstohierarchicalGSFA(HGSFA).Insteadofextracting slow features in a single step, the features are computed by applying GSFA on lower dimensional data chunks repeatedly. Other advantages of HGSFA over direct GSFA include lower memory requirements and a complex global nonlinearity due to the composition of nonlinear transformations through the layers. Moreover, the local extraction of slow features from lower-dimensional data chunks typically results in less overfitting. In this article, we show that HGSFA suffers from a drawback: The separate instancesofGSFAthatprocessthelow-dimensionaldatachunksinthenetwork, also called GSFA nodes, may prematurely discard features that are not slow at a local level but that would have been useful to improve global slowness (i.e., the slowness of the final output features) if combined with features from other nodes. This drawback, which we call unnecessary information loss, leads to a suboptimal use of the feature space comprised by the network and also affects the estimation accuracy for the underlying supervised learning problem. Principle or heuristic Implemented through Slownessprinciple, GSFA, supervised training exploitlabelsimilarities graphs throughslowness Spatiallocalizationoffeatures, Hierarchicalprocessing divideandconquerapproach Normalizedorsaturating Robustnesstooutliers nonlinearexpansions Informationpreservation Minimizationofa (new) reconstructionerror,PCA Multipleinformation Multi-labellearningcombining channels(new) efficienttraininggraphs Table 1: Principles, heuristics and ideas considered in HiGSFA and the base methods or algorithmsusedtoexploitthem. To reduce the unnecessary information loss in HGSFA, we propose to com- plement slowness with information preservation (i.e., maximization of mutual information between the input data and the output features). For simplicity and efficiency, we implement this idea as the minimization of a reconstruction error. Theresultingnetworkiscalledhierarchicalinformation-preservingGSFA (HiGSFA), and the algorithm constituting each node of the network is called information-preserving GSFA (iGSFA). The features computed by iGSFA can be divided in two parts: A slow part, which is a linear transformation of the (nonlinear)featurescomputedwithGSFA,andaninput-reconstructivepart. To compute the input-reconstructive part the input data are approximated using the slow part, resulting in residual data that are then processed by PCA. The constructionensuresthatbothpartsaredecorrelatedandthefeatureshavecom- 3 patible scales. Different versions of SFA with and without information preser- vationareshowninFigure1. TheprinciplesandheuristicsbehindHiGSFAare presented in Table 1. The experiments show the advantages of HiGSFA over HGSFA: (1) slower features, (2) better generalization to unseen data, (3) much better input recon- struction (see Figure 2), and (4) improved accuracy for the supervised learning problem. Furthermore,thecomputationalandmemoryrequirementsofHiGSFA are asymptotically the same as in HGSFA. Figure2: (a)Animagefromaprivatedatabaseafterposenormalization. (b)Thesameimage fully pre-processed (i.e., after pose normalization and face sampling, 96×96 pixels). Linear reconstructionson75featuresextractedwitheither(c)PCA,(d)HiGSFAor(e)HGSFA.(f) Averageoverallpre-processedimagesoftheMORPH-IIdatabase. 2. Related work HiGSFA is the main extension to SFA proposed in this work and belongs to supervisedDR.OtheralgorithmsforsupervisedDRincludeFisherdiscriminant analysis(FDA)[14],localFDA(LFDA)[15],pairwiseconstraints-guidedfeature projection (PCGFP) [16], semi-supervised dimensionality reduction (SSDR) [17], and semi-supervised LFDA (SELF) [18]. Existing extensions to SFA include extended SFA (xSFA) [19], generalized SFA (genSFA) [20] and graph-based SFA (GSFA) [9, 10, 11]. HiGSFA extends hierarchical GSFA (HGSFA) by adding information preservation. With some adaptations, SFA has been shown to be successful for classification (e.g., [21, 22, 23, 24, 25]), and regression (e.g., [21]). From a mathematical point of view, SFA, LPP, genSFA and GSFA belong to the same family of optimization problems, see [26], and can be solved via generalizedeigenvalues. Conversionsbetweenthesefouralgorithmsarepossible. Two differences between LPP and GSFA are that in GSFA the vertex weights are independent of the edge weights and that GSFA is invariant to the scale of the weights, providing a normalized objective function. 4 In spite of being closely related mathematically, SFA, LPP, genSFA and GSFA originate from different backgrounds and were first motivated with dif- ferent goals in mind. Therefore, they usually have different applications and are trained with different similarity matrices, resulting in features with differ- ent properties. For instance, LPP originates from the field of manifold learning andtransitionsaretypicallydefinedfrominputsimilarities(nearestneighbors). SFA originates from unsupervised learning and transitions are typically defined by the temporal ordering of the samples. genSFA typically combines input similarities with class information. In GSFA, transitions typically reflect label similarities. For SFA and GSFA the use of input similarities might be disad- vantageous, because it might compromise the invariance of the features, which is one of their central goals. HiGSFA is a hierarchical implementation of information-preserving GSFA (iGSFA), which has one parameter ∆ (∆-threshold) more than GSFA. This T parameterbalancesthenumberofslowandreconstructivefeatures. When∆ < T 0 iGSFA becomes equivalent to PCA, and when ∆ ≥ 4.0 iGSFA computes a T linear transformation of GSFA features. Theory justifies fixing ∆ slightly T smaller than 2.0 (Section 4.4). The rest of the article is organized as follows. In the next section we review GSFA. In Section 4, we analyze the advantages and limitations of hierarchical networks for slow feature extraction. In Section 5, we propose the iSFA (and iGSFA) algorithm. In Section 6, we evaluate HiGSFA experimentally using the problem of age estimation from facial photographs. We conclude with a discussion section. 3. Overview of Graph-based SFA (GSFA) AtraininggraphG=(V,E)(illustratedinFigure3.a)hasasetVofvertices x(n), each vertex being a sample (i.e. an I-dimensional vector), and a set E of edges (x(n),x(n(cid:48))), which are pairs of samples, with 1≤n,n(cid:48) ≤N. The index n (or n(cid:48)) substitutes the time variable t of SFA. The edges are undirected and have symmetric weights γ = γ , which indicate the similarity between n,n(cid:48) n(cid:48),n the connected vertices; also each vertex x(n) has an associated weight v > 0, n which can be used to reflect its frequency. This representation includes the standard time series of SFA as a special case (Figure 3.b). TheGSFAoptimizationproblem[10]canbestatedasfollows. For1≤j ≤J, find features y (n) = g (x(n)) with g ∈ F, where 1 ≤ n ≤ N and F is the j j j feature space, such that the objective function (weighted delta value) ∆ d=ef 1 (cid:88)γ (y (n(cid:48))−y (n))2 is minimal (1) j R n,n(cid:48) j j n,n(cid:48) 5 Figure 3: (a) Example of a training graph with N =7 vertices. (b) A linear graph suitable forGSFAthatlearnsthesamefeaturesasSFAonthetimeseriesx(1),...,x(6). (Figurefrom [10]). under the constraints 1 (cid:88) v y (n)=0, (2) Q n j n 1 (cid:88) v (y (n))2 =1, and (3) Q n j n 1 (cid:88) v y (n)y (n)=0 for j(cid:48) <j (4) Q n j j(cid:48) n def (cid:80) def (cid:80) with R = γ and Q = v . n,n(cid:48) n,n(cid:48) n n Theconstraints(2–4)arecalledweightedzeromean,weightedunitvariance, andweighteddecorrelation. Thefactors1/R and1/Qprovideinvariancetothe scaleoftheedgeweightsaswellastothescaleofthevertexweights. Typically, a linear feature space is used, but the input samples are preprocessed by a nonlinear expansion function. Depending on the training graph chosen, the properties of the features ex- tracted by GSFA can be quite different. Training graphs for classification favor connections between samples from the same class, whereas graphs for regres- sion favor connections between samples with similar labels. In Section 6.3, we show how to combine graphs for classification and regression into an efficient graph to learn age (regression), gender (classification) and race (classification) simultaneously. The motivation for HiSFA and HiGSFA is to correct certain disadvantages of HSFA and HGSFA while preserving their advantages. An analysis of these advantages and disadvantages is the topic of the next section. 6 4. Advantages and limitations of HSFA networks In this section we analyze HSFA networks from the point of view of their advantages, particularly their computational complexity, and their limitations, particularly unnecessary information loss. We focus on HSFA but HGSFA is coveredbyextension. UnderstandingthedrawbacksofHSFAiscrucialtojustify the extensions with information preservation proposed in Section 5. 4.1. Previous work on HSFA and terminology The first example of HSFA appears also in the paper that first introduces the SFA algorithm [5], where it is employed as a model of the visual system for learning invariant representations. Various contributions have continued with this biological interpretation. In [27], HSFAhasbeenusedtolearninvariantfeaturesfromthesimulatedviewof a rat that moves inside a box. In conjunction with a sparseness post-processing step, the extracted features are similar to the responses of place cells in the hippocampus. Other works have focused more on its computational efficiency compared to direct SFA. In [28], HSFA has been used for object recognition from images and to estimate pose parameters of single objects moving and rotating over a homogeneous background. In [10], an HGSFA network with 11 layers has been used to accurately find the horizontal position of faces in photographs, which is a sub-problem of face detection. In general, HSFA networks should be directed and acyclic with arbitrary edges otherwise, but usually they are composed of multiple layers, each having a regular structure. Most of the HSFA networks in the literature have a similar composition. Typical differences are how the data are split into smaller chunks and the particular processing done by the SFA nodes themselves. The structure of the network usually follows the structure of the data. For instance,networksfordatain1dimension(e.g.,audiodatarepresentedasfixed length vectors) typically have a 1-dimensional structure (e.g., Figure 4), and networksfordatain2dimensions(e.g.,images)havea2-dimensionalstructure. This idea extends to voxel data in 3 dimensions, and beyond. For simplicity, we refer to the input data as layer 0. Important parameters that define the structure of a network include: (a) The output dimensionality of the nodes. (b) The fan-in of the nodes, which is the number of nodes (or data elements) in a previous layer that feed them. (c) The receptive fields of the nodes, which refer to all the elements of the input data that (directly or indirectly) provide input to a particular node. (d) The stride of a layer, which tells how far apart the inputs to adjacent nodes in a layer are. If the stride is smaller than the fan-in, then at least one node in the previous layer will feed two or more nodes in the current layer. This is called receptive field overlap. 4.2. Advantages of HSFA Interestingly,hierarchicalprocessingreducesoverfittingandcanbeseenasa regularization method. Generalization is improved even when a larger number 7 of parameters have to be learned. This can be explained by the fact that the input to each node in the hierarchical network has the same number of samples as the original input but much smaller dimensionality, frequently leading to good generalization to unseen data in individual nodes as well as in the whole network. ThedifferenceingeneralizationbetweenHSFAanddirectSFAismore evident when polynomial expansions are involved, because the larger expanded dimensionality in direct SFA translates into stronger overfitting. Another interesting property of HSFA is that the nonlinearity of the nodes accumulates across the layers, so that even when using simple expansions the network as a whole may describe a highly nonlinear feature space [29]. De- pending on the data, such a complex feature space may be necessary to extract the slowest hidden parameters and to solve the supervised problem with good accuracy. A key motivation for preferring HSFA over direct SFA is its computational efficiency. We focus here on the complexity of training rather than of feature extraction, because the former is more relevant for applications, and the latter is relatively lightweight in both cases. Training linear SFA has a computational (time) complexity T (N,I)=O(NI2+I3), (5) SFA where N is the number of samples and I is the input dimensionality (possibly after a nonlinear expansion). The same complexity holds for GSFA if one uses an efficient training graph (e.g., the clustered or serial graphs, see Section 6.3), otherwise it can be as large as (for arbitrary graphs) T (N,I)=O(N2I2+I3). (6) GSFA For large I (i.e., high-dimensional data) direct SFA and GSFA are therefore inefficient1. Their complexity can be reduced by using HSFA and HGSFA. The exact complexity depends on the structure and parameters of the hierarchical network. As we prove below, it can be linear in I and N for certain networks. 4.3. Complexity of a quadratic HSFA Network AlthoughexistingapplicationshaveexploitedthespeedofHSFA,wearenot aware of any analysis of its actual complexity. We compute the computational complexity of a concrete quadratic HSFA (QHSFA) network for 1D data with L layers, which is illustrated in Figure 4. Each node of the network performs quadratic SFA (QSFA, i.e., a quadratic expansion followed by linear SFA). The receptivefield, fan-in, andstrideofthenodesinlayer1isk inputvalues, where kisafixedparameter. Intherestofthelayersthefan-inandstrideofthenodes is2nodes. AssumethatthetotalinputdimensionalityisI andthateverynode reduces the dimensionality of its input from k components to k/2 components. 1TheproblemisstillfeasibleifN issmallenoughsothatonemightapplysingularvalue decomposition methods. However, a small number of samples N <I usually results in pro- nouncedoverfitting. 8 From the structure described above, it follows that the receptive fields of the nodes are non-overlapping, the network’s output has k/2 components, the number of layers L is related to I and k: I = k2L−1, (7) and the total number of nodes in the network is M = 2L−1. (8) TheinputtoeachQSFAnodehask dimensions, whichisincreasedtok(k+ 3)/2 dimensions by the quadratic expansion. Afterwards, linear SFA reduces the dimensionality to k/2. Hence, the complexity of training a single node is T (N,k) (=5) O(N(k(k+3)/2)2+(k(k+3)/2)3) (9) QSFA = O(Nk4+k6). (10) (7,8) The number of nodes is M = O(I/k). Therefore, the complexity of training the whole network is T (N,I,k)(1=0)O((Nk4+k6)I/k)=O(INk3+Ik5). (11) QHSFA This means the complexity of the QHSFA network above is linear w.r.t. the input dimension I, whereas the complexity of direct QSFA is T (N,I)(1=0)O(NI4+I6), (12) QSFA which is linear w.r.t. I6. Thus, the QHSFA network has a large computational advantage over direct QSFA. Since each layer in the QHSFA network is quadratic, in general the output features of layer l can be written as polynomials of degree 2l. In particular, the output features of the network are polynomials of degree 2L. However, the actual feature space does not span all the polynomials of this degree but only a subset of them due to the restricted connectivity of the network. In contrast, direct QSFA only contains quadratic polynomials (although all of them). One could train direct SFA on data expanded by a polynomial expansion of degree 2L. The expanded dimensionality would be (cid:80)2L (cid:0)d+I−1(cid:1)=O(I2L), d=0 I−1 resulting in a complexity of O(NI2L+1 +I3·2L), being even less feasible than direct QSFA. The space (memory) complexity of linear SFA is S (N,I)=O(I2+NI) (13) SFA (where the term NI is due to the input data). One can reduce this by using HSFA and training each node separately, one at a time. For instance, the space complexity of direct QSFA is S (N,I)=O(I4+NI), (14) QSFA 9 Figure4: Exampleofan1DQHSFAnetworkwithbinaryfan-insandnooverlap. Eachnode performs quadratic SFA reducing the dimensionality from k to k/2 components. A small fan-inresultsinanetworkwithalargenumberLoflayers,whichmaybeusefultobuilddeep networks. whereas the space complexity of the QHSFA network is only S (N,I,k)=O(k4+NI). (15) QHSFA It is possible to design more sophisticated networks than the QHSFA one andpreserveasimilarcomputationalcomplexity. Forexample,thenetworkpro- posedinSection6hasoverlappingreceptivefields,largerfan-ins,a2Dstructure, and a complexity also linear in I and N. 4.4. Limitations of HSFA networks In this section, we analyze the limitations of HSFA networks, or in general, anynetworkinwhichthenodeshaveonlyonecriterionforDR,namely,slowness maximization. Otherwise the nodes are unrestricted; they might be linear or nonlinear, include additive noise, clipping, various passes of SFA, etc. In spite of the remarkable advantages of HSFA, we show that relying only on slowness to determine which aspects of the data that are preserved results inthreedisadvantages: unnecessaryinformationloss,poorinputreconstruction and feature garbling. Unnecessaryinformationloss. Thisdisadvantageoccurswhenthenodesdiscard dimensions of the data that are not significantly slow locally (i.e., at the node level), but which would have been useful for slowness optimization at higher layers of the network if they had been preserved. We show through a toy experiment that dimensions important for global slownessarenotnecessarilyslowlocally. Considerfourzero-mean,unit-variance signals: s (t),s (t),s (t)andn(t)thatcanonlytakebinaryvaluesin{−1,+1} 1 2 3 10