Penalized Maximum Likelihood Estimation of Multi-layered Gaussian Graphical Models Jiahe Lin ∗1, Sumanta Basu ∗2, Moulinath Banerjee3, and George Michailidis †4 1University of Michigan. [email protected] 2University of California, Berkeley. [email protected] 6 3University of Michigan. [email protected] 1 0 4University of Florida. gmichail@ufl.edu 2 n a J Abstract 5 Analyzingmulti-layeredgraphicalmodelsprovidesinsightintounderstandingtheconditional ] relationships among nodes within layers after adjusting for and quantifying the effects of nodes E from other layers. We obtain the penalized maximum likelihood estimator for Gaussian multi- M layered graphical models, based on a computational approach involving screening of variables, t. iterative estimation of the directed edges between layers and undirected edges within layers a and a final refitting and stability selection step that provides improved performance in finite t s sample settings. We establish the consistency of the estimator in a high-dimensional setting. [ To obtain this result, we develop a strategy that leverages the biconvexity of the likelihood 1 function to ensure convergence of the developed iterative algorithm to a stationary point, as v well as careful uniform error control of the estimates over iterations. The performance of the 6 maximum likelihood estimator is illustrated on synthetic data. 3 7 Key Words: graphical models; penalized likelihood; block coordinate descent; convergence; con- 0 0 sistency . 1 0 1 Introduction 6 1 : v The estimation of directed and undirected graphs from high-dimensional data has received a lot of i attention in the machine learning and statistics literature (e.g., see Bu¨hlmann and Van De Geer, X 2011, and references therein), due to their importance in diverse applications including under- r a standing of biological processes and disease mechanisms, financial systems stability and social interactions, just to name a few (Sachs et al., 2005; Sobel, 2000; Wang et al., 2007). In the case of undirected graphs, the edges capture conditional dependence relationships between the nodes, while for directed graphs they are used to model causal relationships (Bu¨hlmann and Van De Geer, 2011). However, in a number of applications the nodes can be naturally partitioned into sets that ex- hibit interactions both between them and amongst them. As an example, consider an experiment where one has collected data for both genes and metabolites for the same set of patient specimens. In this case, we have three types of interactions between genes and metabolites: regulatory in- teractions between the two of them and co-regulation within the gene and within the metabolic ∗Equal Contribution. †Corresponding Author. 1 compartments. The latter two types of relationships can be expressed through undirected graphs within the sets of genes and metabolites, respectively, while the regulation of metabolites by genes corresponds to directed edges. Note that in principle there are feedback mechanisms from the metabolic compartment to the gene one, but these are difficult to detect and adequately estimate in the absence of carefully collected time course data. Another example comes from the area of financial economics, where one collects data on returns of financial assets (e.g. stocks, bonds) and also on key macroeconomic indicators (e.g. interest rate, prices indices, various measures of money supply and various unemployment indices). Once again, over short time periods there is influence fromtheeconomicvariablestothereturns(directededges), whilethereareco-dependencerelation- shipsbetweentheassetreturnsandthemacroeconomicvariables, respectively, thatcanbemodeled as undirected edges. Technically, such layered network structures correspond to multipartite graphs that possess undirected edges and exhibit a directed acyclic graph structure between the layers, as depicted in Figure 1, where we use directed solid edges to denote the dependencies across layers and dashed undirected edges to denote within-layer conditional depedencies. Selected properties of such so- Figure 1: Diagram for a three-layered network Layer 1 Layer 2 Layer 3 called chain graphs have been studied in the work of Drton and Perlman (2008), with an emphasis on two alternative Markov properties including the LWF Markov property (Frydenberg, 1990; Lauritzen and Wermuth, 1989) and the AMP Markov property (Andersson et al., 2001). While layered networks being interesting from a theoretical perspective and having significant scope for applications, their estimation has received little attention in the literature. Note that for a 2-layered structure, the directed edges can be obtained through a multivariate regression procedure, while the undirected edges in both layers through existing procedures for graphical models (for more technical details see Section 2.2). This is the strategy leveraged in the work of Rothman et al. (2010), where for a 2-layered network structure proposed a multivariate regression withcovarianceestimation(MRCE)methodforestimatingtheundirectededgesinthesecondlayer and the directed edges between them. A coordinate descent algorithm was introduced to estimate the directed edges, while the popular glasso estimator (Friedman et al., 2008) was used for the undirected edges. However, this method does not scale well according to the simulation results presented and no theoretical properties of the estimates were provided. In follow-up work, Lee and Liu (2012) proposed the Plug-in Joint Weighted Lasso (PWL) and the Plug-in Joint Graphical 2 Weighted Lasso (PWGL) estimator for estimating the same 2-layered structure, where they use a weighted version of the algorithm in Rothman et al. (2010) and also provide theoretical results for the low dimensional setting, where the number of samples exceeds the number of potential directed and undirected edges to be estimated. Finally, Cai et al. (2012) proposed a method for estimating thesame2-layeredstructureandprovidedcorrespondingtheoreticalresultsinthehighdimensional setting. TheDantzig-typeestimator(CandesandTao,2007)wasusedfortheregressioncoefficients andthecorrespondingresidualswereusedassurrogates, forobtainingtheprecisionmatrixthrough the CLIME estimator (Cai et al., 2011). While the above work assumed a Gaussian distribution for the data, in more recent work by Yang et al. (2014), the authors constructed the model under a general mixed graphical model framework, which allows each node-conditional distribution to belong to a potentially different univariate exponential family. In particular, with an underlying mixed MRF graph structure, instead of maximizing the joint likelihood, the authors proposed to estimate the homogeneous and heterogenous neighborhood for each node (which corresponds to undirected and directed edges respectively, if put in the layered-network setting) by obtaining the (cid:96) regularized M-estimator of the node-conditional distribution parameters, using traditional 1 approaches (e.g. Meinshausen and Bu¨hlmann, 2006) for neighborhood estimation. However, if we considertheoverallerrorincurredbytheneighborhoodselectionprocedureofeachindividualnode, the error bound becomes not tight due to the union bound operation used to obtain it. In this work, we obtain the regularized maximum likelihood estimator under a sparsity assump- tion on both directed and undirected parameters for multi-layered Gaussian graphical models and establish its consistency properties in a high-dimensional setting. As discussed in Section 3, the problem is not jointly convex on the parameters, but convex on selected subsets of them. Further, it turns out that the problem is biconvex if we consider a recursive multi-stage estimation approach that at each stage involves only regression parameters (directed edges) from preceeding layers and precision matrix parameters (undirected edges) for the last layer considered in that stage. Hence, wedecomposethemulti-layernetworkstructureestimationintoasequenceof2-layerproblemsthat allows us to establish the desired results. Leveraging the biconvexity of the 2-layer problem, we establish the convergence of the iterates to the maximum-likelihood estimator, which under certain regularity conditions is arbitrarily close to the true parameters. The theoretical guarantees pro- vided require a uniform control of the precision of the regression and precision matrix parameters, which poses a number of theoretical challenges resolved in Section 3. In summary, despite the lack of overall convexity, we are able to provide theoretical guarantees for the MLE in a high dimensional setting. We believe that the proposed strategy is generally applicable to other non-convex statistical estimation problems that can be decomposed to two biconvex problems. Further, to enhance the numerical performance of the MLE in finite (and small) sample settings, we introduce a screening step that selects active nodes for the iterative algorithmusedandthatleveragesrecentdevelopmentsinthehigh-dimensionalregressionliterature (e.g., Javanmard and Montanari, 2014; Van de Geer et al., 2014; Zhang and Zhang, 2014). We also post-process the final MLE estimate through a stability selection procedure. As mentioned above, the screening and stability selection steps are beneficial to the performance of the MLE in finite samples and hence recommended for similarly structured problems. The remainder of the paper is organized as follows. In Section 2, we introduce the proposed methodology,withanemphasisonhowthemulti-layerednetworkestimationproblemisdecomposed into a sequence of two-layered network estimation problem(s). In Section 3, we provide theoretical guaranteesfortheestimationprocedureposited. Inparticular,weshowconsistencyoftheestimates and convergence of the algorithm, under a number of common assumptions in high-dimensional settings. In Section 4, we show the performance of the proposed algorithm with simulation results under different simulation settings, and introduce serveral accerleration techniques which speed up 3 the convergence of the algorithm and reduce the computing time in practical settings. 2 Problem Formulation. ConsideranM-layeredGaussiangraphicalmodel. Supposetherearep nodesinLayerm, denoted m by Xm = (Xm,··· ,Xm )(cid:48), for m = 1,··· ,M. 1 pm The structure of the model is given as follows: – Layer 1. X1 = (X1,··· ,X1 )(cid:48) ∼ N(0,Σ1). 1 p1 – Layer 2. For j = 1,··· ,p2: Xj2 = (Bj12)(cid:48)X1 +(cid:15)2j, with Bj12 ∈ Rp1, and (cid:15)2 = ((cid:15)21,··· ,(cid:15)2p2)(cid:48) ∼ N(0,Σ2). . . . – Layer M. For j = 1,2,··· ,p : M M−1 (cid:88) XM = {(BmM)(cid:48)Xm}+(cid:15)M, where BmM ∈ Rpm for m = 1,··· ,M −1, j j j j m=1 and (cid:15)M = ((cid:15)M,··· ,(cid:15)M )(cid:48) ∼ N(0,ΣM). 1 pM The parameters of interest are all directed edges that encode the dependencies across layers, that is: Bst := (cid:2)Bst ··· Bst(cid:3), for 1 ≤ s < t ≤ M, 1 pt and all undirected edges that encode the conditional dependencies within layers after adjusting for the effects from directed edges, that is: Θm := (Σm)−1, for m = 1,··· ,M. It is assumed that Bst and Θm are sparse for all 1,...,M and 1 ≤ s < t ≤ M. Given centered data for all M layers, denoted by Xm = [Xm,··· ,Xm ] ∈ Rn×pm for all m = 1 pm 1,··· ,M, we aim to obtain the MLE for all Bst,1 ≤ s < t ≤ M and all Θm,m = 1,··· ,M parameters. Henceforth, we use Xm to denote random vectors, and Xm to denote the jth column j in the data matrix Xm whenever there is no ambiguity. n×pm Through Markov factorization (Lauritzen, 1996), the full log-likelihood function can be decom- posed as: (cid:96)(Xm;Bst,Θm,1≤s<t≤M,1≤m≤M)=(cid:96)(XM|XM−1,···,X1;B1M,···,BM−1,M,ΘM) +(cid:96)(XM−1|XM−2,···,X1;B1M−1,···,BM−2,M−1,ΘM−1) +···+(cid:96)(X2|X1;B12,Θ2)+(cid:96)(X1;Θ1) =(cid:96)(X1;Θ1)+(cid:88)M (cid:96)(Xm|X1,···,Xm−1;B1m,···,Bm−1,m,Θm). m=2 Notethatthesummandssharenocommonparameters,whichenablesustomaximizethelikelihood withrespecttoindividualparametersintheM termsseparately. Moreimportantly,byconditioning Layermnodesonnodesinitsprevious(m−1)layers, wecantreatLayermnodesasthe“response” layer, and all nodes in the previous (m−1) layer combined as a super “parent” layer. If we ignore 4 the structure within the bottom layer (X1) for the moment, the M-layered network can be viewed as (M −1) two-layered networks, each comprising a response layer and a parent layer. Thus, the network structure in Figure 1 can be viewed as a 2 two-layered network: for the first network, Layer 3 is the response layer, while Layers 1 and 2 combined form the “parent” layer; for the second network, Layer 2 is the response layer, and Layer 1 is the “parent” layer. Therefore, the problem for estimating all (cid:0)M(cid:1) coefficient matrices and M precision matrices can be translated 2 into estimating (M −1) two-layered network structures with directed edges from the parent layer to the response layer, and undirected edges within the response layer, and finally estimating the undirected edges within the bottom layer separately. Since all estimation problems boil down to estimating the structure of a 2-layered network, we focus the technical discussion on introducing our proposed methodology for a 2-layered network setting1. The theoretical results obtained extend in a straightforward manner to an M-layered Gaussian graphical model. Remark 1. For the M-layer network structure, we impose certain identifiability-type condition on the largest “parent” layer (encompassing M −1 layers), so that the directed edges of the entire networkareestimable. Theimposedconditiontranslatesintoaminimumeigenvalue-typecondition on the population precision matrix within layers, and conditions on the magnitude of dependencies across layers. Intuitively, consider a three-layered network: if X1 and X2 are highly correlated, then the proposed (as well as any other) method will exhibit difficulties in distinguishing the effect of X1 on X3 from that of X2 on X3. The (group) identifiability-type condition is thus imposed to obviate such circumstances. An in-depth discussion on this issue is provided in Section 3.4. 2.1 A Two-layered Network Set-up. Consider a two-layered Gaussian graphical model with p nodes in the first layer, denoted by 1 X = (X ,··· ,X )(cid:48), and p nodes in the second layers, denoted by Y = (Y ,··· ,Y )(cid:48). The model 1 p1 2 1 p2 is defined as follows: – X = (X ,··· ,X )(cid:48) ∼ N(0,Σ ). 1 p1 X – For j = 1,2,··· ,p2: Yj = Bj(cid:48)X +(cid:15)j, Bj ∈ Rp1 and (cid:15) = ((cid:15)1,··· ,(cid:15)p2)(cid:62) ∼ N(0,Σ(cid:15)). The parameters of interest are: Θ := Σ−1,Θ := Σ−1 and B = [B ,··· ,B ]. As with most X X (cid:15) (cid:15) 1 p2 estimation problems in the high dimensional setting, we assume these parameters to be sparse. Now given data X = [X1,··· ,Xp1] ∈ Rn×p1 and Y = [Y1,··· ,Yp2] ∈ Rn×p2, both centered, we would like to use the penalized maximum likelihood approach to obtain estimates for Θ , Θ and X (cid:15) B. Throughout this paper, we use X, Y and E to denote the size-n realizations of the random vectors X, Y and (cid:15), respectively. Also, with a slight abuse of notation, we use X ,i = 1,2,··· ,p i 1 and Y ,j = 1,2,··· ,p to denote the columns of the data matrix X and Y, respectively, whenever j 2 there is no ambiguity. The full log-likelihood can be written as (cid:96)(X,Y;B,Θ ,Θ ) = (cid:96)(Y|X;Θ ,B)+(cid:96)(X;Θ ) (1) (cid:15) X (cid:15) X Note that the first term only involves Θ and B, and the second term only involves Θ . Hence, (cid:15) X (1) can be maximized by maximizing (cid:96)(Y|X) w.r.t. (Θ ,B), and maximizing (cid:96)(X) w.r.t. Θ , (cid:15) X respectively. Θ(cid:98)X can be obtained using traditional methods for estimating undirected graphs, e.g., 1InAppendix5.2,wegiveadetailexampleonhowourproposedmethodworksundera3-layerednetworksetting. 5 the Graphical Lasso (Friedman et al., 2008) or the Nodewise Regression prcoedure (Meinshausen and Bu¨hlmann, 2006). Therefore, the rest of this paper will mainly focus on obtaining estimates for Θ and B. In the next subsection, we introduce our estimation procedure for obtaining the (cid:15) MLE for Θ and B. (cid:15) Remark 2. Our proposed method is targeted towards maximizing (cid:96)(Y|X;Θ ,B) (with proper pe- (cid:15) nalization) in (1) only, which gives the estimates for across-layers dependencies between the re- sponse layer and the parent layer, as well as estimates for the conditional dependencies within the response layer each time we solve a 2-layered network estimation problem. For an M-layered estimation problem, the maximization regarding (cid:96)(X;Θ ) occurs only when we are estimating the X within-layer conditional dependencies for the bottom layer. 2.2 Estimation Algorithm. The conditional likelihood for response Y given X can be written as: L(Y|X) = (√12π)np2|Σ(cid:15)⊗In|−1/2exp(cid:8)−12(Y −Xβ)(cid:62)(Σ(cid:15)⊗In)−1(Y −Xβ)(cid:9), where Y = vec(Y ,··· ,Y ), X = I ⊗ X and β = vec(B ,··· ,B ). After writing out the 1 p2 p2 1 p2 Kronecker product, the log-likelihood can be written as: n 1 (cid:88)p2 (cid:88)p2 (cid:96)(Y|X) = constant+ logdetΘ − σij(Y −XB )(cid:62)(Y −XB ). 2 (cid:15) 2 (cid:15) i i j j j=1 i=1 Here, σij denotes the ij-th entry of Θ . With (cid:96) penalization which induces sparsity, the optimiza- (cid:15) (cid:15) 1 tion problem can be formulated as: 1 (cid:88)p2 (cid:88)p2 (cid:88)p2 min σij(Y −XB )(cid:62)(Y −XB )−logdetΘ +λ (cid:107)B (cid:107) +ρ (cid:107)Θ (cid:107) , (2) B∈Rp1×p2 n (cid:15) i i j j (cid:15) n j 1 n (cid:15) 1,off Θ(cid:15)∈Sp+2+×p2 j=1 i=1 j=1 and the first term in (2) can be equivalently written as: (Y −XB )(cid:62) 1 1 trn1 ... (cid:2)(Y1−XB1) ··· (Yp2 −XBp2)(cid:3)Θ(cid:15) := tr(SΘ(cid:15)). (Y −XB )(cid:62) p2 p2 where S is defined as the sample covariance matrix of E ≡ Y −XB. This gives rise to the following optimization problem: (cid:88)p2 min tr(SΘ )−logdetΘ +λ (cid:107)B (cid:107) +ρ (cid:107)Θ (cid:107) ≡ f(B,Θ ), (3) (cid:15) (cid:15) n j 1 n (cid:15) 1,off (cid:15) B∈Rp1×p2 Θ(cid:15)∈Sp+2+×p2 j=1 where (cid:107)Θ(cid:107) is the absulote sum of the off-diagonal entries in Θ, λ and ρ are both positive 1,off n n tuning parameters. This penalized log-likelihood corresponds to the objective function initially proposed in Rothman et al. (2010), and has also been examined in Lee and Liu (2012). Note that the objective function (3) is not jointly convex in (B,Θ ), but only convex in B for (cid:15) fixed Θ and in Θ for fixed B; hence, it is bi-convex, which in turn implies that the proposed (cid:15) (cid:15) 6 Algorithm 1: Computational procedure for estimating B and Θ (cid:15) Input : Data from the parent layer X and the response layer Y. 1 Screening: 2 for j = 1,··· ,p2 do regress Y on X using the de-biased Lasso procedure in Javanmard and Montanari j (2014) and obtain the corresponding vector of p-values P ; j end obtain adjusted p-values P(cid:101)j by applying Bonferroni correction to vec(P1,··· ,Pj); determine the support set B for each regression using (4). j 3 Initialization: 4 Initialize column j = 1,··· ,p2 of B(cid:98)(0) by solving (5). (0) Initialize Θ(cid:98)(cid:15) by solving (9) using the graphical lasso (Friedman et al., 2008). 5 while |f(B(cid:98)(k),Θ(cid:98)((cid:15)k))−f(B(cid:98)(k+1),Θ(cid:98)((cid:15)k+1))| ≥ (cid:15) do 6 update B(cid:98) with (6); 7 update Θ(cid:98)(cid:15) with (8); 8 end 9 Refitting B and Θ(cid:15): for j = 1,··· ,p do 2 Obtain the refitted B(cid:101)j using (9); end re-estimate Θ(cid:101)(cid:15) using (10) with W coming from stability selection. Output: Final Estimates B(cid:101) and Θ(cid:101)(cid:15). algorithm may fail to converge to the global optimum, especially in settings where p > n, as 1 pointed out by Lee and Liu (2012). As is the case with most non-convex problems, good initial parameters are beneficial for fast convergence of the algorithm, a fact supported by our numerical work on the present problem. Further, a good initialization is critical in establishing convergence of the algorithm for this problem (see Section 3.1). To that end, we introduce a screening step for obtaining a good initial estimate for B. The theoretical justification for employing the screening step is provided in Section 3.3. Anoutlineof the computationalprocedureispresented inAlgorithm1, whilethedetailsof each step involved are discussed next. Screening. For each variable Y ,j = 1,··· ,p in the response layer, regress Y on X via the j 2 j de-biased Lasso procedure proposed by Javanmard and Montanari (2014). The output consists of thep-value(s)foreachpredictorineachregression, denotedbyPj, withPj ∈ [0,1]p1. Tocontrolthe family-wiseerrorrateoftheestimates, wedoaBonferronicorrectionatlevelα: defineα(cid:63) = α/p p 1 2 and set B = 0 if the p-value obtained for the k’th predictor in the j’th regression P exceeds j,k j,k α(cid:63). Further, let Bj = {Bj ∈ Rp1 : Bj,k = 0 if k ∈ S(cid:98)jc} ⊆ Rp1, (4) where S(cid:98)j is the collection of indices for those predictors deemed “active” for response Yj: S(cid:98)j = {k : Pj,k > α(cid:63)}, for j = 1,··· ,p2. Therefore, subsequent estimation of the elements of B will be restricted to B ×···×B . 1 p2 7 Alternating Search. In this step, we utilize the bi-convexity of the problem and estimate B and Θ by minimizing in an iterative fashion the objective function with respect to (w.r.t.) one set (cid:15) of parameters, while holding the other set fixed within each iteration. Aswithmostiterativealgorithms,weneedaninitializer;forB(cid:98)(0)itcorrespondstoaLasso/Ridge regressionestimatewithasmallpenalty, whileforΘ(cid:98)(cid:15) weusetheGraphicalLassoprocedureapplied to the residuals obtained from the first stage regression. That is, for each j = 1,··· ,p , 2 B(cid:98)j(0) = argmin(cid:8)(cid:107)Yj −XBj(cid:107)22+λ0n(cid:107)Bj(cid:107)1(cid:9), (5) Bj∈Bj where λ0n is some small tuning parameter for initialization, and set E(cid:98)j(0) := Yj −XB(cid:98)j(0). An initial estimate for Θ(cid:98)(cid:15) is then given by solving for the following optimization problem with the graphical lasso (Friedman et al., 2008) procedure: (cid:110) (cid:111) Θ(cid:98)((cid:15)0) = argmin logdetΘ(cid:15)−tr(S(cid:98)(0)Θ(cid:15))+ρn(cid:107)Θ(cid:15)(cid:107)1,off , Θ(cid:15)∈Sp+2+×p2 where S(cid:98)(0) is the sample covariance matrix based on (E(cid:98)1(0),··· ,E(cid:98)p(02)). Next we use an alternating block coordinate descent algorithm with (cid:96) penalization to reach a 1 stationary point of the objective function (3): – Update B as: 1 (cid:88)p2 (cid:88)p2 (cid:88)p2 B(cid:98)(k+1) = argmin n (σ(cid:98)(cid:15)ij)(k)(Yi−XBi)(cid:62)(Yj −XBj)+λn (cid:107)Bj(cid:107)1 , (6) B∈B1×···×Bp2 i=1 j=1 j=1 whichcanbeobtainedbycycliccoordinatedescentw.r.teachcolumnB ofB, thatis, update j each column B by: j (cid:110) (cid:111) B(cid:98)(t+1) = argmin (σ(cid:98)(cid:15)jj)(k)(cid:107)Y +r(t+1)−XB (cid:107)2+λ (cid:107)B (cid:107) , (7) j n j j j 2 n j 1 Bj∈Bj where rj(t+1) = (σjj1)(k) (cid:88)j−1(σ(cid:98)(cid:15)ij)(k)(Yi−XB(cid:98)i(t+1))+ (cid:88)p2 (σ(cid:98)(cid:15)ij)(k)(Yi−XB(cid:98)i(t)), (cid:98)(cid:15) i=1 i=j+1 and iterate over all columns until convergence. Here, we use k to index the outer iteration while minimizing w.r.t. B or Θ , and use t to index the inner iteration while cyclically (cid:15) minimizing w.r.t. each column of B. – Update Θ as: (cid:15) (cid:110) (cid:111) Θ(cid:98)((cid:15)k+1) = argmin logdetΘ(cid:15)−tr(S(cid:98)(k+1)Θ(cid:15))+ρn(cid:107)Θ(cid:15)(cid:107)1,off , (8) Θ(cid:15)∈Sp+2+×p2 where S(cid:98)(k+1) is the sample covariance matrix based on E(cid:98)j(k+1) = Yj−XB(cid:98)j(k+1),j = 1,··· ,p2. 8 Refitting and Stabilizing. Asnotedintheintroduction,thisstepisbeneficialinapplications, especially when one deals with large scale multi-layer networks and relatively smaller sample sizes. Denote the solution obtained by the above iterative procedure by B∞ and Θ∞. For each j = (cid:15) 1,··· ,p2, set B(cid:101)j = {Bj : Bj,i = 0 if Bj∞,i = 0,Bj ∈ Rp1} and the final estimate for Bj is given by ordinary least squares: B(cid:101)j = argmin(cid:107)Yj −XBj(cid:107)2. (9) Bj∈B(cid:101)j For Θ , we obtain the final estimate by a combination of stability selection (Meinshausen and (cid:15) Bu¨hlmann, 2010) and graphical lasso (Friedman et al., 2008). That is, after obtaining the refitted residualsE(cid:101)j := Yj−XB(cid:101)j,j = 1,··· ,p2,basedonthestabilityselectionprocedurewiththegraphical lasso, we obtain the stability path, or probability matrix W for each edge, which records the proportion of each edge being selected based on bootstrapped samples of E(cid:101)j’s. Then, using this probability matrix W as a weight matrix, we obtain the final estimate of Θ(cid:101)(cid:15) as follow: (cid:110) (cid:111) Θ(cid:101)(cid:15) = argmin logdetΘ(cid:15)−tr(S(cid:101)Θ(cid:15))+ρ(cid:101)n(cid:107)(1−W)∗Θ(cid:15)(cid:107)1,off , (10) Θ(cid:15)∈Sp+2+×p2 where we use ∗ to denote the element-wise product of two matrices, and S(cid:101)is the sample covariance matrixbasedontherefittedresidualsE(cid:101). Again, (10)canbesolvedbythegraphicallassoprocedure (Friedman et al., 2008), with ρ properly chosen. (cid:101)n 2.3 Tuning Parameter Selection. To select the tuning parameters (λ ,ρ ), we use the Bayesian Information Criterion(BIC), which n n is the summation of a goodness-of-fit term (log-likelihood) and a penalty term. The explicit form of BIC (as a function of B and Θ ) in our setting is given by (cid:15) logn (cid:107)Θ (cid:107) −p (cid:15) 0 2 BIC(B,Θ ) = −logdetΘ +tr(SΘ )+ ( +(cid:107)B(cid:107) ) (cid:15) (cid:15) (cid:15) 0 n 2 where (Y −XB )(cid:62) 1 1 S := n1 ... (cid:2)(Y1−XB1) ··· (Yp2 −XBp2)(cid:3), (Y −XB )(cid:62) p2 p2 and (cid:107)Θ (cid:107) is the total number of nonzero entries in Θ . Here we penalize the non-zero elements in (cid:15) 0 (cid:15) the upper-triangular part of Θ and the non-zero ones in B. We choose the combination (λ∗,ρ∗) (cid:15) n n over a grid of (λ,ρ) values, and (λ∗,ρ∗) should minimize the BIC evaluated at (B∞,Θ∞). n n (cid:15) 3 Theoretical Results In this section, we establish a number of theoretical results for the proposed iterative algorithm. We focus the presentation on the two-layer structure, since as explained in the previous section the multi-layer estimation problem decomposes to a series of two-layer ones. As mentioned in the introduction, one key challenge for estabilishing the theoretical results comes from the fact that the objective function (3) is not jointly convex in B and Θ . Consequently, if we simply used (cid:15) properties of block-coordinate descent algorithms, we would not be able to provide the necessary theoretical guarantees for the estimates we obtain. On the other hand, the biconvex nature of the 9 objective function allows us to establish convergence of the alternating algorithm to a stationary point, provided it is initialized from a point close enough to the true parameters. This can be accomplished using a Lasso-based initializer for B and Θ as previously discussed. The details of (cid:15) algorithmic convergence are presented in Section 3.1. Anothertechnicalchallengeisthateachupdateinthealternatingsearchstepreliesonestimated quantities–namelytheregressionandprecisionmatrixparameters–ratherthantherawdata,whose estimationprecisionneedstobecontrolleduniformlyacrossalliterations. Thedetailsofestablishing consistency of the estimates for both fixed and random realizations are given in Section 3.2. Next, we outline the structure of this section. In Section 3.1 Theorem 1, we show that for any fixedsetofrealizationofX andE2,theiterativealgorithmisguaranteedtoconvergetoastationary point if estimates for all iterations lie in a compact ball around the true value of the parameters. In Section 3.2, we show in Theorem 4 that for any random X and E, with high probability, the estimates for all iterations lie in a compact ball around the true value of the parameters. Then in Section 3.3, we show that asymptotically with log(p p )/n → 0, while keeping the family-wise type 1 2 I error under some pre-specified level, the screening step correctly identifies the true support set for eachoftheregressions, baseduponwhichtheiterativealgorithmisprovidedwithaninitializerthat is close to the true value of the parameters. Finally in Section 3.4, we provide sufficient conditions for both directed and undirected edges to be identifiable (estimable) for multi-layered network. Throughout this section, to distinguish the estimates from the true values, we use B∗ and Θ∗ (cid:15) to denote the true values. 3.1 Convergence of the Iterative Algorithm In this subsection, we prove that the proposed block relaxation algorithm converges to a stationary point for any fixed set of data, provided that the estimates for all iterations lie in a compact ball around the true value of the parameters. This requirement is shown to be satisfied with high probability in the next subsection 3.2. Decompose the optimization problem in (3) as follows: min f(B,Θ ) ≡ f (B,Θ )+f (B)+f (Θ ) (cid:15) 0 (cid:15) 1 2 (cid:15) B∈Rp1×p2 Θ(cid:15)∈Sp+2+×p2 where 1 (cid:88)p2 (cid:88)p2 f (B,Θ ) = σij(Y −XB )(cid:48)(Y −XB )−logdetΘ = tr(SΘ )−logdetΘ , 0 (cid:15) n (cid:15) i i j j (cid:15) (cid:15) (cid:15) j=1 i=1 f (B) = λ (cid:107)B(cid:107) , f (Θ ) = ρ (cid:107)Θ (cid:107) . 1 n 1 2 (cid:15) n (cid:15) 1,off and Sp2×p2 is the collection of p ×p symmetric positive definite matrices. Further, denote the ++ 2 2 limit point (if there is any) of {B(cid:98)(k)} and {Θ(cid:98)(cid:15)(k)} by B∞ = limk→∞B(cid:98)(k) and Θ∞(cid:15) = limk→∞Θ(cid:98)((cid:15)k), respectively. Definition 1 (stationary point(Tseng, 2001) pp.479). Define z to be a stationary point of f if z ∈ dom(f) and f(cid:48)(z;d) ≥ 0,∀ direction d = (d ,··· ,d ) where d is the tth coordinate block. 1 N t 2We actually observe X and Y, which is given by a corresponding set of realization in X and E based on the model. 10