ebook img

Communication-Efficient Algorithms for Decentralized and Stochastic Optimization PDF

0.49 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Communication-Efficient Algorithms for Decentralized and Stochastic Optimization

Noname manuscript No. (will be inserted by the editor) Communication-Efficient Algorithms for Decentralized and Stochastic Optimization Guanghui Lan · Soomin Lee · Yi Zhou thedateofreceiptandacceptance shouldbeinsertedlater 7 1 Abstract Wepresentanewclassofdecentralizedfirst-ordermethodsfornonsmoothandstochasticoptimization 0 2 problemsdefined overmultiagentnetworks.Consideringthatcommunicationis a majorbottleneckin decentral- izedoptimization,ourmaingoalinthispaperistodevelopalgorithmicframeworkswhichcansignificantlyreduce b the number of inter-node communications.We first propose a decentralized primal-dual method which can find e F an ǫ-solution both in terms of functional optimalitygap and feasibility residual in (1/ǫ) inter-node communi- O cationroundswhen theobjectivefunctionsareconvex and thelocalprimalsubproblemsaresolvedexactly.Our 4 majorcontributionis to presenta new class ofdecentralizedprimal-dualtypealgorithms,namelythe decentral- ized communication sliding (DCS) methods, which can skip the inter-node communications while agents solve ] C the primal subproblems iterativelythrough linearizationsof their local objectivefunctions. By employing DCS, O agentscan stillfind an ǫ-solutionin (1/ǫ)(resp., (1/√ǫ))communicationroundsfor generalconvex functions O O (resp., strongly convex functions), while maintaining the (1/ǫ2) (resp., (1/ǫ)) bound on the total number of h. intra-node subgradient evaluations. We also present a stoOchastic counterOpart for these algorithms, denoted by t a SDCS, for solving stochastic optimization problems whose objective function cannot be evaluated exactly. In m comparison with existing results for decentralized nonsmooth and stochastic optimization, we can reduce the [ total number of inter-node communication rounds by orders of magnitude while still maintaining the optimal complexityboundsonintra-nodestochasticsubgradientevaluations.Theboundsonthe(stochastic)subgradient 2 evaluations are actually comparable to those required for centralized nonsmooth and stochastic optimization v under certain conditions on the target accuracy. 1 6 Keywords: decentralizedoptimization,decentralizedmachinelearning,communicationefficient, stochasticpro- 9 gramming,nonsmooth functions, primal-dualmethod, complexity 3 0 AMS 2000 subject classification: 90C25,90C06,90C22,49M37, 93A14, 90C15 . 1 0 1 Introduction 7 1 : Decentralizedoptimizationproblemsdefined overcomplexmultiagentnetworksareubiquitousinsignalprocess- v ing, machinelearning,control,and other areas in science and engineering (see e.g. [47,21,50,15]). In this paper, i X we consider the followingdecentralized optimizationproblem which is cooperativelysolved by the networkof m r agents: a min f(x):= m f (x) (1.1) x i=1 i P ThisworkwasfundedbyNationalScienceFoundationgrants1637473and1637474,andOfficeofNavalResearchgrantN00014- 16-1-2802 Department of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, 30332. (E-mail: [email protected], [email protected], [email protected]) Address(es)ofauthor(s)shouldbegiven 2 Guanghui Lanetal. s.t. x X, X := m X , ∈ ∩i=1 i where f :X R is a convex and possibly nonsmooth objectivefunction of agent i satisfying i i → µ2kx−yk2 ≤fi(x)−fi(y)−hfi′(y),x−yi≤Mkx−yk, ∀x,y∈Xi, (1.2) forsomeM,µ≥0 and fi′(y)∈∂fi(y),where∂fi(y)denotes thesubdifferentialof fi at y,and Xi ⊆Rd is a closed convex constraint set of agent i. Note that f and X are private and only known to agent i. Throughout the i i paper, we assume the feasible set X is nonempty. In this paper, we also consider the situation where one can only have access to noisy first-order information (function values and subgradients) of the functions f , i = 1,...,m (see [41,23]). This happens, for example, i when the function f ’s are given in the form of expectation, i.e., i f (x):=E [F (x;ξ )], (1.3) i ξi i i wheretherandomvariableξ modelsa sourceof uncertaintyand thedistributionP(ξ ) isnot knownin advance. i i As a special case of (1.3), f may be given as the summation of many components,i.e., i f (x):= l fj(x), (1.4) i j=1 i wherel 1issomelargenumber.StochasticoptimizatiPonproblemofthistypehasgreatpotentialofapplications ≥ in data analysis,especially in machinelearning.In particular,problem (1.3) corresponds to the minimizationof generalized risk and is particularly useful for dealing with online (streaming) data distributed over a network, while problem (1.4) aims at the collaborativeminimizationof empirical risk. Currently the dominant approach istocollectallagents’privatedataonaserver(orcluster)andtoapplycentralizedmachinelearningtechniques. However, this centralization scheme would require agents to submit their private data to the service provider without much control on how the data will be used, in addition to incurring high setup cost related to the transmissionof datato theservice provider.Decentralizedoptimizationprovidesa viableapproach to deal with these data privacy related issues. In these decentralized and stochastic optimization problems, each network agent i is associated with the local objective function f (x) and all agents intend to cooperatively minimize the system objective f(x) as the i sum of all local objectivef ’s in the absence of full knowledgeabout the global problem and network structure. i A necessary feature in decentralized optimization is, therefore, that the agents must communicate with their neighboring agents to propagatethe distributed informationto every locationin the network. One of the most well-studied techniques in decentralized optimization are the subgradient based methods (see e.g., [39,35,57,37,14,27,52]), where at each step a local subgradient is taken at each node, followed by the communicationwithneighboringagents.Althoughthesubgradientcomputationateachstepcanbeinexpensive, thesemethodsusuallyrequirelotsofiterationsuntilconvergence.Consideringthatoneiterationindecentralized optimizationisequivalenttoonecommunicationroundamongagents,thiscanincur asignificantlatency.CPUs in thesedayscan read and writethememoryat over10GB per second whereascommunicationoverTCP/IPis about 10 MB per second. Therefore, the gap between intra-node computationand inter-node communicationis about 3 orders of magnitude. The communication start-up cost itself is also not negligible as it usually takes a few milliseconds. Another well-known type of decentralized algorithm relies on dual methods (see e.g., [4,62,10]), where at each step for a fixed dual variable, the primal variables are solved to minimize some local Lagrangian related function, then the dual variables associated with the consistency constraintsare updated accordingly.Although these dual type methods usually require fewer numbers of iterations (hence, fewer communicationrounds) than the subgradient methods until convergence, one crucial problem of these methods is that the local subproblem associated with each agent cannot be solved efficiently in many cases. The main goal of this paper is, therefore, to develop dual based decentralized algorithms for solving (1.1) that is communicationefficient and has local subproblems easily solved by each agent through the utilizationof (noisy)first-order informationof f . More specifically,we willprovidea theoreticalunderstandingon how many i numbers of inter-node communicationsand intra-node(stochastic)subgradient evaluationsof f are required in i order to find a certain approximatesolutionof (1.1). Communication-EfficientAlgorithmsforDecentralizedandStochastic Optimization 3 1.1 Notationand Terminologies Let R denote the set of real numbers. All vectors are viewed as column vectors, and for a vector x Rd, we ∈ use x⊤ to denote its transpose. For a stacked vector of xi’s, we often use (x1,...,xm) to represent the column vector [x⊤1,...,x⊤m]⊤. We denote by 0 and 1 the vector of all zeros and ones whose dimensions vary from the context. The cardinality of a set S is denoted by S . We use Id to denote the identity matrix in Rd×d. We use A B for matrices A Rn1×n2 and B Rm1×m2 |to|denote their Kronecker product of size Rn1m1×n2m2. For a ma⊗trixA Rn×m,we∈useAij todenote∈theentryofi-throwandj-thcolumn.Foranym 1,thesetofintegers ∈ ≥ 1,...,m is denoted by [m]. { } 1.2 Problem Formulation Consider a multiagent network system whose communication is governed by an undirected graph = ( , ), G N E where =[m] indexes the set of agents, and represents the pairs of communicatingagents. If there N E ⊆N ×N exists an edge from agent i to j which we denote by (i,j), agent i may send its informationto agent j and vice versa. Thus, each agent i can directly receive (resp., send) information only from (resp., to) the agents in ∈ N its neighborhood N = j (i,j) i , (1.5) i { ∈N | ∈E}∪{ } where we assume that there always exists a self-loop (i,i) for all agents i . Then, the associated Laplacian L Rm×m of is L :=D A where D is the diagonal degree matrix, and∈AN Rm×m is the adjacency matrix ∈ G − ∈ with the property that A =1 if and only if (i,j) and i=j, i.e., ij ∈E 6 N 1 if i=j i | |− Lij = 1 if i=j and (i,j) (1.6) − 6 ∈E 0 otherwise.  We consider a reformulation of problem(1.1) which will be used in the development of our decentralized algorithms. We introduce an individual copy x of the decision variable x for each agent i and impose i ∈ N the constraint x = x for all pairs (i,j) . The transformed problem can be written compactly by using the i j ∈ E Laplacianmatrix L: min F(x):= m f (x ) (1.7) x i=1 i i s.t. Lx=0, Px X , for all i=1,...,m, i i ∈ wherex=(x1,...,xm) X1 ... Xm,F :X1 ... Xm R,andL=L Id Rmd×md.TheconstraintLx=0 ∈ × × × × → ⊗ ∈ is a compact way of writing x = x for all agents i and j which are connected by an edge. By construction, L i j is symmetric positive semidefinite and its null space coincides with the “agreement” subspace, i.e., L1=0 and 1 L=0. To ensure each node gets informationfrom every other node, we need the followingassumption. ⊤ Assumption 1 The graph is connected. G Under Assumption 1, problem (1.1) and (1.7) are equivalent. We let Assumption 1 be a blanket assumption for the rest of the paper. Wenextconsiderareformulationoftheproblem(1.7)asasaddlepointproblem.BythemethodofLagrange multipliers,problem (1.7) is equivalent to the followingsaddle point problem: min F(x)+ max Lx,y , (1.8) x Xm y Rmdh i ∈ (cid:20) ∈ (cid:21) where Xm := X1 ... Xm and y = (y1,...,ym) Rmd are the Lagrange multipliers associated with the constraints Lx = 0×. We×assume that there exists an∈optimal solution x∗ Xm of (1.7) and that there exists y∗ Rmd such that (x∗,y∗) is a saddle point of (1.8). ∈ ∈ 4 Guanghui Lanetal. 1.3 Literaturereview Decentralized optimization has been extensively studied in recent years due to the emergence of large-scale networks. The seminal work on distributed optimization [60,59] has been followed by distributed incremental (sub)gradientmethodsand proximalmethods[36,48,2,61],andmorerecentlytheincrementalaggregatedgradi- entmethodsanditsproximalvariants[18,3,26].Alloftheseincrementalmethodsarenotfullydecentralizedina sense thatthey requirea special starnetworktopologyin which theexistenceof a centralauthorityis necessary for operation. To consider a more general network topology, a decentralized subgradient algorithm was first proposed in [39], and further studied in many other literature (see e.g. [14,65,35,37,56]). These algorithms are intuitive and simple but very slow due to the fact that they need to use diminishing stepsize rules. All of these methods require (1/ǫ2)inter-nodecommunicationsandintra-nodegradientcomputationsinordertoobtainanǫ-optimal O solution. First-order algorithms by Shi et. al. [52,53] use constant stepsize rules with backtracking and require (1/ǫ) communications when the objective function in (1.1) is a relatively simple convex function, but require O bothsmoothnessandstrongconvexityinordertoachievealinearconvergencerate.Recently,ithasbeenshownin [45,38] thatthe linearrateofconvergencecan be obtainedfor minimizing“unconstrained”smoothand strongly convexproblems.Thesemethodsdonotapplytogeneralnonsmoothandstochasticoptimizationproblemstobe studied in this work. Another well-known type of decentralized algorithm is based on dual methods including the distributed dual decomposition [55] and decentralized alternatingdirection method of multipliers(ADMM) [51,28,62]. The decentralizedADMM[51,28]hasbeenshowntorequire (log1/ǫ)communicationsinordertoobtainanǫ-optimal O solution under the no constraint, strong convexity and smoothness assumptions while [62] has been shown to require (1/ǫ)communicationsforrelativelysimpleconvexfunctionsf (seealso[20]fortheapplicationofmirror- i O prox method for solving these problems). These dual-based methods have been further studied via proximal- gradient[9,8].However,thelocalLagrangianminimizationproblemassociatedwitheachagentcannotbesolved efficiently in many cases, especially when the problem is constrained. Second-order approximationmethods [29, 30] have been studied in order to handle this issue, but due to the nature of these methods differentiability of the objectivefunction is necessary in this case. There exist some distributed methods that just assume smoothness on the objectivefunctions, but actually requiremorecommunicationrounds than gradientcomputations.For example,the distributedNesterov’saccel- eratedgradientmethod[22]employsmulti-consensusintheinner-loop.Althoughtheirmethodrequires (1/√ǫ) O intra-node gradient computations, inter-node communications must increase at a rate of (log(k)) as the it- O eration k increases. Similarly, the proximal gradient method with adapt-then-combine (ATC) multi-consensus strategy and Nesterov’s acceleration under the assumption of bounded and Lipschitz gradients [11] is shown to have (1/√ǫ)intra-nodegradientcomputations,butinter-nodecommunicationsmustincreaseatarateof (k). O O Duetothenatureofdecentralizednetworkedsystems,thetimerequiredforinter-nodecommunicationsishigher by a few orders of magnitude than that for intra-node computations. Multi-consensus schemes in nested loop algorithmsdo not account for this featureof networkedsystems and hence are less desirable. Decentralized stochastic optimization methods can be useful when the noisy gradient information of the functionf ,i=1,...,m,in(1.1)isonlyavailableoreasiertocompute.Stochasticfirst-ordermethodsforproblem i (1.1) are studied in [14,49,35], all of which require (1/ǫ2) inter-node communicationsand intra-node gradient O computations to obtain an ǫ-optimal solution. Multiagent mirror descent method for decentralized stochastic optimization [46] showed a (1/ǫ) complexity bound when the objective functions are strongly convex. An O alternativeformofmirrordescentinthemultiagentsettingwasproposedby[63]withanasymptoticconvergence result. On a broader scale, decentralized stochasticoptimizationwas also considered in the case of time-varying objectivefunctions in the recent work [54,58]. All these previous works in decentralized stochastic optimization suffered from high communication costs due to the coupled scheme for stochastic subgradient evaluation and communication,i.e., each evaluation of stochastic subgradient will incur one round of communication. Communication-EfficientAlgorithmsforDecentralizedandStochastic Optimization 5 1.4 Contributionof the paper Themaininterestofthispaperistodevelopcommunicationefficientdecentralizedalgorithmsforsolvingproblem (1.7)inwhichf ’sareconvexorstronglyconvex,butnotnecessarilysmooth,andthelocalsubproblemassociated i with each agent is nontrivialto solve. Our contributionsin this paper are listed below. Firstly,weproposeadecentralizedprimal-dualframeworkwhichinvolvesonlytwointer-nodecommunications per iteration. The proposed method can find an ǫ-optimal solution both in terms of the primal optimality gap and feasibility residual in (1/ǫ) communicationrounds when the objectivefunctions are convex, and the local O proximal projection subproblems can be solved exactly. This algorithm serves as a benchmark in terms of the communicationcost for our subsequent development. Secondly,weintroduceanewdecentralizedprimal-dualtypemethod,calleddecentralizedcommunicationslid- ing (DCS), where the agents can skip communicationswhile solving their local subproblems iterativelythrough successive linearizationsof their local objective functions. We show that agents can still find an ǫ-optimal solu- tion in (1/ǫ) (resp., (1/√ǫ)) communicationrounds while maintainingthe (1/ǫ2) (resp., (1/ǫ)) bound on O O O O the total number of intra-node subgradient evaluations when the objective functions are general convex (resp., stronglyconvex).Theboundsonthesubgradientevaluationsareactuallycomparabletothoseoptimalcomplex- ityboundsrequiredforcentralizednonsmoothoptimizationundercertainconditionsonthetargetaccuracy,and hence are not improvablein general. Thirdly, we present a stochasticdecentralized communicationsliding method, denoted by SDCS, for solving stochasticoptimizationproblemsandshowcomplexityboundssimilartothoseofDCSonthetotalnumberofre- quiredcommunicationroundsandstochasticsubgradientevaluations.Inparticular,only (1/ǫ)(resp., (1/√ǫ)) communication rounds are required while agents perform up to (1/ǫ2) (resp., (1/ǫ))Ostochastic subOgradient O O evaluations for general convex (resp., strongly convex) functions. Only requiring the access to stochastic sub- gradient at each iteration, SDCS is particularly efficient for solving problems with f given in the form of (1.3) i and (1.4). In the former case, SDCS requires only one realization of the random variable at each iteration and provides a communication-efficientway to deal with streaming data and decentralized machine learning. In the lattercase,eachiterationofSDCSrequiresonlyonerandomlyselectedcomponent,leadinguptoafactorof (l) O savings on the total number of subgradient computationsover DCS. To the best of our knowledge, this is the first time that these communication sliding algorithms, and the aforementionedseparate complexitybounds on communicationrounds and (stochastic)subgradient evaluations are presented in the literature. 1.5 Organizationof the paper This paper is organizedas follows.In Section 2, we providesome preliminarieson distancegenerating functions and prox-functions, as well as the definition of gap functions, which will be used as termination criteria of our primal-dualmethods.InSection3,wepresentanewdecentralizedprimal-dualmethodforsolvingproblem(1.8). In Section 4, we present the communicationsliding algorithmswhen the exact subgradients of f ’s are available i and establish their convergence properties for the general and strongly convex case. In Section 5, we generalize the algorithms in Section 4 for stochastic problems. The proofs of the lemmas in Section 3-5 are provided in Section 6. Finally,we providesome concluding remarks in Section 7. 2 Preliminaries In this section, we provide a brief review on the prox-function, and define appropriategap functions which will be used for the convergence analysisand terminationcriteriaof our primal-dualalgorithms. 2.1 Distance GeneratingFunction and Prox-function In this subsection, we define the concept of prox-function, which is also known as proximitycontrol function or Bregman distance function [5]. Prox-function has played an important role in the recent development of first- 6 Guanghui Lanetal. order methods for convex programming as a substantial generalization of the Euclidean projection. Unlike the standard projection operator Π [x]:=argmin x u 2, which is inevitably tied to the Euclidean geometry, prox-functioncan be flexibly taiUloredto the geuo∈mUektry−ofka constraintset U. For any convex set U equipped with an arbitrarynorm , we say that a function ω:U R is a distance U k·k → generating function with modulus ν > 0 with respect to , if ω is continuously differentiable and strongly U k·k convex with modulus ν with respect to , i.e., U k·k ω(x) ω(u),x u ν x u 2, x,u U. (2.1) h∇ −∇ − i≥ k − kU ∀ ∈ The prox-function, or Bregman distance function, induced by ω is given by V(x,u) Vω(x,u):=ω(u) [ω(x)+ ω(x),u x ]. (2.2) ≡ − h∇ − i It then follows from the strong convexityof ω that V(x,u) ν x u 2, x,u U. ≥ 2k − kU ∀ ∈ WenowassumethattheindividualconstraintsetX foreachagentinproblem(1.1)areequippedwithnorm i , and theirassociatedprox-functionsaregivenby V (, ). Moreover,weassumethateach V (, ) sharesthe k·kXi i · · i · · same strongly convex modulus ν =1, i.e., V (x ,u ) 1 x u 2 , x ,u X , i=1,...,m. (2.3) i i i ≥ 2k i− ikXi ∀ i i ∈ i We define the norm associated with the primalfeasible set Xm =X1 ... Xm of (1.8) as follows:1 × × kxk2 ≡kxk2Xm := mi=1kxik2Xi, (2.4) where x=(x1,...,xm) Xm for any xi Xi. Therefore, tPhe corresponding prox-function V(, ) can be defined ∈ ∈ · · as V(x,u):= m V (x ,u ), x,u Xm. (2.5) i=1 i i i ∀ ∈ Note that by (2.3) and (2.4), it can be easily sePen that V(x,u) 1 x u 2, x,u Xm. (2.6) ≥ 2k − k ∀ ∈ Throughout the paper, we endow the dual space where the multipliers y of (1.8) reside with the standard Euclidean norm , since the feasible region of y is unbounded. For simplicity, we often write y instead of 2 y for a dual mku·lktipliery Rmd. k k 2 k k ∈ 2.2 Gap Functions: TerminationCriteria Given a pair of feasiblesolutionsz=(x,y) and z¯=(x¯,y¯) of (1.8), wedefine the primal-dual gap function Q(z;¯z) by Q(z;¯z):= F(x)+ Lx,y¯ [F(x¯)+ Lx¯,y ]. (2.7) h i− h i Sometimes we also use the notations Q(z;¯z):=Q(x,y;x¯,y¯) or Q(z;¯z):=Q(x,y;¯z)=Q(z;x¯,y¯). One can easily see that Q(z∗;z) 0 and Q(z;z∗) 0 for all z Xm Rmd, where z∗ =(x∗,y∗) is a saddle point of (1.8). For compact sets Xm≤ Rmd, Y Rmd≥, the gap fun∈ction × ⊂ ⊂ sup Q(z;¯z) (2.8) ¯z Xm Y ∈ × measures the accuracy of the approximatesolutionz to the saddle point problem (1.8). fo1rsoWmeecpain>de0fi,nie=th1e,.n.o.,rmm.aAsscoccoiradteindgwlyi,tthhXepmroixn-faunmcotiroengVen(e·r,a·)lcwaany,bee.gd.e,fiknxekd2a:=sVP(xmi=,1up)i:k=xiPk2Xmii=,1p∀ixV=i(x(ix,1u,i.).,.∀,xxm,u)∈∈XXmm,. Thissettinggivesusflexibilitytochoosepi’sbasedontheinformationofindividualXi’s,andthepossibilitytofurtherrefinethe convergence results. Communication-EfficientAlgorithmsforDecentralizedandStochastic Optimization 7 However,thesaddlepointformulation(1.8) of ourproblemof interest(1.1) mayhavean unbounded feasible set. We adopt the perturbation-based termination criterion by Monteiro and Svaiter [31,32,33] and propose a modified version of the gap function in (2.8). More specifically, we define gY(s,z):= supQ(z;x∗,y¯) s,y¯ , (2.9) y¯ Y −h i ∈ for any closed set Y Rmd, z Xm Rmd and s Rmd. If Y =Rmd, we omit the subscript Y and simply use ⊆ ∈ × ∈ the notationg(s,z). This perturbed gap function allows us to bound the objective function value and the feasibility separately. We first define the followingterminology. Definition 1 A point x Xm is called an (ǫ,δ)-solutionof (1.7) if ∈ F(x) F(x∗) ǫ and Lx δ. (2.10) − ≤ k k≤ We say that x has primalresidual ǫ and feasibilityresidual δ. Similarly, a stochastic (ǫ,δ)-solution of (1.7) can be defined as a point xˆ Xm s.t. E[F(xˆ) F(x∗)] ǫ and E[ Lxˆ ] δ for some ǫ,δ > 0. Note that for problem (1.7), the feasibility∈residual measures−the disag≤reement k k ≤ among the local copies x , for i . i ∈N Inthefollowingproposition,weadoptaresultfrom[44,Proposition2.1]todescribetherelationshipbetween the perturbed gap function (2.9) and the approximatesolutionsto problem (1.7). Although the propositionwas originallydeveloped for deterministiccases, the extension of this to stochastic cases is straightforward. Proposition 1 For any Y Rmd such that 0 Y, if g (Lx,z) ǫ < and Lx δ, where z = (x,y) Y Xm Rmd, then x is an (ǫ,⊂δ)-solution of (1.7).∈In particular, when≤Y =Rm∞d, for kanyks ≤such that g(s,z) ǫ< ∈ × ≤ ∞ and s δ, we always have s=Lx. k k≤ 3 Decentralized Primal-Dual Inthissection,wedescribeanalgorithmicframeworkforsolvingthesaddlepointproblem(1.8)inadecentralized fashion. The basic scheme of the decentralized primal-dual method in Algorithm 1 is similar to Chambolle and Pork’s primal-dual method in [7]. The primal-dual method in [7] is an efficient and simple method for solving saddle point problems, which can be viewed as a refined version of the primal-dual hybrid gradient method by Arrow et al. [1]. However, its design and analysis is more closely related to a few recent important works which established the (1/k) rate of convergence for solving bilinear saddle point problems (e.g., [43,40,34, O 19]).Recently,Chen,LanandOuyang[12]incorporatedBregmandistanceintotheprimal-dualmethodtogether with an acceleration step. Dang and Lan [13], and Chambolle and Pork [6] discussed improved algorithms for problems with strongly convex primal or dual functions. Randomized versions of the primal-dual method have been discussed by Zhang and Xiao [64], and Dang and Lan [13]. Lan and Zhou [26] revealed some inherent relationship between Nesterov’s accelerated gradient method and the primal-dual method, and presented an optimalrandomized incremental gradientmethod. Our main goals here in this section are to: 1) adapt the primal-dual framework for a decentralized setting; and 2) provide complexityresults (number of communicationrounds and subgradient computations)separately in termsofprimalfunctionaloptimalitygap and constraint(orconsistency)violation.It shouldbe stressedthat themaincontributionsofthispaper existinthedevelopmentof decentralizedcommunicationslidingalgorithms (see Section 4 and 5). However,introducingthe basic decentralized primal-dualmethod here will help us better explain these methods and provideus with a certain benchmark in terms of the communicationcost. 3.1 The Algorithm Theprimal-dualalgorithminAlgorithm1canbedecentralizedduetothestructureoftheLaplacianL.Recalling that x = (x1,...,xm) and y = (y1,...,ym), each agent i’s local update rule can be separately written as in 8 Guanghui Lanetal. Algorithm 1 Decentralizedprimal-dual Letx0=x−1∈Xm andy0∈Rmd,thenonnegative parameters{αk},{τk}and{ηk},andtheweights{θk}begiven. fork=1,...,N do Updatezk =(xk,yk)accordingto x˜k =αk(xk−1−xk−2)+xk−1 (3.1) yk =argminy∈Rmd h−Lx˜k,yi+ τ2kky−yk−1k2 (3.2) xk =argminx∈Xm hLyk,xi+F(x)+ηkV(xk−1,x) (3.3) endfor return z¯N =(PNk=1θk)−1PNk=1θkzk. Algorithm 2 Decentralizedprimal-dualupdate for each agent i Letx0i =x−i 1∈Xi andyi0∈Rd fori∈[m],thenonnegativeparameters{αk},{τk}and{ηk},andtheweights{θk}begiven. fork=1,...,N do Updatezk =(xk,yk)accordingto i i i x˜ki =αk(xki−1−xki−2)+xki−1 (3.4) vik =Pj∈NiLijx˜kj (3.5) yk =yk−1+ 1 vk (3.6) i i τk i wik =Pj∈NiLijyjk (3.7) xki =argminxi∈Xi hwik,xii+fi(xi)+ηkVi(xki−1,xi) (3.8) endfor return z¯N =(PNk=1θk)−1PNk=1θkzk Algorithm 2. Each agent i maintains two local sequences, namely, the primal estimates xk and the dual variables yk . The element xk can be seen as agent i’s estimateof the decision variablex at{tiim}ek, while yk is a subvecto{rio}f all dual variablies yk associated with the agent i’s consistency constraintswith its neighbors. i More specifically, each primal estimate x0i is locally initialized from some arbitrary point in Xi, and x−i 1 is also set to be the same value. At each time step k 1, each agent i computes a local prediction x˜k using ≥ ∈ N i the two previous primal estimates (ref. (3.4)), and broadcasts this to all of the nodes in its neighborhood, i.e., to allagentsj N . In (3.5)-(3.6), each agenti calculatestheneighborhooddisagreementvk using themessages received from a∈genits in N , and updates the dual subvector yk. Then, another round of coimmunication occurs i i in (3.7) to broadcast this updated dual variables and calculate wk. Therefore, each iteration k involves two i communication rounds, one for the primal estimates and the other for the dual variables. Lastly, each agent i solvestheproximalprojectionsubproblem(3.8).Notethatthedescriptionofthealgorithmisonlyconceptualat this moment since we have not specified the parameters α , τ , η and θ yet. We will later instantiate k k k k { } { } { } { } this generic algorithmwhen we stateits convergence properties. 3.2 Convergenceof the Decentralized Primal-dualMethod For the sake of simplicity, we focus only on the case when f ’s are general convex functions in this section. We i leave the discussion about the case when f ’s are strongly convex later in Sections 4 and 5 for decentralized i communicationsliding algorithms. In the following lemma, we present estimates on the gap function defined in (2.7) together with conditions on the parameters α , τ , η and θ , which will be used to provide the rate of convergence for the k k k k { } { } { } { } decentralized primal-dualmethod. The proof of this lemma can be found in Section 6. Communication-EfficientAlgorithmsforDecentralizedandStochastic Optimization 9 Lemma 1 Let the iterates zk = (xk,yk), k = 1,...,N be generated by Algorithm 1 and ¯zN be defined as ¯zN := 1 N θ − N θ zk. Assume that the parameters α , τ , η and θ in Algorithm 1 satisfy k=1 k k=1 k { k} { k} { k} { k} (cid:16) (cid:17) P P θ η θ η , k=2,...,N, (3.9) k k ≤ k−1 k−1 α θ = θ , k=2,...,N, (3.10) k k k 1 − θ τ θ τ , k=2,...,N, (3.11) k k ≤ k−1 k−1 α L 2 η τ , k=2,...,N, (3.12) kk k ≤ k−1 k θ1τ1 = θNτN, (3.13) θN L 2 θ1τ1ηN. (3.14) k k ≤ Then, for any z:=(x,y) Xm Rmd, we have ∈ × 1 Q(¯zN;z)≤ Nk=1θk − θ1η1V(x0,x)+ θ12τ1ky0k2+hs,yi , (3.15) (cid:16) (cid:17) h i where Q is defined in (2.7) and s is definPed as s:=θNL(xN xN−1)+θ1τ1(yN y0). (3.16) − − Furthermore, for any saddle point (x∗,y∗) of (1.8), we have θ2N 1− ηkNLτkN2 max{ηNkxN−1−xNk2,τNky∗−yNk2}≤θ1η1V(x0,x∗)+ θ12τ1ky∗−y0k2. (3.17) (cid:16) (cid:17) In the followingtheorem, we provide a specific selection of α , τ , η and θ satisfying (3.9)-(3.14). k k k k { } { } { } { } UsingLemma1 and Proposition1, wealsoestablishthecomplexityof thedecentralizedprimal-dualmethodfor computing an (ǫ,δ)-solutionof problem (1.7). Theorem 1 Let x∗ be a saddle point of (1.7), and suppose that αk , τk , ηk and θk are set to { } { } { } { } α =θ =1, η =2 L , and τ = L , k=1,...,N. (3.18) k k k k k k k k ∀ Then, for any N 1, we have ≥ F(x¯N)−F(x∗)≤ kNLk 2V(x0,x∗)+ 21ky0k2 (3.19) h i and kLx¯Nk≤ 2kNLk 3 V(x0,x∗)+2ky∗−y0k , (3.20) where x¯N = 1 N xk. h p i N k=1 Proof ItiseasyPtocheckthat(3.18)satisfiesconditions(3.9)-(3.14).Therefore,bypluggingthesevaluesin(3.15), we have Q(¯zN;x∗,y)≤ N1 2kLkV(x0,x∗)+ kL2kky0k2 + N1hs,yi. (3.21) Letting sN := 1s, then from (3.16) and (3.17h) we have i N ksNk≤ kNLk kxN −xN−1k+kyN −y∗k+ky∗−y0k ≤ kNLkh3 4V(x0,x∗)+ky∗−y0k2+ky∗−y0ki . Furthermore, by (3.21) we have h p i g(sN,¯zN)≤ kNLk 2V(x0,x∗)+ 21ky0k2 . h i The results in (3.19) and (3.20) then immediatelyfollow from Proposition1 and the above two inequalities. From (3.19)-(3.20), we can see that the complexity of decentralized primal-dual method for computing an (ǫ,δ)-solution is (1/ǫ) for the primal functional optimality and (1/δ) for the constraint violation. Since each O O iterationinvolvesaconstantnumberofcommunicationrounds,thenumberofinter-nodecommunicationsrequired is also in the same order. 10 Guanghui Lanetal. 4 Decentralized Communication Sliding In this section, we present a new decentralized primal-dual type method, namely, the decentralized commu- nication sliding (DCS) method for the case when the primal subproblem (3.8) is not easy to solve. We show that one can still maintain the same number of inter-node communications even when the subproblem (3.8) is approximately solved through an iterative subgradient descent procedure, and that the total number of re- quired subgradient evaluations is comparable to centralized mirror descent methods. Throughout this section, we consider the deterministiccase where exact subgradients of f ’s are available. i Algorithm 3 DecentralizedCommunicationSliding (DCS) Letx0i =x−i 1=xˆ0i ∈Xi,yi0∈Rd fori∈[m]andthenonnegative parameters{αk},{τk},{ηk}and{Tk}begiven. fork=1,...,N do Updatezk =(xˆk,yk)accordingto x˜ki =αk(xˆki−1−xki−2)+xki−1 (4.1) vik =Pj∈NiLijx˜kj (4.2) yik =argminyi∈Rd h−vik,yii+ τ2kkyi−yik−1k2=yik−1+ τ1kvik (4.3) wik=Pj∈NiLijyjk (4.4) (xki,xˆki)=CS(fi,Xi,Vi,Tk,ηk,wik,xki−1) (4.5) endfor return ziN =(xˆiN,yiN) TheCS(Communication-Sliding)procedurecalledat(4.5)isstatedasfollows. procedure:(x,xˆ)=CS(φ,U,V,T,η,w,x) Letu0=uˆ0=xandtheparameters{βt}and{λt}begiven. fort=1,...,T do ht−1=φ′(ut−1)∈∂φ(ut−1) (4.6) ut=argminu∈U(cid:2)hw+ht−1,ui+ηV(x,u)+ηβtV(ut−1,u)(cid:3) (4.7) endfor Set uˆT := (cid:16)PTt=1λt(cid:17)−1PTt=1λtut. (4.8) Setx=uT andxˆ=uˆT. end procedure 4.1 The DCS Algorithm We formallydescribe our DCS algorithmin Algorithm3. We say that an outer iterationof the DCS algorithm, which we call the outer-loop, occurs whenever the index k in Algorithm 3 is incremented by 1. Since the sub- problems are solved inexactly, the outer-loopof the primal-dualalgorithmalso needs to be modified in order to attain the best possible rate of convergence. In particular, in addition to the primal estimate xk , we let each agent i maintain another primal sequence xˆk (cf. the definition of x˜k in (4.1)), which will la{teri}play a crucial { i} i role in the development and convergence proof of the algorithm. Observe that the DCS method, in spirit, has been inspired by some of our recent work on gradient sliding [24]. However, the gradient sliding method in [24] focuses on how to save gradient evaluations for solving certain structured convex optimizationproblems, rather than how to save communicationrounds for decentralized optimization,and its algorithmicscheme is also quite different from the DCS method.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.