Universite´ Pierre et Marie Curie E´coleDoctoraledeSciencesMathe´matiquesdeParisCentre386 LaboratoireJacques-LouisLions/EquipeAlpines T ` HESE DE DOCTORAT Discipline : Mathe´matiques Applique´es pre´sente´e par Sophie MOUFAWAD Enlarged Krylov Subspace Methods and Preconditioners for Avoiding Communication dirige´e par Laura GRIGORI Soutenue le ** **** 2014 devant un jury compose´ de : M. Pre´nom NOM Universite´ pre´sident Mlle Pre´nom NOM Institut examinateur Mme Pre´nom NOM Universite´ rapporteur M. Pre´nom NOM Universite´ directeur M. Pre´nom NOM Universite´ rapporteur LaboratoireJacquesLouisLions E´coledoctoralePariscentre Boitecourrier187 Boitecourrier188 4,placeJussieu 4,placeJussieu 75252Pariscedex05 75252Pariscedex05 Contents 1 Introduction 5 2 Preliminaries 9 2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Graphsandpartitioningtechniques . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 Nesteddissection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.3 K-waygraphpartitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Orthonormalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 ClassicalGramSchmidt(CGS) . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 ModifiedGramSchmidt(MGS) . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.3 TallandskinnyQR(TSQR) . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 TheA-orthonormalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1 ModifiedGramSchmidtA-orthonormalization . . . . . . . . . . . . . . . 18 2.4.2 ClassicalGramSchmidtA-orthonormalization . . . . . . . . . . . . . . . 22 2.4.3 CholeskyQRA-orthonormalization . . . . . . . . . . . . . . . . . . . . . 28 2.5 Matrixpowerskernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6 Testmatrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3 KrylovSubspaceMethods 37 3.1 ClassicalKrylovsubspacemethods . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.1 TheKrylovsubspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.2 TheKrylovsubspacemethods . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.3 Krylovprojectionmethods . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.4 Conjugategradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1.5 Generalizedminimalresidual(GMRES)method . . . . . . . . . . . . . . 41 3.2 ParallelizablevariantsoftheKrylovsubspacemethods . . . . . . . . . . . . . . . 43 3.2.1 BlockKrylovmethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.2 Thes-stepKrylovmethods . . . . . . . . . . . . . . . . . . . . . . . . . . 45 1 2 S.MOUFAWAD 3.2.3 Communicationavoidingmethods . . . . . . . . . . . . . . . . . . . . . . 48 3.2.4 OtherCGmethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 Preconditioners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.1 IncompleteLUpreconditioner . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3.2 BlockJacobipreconditioner . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.3 RestrictedadditiveSchwarzpreconditioner . . . . . . . . . . . . . . . . . 58 4 EnlargedKrylovSubspace(EKS)Methods 61 4.1 TheenlargedKrylovsubspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.1.1 Krylovprojectionmethods . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1.2 Theminimizationproperty . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1.3 Convergenceanalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2 Multiple search direction with orthogonalization conjugate gradient (MSDO-CG) method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2.1 Theresidualr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 k 4.2.2 ThedomainsearchdirectionP . . . . . . . . . . . . . . . . . . . . . . . 69 k 4.2.3 Findingtheexpressionof↵ and� . . . . . . . . . . . . . . . . . . 70 k 1 k 1 ` ` 4.2.4 TheMSDO-CGalgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3 Longrecurrenceenlargedconjugategradient(LRE-CG)method . . . . . . . . . . 72 4.3.1 TheLRE-CGalgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.4 Convergenceresults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.5 Parallelmodelandexpectedperformance . . . . . . . . . . . . . . . . . . . . . . 81 4.5.1 MSDO-CG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.5.2 LRE-CG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.6 PreconditionedenlargedKrylovsubspacemethods . . . . . . . . . . . . . . . . . 88 4.6.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5 CommunicationAvoidingIncompleteLU(0)Preconditioner 97 5.1 ILUmatrixpowerskernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.1.1 Thepartitioningproblem . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.1.2 ILUpreconditionedmatrixpowerskernel . . . . . . . . . . . . . . . . . . 100 5.2 Alternatingmin-maxlayers(AMML(s))reorderingforILU(0)matrixpowerskernel 102 5.2.1 Nesteddissection+AMML(s)reorderingofthematrixA . . . . . . . . . 103 5.2.2 K-way+ AMML(s)reorderingofthematrixA . . . . . . . . . . . . . . . . 109 5.2.3 Complexityof AMML(s)Reordering . . . . . . . . . . . . . . . . . . . . . 115 5.3 CA-ILU0preconditioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.4 ExpectednumericalefficiencyandperformanceofCA-ILU0preconditioner . . . . 119 5.4.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 3 5.4.2 Avoidedcommunicationversusmemoryrequirementsandredundantflops oftheILU0matrixpowerskernel . . . . . . . . . . . . . . . . . . . . . . 123 5.4.3 ComparisonbetweenCA-ILU0preconditionerandblockJacobiprecondi- tioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6 ConclusionandFuturework 131 AppendixA ILU(0)preconditionedGMRESconvergencefordifferentreorderings 143 4 S.MOUFAWAD Chapter 1 Introduction Many scientific problems require the solution of systems of linear equations of the form Ax b, “ where A is an n n matrix and b is an n 1 vector. There are two broad categories for solving ˆ ˆ systemsoflinearequations,directmethodsanditerativemethods. Directmethodssolvethesystem in a finite number of steps or operations. Examples of direct methods are matrix decompositions like LU decomposition A LU, Cholesky decomposition for symmetric positive definite A “ “ LLt,andQRdecompositionforfullrankA QR,whereLisalowertriangularmatrix,U andR “ are upper triangular matrices, and Q is an orthonormal matrix. After decomposing the matrix A, theuppertriangularandlowertriangularsystemsaresolvedbybackwardandforwardsubstitution. The matrix A can be a dense or a sparse matrix. Several libraries that implement direct methods for solving sparse systems have been introduced, like MUMPS [1], PARADISO [69, 55], and SuperLU[57,58]. However,whenthematrixAissparse,thefactorsobtainedafterdecomposition aredenserthantheinputmatrix. Moreover,directmethodsareprohibitiveintermsofmemoryand flopswhen itcomesto solvingverylarge systems,andthey arenoteasily parallelizedonmodern- day architectures. Thus, iterative methods that compute a sequence of approximate solutions for thesystemAx bbystartingfromaninitialguess,areagoodalternative. “ Weareinterestedinsolvingsystemsoflinearequations,Ax b,wherethematrixAissparse. “ Such systems may arise from the dicretization of partial differential equations. The Krylov sub- spacemethodsareamongthemostpracticalandpopulariterativemethodstoday. Theyarepolyno- mialiterativemethodsthataimtosolvesystemsoflinearequations(Ax b)byfindingasequence “ ofvectorsx ,x ,x ,x ,...,x thatminimizessomemeasureoferroroverthecorrespondingspaces 1 2 3 4 k x A,r , i 1,...,k 0 i 0 `K p q “ wherex istheinitialiterate,r istheinitialresidual,and A,r span r ,Ar ,A2r ,...,Ai 1r 0 0 i 0 0 0 0 ´ 0 K p q “ t u istheKrylovsubspaceofdimensioni. ConjugateGradient(CG)[47],GeneralizedMinimalResid- ual (GMRES) [66], bi-Conjugate Gradient [56, 30] and bi-Conjugate Gradient Stabilized [75] are some of the Krylov subspace methods. These methods compute one basis vector or one search direction vector at each iteration i, by performing a matrix-vector multiplication. Then, the ith 5 6 S.MOUFAWAD approximate solution is defined by performing saxpy (ax y) and dot products (xtx), where a is ` scalar,andx,y arevectors. The performance of an algorithm on any architecture is dependent on the processing unit’s speed for performing floating point operations (flops) and the speed of accessing memory and disk. Moreover, the efficiency of parallel implementations is dependent on the amount of per- formed computations per communication, data movement. This is due to the fact that the cost of communication is much higher than arithmetic operations, and this gap is expected to continue to increase exponentially [37]. As a result, communication is often the bottleneck in numerical algorithms. In a quest to address the communication problem, recent research has focused on re- formulating linear algebra operations such that the movement of data is significantly reduced, or even minimized as in the case of dense matrix factorizations like LU factorization, QR factoriza- tion,tallandskinnyQRfactorization[21,38,4]. Suchalgorithmsarereferredtoascommunication avoiding. The Krylov subspace methods are governed by Blas1 and Blas2 operations like dot products andmatrixvectormultiplications,asdiscussedabove. Parallelizingdotproductsisconstrainedby communication, since the performed computation is negligible. If the dot products are performed by one processor, then there is a need for a communication before (synchronization) and after the computations. In both cases, communication is a bottleneck. This problem has been tackled by differentapproaches. Oneapproachistothehidethecommunication’scostbyoverlappingitwith other communications and computations, like pipelined CG [23, 43] and pipelined GMRES [34]. AnotherapproachconsistsofreplacingBlas1andBlas2operationsbyBlas2andBlas3operations, byeitherintroducingnewmethodsorbyreformulatingthealgorithmitself. Thefirstsuchmethods to be introduced, are block methods that solve a system with multiple right-hand sides AX B, “ like O’Leary’s block CG [63]. These methods compute at each iteration a block of vectors by performingamatrixtimesablockofvectors. Then,theith blockapproximatesolutionisobtained bysolvingsmallsystemsandperformingtallskinnygaxpy’s,Ax y,whereAisann mmatrix ` ˆ withn m,andx,y arevectors. °° Unlike the block methods, the s-step methods solve the system Ax b by computing s basis “ vectorsperiterationandsolvingsmallsystems. Someofthes-stepmethodsares-stepCG[19]and s-step GMRES [26]. Both methods, block and s-step, use Blas2 and Blas3 operations. Recently, communication avoiding Krylov methods, based on s-step methods, were introduced, like CA- CG, and CA-GMRES [60, 48, 12]. The communication avoiding methods aim at further avoiding communication in the Blas2 and Blas3 at the expense of performing some redundant flops. For s 1, where s-step methods are equivalent to classical methods, there are many available pre- “ conditioners. Oneofthem,blockJacobi,isanaturallyparallelizableandcommunicationavoiding preconditioner. However, except a discussion in [48], there are no available preconditioners that avoidcommunicationandcanbeusedwiths-stepmethodsfors 1. Thisisaseriouslimitationof ° these methods, since for difficult problems, Krylov subspace methods without preconditioner can be very slow or even might not converge. In this thesis, we introduce a communication avoiding ILU(0) preconditioner (CA-ILU0) [40], that can be computed in parallel, and applied to s vectors CHAPTER1. INTRODUCTION 7 of the form y LU 1Ay without any communication, for i 1,2,..,s. This preconditioner i ´ i 1 “ p q ´ “ can be parallelized without communication due to the use of a heuristic reordering of the matrix A, that we call alternating min-max layers AMML(s). Moreover, the CA-ILU0 preconditioner can alsobeusedwithclassicalKrylovsubspacemethods,whereitavoidscommunication. Since communication avoiding methods are based on s-step methods which have some stabil- ity issues, we introduce a new type of Krylov subspace methods. We introduce a new approach that consists of enlarging the Krylov subspace based on domain decomposition. First, we split the initial r into t vectors depending on a decomposed domain. Then, the obtained t vectors 0 are multiplied by A at each iteration to generate the t new basis vectors of the enlarged Krylov subspace. Enlarging the Krylov subspace should lead to faster convergence and parallelizable al- gorithms with less communication than the classical Krylov methods, due to the use of Blas2 and Blas3 operations. In this thesis, we introduce two new versions of conjugate gradient. The first version,multiplesearchdirectionwithorthogonalizationCG(MSDO-CG),hasthesamestructure astheclassicalconjugategradientmethod,wherewefirstdefinetnewsearchdirections,thenfind thetsteplengthsbysolvingat tsystemandupdatethesolutionandtheresidual. ButunlikeCG, ˆ the search directions are not A-orthogonal. We A-orthonormalize the search directions, to obtain a projection method that guarantees convergence at least as fast as CG. The second version, long recurrenceenlargedCG(LRE-CG),issimilartoGMRESinthatwebuildanorthonormalbasisfor the enlarged Krylov subspace rather than finding search directions. Then, we use the whole basis toupdatethesolutionandtheresidual. The thesis is organized as follows. In chapter 2 we briefly introduce some notations and ker- nelsthatareusedthroughoutthisthesissuchasgraphsandgraphpartitioning,orthonormalization schemes, A-orthonormalization schemes, the matrix powers kernel, and the set of test matrices used to test our introduced methods. In chapter 3 we discuss several variants of Krylov subspace methods, such as classical Krylov subspace methods (CG and GMRES), block Krylov methods (block CG), s-step Krylov methods (s-step CG and s-step GMRES), communication avoiding Krylovmethods(CA-GMRES),andotherparallelizableversion(MSD-CGandcoop-CG).Wealso discuss preconditioners, such as incomplete LU preconditioner, block Jacobi preconditioner, and restricted additive Schwarz preconditioner, which are crucial for the fast convergence of Krylov subspacemethods. Inchapter4weintroducetheenlargedKrylovsubspace,theMSDO-CGmethod,andtheLRE- CG method. We show that both methods are projection methods and hence converge at least as fast as CG in exact precision. And we compare the convergence behavior of MSDO-CG and LRE-CG methods using different A-orthonormalization and orthonormalization methods. Then we compare the most stable versions with CG and other related methods. Both methods converge faster than CG, but LRE-CG converges faster than MSDO-CG since it uses the whole basis to update the solution rather than only t search directions. We also present the parallel algorithms withtheirexpectedperformance,andthepreconditionedversionswiththeirconvergencebehavior. Thischapterisbasedonthearticle[41]whichisinpreparationforsubmission. In chapter 5 we introduce the communication avoiding ILU(0) preconditioner (CA-ILU0) that 8 S.MOUFAWAD minimizescommunicationduringtheconstructionofM LU (i.e,theILU(0)factorization),and “ duringitsapplicationtosvectors(z M 1y LU 1y )ateachiterationofthes-stepmethods. ´ ´ “ “ p q q In other words, it is possible to solve s upper triangular system and s lower triangular system, in addition to the s matrix vector multiplications without any communication. First, we adapt the matrixpowerskerneltothecaseofILUpreconditionedsystems. Then,weintroducethe AMML(s) heuristic reordering based on nested dissection and k-way graph partitioning. Then, we show that our reordering does not affect the convergence of ILU(0) preconditioned GMRES, and we model the expected performance of our preconditioner based on the needed memory and the redundant flops introduced to reduce the communication. This chapter is based on some parts of a revised version of the technical report [40], and on the article [39], which was submitted to SIAM journal on scientific computing (SISC) and is in revision. Finally, in chapter 6 we conclude and discuss possiblefutureworkintheintroducedmethods.
Description: