See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/303597052 Linear algebra software for large- scale accelerated multicore computing Article in Acta Numerica · May 2016 Impact Factor: 7.36 · DOI: 10.1017/S0962492916000015 READS 102 10 authors, including: Jakub Kurzak University of Tennessee 76 PUBLICATIONS 1,676 CITATIONS SEE PROFILE Stanimire Tomov University of Tennessee 122 PUBLICATIONS 1,771 CITATIONS SEE PROFILE Available from: Jack Dongarra Retrieved on: 20 June 2016 Acta Numerica http://journals.cambridge.org/ANU Additional services for Acta Numerica: Email alerts: Click here Subscriptions: Click here Commercial reprints: Click here Terms of use : Click here Linear algebra software for large-scale accelerated multicore computing A. Abdelfattah, H. Anzt, J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, I. Yamazaki and A. YarKhan Acta Numerica / Volume 25 / May 2016, pp 1 - 160 DOI: 10.1017/S0962492916000015, Published online: 23 May 2016 Link to this article: http://journals.cambridge.org/abstract_S0962492916000015 How to cite this article: A. Abdelfattah, H. Anzt, J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, I. Yamazaki and A. YarKhan (2016). Linear algebra software for large-scale accelerated multicore computing. Acta Numerica, 25, pp 1-160 doi:10.1017/S0962492916000015 Request Permissions : Click here Downloaded from http://journals.cambridge.org/ANU, IP address: 192.50.6.17 on 24 May 2016 Acta Numerica (2016), pp. 1–160 (cid:13)c Cambridge University Press, 2016 doi:10.1017/S0962492916000015 Printed in the United Kingdom Linear algebra software for large-scale accelerated multicore computing∗ A. Abdelfattah, H. Anzt, J. Dongarra, M. Gates, A. Haidar, J. Kurzak P. Luszczek, S. Tomov, I. Yamazaki and A. YarKhan Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer Boulevard, Suite 203 Claxton, Knoxville, TN 37996, USA E-mail:{ahmad,hanzt,dongarra,mgates3,haidar,kurza, luszczek,tomov,iyamazak,yarkhan}@icl.utk.edu Many crucial scientific computing applications, ranging from national secur- ity to medical advances, rely on high-performance linear algebra algorithms and technologies, underscoring their importance and broad impact. Here we present the state-of-the-art design and implementation practices for the accelerationofthepredominantlinearalgebraalgorithmsonlarge-scaleaccel- erated multicore systems. Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value decomposition (SVD) problems. The implementations presented are readilyavailableviatheopen-sourcePLASMAandMAGMAlibraries,which representthenextgenerationmodernizationofthepopularLAPACKlibrary for accelerated multicore systems. To generate the extreme level of parallelism needed for the efficient use of these systems, algorithms of interest are redesigned and then split into well-chosen computational tasks. The task execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators and/or Xeon Phi coprocessors, using either static scheduling or light-weight runtime systems. The use of light-weight runtime systems keeps scheduling overheads low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the de- velopmenteffortandallowsexplorationoftheuniquestrengthsofthevarious hardware components. Finally, we emphasize the development of innovative linear algebra algorithms using three technologies – mixed precision arith- metic, batched operations, and asynchronous iterations – that are currently of high interest for accelerated multicore systems. ∗ Colour online for monochrome figures available at journals.cambridge.org/anu. 2 A. Abdelfattah et al. CONTENTS 1 Legacy software packages for dense linear algebra 2 2 New hardware architectures 12 3 Dynamic runtime scheduling 16 4 LU factorization 31 5 QR factorization 44 6 Cholesky factorization 48 7 LDLT decomposition 55 8 Eigenvalue and singular value problems 64 9 Mixed precision algorithms 103 10 Batched operations 114 11 Sparse linear algebra 138 12 Conclusion 148 References 149 1. Legacy software packages for dense linear algebra Over the years, computational physics and chemistry have provided ongo- ing demand for the ever-increasing performance in hardware and software required to efficiently solve problems in these fields. Fortunately, most of these problems can be translated into solutions for systems of linear equa- tions – the very topic of numerical linear algebra. Seemingly then, a set of efficient linear solvers could be solving important scientific problems for years to come. We also argue that dramatic changes in hardware design, precipitated by the shifting nature of the computer hardware marketplace, has had a continuous effect on numerical linear algebra software, and the extraction of high percentages of peak performance from evolving hard- ware requires continuous adaptation in software. If the history of linear algebra software’s adaptive nature is any indication, then the future holds yet more changes – changes aimed at harnessing the incredible advances of the evolving hardware infrastructure. 1.1. The LINPACK library and its implementations In the 1970s, Fortran was the language for scientific computation: see Fig- ure 1.1. Fortran stores two-dimensional arrays in column-major order (all of the elements of the first column are stored first, then the elements of the second column are stored, and so on). Accessing the array in a row- wise fashion within the matrix could involve successive memory reference to locations separated from each other by a large increment, depending on the size of the declared array. This situation was further complicated by the operating system’s use of memory pages to effectively control memory Accelerated multicore linear algebra 3 subroutinedgefa(a,lda,n,ipvt,info) a(k,k)=t integerlda,n,ipvt(1),info 10 continue doubleprecisiona(lda,1) c doubleprecisiont c computemultipliers integeridamax,j,k,kp1,l,nm1 c c t=-1.0d0/a(k,k) c gaussianeliminationwithpartialpivoting calldscal(n-k,t,a(k+1,k),1) c c info=0 c roweliminationwithcolumnindexing nm1=n-1 c if(nm1.lt.1)goto70 do30j=kp1,n do60k=1,nm1 t=a(l,j) kp1=k+1 if(l.eq.k)goto20 c a(l,j)=a(k,j) c findl=pivotindex a(k,j)=t c 20 continue l=idamax(n-k+1,a(k,k),1)+k-1 calldaxpy(n-k,t,a(k+1,k),1,a(k+1,j),1) ipvt(k)=l 30 continue c goto50 c zeropivotimpliesthiscolumnisalreadytriangularized 40 continue c info=k if(a(l,k).eq.0.0d0)goto40 50 continue c 60 continue c interchangeifnecessary 70 continue c ipvt(n)=n if(l.eq.k)goto10 if(a(n,n).eq.0.0d0)info=n t=a(l,k) return a(l,k)=a(k,k) end Figure1.1.LINPACKvariantofLUfactorization.ThisistheoriginalFORTRAN66 code; if LINPACK were written today it would use a modern Fortran standard. usage. With a large matrix and a row-oriented algorithm in a Fortran en- vironment, an excessive number of page swaps might be generated in the process of running the software. Cleve Moler pointed out this issue in the 1970s (Moler 1972). In the mid-to-late 1970s, the introduction of vector computer systems marked the beginning of modern supercomputing. These systems offered a performance advantage of at least one order of magnitude over conventional systems of that time. Raw performance was the main, if not the only, selling point of new computing hardware. In the first half of the 1980s the integration of vector systems in conventional computing environments became more important. Only the manufacturers who provided standard programming environments, operating systems, and key applications were successful in earning industrial customers and survived. Performance was mainly increased by improved chip technologies and by producing shared memory multiprocessor systems. These systems were able, in one step, to performasingleoperationonarelativelylargenumberofoperandsstoredin vector registers. Expressing matrix algorithms as vector–vector operations was a natural fit for this type of machine (Dongarra, Gustavson and Karp 1984). However, someofthevectordesignshadalimitedabilitytoloadand store the vector registers in main memory. A technique called ‘chaining’ allowed this limitation to be circumvented by moving data between the 4 A. Abdelfattah et al. registers before accessing main memory. Chaining required recasting linear algebra in terms of matrix–vector operations. Vector architectures exploit pipeline processing by running mathematical operations on arrays of data in a simultaneous or pipelined fashion. Most algorithms in linear algebra can be easily vectorized. Therefore, in the late 1970s there was an effort to standardize vector operations for use in scientificcomputations. Theideawastodefinesomesimple,frequentlyused operations and implement them on various systems to achieve portability and efficiency. This package came to be known as the Level 1 Basic Linear Algebra Subprograms (BLAS) or Level 1 BLAS (Lawson, Hanson, Kincaid and Krogh 1979). ThetermLevel1denotesvector–vectoroperations. Aswewillsee,Level2 (matrix–vector operations: Hammarling, Dongarra, Du Croz and Hanson 1988), and Level 3 (matrix–matrix operations: Dongarra, Du Croz, Ham- marling and Duff 1990) play important roles as well. In the 1970s, dense linearalgebraalgorithmswereimplementedinasystematicwaybytheLIN- PACK project (Dongarra, Bunch, Moler and Stewart 1979). LINPACK is a collection of Fortran subroutines that analyse and solve linear equations and linear least-squares problems. The package solves linear systems whose matrices are general, banded, symmetric indefinite, symmetric positive def- inite, triangular, and square tridiagonal (only diagonal, super-diagonal, and sub-diagonal are present). In addition, the package computes the QR (mat- rixQisunitaryorHermitianandRisuppertrapezoidal)andsingularvalue decompositions of rectangular matrices and applies them to least-squares problems. LINPACK uses column-oriented algorithms, which increase effi- ciency by preserving locality of reference. By column orientation, we mean that the LINPACK code always references arrays down columns, not across rows. This is important since Fortran stores arrays in column-major order. This means that as one proceeds down a column of an array, the memory references proceed sequentially through memory. Thus, if a program refer- ences an item in a particular block, the next reference is likely to be in the same block. The software in LINPACK was kept machine-independent partly through the introduction of the Level 1 BLAS routines. Calling Level 1 BLAS did almost all of the computation. For each machine, the set of Level 1 BLAS wouldbeimplementedinamachine-specificmannertoobtainhighperform- ance. The Level 1 BLAS subroutines DAXPY, DSCAL, and IDAMAX are used in the routine DGEFA. It was presumed that the BLAS operations would be implemented in an efficient, machine-specific way suitable for the computer on which the subroutines were executed. On a vector computer, this could translate into a simple, single vector operation. This avoided leaving the optimization up to the compiler and explicitly exposing a performance-critical operation. Accelerated multicore linear algebra 5 In a sense, then, the beauty of the original code was regained with the use of a new vocabulary to describe the algorithms: the BLAS. Over time, the BLAS became a widely adopted standard and was most likely the first to enforce two key aspects of software: modularity and portability. Again, these are taken for granted today, but at the time they were not. One could have the advantages of the compact algorithm representation and the portability, because the resulting Fortran code was portable. Most algorithms in linear algebra can be easily vectorized. However, to gain the most out of such architectures, simple vectorization is usually not enough. Some vector computers are limited by having only one path betweenmemoryandthevectorregisters. Thiscreatesabottleneckifapro- gramloadsavectorfrommemory,performssomearithmeticoperations,and thenstorestheresults. Inordertoachievetopperformance,thescopeofthe vectorization must be expanded to facilitate chaining operations together andtominimizedatamovement,inadditiontousingvectoroperations. Re- casting the algorithms in terms of matrix–vector operations makes it easy for a vectorizing compiler to achieve these goals. Thus, as computer architectures became more complex in the design of their memory hierarchies, it became necessary to increase the scope of the BLAS routines from Level 1 to Level 2 to Level 3. 1.2. Implementation in terms of submatrices RISC (reduced instruction set computing) computers were introduced in the late1980sandearly1990s. Whiletheirclockratesmighthavebeencompar- abletothoseofthevectormachines,thecomputingspeedlaggedbehinddue to their lack of vector registers. Another deficiency was their creation of a deepmemoryhierarchywithmultiplelevelsofcachememorytoalleviatethe scarcity of bandwidth that was, in turn, caused mostly by a limited number of memory banks. The eventual success of this architecture is commonly attributed to the price point and astonishing improvements in performance overtimeaspredictedbyMoore’slaw(Moore1965). WithRISCcomputers, the linear algebra algorithms had to be redone yet again. This time, the formulations had to expose as many matrix–matrix operations as possible, which guaranteed good cache re-use. Asmentionedbefore, theintroductioninthelate1970sandearly1980sof vector machines brought about the development of another variant of algo- rithms for dense linear algebra. This variant was centred on the multiplica- tionofamatrixbyavector. Thesesubroutinesweremeanttogiveimproved performance over the dense linear algebra subroutines in LINPACK, which were based on Level 1 BLAS. In the late 1980s and early 1990s, with the introduction of RISC-type microprocessors (the ‘killer micros’) and other machines with cache-type memories, we saw the development of LAPACK 6 A. Abdelfattah et al. (Andersonet al.1999)Level3algorithmsfordenselinearalgebra. ALevel3 code is typified by the main Level 3 BLAS, which, in this case, is matrix multiplication (Anderson and Dongarra 1990b). TheoriginalgoaloftheLAPACKprojectwastoallowlinearalgebraprob- lems to run efficiently on vector and shared memory parallel processors. On these machines, LINPACK is inefficient because its memory access patterns disregard the multilayered memory hierarchies of the machines, thereby spending too much time moving data instead of doing useful floating-point operations. LAPACKaddressesthisproblembyreorganizingthealgorithms to use block matrix operations, such as matrix multiplication, in the inner- most loops (see the paper by Anderson and Dongarra 1990a). These block operations can be optimized for each architecture to account for its memory hierarchy, and so provide a transportable way to achieve high efficiency on diverse modern machines. Here we use the term ‘transportable’ instead of ‘portable’ because, for fastestpossibleperformance, LAPACKrequiresthathighlyoptimizedblock matrixoperationsalreadybeimplementedoneachmachine. Inotherwords, the correctness of the code is portable, but high performance is not – if we limit ourselves to a single Fortran source code. LAPACK can be regarded as a successor to LINPACK in terms of func- tionality, although it does not use the same function-calling sequences. As such a successor, LAPACK was a win for the scientific community because it could keep LINPACK’s functionality while getting improved use out of new hardware. Most of the computational work in the algorithm from Figure 1.2 is con- tained in three routines: • DGEMM – matrix–matrix multiplication, • DTRSM – triangular solve with multiple right-hand sides, • DGETF2 – unblocked LU factorization for operations within a block column. One of the key parameters in the algorithm is the block size, called NB here. If NB is too small or too large, poor performance can result – hence the importance of the ILAENV function, whose standard implementation was meant to be replaced by a vendor implementation encapsulating machine- specific parameters upon installation of the LAPACK library. At any given point of the algorithm, NB columns or rows are exposed to a well-optimized Level 3 BLAS. If NB is 1, the algorithm is equivalent in performance and memory access patterns to LINPACK’s version. Matrix–matrixoperationsoffertheproperlevelofmodularityforperform- ance and transportability across a wide range of computer architectures, including parallel systems with memory hierarchy. This enhanced perform- ance is due primarily to a greater opportunity for re-using data. There are Accelerated multicore linear algebra 7 SUBROUTINEDGETRF(M,N,A,LDA,IPIV,INFO) DO20J=1,MIN(M,N),NB INTEGER INFO,LDA,M,N JB=MIN(MIN(M,N)-J+1,NB) INTEGER IPIV(*) * Factordiagonalandsubdiagonalblocksandtestforexact DOUBLEPRECISION A(LDA,*) * singularity. DOUBLEPRECISION ONE CALLDGETF2(M-J+1,JB,A(J,J),LDA,IPIV(J),IINFO) PARAMETER (ONE=1.0D+0) * AdjustINFOandthepivotindices. INTEGER I,IINFO,J,JB,NB IF(INFO.EQ.0.AND.IINFO.GT.0)INFO=IINFO+J-1 EXTERNAL DGEMM,DGETF2,DLASWP,DTRSM DO10I=J,MIN(M,J+JB-1) EXTERNAL XERBLA IPIV(I)=J-1+IPIV(I) INTEGER ILAENV 10 CONTINUE EXTERNAL ILAENV * Applyinterchangestocolumns1:J-1. INTRINSIC MAX,MIN CALLDLASWP(J-1,A,LDA,J,J+JB-1,IPIV,1) INFO=0 IF(J+JB.LE.N)THEN IF(M.LT.0)THEN * ApplyinterchangestocolumnsJ+JB:N. INFO=-1 CALLDLASWP(N-J-JB+1,A(1,J+JB),LDA,J,J+JB-1,IPIV,1) ELSEIF(N.LT.0)THEN * ComputeblockrowofU. INFO=-2 CALLDTRSM(’Left’,’Lower’,’Notranspose’,’Unit’,JB, ELSEIF(LDA.LT.MAX(1,M))THEN $ N-J-JB+1,ONE,A(J,J),LDA,A(J,J+JB),LDA) INFO=-4 IF(J+JB.LE.M)THEN ENDIF * Updatetrailingsubmatrix. IF(INFO.NE.0)THEN CALLDGEMM(’Notranspose’,’Notranspose’,M-J-JB+1, CALLXERBLA(’DGETRF’,-INFO) $ N-J-JB+1,JB,-ONE,A(J+JB,J),LDA, RETURN $ A(J,J+JB),LDA,ONE,A(J+JB,J+JB),LDA) ENDIF ENDIF IF(M.EQ.0.OR.N.EQ.0)RETURN ENDIF NB=ILAENV(1,’DGETRF’,’’,M,N,-1,-1) 20 CONTINUE IF(NB.LE.1.OR.NB.GE.MIN(M,N))THEN ENDIF CALLDGETF2(M,N,A,LDA,IPIV,INFO) RETURN ELSE END Figure 1.2. LAPACK’s LU factorization routine DGETRF using FORTRAN 77. numerous ways to accomplish this re-use of data to reduce memory traffic and to increase the ratio of floating-point operations to data movement through the memory hierarchy. This improvement can bring a three- to tenfold improvement in performance on modern computer architectures. Thejuryisstilloutconcerningtheproductivityofwritingandreadingthe LAPACK code: How hard is it to generate the code from its mathematical description? The use of vector notation in LINPACK is arguably more natural than LAPACK’s matrix formulation. The mathematical formulas thatdescribealgorithmsareusuallymorecomplexifonlymatricesareused, as opposed to mixed vector–matrix notation. 1.3. Beyond a single compute node Thetraditionaldesignfocusformassivelyparallelprocessing(MPP)systems was the very high end of performance. In the early 1990s, the symmetric multiprocessing (SMP) systems of various workstation manufacturers, as well as the IBM SP series – which targeted the lower and medium market segments – gained great popularity. Their price/performance ratios were better thanks to the economies of scale associated with larger production numbers, and due to lack of overhead in the design for support of the very large configurations. Due to the vertical integration of performance, it was 8 A. Abdelfattah et al. no longer economically feasible to produce and focus on the highest end of computing power alone. The design focus for new systems shifted to the market of medium performance systems. The acceptance of MPP systems, not only for engineering applications but also for new commercial applications (e.g. databases), emphasized dif- ferent criteria for market success, such as the stability of the system, con- tinuity/longevity of the manufacturer, and price/performance. Thriving in commercial environments became an important endeavour for a successful supercomputer business in the late 1990s. Due to these factors, and the consolidation of the number of vendors in the market, hierarchical systems builtwithcomponentsdesignedforthebroadercommercialmarketreplaced homogeneous systems at the very high end of performance. The market- place adopted clusters of SMPs readily, while academic research focused on clusters of workstations and PCs. At the end of the 1990s, clusters were common in academia but mostly as research objects and not primarily as general-purpose computing platforms for applications. Most of these clusters were of comparable small scale, and as a result the November 1999 edition of the TOP500 (Meuer, Stroh- maier, Dongarra and Simon 2011) listed only seven cluster systems. This changed dramatically as industrial and commercial customers started de- ploying clusters as soon as applications with less stringent communication requirements permitted them to take advantage of the better price/per- formance ratio. At the same time, all major vendors in the HPC market startedsellingthistypeofclustertotheircustomerbase. InNovember2004, clusters were the dominant architectures in the TOP500, with 294 systems at all levels of performance. Companies such as IBM and Hewlett-Packard were selling the majority of these clusters, and a large number of them were installed for commercial and industrial customers. In the early 2000s, clusters built with off-the-shelf components gained more and more attention not only as academic research objects, but also as viable computing platforms for end-users of HPC computing systems. By 2004, these groups of clusters represented the majority of new systems on the TOP500 in a broad range of application areas. One major consequence of this trend was the rapid rise in the utilization of Intel processors in HPC systems. While virtually absent in the high end at the beginning of the decade, Intel processors are now used in the majority of HPC systems. Clusters in the 1990s were mostly self-made systems designed and built by small groups of dedicated scientists or application experts. This changed rapidly as soon as the market for clusters based on PC technology matured. Today, the large majority of TOP500-class clusters are manufactured and integrated by either a few traditional large HPC manufacturers, such as IBM or Hewlett-Packard, or numerous small, specialized integrators of such systems.
Description: