THREE PARALLEL PROGRAMMING PARADIGMS: COMPARISONS ON AN ARCHETYPAL PDE COMPUTATION M. EHTESHAM HAYDER(cid:3), CONSTANTINOS S. IEROTHEOUy, AND DAVID E. KEYESz Abstract. Three paradigms for distributed-memory parallel computation that free the appli- cation programmer from the details of message passing are compared for an archetypal structured scienti(cid:12)c computation | a nonlinear, structured-grid partial di(cid:11)erential equation boundary value problem | using the same algorithm on the same hardware. All of the paradigms | parallel languages represented by the Portland Group’s HPF, (semi-)automated serial-to-parallelsource-to- sourcetranslationrepresentedbyCAPToolsfromtheUniversityofGreenwich,andparallellibraries represented by Argonne’s PETSc | arefound to be easy to use for this problem class, and all are reasonably e(cid:11)ective in exploiting concurrency after a short learning curve. The level of involve- mentrequiredbytheapplicationprogrammerunderanyparadigmincludesspeci(cid:12)cationofthedata partitioning,correspondingtoageometricallysimpledecompositionofthedomainofthePDE.Pro- gramming in SPMD style for the PETSc library requires writing only the routines that discretize thePDEanditsJacobian,managingsubdomain-to-processormappings(a(cid:14)neglobal-to-localindex mappings), and interfacing to library solver routines. Programming for HPF requires a complete sequential implementation of the same algorithm as a starting point, introduction of concurrency through subdomain blocking (a task similar to the index mapping), and modest experimentation with rewritingloops to elucidate to the compiler the latent concurrency. Programmingwith CAP- Tools involves feedingthe samesequential implementation tothe CAPTools interactive paralleliza- tionsystem,andguidingthesource-to-sourcecodetransformationbyrespondingtovariousqueries aboutquantitiesknowableonlyatruntime. Resultsrepresentativeof\thestateofthepractice"fora scaledsequenceofstructuredgridproblemsaregivenonthreeofthemostimportantcontemporary high-performanceplatforms: theIBMSP,theSGIOrigin2000, andtheCRAYT3E. 1. Introduction. Parallel computations advance through synergism in numer- ical algorithms and system software technology. Algorithmic advances permit more rapidconvergencetomoreaccurateresultswiththesameorreduceddemandsonpro- cessor, memory, and communication subsystems. System software advances provide moreconvenientexpressionandgreaterexploitationoflatentalgorithmicconcurrency, and take improved advantage of architecture. These advances can be appropriated by application programmers through a variety of means: parallel languages that are compiled directly, source-to-source translators that aid in the embedding of data ex- change and coordination constructs into standard high-level languages, and parallel libraries that support speci(cid:12)c parallel kernels. In this paper, we compare all three approachesas representedby \state-of-the-practice"softwareon three machines that can be programmed using message passing. Unfortunately, the development and tuning of a parallel numerical code from scratch remains a time-consuming task. The burden on the programmer may be reduced if the high-level programming language itself supports parallel constructs, which is the philosophy that underlies the High Performance Fortran [26] extensions to Fortran. With varying degrees of hints from programmers, the HPF approach leaves the responsibility of managing concurrency and data communication to the compiler and runtime system. The di(cid:14)culty of complete and fully automatic interprocedural dependence anal- ysis and disambiguation in a language like Fortran suggests the opportunity for an (cid:3)Center forResearchonHighPerformanceSoftware,RiceUniversity,Houston, TX77005, USA, [email protected], yParallel Processing Research Group, University of Greenwich, London SE18 6PF, UK, [email protected] zComputer ScienceDepartment, OldDominionUniversityandICASE,Norfolk,VA23529-0162, USA,[email protected] 1 interactive source-to-source approach to parallelization. CAPTools [16] is such an interactive authoring system for message-passing parallel code. Parallel libraries o(cid:11)er a somewhat higher level solution for tasks which are su(cid:14)- cientlycommonthatlibrarieshavebeenwrittenforthem. Thephilosophyunderlying parallel libraries is that for high performance, some expert human programmer must become involved in the concurrency detection, process assignment, interprocess data transfer, and process-to-processor mapping | but only once for each algorithmic archetype. A library, perhaps with multiple levels of entry to allow the application programmer to employ defaults or to exert detailed control, is the embodiment of algorithmic archetypes. One such parallel library is PETSc [3], under continuous expansion at Argonne National Laboratory since 1991. PETSc provides a wide va- riety of parallel numerical routines for scalable applications involving the solution of partial di(cid:11)erential and integral equations, and certain other regular data parallel ap- plications. It uses message passing via MPI and assumes no physical data sharing or global address space. Inthisstudywecomparethethreeparadigmsdiscussedaboveonasimpleproblem representative of low-order structured-grid discretizations of nonlinear elliptic PDEs |theso-called\Bratu"problem. ThesolutionalgorithmisaNewton-Krylovmethod with subdomain-concurrent ILU preconditioning, also known as a Newton-Krylov- Schwarz(NKS) method [22]. Its basic components are typicalofother algorithms for PDEs: (1) sparse matrix-vector products (together with Jacobian matrix and resid- ual vectorevaluations) basedon regularmultidimensional gridstencil operations,(2) sparse triangular solution recurrences, (3) global reductions, and (4) DAXPYs. Our goal is to examine the performance and scalability of these three di(cid:11)erent program- ming paradigms for this broadly important class of scienti(cid:12)c computations. Withrelativelymodeste(cid:11)ort,weobtainsimilarandreasonableperformanceusing anyofthethreeparadigms,suggestingthatallthreetechnologiesarematureforstatic structured problems. This study expands on [14] by bringing the CAPTools system intothecomparison(animportantaddition,sincetheCAPToolsapproachoftenedges out the other two in performance) and by comparisons of the three systems on three machines, instead of just the IBM SP. Theorganizationofthispaperisasfollows. Section2describesamodelnonlinear PDE problem and its discretization and solution algorithm. Section 3 discusses the HPF, CAPTools, and PETSc implementations of the algorithm. The performance of the implementations is compared, side-by-side, in Section 4, and we conclude in Section 5. Our target audience includes both potential users of parallel systems for PDE simulation and developers of future versions of parallel languages, tools, and libraries. 2. Problem and Algorithm. Our test case is a classic nonlinear elliptic PDE, known as the Bratu problem. In this problem, generation from a nonlinear reaction term is balanced by di(cid:11)usion. The model problem is given by (2.1) −r2u−(cid:21)eu =0; with u= 0 at the boundary, where (cid:21) is a constant known as the Frank-Kamenetskii parameterinthecombustioncontext. TheBratuproblemisapartoftheMINPACK- 2testproblemcollection[2]anditssolutionisimplementedinavarietyofwaysinthe distribution set of demo drivers for the PETSc library, to illustrate di(cid:11)erent features of PETSc for nonlinear problems. There are two possible steady-state solutions to this problem for certain values of (cid:21). One solution is close to u = 0 and is easy to 2 obtain. A close starting point is needed to converge to the other solution. For our modelcase,weconsiderasquaredomainofunitlengthand(cid:21)=6. Weuseastandard central di(cid:11)erence scheme on a uniform grid (shown in Figure 2.1) to discretize (2.1) as (cid:26) (cid:27) fij(u)(cid:17) 4ui;j −ui−1;j −ui+1;j −uuiji;;j−1−ui;j+1−h2(cid:21)eui;j; io;tjhe=rw0i;sne =0; where fij is a function of the vector of discrete unknowns uij, de(cid:12)ned at each inte- rior and boundary grid point: ui;j (cid:25) u(xi;yj); xi (cid:17) ih;i = 0;1;:::;n; yj (cid:17) jh;j = 0;1;:::n, h(cid:17) 1. The discretizationleads to a nonlinear algebraic problemof dimen- n sion (n+1)2, with a sparse Jacobian matrix of condition number O(n2), asymptot- ically in n, for (cid:12)xed (cid:21). The typical number of nonzeros per row of the Jacobian is (cid:12)ve, with just one in rows corresponding to boundary points of the physical domain. The algorithmic discussion in the balance of this section is su(cid:14)cient to understand themaincomputationandcommunicationcostsinsolvingnonlinearellipticboundary value problems like (2.1), but we defer full parallel complexity studies, including a discussionofoptimalparallelgranularities,partitioningstrategies,andrunning times to the literature, e.g. [12, 24]. In this study, most of the computations were done using one-dimensionaldomain decomposition (shown in Figure 2.1). We also present some results for two-dimensional blocking. Outer Iteration: Newton. We solve f(u)=0 by an inexact Newton-iterative method with a cubic backtrackingline search[10]. We use the term\inexactNewton method"todenoteanynonlineariterativemethodapproachingthesolutionuthrough a sequence of iterates u‘ =u‘−1+(cid:11)(cid:1)(cid:14)u‘, beginning with an initial iterate u0, where (cid:11) is determined by some globalization strategy, and (cid:14)u‘ approximately satis(cid:12)es the true Newton correction linear system (2.2) f0(u‘−1) (cid:14)u=−f(u‘−1); inthesensethatthelinearresidualnormjjf0(u‘−1)(cid:14)u‘+f(u‘−1)jjissu(cid:14)cientlysmall. Typically the RHS of the linear Newton correction equation, which is the negative of the nonlinear residual vector, f(u‘−1), is evaluated to full precision. The inex- actness arises from an incomplete convergence employing the true Jacobian matrix, f0(u), freshly evaluated at u‘−1, or from the employment of an inexact or a \lagged" Jacobian. AnexactNewtonmethodisrarelyoptimalintermsofmemoryandCPUresources for large-scale problems, such as (cid:12)nely resolved multidimensional PDE simulations. The pioneering work in showing that properly tuned inexact Newton methods can save enormous amounts of work over a sequence of Newton iterations, while still convergingasymptoticallyquadratically,is [9]. We terminate the nonlineariterations at the ‘ for which the norm of the nonlinear residual (cid:12)rst falls below a threshold de(cid:12)ned relative to the initial residual: jjf(u‘)jj=jjf(u0)jj<(cid:28)rel. Our (cid:28)rel in Section 4 is a loose 0.005, to keep total running times modest in the unpreconditioned cases consideredbelow, since the asymptotic convergencebehavior ofthe method has been well studied elsewhere. Inner Iteration: Krylov. A Newton-Krylov method uses a Krylov method to solve (2.2) for (cid:14)u‘. From a computational point of view, one of the most important characteristics of a Krylov method for the linear system Ax = b is that information about the matrix A needs to be accessed only in the form of matrix-vector prod- ucts in a relatively small number of carefully chosen directions. When the matrix A 3 represents the Jacobian of a discretized system of PDEs, each of these matrix-vector productsissimilarincomputationalandcommunicationcosttoastencilupdatephase of an explicit method applied to the same set of discrete conservation equations. Pe- riodic nearest-neighbor communication is required to \ghost" the values present in the boundary stencils ofone processorbut maintainedandupdated by a neighboring processor. We use the restarted generalized minimum residual (GMRES) [30] method for the iterative solution of the linearized equation. (Though the linearized operator is self-adjoint, we do not exploit symmetry in the iterative solver, since symmetry is too special for general applications and since our implementation of the discrete boundary conditions doPes not preserve symmetry.) GMRES constructs an approx- imation solution x = mi=1civi as a linear combination of an orthogonal basis vi of a Krylov subspace, K = fr0;Ar0;A2r0;:::g, built from an initial residual vec- tor, r0 =b−Ax0, by matrix-vectorproducts and a Gram-Schmidt orthogonalization process. This Gram-Schmidt process requires periodic global reduction operations to accumulate the locally summed portions of the inner products. We employ the conventional modi(cid:12)ed Gram-Schmidt process that reduces each inner product in se- quence as opposed to the more communication-e(cid:14)cient version that simultaneously reduces a batch of inner products. However, in well preconditioned time-evolution problems and on large numbers of processors, we often prefer the batched version. Restarted GMRES of dimension m (cid:12)nds the optimal solution of Ax = b in a leastsquaressensewithinthe currentKrylovspaceofdimensionupto mandrepeats the process with a new subspace built from the residual of the optimal solution in the previous subspace if the resulting linear residual does not satisfy the convergence criterion. The residualnormis monitoredateachintermediate stageas a by-product of advancing the iteration. GMRES is well-suited for inexact Newton methods, since its convergencecanbe terminatedatanypoint,with anoverallcostthatis monoton- icallyandrelativelysmoothlyrelatedtoconvergenceprogress. ByrestartingGMRES atrelativelyshortintervalswecankeepits memoryrequirementsbounded. However, a global convergence theory exists only for the nonrestarted version. For problems with highly inde(cid:12)nite matrices, m may need to approach the full matrix dimension, but this does not occur for practically desired (cid:21) in (2.1). We de(cid:12)ne the GMRES iteration for (cid:14)u‘ at each outer iteration ‘ with an inner iteration index, k = 0;1;:::, such that (cid:14)u‘;0 (cid:17) 0 and (cid:14)u‘;1 (cid:17) (cid:14)u‘. We terminate GMRESatthek forwhichthenormofthelinearresidual(cid:12)rstfallsbelowathreshold de(cid:12)ned relative to the initial: jjf(u‘−1)+f0(u‘−1) (cid:14)u‘;kjj=jjf(u‘−1)jj < (cid:27)rel, or at which it falls below an absolute threshold: jjf(u‘−1)+f0(u‘−1) (cid:14)u‘jj < (cid:27)abs. In the experiments reported below, (cid:27)rel is 0.5 and (cid:27)abs is 0.005. (Ordinarily, in an applica- tionforwhichitis goodpreconditioningis easilya(cid:11)orded,suchas this one,we would employ a tighter (cid:27)rel. However,we wish to compare preconditioned and unprecondi- tionedcaseswhile keepingthe comparisonasuncomplicatedbyparameterdi(cid:11)erences as possible.) A relative mix of matrix-vector multiplies, function evaluations, inner products,andDAXPYssimilartothoseofmorecomplexapplicationsisachievedwith these settings. The single task that is performed more frequently relative to the rest than might occur in practice is that of Jacobian matrix evaluation, which we carry out on every Newton step. InnerIterationPreconditioning: Schwarz. ANewton-Krylov-Schwarzmethod combines a Newton-Krylov (NK) method with a Krylov-Schwarz (KS) method. If the Jacobian A is ill-conditioned, the Krylov method will require an unacceptably 4 large number of iterations. The system can be transformed into the equivalent form B−1Ax = B−1b through the action of a preconditioner, B, whose inverse action ap- proximatesthatofA, butatsmallercost. Itis inthe choiceofpreconditioningwhere the battle for low computational cost and scalable parallelism is usually won or lost. In KS methods, the preconditioning is introducedon a subdomain-by-subdomainba- sis throughconvenientlyconcurrentlycomputable approximationsto localJacobians. Such Schwarz-type preconditioning provides good data locality for parallel imple- mentations over a range of parallel granularities, allowing signi(cid:12)cant architectural adaptability [13]. In our tests, the preconditioning is applied on the right-hand side; that is, we solve My = b, where M = AB−1, and recover x = B−1y with a (cid:12)nal application of the preconditioner to the y that represents the convergedsolution. Two-levelAdditiveSchwarzpreconditioning[11]withmodestoverlapbetweenthe subdomains and a coarse grid is \optimal" for this problem, for su(cid:14)ciently small (cid:21). (We use \optimal" in the sense that convergence rate is asymptotically independent of the (cid:12)neness of the grid and the granularity of the partitioning into subdomains. For more formally stated conditions on the overlap and the coarse grid required for this, see [31].) However, for conformity with common parallel practice and sim- plicity of coding, we employ a \poor man’s" Additive Schwarz, namely single-level zero-overlap subdomain-block Jacobi. We further approximate the subdomain-block Jacobi by performing just a single iteration of zero-(cid:12)ll incomplete lower/upper fac- torization (ILU) on each subdomain during each preconditioner phase. These latter twosimpli(cid:12)cations(zerooverlapandzero(cid:12)ll)savecommunication,computation,and memory relative to preconditioners with modest overlap and modest (cid:12)ll that possess provablysuperiorconvergencerates. Domain-basedparallelismisrecognizedbyarchi- tectsandalgorithmicistsasthe formofdataparallelismthatmoste(cid:11)ectivelyexploits contemporary multi-level memory hierarchy microprocessors [8, 25, 33]. Schwarz- type domain decomposition methods have been extensively developed for (cid:12)nite dif- ference/element/volumePDE discretizationsoverthe pastdecade, as reportedin the annual proceedings of the international conferences on domain decomposition meth- ods(see,e.g.,[5]andthe referencestherein). Thetrade-o(cid:11)betweencostperiteration and number of iterations is variously resolvedin the parallelimplicit PDE literature, but our choices are rather common and not far from optimal, in practice. Algorithmic Behavior. Contours of the initial iterate (u0) and (cid:12)nal solution (u1)forourtestcaseareshowninFigure2.2. Figure2.3containsaconvergencehis- tory for Schwarz-ILUpreconditioning on a 512(cid:2)512 grid and for no preconditioning on a quarter-size 256(cid:2)256 grid. The convergence plot depicts in a single graph the outer Newton history and the sequence of inner GMRES histories, as a function of cumulativeGMRESiterations;thus,itplotsincrementalprogressagainstacomputa- tional work unit that approximately corresponds to the conventional multigrid work unit of a complete set of stencil operations on the grid. The plateaus in the resid- ual norm plots correspond to successive values of jjf(u‘)jj, l = 0;1;:::. (There are (cid:12)ve suchintermediate plateausin the preconditionedcase,separatingthe six Newton correction cycles.) The typically concave-down arcs connecting the plateaus corre- spond to jjf(u‘−1)+f0(u‘−1) (cid:14)u‘;kjj, k = 0;1;::: for each ‘. By Taylor’s theorem f(u‘) (cid:25) f(u‘−1)+f0(u‘−1) (cid:14)u‘ +O(((cid:14)u‘)2), so for truncated inner iterations, for which (cid:14)u‘ is small, the Taylor estimate for the nonlinear residual norm at the end of every Newton step is an excellent approximation for the true nonlinear residual norm at the beginning of the next Newton step. We do not actually evaluate the true nonlinear residual norm more frequently than once at the end of each cycle of 5 Domain 1 Domain 2 Fig. 2.1. Computational domain interface, showing on-processor (open arrowhead) and o(cid:11)-processor ((cid:12)lled arrowhead) data dependencies for the 5-point star stencil Y Y 000001......024680 22334455667782398AB39C27D6CA785E46BA54D CBA992883776655443322 LeFEDCBA987654321velu000000000000000..............6554433322110051728451728474185274185212457812457848271548271536914736914731 000001......024680 1121231456789 5B4CBDA23 E8D159BC9473AE6762D8 C1BA98574653432211 LeFEDCBA987654321velu000000000000000..............6554433322110051728451728474185274185212457812457848271548271536914736914731 X X 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 Fig.2.2. Contour plots of initial condition and converged solution GMRES iterations (that is, on the plateaus); the intermediate arcs are Taylor-based interpolations. 3. Parallel Implementations. 3.1. HPF Implementation. High Performance Fortran (HPF) is a set of ex- tensions to Fortran, designed to facilitate e(cid:14)cient data parallel programming on a wide range of parallel architectures [15]. The basic approach of HPF is to provide directivesthatallowtheprogrammertospecifythedistributionofdataacrossproces- sors, which in turn helps the compiler e(cid:11)ectively exploit the parallelism. Using these directives,the userprovideshigh-level\hints"aboutdatalocality,while the compiler generates the actual low-levelparallelcode for communication and scheduling that is appropriate for the target architecture. The HPF programming paradigm provides a 6 100 ILU preconditioning No preconditioning al u d esi10−1 e r h of t m or n 2 10−2 L 10−3 1 10 100 1000 Number of GMRES steps Fig. 2.3. Convergence histories of illustrative unpreconditioned (256(cid:2)256) and global ILU-preconditioned (512(cid:2)512) cases global name space and a single thread of control allowing the code to remain essen- tially sequential with no explicit tasking or communication statements. The goal is to allow architecture-speci(cid:12)c compilers to transform this high-level speci(cid:12)cation into e(cid:14)cient explicitly parallel code for a wide variety of architectures. HPF provides an extensive set of directives to specify the mapping of array ele- mentstomemoryregionsreferredtoas\abstractprocessors." Arraysare(cid:12)rstaligned relative to each other, and then the aligned group of arraysis distributed onto a rec- tilinear arrangement of abstract processors. The distribution directives allow each dimension of an array to be independently distributed. The simplest forms of dis- tribution are block and cyclic; the former breaks the elements of a dimension of the arrayinto contiguous blocks that are distributed acrossthe targetset of abstract processors,whilethelatterdistributestheelementscyclicallyacrosstheabstractpro- cessors. DataparallelisminthecodecanbeexpressedusingtheFortranarraystatements. HPF provides the independentdirective, which can be used to assert that the itera- tions of a loop do not have any loop-carried dependencies and thus can be executed in parallel. HPF is well suited for data parallel programming. However, in order to accom- modate other programming paradigms, HPF provides extrinsic procedures. These de(cid:12)ne an explicit interface and allow codes expressed using a di(cid:11)erent language,e.g., C,oradi(cid:11)erentparadigm,suchasanexplicitmessage-passingcode,tobecalledfrom an HPF program. The (cid:12)rst version of HPF, version 1.0, was released in 1994 and used Fortran 90 as its base language. HPF 2.0, released in January 1997, added new features to the language while modifying and deleting others. Some of the HPF 1 features, e.g., the forallstatement and construct, were dropped because they have been incorporated into Fortran 95. The current compilers for HPF, including the PGI compiler, used 7 for generating the performance (cid:12)gures in this paper, support the features in HPF 1 only and use Fortran 90 as the base language. We have provided only a brief description of some of the features of HPF. A full description can be found in [15], while a discussion of how to use these features in various applications can be found in [7, 28, 29]. Conversion of the Code to HPF. The original code for the Bratu problem was a Fortran 77 implementation of the NKS method of Section 2, written by one of the authors,which pre-datedthe PETScNKS implementation. In this subsectionwe describe the changesmade to the Fortran77 code to port it to HPF, and the reasons for the changes. Fortran’s sequence and storage association models are natural concepts only on machineswithlinearlyaddressedmemoryandcauseine(cid:14)ciencieswhentheunderlying memory is physically distributed. Since HPF targets architectures with distributed memories, it does not support storageand sequence associationfor data objects that havebeenexplicitlymapped. TheoriginalcodereliedonFortran’smodelofsequence association to redimension arrays across procedures in order to allow the problem size,andthus the sizeofthe dataarrays,to be determined atruntime. The code had to be rewritten so that the sizes of the arrays are hardwired throughout and there is no redimensioning of arrays across procedure boundaries. The code could have been converted to use Fortran 90 allocatable arrays; however, we chose to hardwire the sizes of the arrays. This implied that the code needed to be recompiledwhenever the problem size was changed. (This is, of course, no signi(cid:12)cant sacri(cid:12)ce of programmer convenience or code generality when accomplished through parameter and include statements and makefiles. It does, however, cost the time of recompilation.) We mention a few other low-level details of the conversion because they address issues that may be more widely applicable in ports to HPF. During the process of conversion,someofthesimpledoloopswereconvertedintoarraystatements;however, most of the loops were left untouched and were automatically parallelized by PGI’s HPF compiler. That is, we did not need to use either the forall construct or the independent directive for these loops | they were simple enough for the compiler to analyze and parallelize automatically. Along with this, two BLAS library routines used in the original code, ddot and dnrm2, were explicitly coded since the BLAS libraries have not been converted for use with HPF codes. Theoriginalsolverwaswrittenforasystemofequationswithmultiple unknowns at each grid point. To specialize for a scalar equation we deleted the corresponding inner loops and the corresponding indices from the (cid:12)eld and coe(cid:14)cient arrays. We thereby converted four-dimensional Jacobian arrays (in which was expressed each nontrivial dependence of each residual equation on each unknown at each point in two-dimensional space) into two-dimensional arrays (since both number of residual equation and unknown is one, we needed only two-dimensionalarraysto store values at each grid point). This, in turn, reduced some dense point-block linear algebra subroutines to scalar operations, which we inlined. We also rewrote the matrix multiplication routine to utilize a single do loop in- steadofninesmallloops,eachofwhichtookcareofadi(cid:11)erentinteriororsidebound- ary or corner boundary stencil con(cid:12)guration. Some trivial operations are thereby addednearboundaries,butcheckingproximityoftheboundaryandsettingupmulti- pledoloopsareavoided. TheoriginalnineloopscausedtheHPFcompilertogenerate multiple communication statements. Rewriting the code to use a single do loop al- lowedthecompilertogeneratetheoptimalnumberofcommunicationstatementseven 8 though a few extra values had to be communicated. ThesequentialILUroutineintheoriginalcodewasconvertedtosubdomain-block ILUtoconformtothesimplestpreconditioningoptioninthePETSclibrary. Thiswas donebystrip-miningtheloopsinthex-andy-directionstorunovertheblocks,witha sequentialILUwithineachblock. Eventhoughtherewerenodependenciesacrossthe block loops, the HPF compiler could not optimize the code and generated a locality check within the internal loop. This caused unnecessary overhead in the generated code. Weavoidedtheoverheadbycreatingasubroutineforthecodewithintheblock loops and declaring it to be extrinsic. Since the HPF compiler ensures that a copy of an extrinsic routine is called on each processor, no extraneous communication or locality checks now occur while the block sequential ILU code is executed on each processor. The HPF distribution directives are used to distribute the arrays by block. For example, when experimenting with a one-dimensional distribution, a typical array is mapped as follows: real, dimension (nxi,nyi) :: U !HPF$ distribute (*, block) :: U ... do i = 1, nxi do j = 1, nyi U(i,j) = ... end do end do The above distributedirective maps the second dimension of the arrayU by block, i.e.,thenyicolumnsofthearrayUareblock-distributedacrosstheunderlyingproces- sors. As shown by the do loops above, the computation in an HPF code is expressed using global indices independent of the distribution of the arrays. To change the mapping of the array U to a two-dimensional distribution, the distribution directive needstobemodi(cid:12)edasfollows,soastomapacontiguoussub-blockofthearrayonto each processor in a two-dimensional array of processors: real, dimension (nxi,nyi) :: U !HPF$ distribute (block, block) :: U ... do i = 1, nxi do j = 1, nyi U(i,j) = ... end do end do Thecodeexpressingthecomputationremainsthesame;onlythedistributedirective itself is changed. It is the compiler’s responsibility to generate the correct parallel code along with the necessary communication in each case. Mostofthe revisionsdiscussedabovedonothingmorethanconvertFortrancode written for sequential execution into an equivalent sequential form that is easier for theHPFcompilertoanalyze,thusallowingittogeneratemoree(cid:14)cientparallelcode. The only two exceptions are: (a) the mapping directives, which are comments (see code example above) and are thus ignored by a Fortran 90 compiler, and (b) the declarationof two routines, the ILU factorization and forward/backsolveroutines, to beextrinsic. Regularityinourproblemhelpedusinparallelizingthecodewiththese extrinsic calls and a small number of HPF directives. The HPF mapping directives, 9 USER Knowledge about Domain Mask inspection Communication Loop optimisation input parameters/ decomposition tuning code strategy and editing e.g. routine copy etc Browser/Editor Browser/Editor Browser/Editor Browser/Editor Transformations E DE OD CO L C AL LLE TI A N R E Code generation A U P Q Interprocedural Execution Communication N E A N S dependency Data partitioning control mask generation and TR RA analysis generation optimisation Setup for OR T multi-dimensional F R O F partitioning CAPTools b APPLICATION DATABASE Li P A User knowledge Control flow Dependencies(scalar,array,control) Partition definition C Parse tree Execution control masks Communications Fig.3.1. Overview of CAPTools themselves, constitute only about 5% of the line count of the total code. The compilation command, showing the autoparallelization switch and the opti- mization level used in the performance-orientedexecutions, is: pghpf -Mpreprocess -Mautopar -O3 -o bratu bratu.hpf 3.2. CAPTools Implementation. The Computer Aided ParallelisationTools (CAPTools)[16] were initially developedto assistin the processof parallelizing com- putational mechanics (CM) codes. A key characteristic of CAPTools addresses the issue that the code parallelization process could not, in general, be fully automated. Therefore,the interactionbetweenthe code parallelizerandCAPToolsis crucialdur- ing the process of parallelizing codes. An overview of the structure of CAPTools is shown in Figure 3.1. The main components of the tools comprise: (cid:15) A detailed control and dependence analysis of the source code, including the acquisition and embedding of user supplied knowledge. (cid:15) User de(cid:12)nition of the parallelization strategy. (cid:15) Automatic adaptationofthe sourcecodeto anequivalentparallelimplemen- tation. (cid:15) Automaticmigration,mergerandgenerationofallrequiredcommunications. (cid:15) Code optimization including loop interchange, loop splitting, and communi- cation /calculation overlap. The creation of an accurate dependence graph is essential in all stages of the parallelization process from parallelism detection to communication placement. As a result, considerable e(cid:11)ort is made in the analysis to prove the non-existence of de- pendencies. This enables the calculation of a signi(cid:12)cantly higher quality dependence 10
Description: