Table Of Content

OXFORD TEXTS IN APPLIED AND ENGINEERING MATHEMATICS OXFORD TEXTS IN APPLIED AND ENGINEERING MATHEMATICS * G.D.Smith:Numerical Solution of Partial Differential Equations 3rdEdition * R.Hill:A First Course in Coding Theory * I.Anderson:A First Course in Combinatorial Mathematics 2ndEdition * D.J.Acheson:Elementary Fluid Dynamics * S.Barnett:Matrices: Methods and Applications * L.M.Hocking:Optimal Control: An Introduction to the Theory with Applications * D.C.Ince:An Introduction to Discrete Mathematics, Formal System Specification, and Z 2ndEdition * O.Pretzel:Error-Correcting Codes and Finite Fields * P.Grindrod:The Theory and Applications of Reaction–Diffusion Equations: Patterns and Waves 2ndEdition 1. AlwynScott:Nonlinear Science: Emergence and Dynamics of Coherent Structures 2. D.W.JordanandP.Smith:Nonlinear Ordinary Differential Equations: An Introduction to Dynamical Systems 3rdEdition 3. I.J.Sobey:Introduction to Interactive Boundary Layer Theory 4. A.B.Tayler:Mathematical Models in Applied Mechanics (reissue) 5. L.RamdasRam-Mohan:Finite Element and Boundary Element Applications in Quantum Mechanics 6. Lapeyre,et al.:Monte Carlo Methods for Transport and Diffusion Equations 7. I.ElishakoffandY.Ren:Finite Element Methods for Structures with Large Stochastic Variations 8. AlwynScott:Nonlinear Science:Emergence and Dynamics of Coherent Structures2nd Edition 9. W.P.PetersenandP.Arbenz:Introduction to Parallel Computing Titlesmarkedwithanasterisk(*)appearedintheOxfordAppliedMathematics and Computing Science Series, which has been folded into, and is continued by, the current series. Introduction to Parallel Computing W. P. Petersen Seminar for Applied Mathematics Department of Mathematics, ETHZ, Zurich [email protected] P. Arbenz Institute for Scientific Computing Department Informatik, ETHZ, Zurich [email protected] 1 3 GreatClarendonStreet,OxfordOX26DP OxfordUniversityPressisadepartmentoftheUniversityofOxford. ItfurtherstheUniversity’sobjectiveofexcellenceinresearch,scholarship, andeducationbypublishingworldwidein Oxford NewYork Auckland CapeTown DaresSalaam HongKong Karachi KualaLumpur Madrid Melbourne MexicoCity Nairobi NewDelhi Shanghai Taipei Toronto Withofficesin Argentina Austria Brazil Chile CzechRepublic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore SouthKorea Switzerland Thailand Turkey Ukraine Vietnam OxfordisaregisteredtrademarkofOxfordUniversityPress intheUKandincertainothercountries PublishedintheUnitedStates byOxfordUniversityPressInc.,NewYork (cid:1)c OxfordUniversityPress2004 Themoralrightsoftheauthorhavebeenasserted DatabaserightOxfordUniversityPress(maker) Firstpublished2004 Allrightsreserved.Nopartofthispublicationmaybereproduced, storedinaretrievalsystem,ortransmitted,inanyformorbyanymeans, withoutthepriorpermissioninwritingofOxfordUniversityPress, orasexpresslypermittedbylaw,orundertermsagreedwiththeappropriate reprographicsrightsorganization.Enquiriesconcerningreproduction outsidethescopeoftheaboveshouldbesenttotheRightsDepartment, OxfordUniversityPress,attheaddressabove Youmustnotcirculatethisbookinanyotherbindingorcover andyoumustimposethesameconditiononanyacquirer AcataloguerecordforthistitleisavailablefromtheBritishLibrary LibraryofCongressCataloginginPublicationData (Dataavailable) TypesetbyNewgenImagingSystems(P)Ltd.,Chennai,India PrintedinGreatBritain onacid-freepaperby BiddlesLtd.,King’sLynn,Norfolk ISBN0198515766(hbk) 0198515774(pbk) 10 9 8 7 6 5 4 3 2 1 PREFACE The contents of this book are a distillation of many projects which have sub- sequentlybecomethematerialforacourseonparallelcomputinggivenforseveral years at the Swiss Federal Institute of Technology in Zu¨rich. Students in this course have typically been in their third or fourth year, or graduate students, andhavecomefromcomputerscience,physics,mathematics,chemistry,andpro- gramsforcomputationalscienceandengineering.Studentcontributions,whether large or small, critical or encouraging, have helped crystallize our thinking in a quickly changing area. It is, alas, a subject which overlaps with all scientific and engineering disciplines. Hence, the problem is not a paucity of material but rather the distillation of an overflowing cornucopia. One of the students’ most oftenvoicedcomplaintshasbeenorganizationalandofinformationoverload.Itis thus the point of this book to attempt some organization within a quickly changing interdisciplinary topic. In all cases, we will focus our energies on floating point calculations for science and engineering applications. Our own thinking has evolved as well: A quarter of a century of experi- ence in supercomputing has been sobering. One source of amusement as well as amazement to us has been that the power of 1980s supercomputers has been brought in abundance to PCs and Macs. Who would have guessed that vector processing computers can now be easily hauled about in students’ backpacks? Furthermore, the early 1990s dismissive sobriquets about dinosaurs lead us to chuckle that the most elegant of creatures, birds, are those ancients’ successors. Just as those early 1990s contemptuous dismissals of magnetic storage media must now be held up to the fact that 2 GB disk drives are now 1 in. in diameter and mounted in PC-cards. Thus, we have to proceed with what exists now and hope that these ideas will have some relevance tomorrow. Until the end of 2004, for the three previous years, the tip-top of the famous Top 500 supercomputers [143] was the Yokohama Earth Simulator. Currently, the top three entries in the list rely on large numbers of commodity processors: 65536 IBM PowerPC 440 processors at Livermore National Laboratory; 40960 IBMPowerPCprocessorsattheIBMResearchLaboratoryinYorktownHeights; and 10160 Intel Itanium II processors connected by an Infiniband Network [75] and constructed by Silicon Graphics, Inc. at the NASA Ames Research Centre. The Earth Simulator is now number four and has 5120 SX-6 vector processors from NEC Corporation. Here are some basic facts to consider for a truly high performance cluster: 1. Moderncomputerarchitecturesruninternalclockswithcycleslessthana nanosecond. This defines the time scale of floating point calculations. vi PREFACE 2. For a processor to get a datum within a node, which sees a coherent memory image but on a different processor’s memory, typically requires a delay of order 1 µs. Note that this is 1000 or more clock cycles. 3. For a node to get a datum which is on a different node by using message passing takes more than 100 or more µs. Thuswehavethefollowingnotparticularlyprofoundobservations:ifthedataare local to a processor, they may be used very quickly; if the data are on a tightly coupled node of processors, there should be roughly a thousand or more data items to amortize the delay of fetching them from other processors’ memories; and finally, if the data must be fetched from other nodes, there should be a 100 times more than that if we expect to write-off the delay in getting them. So it is that NEC and Cray have moved toward strong nodes, with even stronger processors on these nodes. They have to expect that programs will have blocked or segmented data structures. As we will clearly see, getting data from memory totheCPUistheproblemofhighspeedcomputing, notonlyforNECandCray machines, but even more so for the modern machines with hierarchical memory. It is almost as if floating point operations take insignificant time, while data access is everything. This is hard to swallow: The classical books go on in depth about how to minimize floating point operations, but a floating point operation (flop)countisonlyanindirectmeasureofanalgorithm’sefficiency. Alowerflop count only approximately reflects that fewer data are accessed. Therefore, the best algorithms are those which encourage data locality. One cannot expect a summation of elements in an array to be efficient when each element is on a separate node. This is why we have organized the book in the following manner. Basically, we start from the lowest level and work up. 1. Chapter 1 contains a discussion of memory and data dependencies. When one result is written into a memory location subsequently used/modified byanindependentprocess,whoupdateswhatandwhenbecomesamatter of considerable importance. 2. Chapter 2 provides some theoretical background for the applications and examples used in the remainder of the book. 3. Chapter 3 discusses instruction level parallelism, particularly vectoriza- tion. Processor architecture is important here, so the discussion is often close to the hardware. We take close looks at the Intel Pentium III, Pentium 4, and Apple/Motorola G-4 chips. 4. Chapter 4 concerns shared memory parallelism. This mode assumes that dataarelocaltonodesoratleastpartofacoherentmemoryimageshared by processors. OpenMP will be the model for handling this paradigm. 5. Chapter 5 is at the next higher level and considers message passing. Our model will be the message passing interface, MPI, and variants and tools built on this system. PREFACE vii Finally,averyimportantdecisionwasmadetouseexplicitexamplestoshow howallthesepieceswork.Wefeelthatonelearnsbyexamplesandbyproceeding from the specific to the general. Our choices of examples are mostly basic and familiar: linear algebra (direct solvers for dense matrices, iterative solvers for largesparsematrices),FastFourierTransform,andMonteCarlosimulations.We hope, however, that some less familiar topics we have included will be edifying. Forexample,howdoesonedolargeproblems,orhighdimensionalones?Itisalso not enough to show program snippets. How does one compile these things? How does one specify how many processors are to be used? Where are the libraries? Here, again, we rely on examples. W. P. Petersen and P. Arbenz Authors’ comments on the corrected second printing We are grateful to many students and colleagues who have found errata in the one and half years since the first printing. In particular, we would like to thank Christian Balderer, Sven Knudsen, and Abraham Nieva, who took the time to carefully list errors they discovered. It is a difficult matter to keep up with such a quickly changing area such as high performance computing, both regarding hardware developments and algorithms tuned to new machines. Thus we are indeed thankful to our colleagues for their helpful comments and criticisms. July 1, 2005. ACKNOWLEDGMENTS Our debt to our students, assistants, system administrators, and colleagues is awesome. Former assistants have made significant contributions and include Oscar Chinellato, Dr Roman Geus, and Dr Andrea Scascighini—particularly for theircontributionstotheexercises.Thehelpofoursystemguruscannotbeover- stated. George Sigut (our Beowulf machine), Bruno Loepfe (our Cray cluster), and Tonko Racic (our HP9000 cluster) have been cheerful, encouraging, and at every turn extremely competent. Other contributors who have read parts of an always changing manuscript and who tried to keep us on track have been Prof.MichaelMascagniandDrMichaelVollmer.IntelCorporation’sDrVollmer did so much to provide technical material, examples, advice, as well as trying hard to keep us out of trouble by reading portions of an evolving text, that a “thank you” hardly seems enough. Other helpful contributors were Adrian Burri, Mario Ru¨tti, Dr Olivier Byrde of Cray Research and ETH, and Dr Bruce GreerofIntel.Despitetheirvaliantefforts,doubtlesserrorsstillremainforwhich onlytheauthorsaretoblame.Wearealsosincerelythankfulforthesupportand encouragementofProfessorsWalterGander,GastonGonnet,MartinGutknecht, Rolf Jeltsch, and Christoph Schwab. Having colleagues like them helps make many things worthwhile. Finally, we would like to thank Alison Jones, Kate Pullen, Anita Petrie, and the staff of Oxford University Press for their patience and hard work. CONTENTS List of Figures xv List of Tables xvii 1 BASIC ISSUES 1 1.1 Memory 1 1.2 Memory systems 5 1.2.1 Cache designs 5 1.2.2 Pipelines, instruction scheduling, and loop unrolling 8 1.3 Multiple processors and processes 15 1.4 Networks 15 2 APPLICATIONS 18 2.1 Linear algebra 18 2.2 LAPACK and the BLAS 21 2.2.1 Typical performance numbers for the BLAS 22 2.2.2 Solving systems of equations with LAPACK 23 2.3 Linear algebra: sparse matrices, iterative methods 28 2.3.1 Stationary iterations 29 2.3.2 Jacobi iteration 30 2.3.3 Gauss–Seidel (GS) iteration 31 2.3.4 Successive and symmetric successive overrelaxation 31 2.3.5 Krylov subspace methods 34 2.3.6 The generalized minimal residual method (GMRES) 34 2.3.7 The conjugate gradient (CG) method 36 2.3.8 Parallelization 39 2.3.9 The sparse matrix vector product 39 2.3.10 Preconditioning and parallel preconditioning 42 2.4 Fast Fourier Transform (FFT) 49 2.4.1 Symmetries 55 2.5 Monte Carlo (MC) methods 57 2.5.1 Random numbers and independent streams 58 2.5.2 Uniform distributions 60 2.5.3 Non-uniform distributions 64