AInnt rIondtruocdtuiocnti oton to PPaarraalllleell PPrrooggrraammmmiinngg TToobbiiaass WWiittttwweerr VS SD An Introduction to Parallel Programming An Introduction to Parallel Programming Tobias Wittwer VSSD © Tobias Wittwer First edition 2006 Published by: VSSD Leeghwaterstraat 42, 2628 CA Delft, The Netherlands tel. +31 15 278 2124, telefax +31 15 278 7585, e-mail: [email protected] internet: http://www.vssd.nl/hlf URL about this book: http://www.vssd.nl/hlf/a019.htm All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photo- copying, recording, or otherwise, without the prior written permission of the publisher. ISBN-10 90-71301-78-8 ISBN-13 978-90-71301-78-0 NUR 958 Keywords: parallel programming Foreword Anycomputeruserknowsthecravingformorecomputingpower. Itwasthis“needforspeed” thatmademechoosehighperformancecomputingassubjectofmymasterthesis.Thisgaveme theopportunitytoworkwithsystemsatStuttgart’shighperformancecomputingcenter,HLRS. IspentfivemonthsparallelisingprogramsforgravityfieldmodellingusingOpenMPandMPI, andtestingthemonaCrayOpteroncluster,NECTX-7,andNECSX-6. After graduating I moved to the Netherlands for my PhD research in the Physical and Space Geodesy Groupof the DelftInstituteof EarthObservationand Space System, at thefaculty of AerospaceEngineeringoftheDelftUniversityofTechnology.Myresearchtopic,regionalgrav- ityfieldmodelling,provedtobecomputationallyintensive.Luckily,wehadaccesstoTerasand Aster, twosupercomputersatthe SARA supercomputingfacilityinAmsterdam. Myprograms werequicklyparallelised,shorteningprogramrunsfromhourstominutes. SeeingmycolleaguesstrugglewiththelimitedpoweroftheirPCsgavemetheideaofwritinga tutorialaboutparallelprogramming,togiveeveryonetheopportunitytoeasilyparalleliseheror hisprograms. MoreurgewasaddedwhenourgroupdecidedtobuyitsownLinuxcluster.Now allIneededwasanexampleprogram-andwhenIhadtowriteaprogramforsphericalharmonic analysistochecksomeresults,Ihadthisaswell. Aweekofcodingandwritingensued. Testcomputations,creatinggraphics,proofreading,and finetuning followed. My supervisor Prof. Dr.-Ing. Roland Klees reviewed the document and gavehisapproval. Ihopethatthefinishedproductwillbeusefultothereader,andasenjoyable toreadasitwastowrite. TobiasWittwer,Delft,November2006 v Contents 1 Introduction 1 1.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Prerequisites. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 ExampleProgram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 SystemArchitectures 3 2.1 SingleInstruction-SingleData(SISD) . . . . . . . . . . . . . . . . . . . . . . 3 2.2 SingleInstruction-MultipleData(SIMD) . . . . . . . . . . . . . . . . . . . . . 5 2.3 MultipleInstruction-MultipleData(MIMD) . . . . . . . . . . . . . . . . . . . 5 2.4 SharedMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.5 DistributedMemory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.6 ccNUMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.7 Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.8 MultipleInstruction-SingleData(MISD) . . . . . . . . . . . . . . . . . . . . . 10 2.9 SomeExamples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Software 15 3.1 Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 BLAS&LAPACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 BLACS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5 ScaLAPACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 vi CONTENTS vii 4 PerformanceAnalysis 23 4.1 Timing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3 MeasuringPerformance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5 SHALE-aprogramforsphericalharmonicanalysis 27 5.1 Sphericalharmonicanalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.2 Directmethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.3 Conjugategradientmethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Bibliography 51 Index 53 Chapter 1 Introduction 1.1 Goal Manyscientificcomputationsrequireaconsiderableamountofcomputingtime.Thiscomputing timecanbereducedbydistributingaproblemoverseveralprocessors.Multiprocessorcomputers usedtobequiteexpensive,andnoteverybodyhadaccesstothem. Since2005,x86-compatible CPUs designedfor desktopcomputersare availablewithtwo“cores”, whichessentiallymakes themdualprocessorsystems. MorecoresperCPUaretofollow. This cheap extra computingpower has to be used efficiently, which requires parallel program- ming. Parallel programming methods that work on dual-core PCs also work on larger shared memorysystems,andaprogramdesignedforaclusterorothertypeofdistributedmemorysys- temwillalsoperformwellonyourdual-core(ormulti-core)PC. The goalof thistutorialistogivean introductionintoallaspectsof parallel programmingthat arenecessarytowriteyourownparallelprograms. Toachievethis,itexplains • thevariousexistingarchitecturesofparallelcomputers, • thesoftwareneededforparallelprogramming,andhowtoinstallandconfigureit, • howtoanalysesoftwareandfindthepointswereparallelisationmightbehelpful, • howtowriteparallelprogramsforsharedmemorycomputersusingOpenMP, • howtowriteparallelprogramsfordistributedmemorycomputersusingMPIandScaLA- PACK. Thistutorialaimsmainlyatwritingparallelprogramsforsolvinglinearequationsystems.Ihope thatitisalsousefultogivesomehelpforparallelisingprogramsforotherapplications. 1