ebook img

Programming on Parallel Machines: GPU, Multicore, Clusters and More PDF

335 Pages·2013·2.257 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Programming on Parallel Machines: GPU, Multicore, Clusters and More

Programming on Parallel Machines Norm Matloff University of California, Davis GPU, Multicore, Clusters and More See Creative Commons license at http://heather.cs.ucdavis.edu/ matloff/probstatbook.html Thisbookisoftenrevisedandupdated,latesteditionavailableathttp://heather.cs.ucdavis.edu/mat- loff/158/PLN/ParProcBook.pdf CUDA and NVIDIA are registered trademarks. The author has striven to minimize the number of errors, but no guarantee is made as to accuracy of the contents of this book. 2 Author’s Biographical Sketch Dr. Norm Matloff is a professor of computer science at the University of California at Davis, and was formerly a professor of statistics at that university. He is a former database software developer in Silicon Valley, and has been a statistical consultant for firms such as the Kaiser Permanente Health Plan. Dr. Matloff was born in Los Angeles, and grew up in East Los Angeles and the San Gabriel Valley. HehasaPhDinpuremathematicsfromUCLA,specializinginprobabilitytheoryandstatistics. He has published numerous papers in computer science and statistics, with current research interests in parallel processing, statistical computing, and regression methodology. Prof. Matloff is a former appointed member of IFIP Working Group 11.3, an international com- mittee concerned with database software security, established under UNESCO. He was a founding member of the UC Davis Department of Statistics, and participated in the formation of the UCD Computer Science Department as well. He is a recipient of the campuswide Distinguished Teaching Award and Distinguished Public Service Award at UC Davis. Dr. Matloffistheauthoroftwopublishedtextbooks, andofanumberofwidely-usedWebtutorials on computer topics, such as the Linux operating system and the Python programming language. He and Dr. Peter Salzman are authors of The Art of Debugging with GDB, DDD, and Eclipse. Prof. Matloff’s book on the R programming language, The Art of R Programming, was published in 2011. His book, Parallel Computation for Data Science, will come out in 2014. He is also the author of several open-source textbooks, including From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science(http://heather.cs.ucdavis.edu/probstatbook),and Programming on Parallel Machines (http://heather.cs.ucdavis.edu/~matloff/ParProcBook. pdf). 3 About This Book Why is this book different from all other parallel programming books? It is aimed more on the practical end of things, in that: • There is very little theoretical content, such as O() analysis, maximum theoretical speedup, PRAMs, directed acyclic graphs (DAGs) and so on. • Real code is featured throughout. • We use the main parallel platforms—OpenMP, CUDA and MPI—rather than languages that at this stage are largely experimental or arcane. • The running performance themes—communications latency, memory/network contention, load balancing and so on—are interleaved throughout the book, discussed in the context of specific platforms or applications. • Considerable attention is paid to techniques for debugging. The main programming language used is C/C++, but some of the code is in R, which has become the pre-eminent language for data analysis. As a scripting language, R can be used for rapid prototyping. In our case here, it enables me to write examples which are much less less cluttered than they would be in C/C++, thus easier for students to discern the fundamental parallelixation principles involved. For the same reason, it makes it easier for students to write their own parallel code, focusing on those principles. And R has a rich set of parallel libraries. It is assumed that the student is reasonably adept in programming, and has math background through linear algebra. An appendix reviews the parts of the latter needed for this book. Another appendix presents an overview of various systems issues that arise, such as process scheduling and virtual memory. It should be note that most of the code examples in the book are NOT optimized. The primary emphasis is on simplicity and clarity of the techniques and languages used. However, there is plenty of discussion on factors that affect speed, such cache coherency issues, network delays, GPU memory structures and so on. Here’s how to get the code files you’ll see in this book: The book is set in LaTeX, and the raw .tex files are available in http://heather.cs.ucdavis.edu/~matloff/158/PLN. Simply download the relevant file (the file names should be clear), then use a text editor to trim to the program code of interest. In order to illustrate for students the fact that research and teaching (should) enhance each other, I occasionally will make brief references here to some of my research work. 4 Like all my open source textbooks, this one is constantly evolving. I continue to add new topics, new examples and so on, and of course fix bugs and improve the exposition. For that reason, it is better to link to the latest version, which will always be at http://heather.cs.ucdavis.edu/ ~matloff/158/PLN/ParProcBook.pdf, rather than to copy it. For that reason, feedback is highly appreciated. I wish to thank Stuart Ambler, Matt Butner, Stuart Hansen, Bill Hsu, Sameer Khan, Mikel McDaniel, Richard Minner, Lars Seeman and Johan Wikstr¨om for their comments. I’m also very grateful to Professor Hsu for his making available to me advanced GPU-equipped machines. You may also be interested in my open source textbook on probability and statistics, at http: //heather.cs.ucdavis.edu/probstatbook. This work is licensed under a Creative Commons Attribution-No Derivative Works 3.0 United States License. Copyright is retained by N. Matloff in all non-U.S. jurisdictions, but permission to use these materials in teaching is still granted, provided the authorship and licensing information here is displayed in each unit. I would appreciate being notified if you use this book for teaching, just so that I know the materials are being put to use, but this is not required. Contents 1 Introduction to Parallel Processing 1 1.1 Why Use Parallel Systems? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Execution Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.3 Distributed Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.4 Our Focus Here . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Parallel Processing Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Shared-Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1.1 Basic Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1.2 Multiprocessor Topologies . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1.3 Memory Issues Etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Message-Passing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2.1 Basic Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2.2 Example: Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.3 SIMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Programmer World Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 Example: Matrix-Vector Multiply . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.2 Shared-Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.2.1 Programmer View . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 i ii CONTENTS 1.3.2.2 Example: Pthreads Prime Numbers Finder . . . . . . . . . . . . . . 7 1.3.2.3 Role of the OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.2.4 Debugging Threads Programs . . . . . . . . . . . . . . . . . . . . . 13 1.3.2.5 Higher-Level Threads . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.2.6 Example: Sampling Bucket Sort . . . . . . . . . . . . . . . . . . . . 13 1.3.3 Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3.3.1 Programmer View . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3.3.2 Example: MPI Prime Numbers Finder . . . . . . . . . . . . . . . . 17 1.3.4 Scatter/Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.3.4.1 R snow Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2 Recurring Performance Issues 25 2.1 Communication Bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 “Embarrassingly Parallel” Applications . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.1 What People Mean by “Embarrassingly Parallel” . . . . . . . . . . . . . . . . 26 2.3.2 Iterative Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 Static (But Possibly Random) Task Assignment Typically Better Than Dynamic . . 28 2.4.1 Example: Matrix-Vector Multiply . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.2 Load Balance, Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.3 Example: Mutual Web Outlinks . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.4 Work Stealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.5 Timing Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5 Latency and Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.6 Relative Merits: Performance of Shared-Memory Vs. Message-Passing . . . . . . . . 33 2.7 Memory Allocation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.8 Issues Particular to Shared-Memory Systems . . . . . . . . . . . . . . . . . . . . . . 34 CONTENTS iii 3 Shared Memory Parallelism 35 3.1 What Is Shared? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Memory Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.1 Interleaving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.2 Bank Conflicts and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.3 Example: Code to Implement Padding . . . . . . . . . . . . . . . . . . . . . . 39 3.3 Interconnection Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.1 SMP Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.2 NUMA Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.3 NUMA Interconnect Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.3.1 Crossbar Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.3.2 Omega (or Delta) Interconnects . . . . . . . . . . . . . . . . . . . . 44 3.3.4 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.5 Why Have Memory in Modules? . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4 Synchronization Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4.1 Test-and-Set Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4.1.1 LOCK Prefix on Intel Processors . . . . . . . . . . . . . . . . . . . . 47 3.4.1.2 Locks with More Complex Interconnects . . . . . . . . . . . . . . . 49 3.4.2 May Not Need the Latest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4.3 Fetch-and-Add Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5 Cache Issues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5.1 Cache Coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5.2 Example: the MESI Cache Coherency Protocol . . . . . . . . . . . . . . . . . 53 3.5.3 The Problem of “False Sharing” . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.6 Memory-Access Consistency Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.7 Fetch-and-Add Combining within Interconnects . . . . . . . . . . . . . . . . . . . . . 58 3.8 Multicore Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 iv CONTENTS 3.9 Optimal Number of Threads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.10 Processor Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.11 Illusion of Shared-Memory through Software . . . . . . . . . . . . . . . . . . . . . . . 59 3.11.0.1 Software Distributed Shared Memory . . . . . . . . . . . . . . . . . 59 3.11.0.2 Case Study: JIAJIA . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.12 Barrier Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.12.1 A Use-Once Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.12.2 An Attempt to Write a Reusable Version . . . . . . . . . . . . . . . . . . . . 66 3.12.3 A Correct Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.12.4 Refinements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.12.4.1 Use of Wait Operations . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.12.4.2 Parallelizing the Barrier Operation . . . . . . . . . . . . . . . . . . . 69 3.12.4.2.1 Tree Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.12.4.2.2 Butterfly Barriers . . . . . . . . . . . . . . . . . . . . . . . 69 4 Introduction to OpenMP 71 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 Example: Dijkstra Shortest-Path Algorithm . . . . . . . . . . . . . . . . . . . . . . . 71 4.2.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2.2 The OpenMP parallel Pragma . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2.3 Scope Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.4 The OpenMP single Pragma . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2.5 The OpenMP barrier Pragma . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2.6 Implicit Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2.7 The OpenMP critical Pragma . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3 The OpenMP for Pragma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3.1 Example: Dijkstra with Parallel for Loops . . . . . . . . . . . . . . . . . . . . 77 CONTENTS v 4.3.2 Nested Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.3 Controlling the Partitioning of Work to Threads: the schedule Clause . . . . 80 4.3.4 Example: In-Place Matrix Transpose . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.5 The OpenMP reduction Clause . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.4 Example: Mandelbrot Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5 The Task Directive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.5.1 Example: Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.6 Other OpenMP Synchronization Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.6.1 The OpenMP atomic Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.6.2 Memory Consistency and the flush Pragma . . . . . . . . . . . . . . . . . . 90 4.7 Combining Work-Sharing Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.8 The Rest of OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.9 Compiling, Running and Debugging OpenMP Code . . . . . . . . . . . . . . . . . . 91 4.9.1 Compiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.9.2 Running . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.9.3 Debugging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.10 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.10.1 The Effect of Problem Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.10.2 Some Fine Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.10.3 OpenMP Internals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.11 Example: Root Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.12 Example: Mutual Outlinks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.13 Example: Transforming an Adjacency Matrix . . . . . . . . . . . . . . . . . . . . . . 101 4.14 Locks with OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.15 Other Examples of OpenMP Code in This Book . . . . . . . . . . . . . . . . . . . . 104 5 Introduction to GPU Programming with CUDA 107 vi CONTENTS 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.2 Example: Calculate Row Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.3 Understanding the Hardware Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3.1 Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3.2 Thread Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3.2.1 SIMT Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3.2.2 The Problem of Thread Divergence . . . . . . . . . . . . . . . . . . 113 5.3.2.3 “OS in Hardware” . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.3.3 Memory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.3.3.1 Shared and Global Memory . . . . . . . . . . . . . . . . . . . . . . . 114 5.3.3.2 Global-Memory Performance Issues . . . . . . . . . . . . . . . . . . 117 5.3.3.3 Shared-Memory Performance Issues . . . . . . . . . . . . . . . . . . 118 5.3.3.4 Host/Device Memory Transfer Performance Issues . . . . . . . . . . 118 5.3.3.5 Other Types of Memory . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.3.4 Threads Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.3.5 What’s NOT There . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.4 Synchronization, Within and Between Blocks . . . . . . . . . . . . . . . . . . . . . . 123 5.5 More on the Blocks/Threads Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.6 Hardware Requirements, Installation, Compilation, Debugging . . . . . . . . . . . . 124 5.7 Example: Improving the Row Sums Program . . . . . . . . . . . . . . . . . . . . . . 126 5.8 Example: Finding the Mean Number of Mutual Outlinks . . . . . . . . . . . . . . . 128 5.9 Example: Finding Prime Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.10 Example: Finding Cumulative Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.11 Example: Transforming an Adjacency Matrix . . . . . . . . . . . . . . . . . . . . . . 133 5.12 Error Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.13 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.14 Short Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.