Al-Saeed, Majed Mohammed Abdullah (2015) Profiling a parallel domain specific language using off-the-shelf tools. PhD thesis. http://theses.gla.ac.uk/6561/ Copyright and moral rights for this thesis are retained by the author A copy can be downloaded for personal non-commercial research or study This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the Author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the Author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given Glasgow Theses Service http://theses.gla.ac.uk/ [email protected] PROFILING A PARALLEL DOMAIN SPECIFIC LANGUAGE USING OFF-THE-SHELF TOOLS Majed Mohammed Abdullah Al-Saeed Submitted in Fulfilment of the Requirements for the Degree of Doctor of Philosophy University of Glasgow College of Science and Engineering School of Computing Science July 2015 The copyright in this thesis is owned by the author. Any quotation from the thesis or use of any of the information contained in it must acknowledge this thesis as the source of the quotation or information. Abstract Profiling tools are essential for understanding and tuning the performance of both parallel programs and parallel language implementations. Assessing the performance of a program in a language with high-level parallel coordination is often complicated by the layers of abstraction present in the language and its implementation. This thesis investigates whether it is possible to profile parallel Domain Specific Languages (DSLs) using existing host language profiling tools. The key challenge is that the host language tools report the performance of the DSL runtime system (RTS) executing the application rather than the performance of the DSL application. The key questions are whether a correct, effective and efficient profiler can be constructed using host language profiling tools; is it possible to effectively profile the DSL implementation, and what capabilities are required of the host language profiling tools? Themaincontributionofthisthesisisthedevelopmentofanexecutionprofilerfor the parallel DSL, Haskell Distributed Parallel Haskell (HdpH) using the host language profilingtools. Weshowthatitispossibletoconstructaprofiler(HdpHProf)tosupport performance analysis of both the DSL applications and the DSL implementation. The implementationusesseveralnewGHCfeatures, includingtheGHC-EventsLibraryand ThreadScope, develops two new performance analysis tools for DSL HdpH internals, i.e. Spark Pool Contention Analysis, and Registry Contention Analysis. We present a critical comparative evaluation of the host language profiling tools that we used (GHC-PPS and ThreadScope) with another recent functional profilers, EdenTV, alongside four important imperative profilers. This is the first report on the performance of functional profilers in comparison with well established industrial standard imperative profiling technologies. We systematically compare the profilers for usability and data presentation. We found that the GHC-PPS performs well in terms of overheads and usability so using it to profile the DSL is feasible and would not have significant impact on the DSL performance. We validate HdpHProf for functional correctness and measure its performance using six benchmarks. HdpHProf works correctly and can scale to profile HdpH pro- grams running on up to 192 cores of a 32 nodes Beowulf cluster. We characterise the performanceofHdpHProfintermsofprofilingdatasizeandprofilingexecutionruntime overhead. It shows that HdpHProf does not alter the behaviour of the GHC-PPS and retains low tracing overheads close to the studied functional profilers; 18% on average. Also, it shows a low ratio of HdpH trace events in GHC-PPS eventlog, less than 3% on average. We show that HdpHProf is effective and efficient to use for performance analysis and tuning of the DSL applications. We use HdpHProf to identify performance issues and to tune the thread granularity of six HdpH benchmarks with different parallel paradigms, e.g. divide and conquer, flat data parallel, and nested data parallel. This include identifying problems such as, too small/large thread granularity, problem size too small for the parallel architecture, and synchronisation bottlenecks. We show that HdpHProf is effective and efficient for tuning the parallel DSL implementation. We use the Spark Pool Contention Analysis tool to examine how the spark pool implementation performs when accessed concurrently. We found that appropriate thread granularity can significantly reduce both conflict ratios, and conflict durations, bymorethan90%. WeusetheRegistryContentionAnalysistooltoevaluate three alternatives of the registry implementations. We found that the tools can give a better understanding of how different implementations of the HdpH RTS perform. 2 Dedication In The Name Of Allah (God), The Most Gracious, The Most Merciful. I dedicate this thesis to the memory of my mother. I miss her so much, but I am glad that she saw the process of this thesis through to its completion, offering the support to make it possible, as well as plenty of prayers, may Allah rest her soul in peace amen. 3 Acknowledgements My praises to Allah (God) for giving me the wellness, the strength of determination, the patience, and support to complete this work successfully. I would like to express my special appreciation and thanks to my supervisor ProfessorPhilTrinder, forallhisencouragement, guidanceandsupporttomyresearch, and for allowing me to grow as a research scientist. I have benefited greatly from his advice, inspiration, and vast experience. Also, I would like to express my very great appreciation to my supervisors Dr. Lilia Georgieva and Dr. Patrick Maier. Lilia’s professional guidance and valuable support have been a great help in my research. Patrick has been a tremendous mentor for me. I have been lucky enough to work with him over the course of my PhD, he was invaluable in providing guidance and support from a different perspective. His willingness to give time so generously has been very much appreciated. Many thanks to Professor Greg Michaelson and Dr. J. Paul Siebert for their useful and constructive feedback and recommendations regarding this thesis. Also, I would like to thank my friends at the Dependable Systems Group (DSG) at Heriot- Watt University Dr. Khari Armih, Dr. Mustafa Aswad, and Dr. Rob Stewart for their advice and valuable technical support. I would also like to extend my thanks to the computing officers at Heriot-Watt University for their support and allowing hardware access for the performance evalua- tion of HdpHProf and the critical analysis study. IwouldliketothankKingFaisalUniversityandtheMinistryofHigherEducation in Saudi Arabia for offering me this scholarship. Also I would like to thank them for providing the finance support throughout the years of study for me and my family. Last, but by no means least, a special thanks to my family. Words cannot express how grateful I am to my father, my wife, my daughters, and my son for all of the sacrifices that you have made on my behalf. Your prayers for me were what has sustained me thus far. 4 Declaration I declare that, except where explicit reference is made to the contribution of others, that this dissertation is the result of my own work and has not been submitted for any other degree at the University of Glasgow or any other institution. Majed Al-Saeed 5 Table of Contents 1 Introduction 16 1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.3 Authorship and Publications . . . . . . . . . . . . . . . . . . . . . . . . 19 2 Background 20 2.1 Parallel Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1.1 Shared-Memory Architectures . . . . . . . . . . . . . . . . . . . 21 2.1.2 Distributed-Memory Architecture . . . . . . . . . . . . . . . . . 22 2.2 Parallel Programming Models . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.1 Shared-Memory Parallelism . . . . . . . . . . . . . . . . . . . . 24 2.2.2 Distributed-Memory Parallelism . . . . . . . . . . . . . . . . . . 24 2.2.3 Parallel Functional Languages . . . . . . . . . . . . . . . . . . . 25 2.2.4 HdpH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 Parallel Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.1 Performance Profiling Process . . . . . . . . . . . . . . . . . . . 29 2.4 Parallel Profilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.1 Imperative Profiling . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.2 Functional Profilers . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3 A Survey of Parallel Functional Profilers 41 3.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.1.1 Experimental Set-up . . . . . . . . . . . . . . . . . . . . . . . . 42 3.1.2 Concordance Benchmark Versions . . . . . . . . . . . . . . . . . 43 3.1.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Profiling Data Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6 3.2.1 Profiling Data Size in Relation to Computation Size . . . . . . . 44 3.2.2 ProfilingDataSizeinRelationtoNumberofProcessingElements (PEs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.3 Profiling Data Size Discussion . . . . . . . . . . . . . . . . . . . 51 3.3 Runtime Overheads of Profiling . . . . . . . . . . . . . . . . . . . . . . 53 3.3.1 Runtime Overhead in Relation to Computation Size . . . . . . . 54 3.3.2 Profiling Overhead in Relation to Number of PEs . . . . . . . . 57 3.3.3 Runtime Overhead Discussion . . . . . . . . . . . . . . . . . . . 61 3.4 Data Presentation and Visualisation . . . . . . . . . . . . . . . . . . . . 63 3.4.1 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4.2 Presentation of Performance Data . . . . . . . . . . . . . . . . . 63 3.4.3 Software Properties . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.4.4 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4 HdpHProf– Design and Implementation 70 4.1 HdpHProf Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2 HdpHProf Implementation Design . . . . . . . . . . . . . . . . . . . . . 71 4.2.1 HdpH Trace Events . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2.2 Multiple Trace Files . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2.3 Time Synchronisation . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.4 Merging Trace Files . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2.5 Trace Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2.6 Trace Analysis and Presentation . . . . . . . . . . . . . . . . . . 77 4.3 HdpHProf Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3.3 Trace File Time Synchronisation . . . . . . . . . . . . . . . . . 86 4.3.4 Merging Trace Files . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3.5 Data Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7 5 Validating HdpHProf 93 5.1 Experimental Tools and Benchmarks . . . . . . . . . . . . . . . . . . . 93 5.2 Validation of Functional Correctness . . . . . . . . . . . . . . . . . . . 95 5.2.1 Code Instrumentation . . . . . . . . . . . . . . . . . . . . . . . 96 5.2.2 Time Synchronisation in Trace Files . . . . . . . . . . . . . . . . 96 5.2.3 Merging Trace Files . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2.4 Contention Analysis Tools . . . . . . . . . . . . . . . . . . . . . 98 5.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.4 Profiling Data Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.4.1 Profiling Data Size vs Computation Size . . . . . . . . . . . . . 108 5.4.2 Profiling Data Size vs Number of PEs . . . . . . . . . . . . . . . 111 5.4.3 Profiling Data Size Comparison and Discussion . . . . . . . . . 113 5.5 Execution Time Overhead . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.5.1 Runtime Overhead vs Computation Size . . . . . . . . . . . . . 114 5.5.2 Runtime Overhead vs Number PEs . . . . . . . . . . . . . . . . 117 5.5.3 Runtime Overhead Discussion . . . . . . . . . . . . . . . . . . . 119 5.6 HdpH Tracing Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6 Evaluating HdpHProf for Applications 123 6.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.1.1 Experimental Set-up . . . . . . . . . . . . . . . . . . . . . . . . 123 6.1.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.2 Identification of Performance Problems . . . . . . . . . . . . . . . . . . 125 6.2.1 Excessively Small Thread Granularity . . . . . . . . . . . . . . . 125 6.2.2 Excessively Large Thread Granularity . . . . . . . . . . . . . . . 126 6.2.3 Sequentialisation bottleneck . . . . . . . . . . . . . . . . . . . . 127 6.2.4 Insufficient Work for Machine Architecture . . . . . . . . . . . . 129 6.2.5 Combination of factors . . . . . . . . . . . . . . . . . . . . . . . 131 6.3 Tuning Thread Granularity . . . . . . . . . . . . . . . . . . . . . . . . 132 6.3.1 Data Parallel Chunk Size . . . . . . . . . . . . . . . . . . . . . . 132 6.3.2 Divide and Conquer Threshold . . . . . . . . . . . . . . . . . . 136 6.4 Co-location and Sequential Granularity . . . . . . . . . . . . . . . . . . 138 6.4.1 High Co-ordination Overheads . . . . . . . . . . . . . . . . . . . 139 6.4.2 Fine Grained Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 140 8 6.4.3 Good Performance . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.4.4 Coarse Grained Tasks . . . . . . . . . . . . . . . . . . . . . . . 142 6.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7 Evaluating HdpHProf for HdpH Internals 146 7.1 Spark Pool Contention . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.1.1 Spark Management . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.1.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.1.3 Conflict Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.1.4 Conflict Duration . . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.1.5 Mean Conflict Duration . . . . . . . . . . . . . . . . . . . . . . 149 7.1.6 Maximum Conflict Duration . . . . . . . . . . . . . . . . . . . . 150 7.1.7 Grouping Conflicts by Schedulers . . . . . . . . . . . . . . . . . 151 7.2 Spark Pool Contention and Granularity . . . . . . . . . . . . . . . . . . 153 7.2.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.2.2 Conflict Ratio and Granularity . . . . . . . . . . . . . . . . . . 155 7.2.3 Conflict Duration and Granularity . . . . . . . . . . . . . . . . 155 7.2.4 Contention and Granularity Discussion . . . . . . . . . . . . . . 156 7.3 Registry Contention . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 7.3.1 Global References and Global IVars . . . . . . . . . . . . . . . . 157 7.3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 7.3.3 Conflict Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.3.4 Total Conflict Duration . . . . . . . . . . . . . . . . . . . . . . 159 7.3.5 Mean Conflict Duration . . . . . . . . . . . . . . . . . . . . . . 160 7.3.6 Maximum Conflict Duration . . . . . . . . . . . . . . . . . . . . 161 7.3.7 Conflicts By Operation Type . . . . . . . . . . . . . . . . . . . 162 7.3.8 Variability in Execution Time . . . . . . . . . . . . . . . . . . . 166 7.3.9 Registry Implementations Discussion . . . . . . . . . . . . . . . 168 7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 8 Conclusion 170 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 8.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 8.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 9
Description: