GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi Submitted in partial ful(cid:12)llment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2016 ⃝c 2016 Evangelia Sitaridi All Rights Reserved ABSTRACT GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi Hardware advances strongly in(cid:13)uence the database system design. The (cid:13)attening speed of CPU cores makes many-core accelerators, such as GPUs, a vital alternative to explore for processing the ever-increasing amounts of data. GPUs have a signi(cid:12)cantly higher degree of parallelism than multi-core CPUs but their cores are simpler. As a result, they do not face the power constraints limiting the parallelism of CPUs. Their trade-off, however, is the increased implementation complexity. This thesis adapts and redesigns data analytics operators to better exploit the GPU special memory and threading model. Due to the increasing memory capacity and also the user’s need for fast interaction with the data, we focus on in-memory analytics. Ourtechniquesspandifferentstepsofthedataprocessingpipeline: (1)Datapreprocess- ing, (2) Query compilation, and (3) Algorithmic optimization of the operators. Our data preprocessingtechniquesadaptthedatalayoutfornumericandstringcolumnstomaximize the achieved GPU memory bandwidth. Our query compilation techniques compute the op- timal execution plan for conjunctive (cid:12)lters. We formulate memory divergence for string matching algorithms and suggest how to eliminate it. Finally, we parallelize decompression algorithms in our compression framework Gompresso to (cid:12)t more data into the limited GPU memory. Gompressoachieveshighspeed-upsonGPUsovermulti-coreCPUstate-of-the-art libraries and is suitable for any massively parallel processor. Table of Contents List of Figures v List of Tables ix 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 GPU Historical Overview . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 GPU Architectural challenges . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Problem Setting and Context . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4.1 Shared Memory Joins & Aggregations . . . . . . . . . . . . . . . . . 15 1.4.2 Multi-Predicate Selection Execution . . . . . . . . . . . . . . . . . . 16 1.4.3 String Matching Optimization . . . . . . . . . . . . . . . . . . . . . 17 1.4.4 SIMD-Accelerated Regular Expressions . . . . . . . . . . . . . . . . 18 1.4.5 Compression Acceleration . . . . . . . . . . . . . . . . . . . . . . . . 19 1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2 Related Work 22 2.1 Relational Operator Accelerator. . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 State-of-the-art GPU-Accelerated Database Systems . . . . . . . . . . . . . 23 3 Shared Memory Joins & Aggregations 27 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 i 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 Data Placement Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.1 Write Con(cid:13)icts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.2 Read Con(cid:13)icts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.2 Memory Footprint . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5.3 Query Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5.4 Optimization Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.6 Summary & Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4 Multi-Predicate Selections 43 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Execution Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4.1 Single-Kernel Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4.2 Multiple-Kernel Plans . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5 Query Cost-Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.5.1 Single-Kernel Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.5.2 Multiple-Kernel Plans . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.5.3 Model Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.6 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.7.2 Cost Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.7.3 Query Plan Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.7.4 Optimization Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.8 Summary & Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 ii 5 Sub-string Matching Acceleration 59 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.1.1 String Matching Acceleration . . . . . . . . . . . . . . . . . . . . . . 60 5.1.2 String Matching on GPUs . . . . . . . . . . . . . . . . . . . . . . . . 62 5.1.3 String matching on GPU Databases . . . . . . . . . . . . . . . . . . 63 5.1.4 Thread-Divergence on GPUs . . . . . . . . . . . . . . . . . . . . . . 64 5.2 Cache Pressure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3 String matching Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3.1 Addressing Thread Divergence . . . . . . . . . . . . . . . . . . . . . 66 5.3.2 Addressing Cache Pressure . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.3 Addressing Memory Divergence . . . . . . . . . . . . . . . . . . . . . 71 5.3.4 Combining Different Optimizations . . . . . . . . . . . . . . . . . . . 75 5.3.5 Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.4.2 Comparing Algorithm Efficiency . . . . . . . . . . . . . . . . . . . . 83 5.4.3 Effect of Thread Divergence . . . . . . . . . . . . . . . . . . . . . . . 84 5.4.4 Effect of Alphabet Size . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.5 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.6 Worse-Case Performance. . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.7 Thread Group Size Tuning . . . . . . . . . . . . . . . . . . . . . . . 89 5.4.8 Comparison with CPUs . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6 SIMD-Accelerated Regular Expressions 97 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.1.1 Substring Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.1.2 Regular Expression Matching . . . . . . . . . . . . . . . . . . . . . . 100 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 iii 6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7 Decompression Acceleration 115 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.3 Gompresso Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.3.1 Parallel Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.3.2 Parallel Decompression . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.4 Data Dependencies in Nested Back-references . . . . . . . . . . . . . . . . . 124 7.4.1 MRR Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.4.2 DE Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.5.2 Data Dependency Resolution . . . . . . . . . . . . . . . . . . . . . . 131 7.5.3 Performance Impact of Nested back-references . . . . . . . . . . . . 131 7.5.4 Impact of DE on Compression Ratio and Speed . . . . . . . . . . . . 134 7.5.5 Compression Framework Tuning . . . . . . . . . . . . . . . . . . . . 134 7.5.6 Dependency on Data Block Size . . . . . . . . . . . . . . . . . . . . 134 7.5.7 GPU vs. Multi-core CPU Performance . . . . . . . . . . . . . . . . . 135 7.6 Summary & Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 8 Concluding Remarks and Future Work 140 8.1 GPU Query Execution Optimization . . . . . . . . . . . . . . . . . . . . . . 140 8.2 String Matching Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.3 Massively Parallel Lossless Compression . . . . . . . . . . . . . . . . . . . . 142 8.4 Heterogeneous Computing Data Analytics . . . . . . . . . . . . . . . . . . . 142 Bibliography 144 Appendix 161 iv List of Figures 1.1 GPU Trends in memory bandwidth and number of cores. . . . . . . . . . . 6 1.2 NVIDIA GPU Architecture overview. . . . . . . . . . . . . . . . . . . . . . 6 1.3 An example of bank/value access pattern for 32 threads in a warp. . . . . . 7 1.4 Thread scheduling in NVIDIA GPUs. . . . . . . . . . . . . . . . . . . . . . 8 1.5 GPU Database system architecture. . . . . . . . . . . . . . . . . . . . . . . 10 3.1 Bank optimization for 8-element chunks. . . . . . . . . . . . . . . . . . . . . 34 3.2 Number of copies per value for shared memory aggregation. . . . . . . . . . 35 3.3 Number of copies per value for shared-memory joins. . . . . . . . . . . . . . 36 3.4 Number of copies per value for varying cardinality. . . . . . . . . . . . . . . 37 3.5 Throughput for different budgets. . . . . . . . . . . . . . . . . . . . . . . . . 37 3.6 Speed-up for Write Con(cid:13)icts. . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.7 Speed-up for Read Con(cid:13)icts. . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.8 Throughput and pro(cid:12)ler Counters for varying con(cid:13)ict degree, t=30M. . . . 39 3.9 Throughput for different windows. . . . . . . . . . . . . . . . . . . . . . . . 40 3.10 Optimizing for value con(cid:13)icts only. . . . . . . . . . . . . . . . . . . . . . . . 41 3.11 Optimization time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1 Execution of the S111 plan for two warps. . . . . . . . . . . . . . . . . . . . 47 4.2 Time performance in ms of different plans for a query with four conditions. 48 4.3 Execution of the K21 plan for two warps. . . . . . . . . . . . . . . . . . . . 49 4.4 Performance of multi-kernel plans when varying the selectivity of the (cid:12)rst condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 v 4.5 Performance of multi-kernel plans when varying the selectivity of the (cid:12)rst condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.6 Single-Kernel Optimization Algorithm. . . . . . . . . . . . . . . . . . . . . . 53 4.7 Actual and estimated performance for different single-kernel plans of Q2. . 55 4.8 Time performance of different plans on the CPU for Q2. . . . . . . . . . . . 56 4.9 Actual and estimated performance of multi-kernel plans for Q1 varying the selectivity of all conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.10 Timeperformanceofsingle-kernelplansforQ1forvaryingpredicateselectivity. 57 4.11 Time performance of multi-kernel plans for Q1 or varying predicate selectivity. 57 4.12 Execution time of the multi-kernel optimizer as a function of the number of conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.1 Strategies for CPU-GPU interaction. . . . . . . . . . . . . . . . . . . . . . . 66 5.2 ExecutionofthebaselinemethodandSplit-2methodforsearchpattern’CAA’. 67 5.3 Execution of Seg 6-4 parallelism method for a pattern of three characters. 68 5.4 Contiguous and pivoted layout for 20-character strings and pivoted-piece of 4-characters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.5 ExecutionofPivot-4methodfor’ATG’patternusingtheKMP-Hybridstring matching method and 4 GPU threads. . . . . . . . . . . . . . . . . . . . . . 75 5.6 Memory access pattern for BM on an average case dataset and an adversar- ially generated input maximizing memory divergence. . . . . . . . . . . . . 76 5.7 Memory access pattern for a worst case input of KMP. . . . . . . . . . . . . 76 5.8 KMP and BM Seg-k-t L2 misses for varying segment size on a dataset of 512K strings with t=4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.9 Performance as a function of string length for Pivot-4 (left) and Pivot-8 (right) layouts. The top row shows the results for shorter strings and the bottom row for longer strings. . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.10 Time performance on A1 for varying pivoted width. . . . . . . . . . . . . . 83 5.11 Time performance on R1 for varying pivoted width. . . . . . . . . . . . . . 84 vi 5.12 Performance of Split-k Optimization on BM for varying number of string occurrences in the input. The (cid:12)rst sub(cid:12)gure is for selectivity 0.6, the second for 0.75, and the last for selectivity 0.9. . . . . . . . . . . . . . . . . . . . . 85 5.13 Time performance of pivoted and unpivoted methods for varying alphabet size and an 8-character pattern. . . . . . . . . . . . . . . . . . . . . . . . . 87 5.14 Performance for increasing pattern length for Independent and Seg-k-t (8) implementations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.15 Performance of BM for average and adversarial input. . . . . . . . . . . . . 88 5.16 Performance of KMP and BM for varying group size. . . . . . . . . . . . . . 89 5.17 Bandwidth of string matching for sparse and dense record lists for pivoted (left) and unpivoted (right) KMP methods. . . . . . . . . . . . . . . . . . . 89 5.18 TimeperformanceofCPUandGPUstringmatchingforA1anda8-character pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.19 Time performance of KMP for two different predicates: ’%S1%S2%S3%’ vs. ’%S1S2S3%’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.20 PerformanceofpivotedAC,pivotedKMPagainstPFACforvaryingselectivity. 94 6.1 Substring matching for TPC-H Q13. . . . . . . . . . . . . . . . . . . . . . . 99 6.2 A DFA that validates e-mail addresses. . . . . . . . . . . . . . . . . . . . . . 100 6.3 DFAs for combinations of: she, her. . . . . . . . . . . . . . . . . . . . . . . . 101 6.4 Selective loads & stores of rids. . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.5 Unaligned vector gathers in Xeon Phi. . . . . . . . . . . . . . . . . . . . . . 107 6.6 Varying string lengths (URL validation) . . . . . . . . . . . . . . . . . . . . 110 6.7 Varying the failure point (URL validation) . . . . . . . . . . . . . . . . . . . 111 6.8 Varying the failure point (URL validation, long strings) . . . . . . . . . . . 111 6.9 Varying the selectivity (URL validation) . . . . . . . . . . . . . . . . . . . . 112 6.10 Varying the DFA size (DFA for multi-pattern matching using English dictio- nary words) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.11 Scalability (multi-pattern matching both positive and negative using English dictionary words) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 vii
Description: