ebook img

Assessment of Cache Coherence Protocols in Shared-memory Multiprocessors by Alexander Grbic ... PDF

195 Pages·2012·1.02 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Assessment of Cache Coherence Protocols in Shared-memory Multiprocessors by Alexander Grbic ...

Assessment of Cache Coherence Protocols in Shared-memory Multiprocessors by Alexander Grbic A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto Copyright c 2003 by Alexander Grbic (cid:176) Abstract Assessment of Cache Coherence Protocols in Shared-memory Multiprocessors Alexander Grbic Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 2003 The cache coherence protocol plays an important role in the performance of a distributed shared-memory (DSM) multiprocessor. A variety of cache coherence protocols exist and difier mainly in the scope of the sites that are updated by a write operation. These protocols can be complex and their impact on the performance of a multiprocessor system is often di–cult to assess. To obtain good performance, both architects and users must understand processor communication, data locality, the properties of the interconnection network, and the nature of the coherence protocols. Analyzing the processor data sharing behavior and determining its efiect on cache coherence communication tra–c is the flrst step to a better understanding of overall performance. Toward this goal, this dissertation provides a framework for evaluating the coherence communication tra–c of difierent protocols and considers using more than one protocol in a DSM multiprocessor. The framework consists of a data access characterization and the application of assessment rules. Its usefulness is demonstrated through an investigation into the performance of difierent cache coherence protocols for a variety of systems and parameters. It is shown to be efiective for determining the relative performance of protocols and the efiect of changes in system and application parameters. The investigation also shows that no single protocol is best suited for all communication patterns. Consequently, the dissertation also considers using more than one cache coherence protocol in a DSM multiprocessor. The results show that the hybrid protocol can signiflcantly reduce tra–c in all levels of the interconnection network with little efiect on execution time. ii Acknowledgements Iwouldliketothankmysupervisors,ProfessorsZvonkoVranesicandSinisaSrbljic,fortheir suggestions, guidance and support throughout my thesis. Without their knowledge, experience and time this work would not have been possible. I am grateful for their continued faith in me in spite of my decisions to take on new challenges and responsibilities. In addition, I wish to acknowledge useful discussions with Professor Michael Stumm and thank him for his help. I cannot say enough to thank my wife Gordana and daughter Lidia for their love, patience and understanding. Gordana, you gave me the support I needed to keep going, even when it looked like there was no end in sight to my graduate work. Lidia, the moment you arrived you brightened up my life, provided me with inspiration and taught me about the important things. To both of you, my love. I would like to thank my parents, brother and sister for their support, sacriflces and their love. Tony and Vanda, thanks for being there for Gordana, Lidia and me whenever we needed you. Tony, your dedication to research has motivated me in more ways than just making me realize that you could flnish before me. I must also thank my friends for the continued friendships. Even though I’ve gone largely into seclusion in the last while, you’ve kept in touch and always made me feel welcome. I express my thanks to the old Computer Group crowd and to the people at work for the friendly and frequent reminders of my unflnished business. IgratefullyacknowledgetheflnancialassistanceprovidedtomethroughOGSSTandNSERC Scholarships as well as a UofT Open Fellowship. iii Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 5 2.1 Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Type of Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Implementing Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Directory Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Understanding Protocol Performance . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 Hybrid Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5.1 On-line Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.2 Ofi-line Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.6 The NUMAchine Multiprocessor - Evolution . . . . . . . . . . . . . . . . . . . . . 22 2.7 Memory Consistency models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.8 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3 The NUMAchine Cache Coherence Protocol 27 3.1 The NUMAchine Multiprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.2 Interconnection Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 iv 3.1.3 Communication Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.4 Organization of the Network Cache . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Protocol Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.1 Processor Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.2 Protocol Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.3 Invalidations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.4 Request and Response Forwarding . . . . . . . . . . . . . . . . . . . . . . 35 3.2.5 Negative Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.6 Efiect of Network Cache Organization . . . . . . . . . . . . . . . . . . . . 36 3.3 Protocol Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.1 Directory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.2 Protocol States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.1 Local Write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.2 Local Read . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.3 Remote Read . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.4 Remote Write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.5 Remote Write-Backs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5 Preserving the Memory Consistency Model . . . . . . . . . . . . . . . . . . . . . 45 3.6 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4 Experimental Environment 48 4.1 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1.1 Mintsim Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1.2 Architectural Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.1 Description of Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.2 Rationale for Choices of Benchmarks . . . . . . . . . . . . . . . . . . . . . 52 4.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 v 5 Sharing Patterns and Tra–c 54 5.1 Data Access Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.1.1 Data Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.1.2 Obtaining the Data Access Characterization. . . . . . . . . . . . . . . . . 57 5.2 Understanding Cache Coherence Protocols . . . . . . . . . . . . . . . . . . . . . . 58 5.2.1 Description of Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.3 Assessment Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Choice of Characterization Interval . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.4 Conflrmation of Rule 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.4.1 Choosing Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4.2 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.5 Extending the Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.6 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6 Evaluation of Protocol Performance 74 6.1 The Update Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.1.1 The Update Protocol in a Distributed System . . . . . . . . . . . . . . . . 77 6.2 The Write-through Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.3 Uncached Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.4 Protocol Communication Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.5 Study Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.5.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.5.2 Page Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.5.3 Interval Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.6 Data Access Characterization of Benchmarks . . . . . . . . . . . . . . . . . . . . 83 6.7 Relative Performance of Difierent Protocols . . . . . . . . . . . . . . . . . . . . . 85 6.7.1 Applying the Assessment Rules . . . . . . . . . . . . . . . . . . . . . . . . 85 6.7.2 Verifying the Assessment Rules . . . . . . . . . . . . . . . . . . . . . . . . 89 vi 6.8 Explanation of Application Behavior . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.9 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7 Hybrid Cache Coherence Protocol 99 7.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.2 Processor Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.2.1 Base Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.2.2 Dirty Shared State Support . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.3 Directory Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.3.1 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.3.2 Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.4 Transitions Between Protocols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.4.1 Dealing with Additional States in the Update Protocol . . . . . . . . . . . 109 7.4.2 Network Cache Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.4.3 Cache Blocks in Transition . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.4.4 Transitions Between Protocols in the Processor Cache . . . . . . . . . . . 113 7.5 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.5.1 Simulation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.5.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.5.3 Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.6 Hybrid Protocol Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.7 Wrong Protocols for Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.8 Decision Functions and Hybrid Protocol Execution Time . . . . . . . . . . . . . . 129 7.8.1 Only the Tra–c-Based Decision Function Changes to Update (t2u) . . . . 132 7.8.2 Only the Tra–c-based Decision Function Changes to Invalidate (t2i) . . . 135 7.8.3 Only the Latency-based Decision Function Changes to Update (l2u) . . . 136 7.8.4 Only the Latency-based Decision Function Changes to Invalidate (l2i) . . 137 7.8.5 General Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.9 Latency-based Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 vii 7.10 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8 Conclusion 147 8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 A NUMAchine Cache Coherence Protocol - Invalidate 151 A.1 Local System Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 A.2 Remote System Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 A.3 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 A.3.1 Negative Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . 161 A.3.2 Exclusive Reads and Upgrades . . . . . . . . . . . . . . . . . . . . . . . . 163 A.3.3 Non-inclusion of Network Cache, NOTIN Cases . . . . . . . . . . . . . . . 165 B System Events 168 Bibliography 174 viii List of Tables 2.1 Experimental and commercial multiprocessor architectures. . . . . . . . . . . . . 13 2.2 Cache coherence in experimental and commercial multiprocessors. . . . . . . . . 15 3.1 States in memory and network cache directories. . . . . . . . . . . . . . . . . . . 40 4.1 Simulation parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Access latencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.1 Values of parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.1 Communicationcostsinnumbersofpacketsforinvalidate,update,write-through and uncached operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.2 System data access characterization and percentage of writes. . . . . . . . . . . . 86 6.3 Data access characterization for the central ring. . . . . . . . . . . . . . . . . . . 87 6.4 Average number of packets per access for difierent cache coherence protocols on a 4-processor system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.5 Average number of packets per access for difierent cache coherence protocols on a 64-processor system central ring. . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.1 Parallel e–ciency for SPLASH2 applications used in the hybrid protocol study. . 116 7.2 Examples of NUMAchine system event costs in terms of number of packets for the invalidate and update protocols. . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.3 Frequency of using incorrect protocols given in numbers of intervals. . . . . . . . 128 ix 7.4 Disagreements between the tra–c-based and latency-based decision functions given in numbers of intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.5 MRSW example for the case where only the tra–c decision function changes to update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7.6 SRMW example for the case where only the tra–c decision function changes to update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.7 MRMW example for the case where only the tra–c decision function changes to update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 7.8 MW example for the case where only the tra–c decision function changes to update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.9 MRSW example for the case where only the tra–c decision function changes to invalidate (t2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.10 MRSW example for the case where only the latency decision function changes to update (l2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.11 MRMW example for the case where only the latency decision function changes to update (l2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.12 MRSW example for the case where only the latency decision function changes to invalidate (l2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.13 MRMW example for the case where only the latency decision function changes to invalidate (l2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.14 MW example for the case where only the latency decision function changes to invalidate (l2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 A.1 System events for local requests. . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 A.2 System events for remote requests. . . . . . . . . . . . . . . . . . . . . . . . . . . 155 B.1 System event descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 B.2 System event details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 B.3 Tra–c and latency costs for system events. . . . . . . . . . . . . . . . . . . . . . 172 B.4 System parameters that afiect tra–c. . . . . . . . . . . . . . . . . . . . . . . . . . 172 x

Description:
then it can create cache pollution and increase traffic and contention of network resources. A number of dynamic hybrid cache coherence protocols have been proposed and imple- mented. They differ mainly in the implementation of the decision function and the amount of hardware support provided for
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.