ebook img

Delft University of Technology Feeding High-Bandwidth Streaming-Based FPGA Accelerators PDF

133 Pages·2017·5.07 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Delft University of Technology Feeding High-Bandwidth Streaming-Based FPGA Accelerators

Delft University of Technology Electrical Engineering, Mathematics, and Computer Science Computer Engineering Feeding High-Bandwidth Streaming-Based FPGA Accelerators Thesis by: Yvo Thomas Bernard Mulder Advisor: Prof. Dr. H.P. Hofstee Committee: Chair: Prof. Dr. H.P. Hofstee Members: Dr. Ir. Z. Al-Ars Dr. Ir. R. van Leuken Feeding High-Bandwidth Streaming-Based FPGA Accelerators Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Engineering by Yvo Thomas Bernard Mulder born in Utrecht, The Netherlands to be defended publicly on January 29, 2018 at 15:00. CE-MS-2018-05 ISBN: 978-94-6186-886-2 Computer Engineering Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology Mekelweg 4, 2628 CD, Delft The Netherlands Abstract A new class of accelerator interfaces has significant implications on system architecture. An order of magnitude more bandwidth forces us to reconsider FPGA design. OpenCAPI is a new interconnect standard that enables attaching FPGAs coherently to a high-bandwidth, low- latency interface. Keeping up with this bandwidth poses new challenges for the design of accelerators, and the logic feeding them. Thisthesisisconductedaspartofagroupproject,wherethreeothermasterstudentsinvestigate database operator accelerators. This thesis focuses on the logic to feed the accelerators, by designing a reconfigurable multi-stream buffer architecture. By generalizing across multiple common streaming-like accelerator access patterns, an interface consisting of multiple read ports with a smaller than cache line granularity is desired. At the same time, multiple read ports are allowed to request any stream, including reading across a cache line boundary. The proposed architecture exploits different memory primitives available on the latest genera- tionofXilinxFPGAs. Bycombiningatraditionalmulti-readportapproachfordataduplication with a second level of buffering, a hierarchy typically found in caches, an architecture is pro- posed which can supply data from 64 streams to eight read ports without any access pattern restrictions. A correct-by-construction design methodology was used to simplify the validation of the design and to speedup the implementation phase. At the same time, the design methodology is doc- umented and examples are provided for ease of adoption. With the design methodology, the proposed architecture has been implemented and is accompanied by a validation framework. Various configurations of the multi-stream buffer have been tested. Configurations up to 64 streams with four read ports meet timing with an AFU request-to-response latency of five cycles. The largest configuration with 64 streams and eight read ports fails timing. Limiting factorsaretheinherentarchitectureofFPGAs,wherememoriesarephysicallylocatedinspecific columns. This makes extracting data complex, especially at the target frequencies of 200MHz and 400MHz. Wires are scattered across the FPGA and wire delay becomes dominant. FPGA design at increasing bandwidths requires new design approaches. Synthesis results are no guarantee for the implemented design, and depending on the design size, could indicate a very optimistic operating frequency. Therefore, designing accelerators to keep up with an order ofmagnitudemorebandwidthcomparedtothecurrentstate-of-the-artiscomplex, andrequires carefully thought out accelerator cores, combined with an interface capable of feeding it. 3 Preface This thesis report marks the end of this project, on which I have worked for a year. The year started at the IBM Austin Research Lab, after Peter Hofstee invited me to work with him on emerging coherently attached FPGA accelerators. During my six months in Austin, Peter was always ready to help and chat. Also in the period after Austin was Peter always available. Peter, it has been a very pleasant journey and I sincerely hope our paths cross again. I can safely say that you not only have been a tremendous supervisor, but also a good friend. Because the project is in the field of FPGAs, I had many interesting discussions with Andrew Martin. Andrew is a research staff member and developed a ready-valid design methodology for FPGA design. Andy, I would like to thank you for your support during the design and implementation phase. Your methodology is the missing link for FPGA design. Dorus, Eric, Jeremy, and Jinho, the three o’clock stretch session must live on, but in different countries and time zones. I enjoyed my time in Austin very much and that is thanks to you. I would like to thank the Universiteitsfonds for providing funding for the six months I spent in Austin. Without this support, it would have been difficult to have had this experience. I would like to thank my parents for their life-long support and faith in me. Finally, I would like to thank Fjóla for always being there for me when I needed her the most and making me a better person every day. 5 Contents List of Figures 11 List of Tables 13 Listings 13 Revision Log 15 1 Introduction 17 1.1 Thesis Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2 Technology Trends 21 2.1 Acceleration in the Data Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.1 Dennard Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.2 Homogeneous Multi-Core Systems . . . . . . . . . . . . . . . . . . . . . . 22 2.1.3 Heterogeneous Multi-Core Systems . . . . . . . . . . . . . . . . . . . . . . 23 2.1.4 Application Specific Acceleration . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.5 FPGA Adoption in the Data Center . . . . . . . . . . . . . . . . . . . . . 24 2.2 Interconnect Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.1 Attached Devices Push Bandwidth Requirements . . . . . . . . . . . . . . 25 2.2.2 Bandwidth Trends at Device-Level . . . . . . . . . . . . . . . . . . . . . . 25 2.2.3 Bandwidth Trends at System-Level . . . . . . . . . . . . . . . . . . . . . . 26 2.3 Current Interconnect Bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.1 Traditional IO Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.2 Communication and Synchronization Overhead . . . . . . . . . . . . . . . 29 2.3.3 Host Memory Access Congestion . . . . . . . . . . . . . . . . . . . . . . . 29 2.4 Interconnect Coherency and Shared Memory: A Necessity . . . . . . . . . . . . . 30 2.4.1 Coherent IO Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.2 System-Wide Shared Memory Address Space . . . . . . . . . . . . . . . . 31 2.4.3 System-Wide Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.4 Thread Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5 Preliminary Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3 State-of-the-Art Interconnects 35 3.1 PCI Express . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.2 PCI Express Gen 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.3 PCI Express Gen 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7 3.1.4 PCI Express Gen 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 CAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.2 CAPI 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.3 CAPI 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 OpenCAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.2 OpenCAPI 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.3 OpenCAPI 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4 CCIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5 AMBA AXI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.5.2 Handshake Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.5.3 AXI Protocol Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.5.4 AXI Coherence Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.6 Interconnect Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.6.1 Bandwidth and Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.6.2 Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.6.3 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.6.4 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.7 Preliminary Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4 OpenCAPI Characterization 47 4.1 POWER9 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 OpenCAPI Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.1 Protocol Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.2 Data Link Layer Frame Format . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.3 Transaction Layer Packets . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.4 Coherent Accelerator Processor Proxy . . . . . . . . . . . . . . . . . . . . 52 4.2.5 OpenCAPI Attached Device . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.6 Address Spaces and Translation . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Coherent Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.1 Coherent Shared Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.2 Accelerator Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4 FPGA Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4.2 Typical Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4.3 Configurable Logic Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.4 Memory Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.5 DLX and TLX Reference Design . . . . . . . . . . . . . . . . . . . . . . . 61 4.5 Preliminary Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5 Requirements and Naive Designs 63 5.1 Accelerator Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2 Merge-Sort Accelerator Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2.1 Naive Buffer Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2.2 Crossing the Cache Line Boundary . . . . . . . . . . . . . . . . . . . . . . 66 5.3 Design Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.4 Naive Design Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 8

Description:
and making me a better person every day. 5 0. 50. 100. 150. 200. 250. 300. Peak aggregate uni-directional interconnect bandwidth [GB/s]. AMD. Applied Micro. Cavium. IBM. Intel. Qualcomm. (b) Peak .. In October 2017, the final specification for PCI Express Gen 4 was released, limited to members.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.