Comparison of high level design methodologies for algorithmic IPs : Bluespec and C-based synthesis by Abhinav Agarwal Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2009 @ Massachusetts Institute of Technology 2009. All rights reserved. Author ................................... Department of Electrical Engineering and Computer Science Feb 1 2009 (N Certified by ............................ Arvind Professor Thesis Supervisor Accepted by ..................... Terry P. Orlando Chairman, Department Committee on Graduate Students MASSACHUSETTS INSTITlrE OF TECw AL>OLOy MAR 0 5 20R9 ARCHIVES L 'P, R ES Comparison of high level design methodologies for algorithmic IPs : Bluespec and C-based synthesis by Abhinav Agarwal Submitted to the Department of Electrical Engineering and Computer Science on Feb 1, 2009, in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science Abstract High level hardware design of Digital Signal Processing algorithms is an important design problem for decreasing design time and allowing more algorithmic exploration. Bluespec is a Hardware Design Language (HDL) that allows designers to express in- tended microarchitecture through high-level constructs. C-based design tools directly generate hardware from algorithms expressed in C/C++. This research compares these two design methodologies in developing hardware for Reed-Solomon decoding algorithm under area and performance metrics. This work illustrates that C-based design flow may be effective in early stages of the design development for fast prototyping. However, the Bluespec design flow produces hardware that is more customized for performance and resource constraints. This is because in later stages, designers need to have close control over the hardware structure generated that is a part of HDLs like Bluespec, but is difficult to express under the constraints of sequential C semantics. Thesis Supervisor: Arvind Title: Professor Acknowledgments I would like to express my deep and sincere gratitude to my advisor, Professor Arvind. This work has been possible because of his encouragement and guidance provided to me in the course of this project. I wish to extend my warmest thanks to Alfred Ng and S.R.K. Branavan, who collab- orated with me during this work. I am grateful to Dr. Jamey Hicks and Dr. Gopal Raghavan for their assistance in using resources provided by Nokia Research Center at Cambridge. All the members of the ARMO group and the CSG-Arvind group have been a great source of advice and support, and I would like to thank them for it. I would also like to thank my family for their love and encouragement. Contents 1 Introduction 8 1.1 C-based design tools ........... . . .. ... . . . . . . . . 9 1.2 Bluespec ........... .. ...... ............... 10 1.3 Design process ....... ....... ............... 10 1.4 Thesis Outline .......... ..... ............... 12 2 The Application: Reed-Solomon Decoder 13 2.1 Decoding Process and its Complexity . . . ............... 14 2.2 Algorithms and Pseudocode ........ ............... 15 2.2.1 Galois Field Arithmetic . . . . . . ............... 15 2.2.2 Syndrome Computation . . . . . . ............... 16 2.2.3 Berlekamp-Massey Algorithm . . . ............... 17 2.2.4 Chien Search .......... ............... 18 2.2.5 Forney's Algorithm ......... ............... 18 2.2.6 Error Correction . . . . . . . ... ............... 19 3 Hardware Implementation using C-based design flow 20 3.1 Defining Module Interfaces . ..... ............. 20 3.2 Initial Implementation in Catapult-C . ..... ....... 22 3.2.1 Loop Unrolling . . . .................... 22 3.3 Language-related issues with refinements . . . . . . . .. 23 3.3.1 Problems with Streaming under Dynamic Conditions 23 3.3.2 Substantial Source Code Modifications . . . . . . .. 25 4 Hardware Implementation using Bluespec design flow 27 4.1 Initial Implementation in Bluespec .... 28 4.2 Refinements made in Bluespec Design . . . 29 4.2.1 Manual Unrolling . ......... 30 . . . . 4.2.2 Out-of-order Execution . . . . . . . 31 4.2.3 Pipelining . ............. 32 4.2.4 Buffer Sizing . . . . . . . . . ... 33 5 Results and Performance Analysis 34 5.1 Source code size ................. .. ... ........ 34 5.2 Hardware synthesis results . . . . . . . . . . . . . . . . .. 34 5.3 Performance analysis ............ .. . . ........... 35 6 Conclusion 36 Bibliography 37 List of Figures 2-1 Stages in Reed-Solomon decoding ........ 3-1 Top level function................. .......... 21 3-2 Simple streaming example - source code . . . . . . . . . . . 24 3-3 Simple streaming example - hardware output . . . . . . . . . . . 24 3-4 Complex streaming example - source code . . . . . . . . . . . . . 25 3-5 Complex streaming example - hardware output . . . . . . . . . . 25 4-1 Bluespec interface for the decoder . ................. 4-2 Initial version of syndrome module ................. 4-3 Parameterized parallel version of the compute-syndrome rule . 4-4 Statically elaborated code generated by Bluespec ......... 4-5 The original structure of the Forney's Algorithm Implementation . 4-6 The modified structure of the Forney's Algorithm Implementation List of Tables 2.1 Order of GF Arithmetic Operations ...... ............ 14 2.2 Input and output symbols per step . .................. 15 3.1 Performance impact of unrolling ....... . ........... .. . 23 4.1 Performance impact of Buffer size . .................. . 33 5.1 FPGA synthesis summary ............ .. ........ 35 5.2 Performance metrics ......... ... ...... ....... .. 35 Chapter 1 Introduction A wide range of DSP applications require large amounts of computations at reason- able power budgets. To satisfy these requirements, many DSP algorithms are imple- mented in dedicated hardware. Meanwhile, DSP algorithms are often modeled using MATLAB or C/C++ because of the familiarity of algorithm designers with these lan- guages and because it frees the designers from hardware considerations. A tool that could generate efficient hardware directly from these high-level codes would be ideal for implementers. Such a tool, however, must provide its users with mechanisms to control generated hardware so that implementations with different performance, area and power tradeoffs can be generated according to the usage of the DSP algorithm. Several EDA vendors provide tools for this purpose [1, 2, 3, 4]. There are reasons to be optimistic about such C-based tools. DSP algorithms are usually highly structured and have abundant fine-grained parallelism. This is exactly the kind of parallelism where extensive research in parallel compilers over the last three or four decades has shown the greatest success. Since hardware generation is mostly about exploiting fine-grained parallelism, the rich body of compiler research should be directly applicable to the task at hand. The practical question is whether the compiler or the tool can be directed to generate hardware that satisfies some specific area, performance or power metric. Even if ASIC implementation is not the final target, the ability to direct the tool to generate a design within the resource constraints of a specific FPGA is an important question. The economic importance of this question is obvious: if C-based synthesis cannot lead to the ultimate target design then at some stage, the implementers will have to jump the rail at enormous design cost and resort to RTL design and synthesis tools. 1.1 C-based design tools C-based tools fall into two distinct categories - those that adhere to pure C/C++ semantics like Catapult-C [2], Synfora [4], etc. and those that deviate from the pure semantics by allowing new constructs, like HandelC [1], SystemC [5], etc. The external constructs are introduced to aid in expressing parallelism. But the issue then arises that once a designer is willing to give up on the pure sequential C semantics, the main benefit of using C-based design flow is already lost. For our study, we chose a tool that maintains C semantics but allows external annotations and settings for greater customization. Catapult-C, a high-level synthesis tool developed by Mentor Graphics for hard- ware synthesis from ANSI-standard C/C++, is a good example of a tool to explore this domain. It provides mechanisms to control the generated hardware. General constraints, such as the operating frequency, can be applied to the whole design. The tool automatically explores the design space to find a design that satisfies the constraints. The user can give annotations, for mechanisms such as loop unrolling, regarding how to optimize a particular part of the program without affecting the cor- rectness of the code. Hardware designers perceive several advantages in using such a C-based design methodology [6, 7] - having a concise source code allows faster de- sign and simulation, technology-dependent physical design is relatively isolated from the source and using an untimed design description allows high level exploration by raising the level of abstraction. Using Catapult-C, Guo et al. [8] were able to rapidly prototype components used in 4G Wireless systems for FPGA implementation. 1.2 Bluespec Most implementers of DSP algorithms regard designing at the RTL level (as in Verilog or VHDL) as tedious, inflexible, error-prone and slow. The high costs of ASIC and FPGA designs give credence to this belief. Bluespec [9] is a synthesis tool to help the designer express both the structure and the behavior of the desired micro-architecture at a higher abstraction level. Bluespec SystemVerilog (BSV), the input language for the tool, is built on sound ideas from modern functional and object-oriented program- ming languages, and uses guarded atomic actions to express concurrent behaviors succinctly [10, 11]. Bluespec facilitates latency insensitive designs by automatically generating handshaking signals between components. This allows designers to incre- mentally refine modules to meet their requirements. For example, in [12], Dave et. al. describe the design of an 802.11a transmitter using the Bluespec design methodology. From the same parameterized source code they were able to generate many different designs for the IFFT block, which consumed the majority of resources, to obtain an optimal design for the 802.11a transmitter. Similar experiences are reported for many other designs in Bluespec, e.g., reusable blocks for OFDM protocols [13], H.264 decoder [14], out-of-order processors [15], cache coherence protocol processors [16]. However, the Bluespec tool, unlike Catapult-C, does not do any design exploration on its own - it only facilitates the expression of different design alternatives. 1.3 Design process In this work the two design methodologies are compared via the implementation of a Reed-Solomon decoder. This example was chosen because it represented a non- trivial algorithmic IP but is simpler than an H.264 decoder. We were not familiar with Reed-Solomon codes before we started this project and needed a Reed-Solomon decoder to implement an 802.16 WiMAX protocol receiver [17]. We expected that it would be straightforward to express the desired decoder for the target performance in Bluespec. Indeed that turned out to be the case. However, even to understand what
Description: