ebook img

VLSI Architectures for Multi-Gbps Low-Density Parity-Check Decoders by Ahmad Darabiha A ... PDF

128 Pages·2008·4.14 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview VLSI Architectures for Multi-Gbps Low-Density Parity-Check Decoders by Ahmad Darabiha A ...

VLSI Architectures for Multi-Gbps Low-Density Parity-Check Decoders by Ahmad Darabiha A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto (cid:13)c Copyright by Ahmad Darabiha 2008 VLSI Architectures for Multi-Gbps Low-Density Parity-Check Decoders Ahmad Darabiha Doctor of Philosophy, 2008 Graduate Department of Electrical and Computer Engineering University of Toronto Abstract Near-capacity performance and parallelizable decoding algorithms have made Low- Density Parity Check (LDPC) codes a powerful competitor to previous generations of codes, such as Turbo and Reed Solomon codes, for reliable high-speed digital communications. As a result, they have been adopted in several emerging standards. This thesis investigates VLSI architectures for multi-Gbps power and area-efficient LDPC decoders. To reduce the node-to-node communication complexity, a decoding scheme is pro- posed in which messages are transferred and computed bit-serially. Also, a broad- casting scheme is proposed in which the traditional computations required in the sum-product and min-sum decoding algorithms are repartitioned between the check and variable node units. To increase decoding throughput, a block interlacing scheme isinvestigatedwhichisparticularlyadvantageousinfully-parallelLDPCdecoders. To increase decoder energy efficiency, an efficient early termination scheme is proposed. In addition, an analysis is given of how increased hardware parallelism coupled with a reduced supply voltage is a particularly effective approach to reduce the power consumption of LDPC decoders. These architectures and circuits are demonstrated in two hardware implementa- tions. Specifically, a 610-Mbps bit-serial fully-parallel (480, 355) LDPC decoder on a single Altera Stratix EP1S80 device is presented. To our knowledge, this is the fastest FPGA-based LDPC decoder reported in the literature. A fabricated 0.13- µm CMOS bit-serial (660, 484) LDPC decoder is also presented. The decoder has iii a 300 MHz maximum clock frequency and a 3.3 Gbps throughput with a nominal 1.2-V supply and performs within 3 dB of the Shannon limit at a BER of 10−5. With more than 60% power saving gained by early termination, the decoder consumes 10.4 pJ/bit/iteration at E /N =4dB. Coupling early termination with supply voltage b 0 scaling results in an even lower energy consumption of 2.7 pJ/bit/iteration with 648 Mbps decoding throughput. The proposed techniques demonstrate that the bit-serial fully-parallel architec- ture is preferred to memory-based partially-parallel architectures, both in terms of throughput and energy efficiency, for applications such as 10GBase-T which use medium-size LDPC code (e.g., 2048 bit) and require multi-Gbps decoding through- put. iv Acknowledgment I feel truly honored for having the chance to work with Prof. Anthony Chan Carusone and Prof. Frank R. Kschischang as my Ph.D. advisors. I would like to thank them both for their academic and personal mentorship throughout this work. The comple- tion of this project would not have been possible without Tony’s continuous support, guidance, dedication and friendship. I have also learned so much from Frank’s in- sightful research approach, skillful teaching methods, and his inspiring weekly group meetings. I thank the members of my PhD supervising committee, Prof. Glenn Gulak and Prof. Roman Genov, for their insightful suggestions during the entire course of this work. I also thank the external members of my final oral examination, Prof. Wei Yu and Prof. Bruce F. Cockburn, for their time and for their valuable comments. I gratefully acknowledge access to fabrication and CAD tools from Gennum Cor- poration. I also thank the FPGA research group at UofT for providing access to the Transmogrifier-4 prototype board. Iamgratefulbeyondmeasurestomydearparents,MohammadKarimandSaeedeh, who have been my best teachers in countless ways. This thesis is dedicated to them, in the hope to acknowledge a tiny fraction of their unconditional and everlasting love and support. I thank my lovely sister, Ladan, and my dear brothers, Mehdi and Majid, who have been my constant source of encouragement even though we have been more than eight time zones away for most of the past few years. I also thank my uncles and their dear families in Mississauga. Their kindness and hospitality have always made me feel like being close to home. And last, but definitely not least, I feel blessed for getting to know so many good friends during my studies at UofT. I have learned a lot from them and I am grateful to them all. I thank Kostas Pagiamtzis, for being the most fun and knowledgeable cubicle mate I could ever hope for. Special thanks to Samira Naraghi for her invalu- able and friendly consultations when they were badly needed. Many thanks to Amir Azarpazhooh for his true friendship (and for paying (most of) his dues in ‘The wood game’!). SpecialthanksgoestoFarsheedMahmoudiforhisenergizing‘ZoorKhooneh’ spirit. IalsothankotherfriendsfromLP392, BA5000, BA5158, Tony’sgroup, Frank’s v vi group, and those from outside the department. In particular, I would like to thank Rubil Ahmadi, Mohamed Ahmed Youssef Abdulla, Masoud Ardakani, Navid Azizi, Mike Bichan, Horace Cheng, Yadi Eslami, Tooraj Esmailian, Vincent Gaudet, Amir Hadji-Abdolhamid, Afshin Haftbaradaran, Mohammad Hajirostam, Meisam Honar- var, Farsheed Mahmoudi, Mahsa Moallem, Alireza Nilchi, Ali Pakzad, Peter Park, Jennifer Pham, Mohammad Poustinchi, Saman Sadr, Sanam Sadr, Siamak Sarvari, Mahdi Shabany, Mehrdad Shahriari, Farzaneh Shahrokhi, Tina Tahmoures-zadeh, Maryam Tohidi, Alexi Tyshchenko, Marcus van Ierssel, and Kentaro Yamamoto, in the alphabetical order. Contents List of Figures ix List of Tables xiii 1 Introduction 1 1.1 Error Control Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Low-Density Parity-Check (LDPC) Codes 7 2.1 Code Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 LDPC Decoding Algorithms . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Sum-product algorithm . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Min-sum algorithm . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Capacity-Approaching LDPC Codes . . . . . . . . . . . . . . . . . . . 12 2.4 Decoder Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Architecture-Aware LDPC Codes . . . . . . . . . . . . . . . . . . . . 17 2.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Reducing Interconnect Complexity 19 3.1 Half-broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.1 Half-broadcasting for fully-parallel decoders . . . . . . . . . . 21 3.1.2 Half-broadcasting for partially-parallel decoders . . . . . . . . 28 3.2 Full-broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Comparison and Discussion . . . . . . . . . . . . . . . . . . . . . . . 29 4 Block-Interlaced Decoding 33 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Interlacing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.1 Design 1: (2048, 1723) RS-based LDPC decoders . . . . . . . 39 4.3.2 Design 2: (660, 484) PEG LDPC decoders . . . . . . . . . . . 42 4.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5 Power-Saving Techniques for LDPC Decoders 47 5.1 Parallelism and Supply-Voltage Scaling . . . . . . . . . . . . . . . . . 47 5.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 vii viii Contents 5.1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2 LDPC Decoding With Early Termination . . . . . . . . . . . . . . . . 52 5.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.2 Early termination . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2.3 Hardware implementation . . . . . . . . . . . . . . . . . . . . 57 5.3 Power vs. BER Performance . . . . . . . . . . . . . . . . . . . . . . . 59 6 Bit-Serial Message-Passing 63 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2 Bit-serial Blocks for Conventional Min-Sum Decoding . . . . . . . . . 66 6.2.1 CNU architecture . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.2.2 VNU architecture . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.3 Bit-Serial Blocks for Approximate Min-Sum Decoding . . . . . . . . . 73 6.3.1 Approximate Min-Sum decoding . . . . . . . . . . . . . . . . . 73 6.3.2 CNU architecture for approximate Min-Sum decoding . . . . . 75 6.4 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.4.1 An FPGA 480-bit LDPC decoder . . . . . . . . . . . . . . . . 77 6.4.2 A 0.13-µm CMOS 660-bit bit-serial decoder . . . . . . . . . . 79 7 Conclusions and Future Work 91 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.3.1 Technology scaling . . . . . . . . . . . . . . . . . . . . . . . . 93 7.3.2 Multi-rate and multi-length decoders . . . . . . . . . . . . . . 94 7.3.3 Custom layout for sub-blocks . . . . . . . . . . . . . . . . . . 94 7.3.4 Hardware-based Gaussian noise generation . . . . . . . . . . . 95 7.3.5 Adjustable message word length . . . . . . . . . . . . . . . . . 95 Appendices 97 A Parity-Check Matrix for FPGA-Based Decoder 97 B Parity-Check Matrix for 0.13-µm CMOS Decoder 103 References 109 List of Figures 1.1 An example communication system with ECC coding: (a) the trans- mitter, (b) the receiver. . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 The effect of ECC on BER performance. . . . . . . . . . . . . . . . . 3 2.1 Tanner graph for a (3,6)-regular LDPC code and the corresponding parity check matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Exchange of information in a message-passing decoding algorithm. . . 10 2.3 (a) Conventional receiver with digital ECC decoder, (b) An alternative receiver with analog decoder. . . . . . . . . . . . . . . . . . . . . . . 13 2.4 A partially-parallel LDPC decoder. . . . . . . . . . . . . . . . . . . . 16 2.5 The fully-parallel iterative LDPC decoder architecture. . . . . . . . . 16 3.1 Aconventionalfully-parallelmessage-passingLDPCdecoderwithgeneric functions for check and variable nodes. . . . . . . . . . . . . . . . . . 20 3.2 A half-broadcast architecture. The check node c broadcasts a single m message, P , to all neighboring variable nodes. . . . . . . . . . . . . 22 m 3.3 Broadcasting reduces the total top-level wirelength by sharing the wires. (a) Output messages of a check node without broadcasting (b) Sharing interconnect wires of a check node with broadcasting . . . . . 23 3.4 Asmallsectionofinterconnectsforalength-2048LDPCcode(a)before broadcast(b)afterbroadcast. Thereis40%reductionintotalwirelength. 23 3.5 The routed nets for one check node output highlighted in a fully- parallel LDPC decoder layout: (a) without broadcasting and (b) with broadcasting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.6 (a) The CNU and (b) the VNU architectures for a conventional hard decision message-passing decoder with no broadcasting. . . . . . . . . 26 3.7 (a)TheCNUand(b)theVNUarchitecturesforaharddecisionmessage- passing decoder with half broadcasting. . . . . . . . . . . . . . . . . . 27 3.8 A full-broadcast architecture. The check node c broadcasts P to the m m neighboring variable nodes. The variable node v broadcasts S . . . . 29 n n 4.1 An example of row/column permutation of H matrix in overlapped message-passing [1]: (a) the original H matrix, (b) the permuted H matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 ix x List of Figures 4.2 Message-passingtimingdiagramfor(a)theoriginalmatrixofFig.4.1.a (b) for the permuted matrix of Fig. 4.1.b. . . . . . . . . . . . . . . . . 35 4.3 A timing diagram for the message-passing algorithm: (a) conventional (b) overlapped message passing [1] (c) block interlacing. . . . . . . . . 36 4.4 (a) conventional, and (b) block-interlaced architecture for fully-parallel LDPC decoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.5 (a) conventional, and (b) block-interlaced architecture for partially- parallel LDPC decoders. . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.6 (a) CNU, and (b) VNU for the block-interlaced LDPC decoder in De- sign 1B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.7 Timingdiagramfor(a)Design1A,(b)Design1B(withblockinterlacing). 41 4.8 VNU for the fully-parallel LDPC decoders in Designs 2A and 2B. . . 42 4.9 CNU for the fully-parallel LDPC decoder in Designs 2A. . . . . . . . 43 4.10 The FindMins block to calculate the first and second minimums (i.e., min1 and min2) in Fig. 4.9.. . . . . . . . . . . . . . . . . . . . . . . . 44 5.1 A partially-parallel LDPC decoder. . . . . . . . . . . . . . . . . . . . 48 5.2 Increased parallelism allows reduced supply voltage. . . . . . . . . . . 49 5.3 Power reduction as a result of a parallel architecture: a) Reduction in supply voltage. b) Reduction in dynamic power dissipation. . . . . . 51 5.4 BER vs. maximum number of iterations under 4-bit quantized min- sum decoding: (a) Reed-Solomon based (6,32)-regular 2048-bit code and (b) PEG (4,15)-regular 660-bit code. . . . . . . . . . . . . . . . . 54 5.5 The fractionofuncorrectedframesvs. iterationnumberfor(a)aReed- Solomon based (6,32)-regular 2048-bit code. (b) a PEG (4,15)-regular 660-bit code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.6 Fraction of active iterations (a) Reed-Solomon based (6,32)-regular 2048-bit code and (b) PEG (4,15)-regular 660-bit code. . . . . . . . . 56 5.7 The fully-parallel iterative LDPC decoder with early termination func- tionality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.8 Block-interlaced decoding timing diagram (a) without early termina- tion, (b) with early termination. . . . . . . . . . . . . . . . . . . . . . 59 5.9 TheeffectofI onBERandpowerperformanceunderafixeddecoding M throughput for the (660, 484) PEG LDPC code. . . . . . . . . . . . . 60 5.10 TheeffectofI onBERandpowerperformanceunderafixeddecoding M throughput for the RS-based (2048, 1723) LDPC code. . . . . . . . . 61 6.1 Bit-serial vs. bit-parallel message passing. . . . . . . . . . . . . . . . 64 6.2 Top-level diagram for a bit-serial CNU. . . . . . . . . . . . . . . . . . 66 6.3 The finite-state machine for bit-serial calculation of CNU output mag- nitudes in a conventional min-sum decoder. . . . . . . . . . . . . . . . 67 6.4 The magnitude calculation module for the CNU in Fig. 6.2. . . . . . . 69

Description:
wireline standard (IEEE802.3an) [13] in 2005. The recent Mobile WiMAX standard. (IEEE802.16e) [14] also suggests LDPC codes as an optional error correction
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.