ebook img

Low Power Asynchronous. Digital Signal Processing PDF

268 Pages·2000·1.478 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Low Power Asynchronous. Digital Signal Processing

LOW POWER ASYNCHRONOUS DIGITAL SIGNAL PROCESSING A thesis submitted to the University of Manchester for the degree of Doctor of Philosophy in the Faculty of Science & Engineering October 2000 Michael John George Lewis Department of Computer Science 1 Contents Chapter 1: Introduction ....................................................................................14 Digital Signal Processing ...............................................................................15 Evolution of digital signal processors ....................................................17 Architectural features of modern DSPs .........................................................19 High performance multiplier circuits .....................................................20 Memory architecture ..............................................................................21 Data address generation .........................................................................21 Loop management ..................................................................................23 Numerical precision, overflows and rounding .......................................24 Architecture of the GSM Mobile Phone System ...........................................25 Channel equalization ..............................................................................28 Error correction and Viterbi decoding ...................................................29 Speech transcoding ................................................................................31 Half-rate and enhanced full-rate coding .............................................33 Summary of processing for GSM baseband functions ...........................34 Evolution towards 3rd generation systems ............................................35 Digital signal processing in 3G systems ........................................................36 Structure of thesis ..........................................................................................37 Research contribution ....................................................................................37 Chapter 2: Design for low power ......................................................................39 Sources of power consumption ......................................................................39 Dynamic power dissipation ....................................................................39 Leakage power dissipation .....................................................................40 Power reduction techniques ...........................................................................41 Reducing the supply voltage ..................................................................41 Architecture-driven voltage scaling ...................................................43 Adaptive supply voltage scaling ........................................................45 Reducing the voltage swing ...............................................................45 Adiabatic switching ............................................................................46 Reducing switched capacitance .............................................................47 Feature size scaling ............................................................................49 Transistor sizing .................................................................................50 Layout optimization ...........................................................................51 SOI CMOS technology ......................................................................51 Reducing switching activity ...................................................................52 Reducing unwanted activity ...............................................................53 Choice of number representation and signal encoding ......................54 Evaluation of number representations for DSP arithmetic .................58 Algorithmic transformations ..............................................................63 Reducing memory traffic ...................................................................63 Asynchronous design .....................................................................................65 Asynchronous circuit styles ...................................................................66 Delay insensitive design .....................................................................66 Bundled-data design ...........................................................................70 2 Asynchronous handshake circuits ......................................................71 Latch controllers for low power asynchronous circuits .........................73 Advantages of asynchronous design ......................................................78 Elimination of clock distribution network .........................................78 Automatic idle-mode ..........................................................................79 Average case computation .................................................................80 Reduced electromagnetic interference ...............................................80 Modularity of design ..........................................................................81 Disadvantages compared to clocked designs .........................................82 Lack of tool support ...........................................................................82 Reduced testability .............................................................................82 Chapter 3: CADRE: A new DSP architecture ................................................84 Specifications .................................................................................................84 Sources of power consumption ......................................................................84 Processor structure .........................................................................................85 Choice of parallel architecture ...............................................................86 FIR Filter algorithm ...........................................................................86 Fast Fourier Transform .......................................................................89 Choice of number representation .......................................................90 Supplying instructions to the functional units ........................................90 Supplying data to the functional units ....................................................92 Instruction buffering ..............................................................................95 Instruction encoding and execution control ...................................................96 Interrupt support ...................................................................................101 DSP pipeline structure .........................................................................102 Summary of design techniques ....................................................................104 Chapter 4: Design flow ....................................................................................106 Design style .................................................................................................106 High-level behavioural modelling ...............................................................106 Modelling environment ........................................................................106 Datapath model design .........................................................................108 Control model design ...........................................................................108 Combined model design .......................................................................111 Integration of simulation and design environment ..............................114 Circuit design ...............................................................................................114 Assembler design .........................................................................................114 Chapter 5: Instruction fetch and the instruction buffer ...............................118 Instruction fetch unit ....................................................................................118 Controller operation .............................................................................119 PC incrementer design .........................................................................120 Instruction buffer design ..............................................................................123 Word-slice FIFO structure ...................................................................125 Looping FIFO design ...........................................................................127 Write and read token passing ...........................................................128 Overall system design ..........................................................................130 PC latch scheme ...................................................................................131 Control datapath design .......................................................................132 3 Evaluation of design .............................................................................133 Results ..................................................................................................134 Loop counter performance ...............................................................134 Chapter 6: Instruction decode and index register substitution ...................137 Instruction decoding ....................................................................................137 First level decoding ..............................................................................138 Parallel instructions ..........................................................................139 Move-multiple-immediate instructions ............................................140 Other instructions .............................................................................141 Changes of control flow ...................................................................141 Second level decoding .........................................................................142 Third level decoding ............................................................................143 Fourth level decoding ...........................................................................143 Control / setup instruction execution ...........................................................144 Branch unit ...........................................................................................144 DO Setup unit .......................................................................................144 Index interface ......................................................................................145 LS setup unit ........................................................................................145 Configuration unit ................................................................................145 The index registers .......................................................................................145 Index register arithmetic ......................................................................146 Circular buffering .............................................................................146 Bit-reversed addressing ....................................................................147 Index unit design ..................................................................................147 Index register substitution in parallel instructions .......................................149 Chapter 7: Load / store operation and the register banks ...........................151 Load and store operations ............................................................................152 Decoupled load / store operation .........................................................152 Read-before-write ordering ..................................................................152 Write-before-read ordering ..................................................................153 Load / store pipeline operation ....................................................................154 Address generation unit .......................................................................156 Address ALU design ........................................................................158 Lock interface ......................................................................................161 Register bank design ....................................................................................162 Data access patterns .............................................................................165 FIR filter data access patterns ..........................................................165 Autocorrelation data access patterns ................................................165 Register bank structure .........................................................................166 Write organization ................................................................................168 Read organisation .................................................................................170 Read operation ..................................................................................171 Register locking ...................................................................................173 Chapter 8: Functional unit design ..................................................................175 Generic functional unit specification ...........................................................176 Decode stage interfaces ........................................................................176 Index substitution stage interfaces .......................................................176 4 Secondary interfaces ........................................................................179 Register read stage ...............................................................................179 Execution stage ....................................................................................179 Functional unit implementation ...................................................................180 Arithmetic / logical unit implementation .....................................................182 Arithmetic / logic datapath design .......................................................184 Multiplier Design .............................................................................185 Input Multiplexer and Rounding Unit ..............................................189 Adder Design ....................................................................................190 Logic unit design ..............................................................................192 Chapter 9: Testing and evaluation .................................................................194 Functional testing ........................................................................................194 Power and performance testing ...................................................................196 Recorded statistics ................................................................................196 Operating speed and functional unit occupancy ..............................197 Memory and register accesses ..........................................................197 Instruction issue ................................................................................197 Address register and index register updating ...................................197 Register read and write times ...........................................................198 Results .........................................................................................................198 Instruction execution performance .......................................................198 Power consumption results ..................................................................199 Evaluation of architectural features .....................................................202 Register bank performance ...............................................................202 Use of indexed accesses to the register bank ...................................206 Effect of instruction buffering ..........................................................207 Effect of sign-magnitude number representation .............................208 Comparison with other DSPs ......................................................................209 Detailed comparisons ...........................................................................209 Other comparisons ...............................................................................212 OAK / TEAK DSP cores ..................................................................213 Texas Instruments TMS320C55x DSP ............................................213 Cogency ST-DSP .............................................................................213 Non-commercial architectures .........................................................213 Evaluation ....................................................................................................214 Chapter 10: Conclusions ..................................................................................217 CADRE as a low-power DSP ......................................................................217 Improving CADRE ......................................................................................218 Scaling to smaller process technologies ...............................................218 Optimising the functional units ............................................................220 Multiplier optimisation .....................................................................220 Pipelined multiply operation ............................................................221 Adder optimisation ...........................................................................221 Improving overall functional unit efficiency ...................................222 Optimising communication pathways ..................................................222 Optimising configuration memories ....................................................222 Changes to the register bank ................................................................223 Conclusions .................................................................................................224 5 References ........................................................................................................225 Appendix A: The GSM full-rate codec.............................................................241 Speech pre-processing................................................................................. 241 LPC Analysis............................................................................................... 242 Short-term analysis filtering........................................................................ 243 Long-term prediction analysis..................................................................... 244 Regular pulse excitation encoding............................................................... 246 Appendix B: Instruction set..............................................................................248 Appendix C: The index register units..............................................................253 Index unit structure...................................................................................... 253 Index ALU operation................................................................................... 255 Split adder / comparator design........................................................... 257 Verification of index ALU operation................................................... 259 Appendix D: Stored opcode and operand configuration................................260 Functional unit opcode configuration.......................................................... 260 Arithmetic operations........................................................................... 262 Logical operations................................................................................ 264 Conditional execution.......................................................................... 265 Stored operand format................................................................................. 266 Index update encoding................................................................................. 267 Load / store operation.................................................................................. 267 6 List of Figures 1.1 A traditional signal processing system, and its digital replacement 16 1.2 Traditional DSP architecture 19 1.3 Multiplication of binary integers 20 1.4 Simplified diagram of GSM transmitter and receiver 27 1.5 TDMA frame structure in GSM 28 1.6 Division of tasks between DSP and microcontroller (after [23]) 29 1.7 Adaptive channel equalization 30 1.8 1/2 rate convolutional encoder for full-rate channels 31 1.9 Analysis-by-synthesis model of speech 32 2.1 A simple CMOS inverter 40 2.2 Components of node capacitance CL 48 2.3 Wire capacitances in deep sub-micron technologies 50 2.4 SOI CMOS transistor structure 52 2.5 Multiply-Accumulate Unit Model. 59 2.6 2s Complement Model Structure 60 2.7 Sign-Magnitude Model Structure 61 2.8 Total Transitions per Component 61 2.9 Synchronous and asynchronous pipelines 66 2.10 Dual-rail domino AND gate 67 2.11 Handshakes in asynchronous micropipelines 70 2.12 A simple signal transition graph (STG) 73 2.13 Pipeline latch operating modes 74 2.14 An early-open latch controller 75 2.15 Energy per operation using different latch controller designs 78 3.1 Layout of functional units 89 3.2 Reducing address generation and data access cost with a register file 94 3.3 Top level architecture of CADRE 95 3.4 Parallel instruction expansion 97 3.5 An algorithm requiring a single configuration memory entry 100 3.6 Using loop conditionals to reduce pre- and post-loop code 101 3.7 CADRE pipeline structure 103 4.1 STG / C-model based design flow for the CADRE processor 109 4.2 A simple sequencer and its STG specification 111 4.3 State structure indicating STG token positions 112 4.4 Evaluation function body 112 4.5 Evaluation code for input, output and internal transitions 113 4.6 An example of assembly language for CADRE 115 4.7 Different encodings for a parallel instruction 116 5.1 Fetch / branch arbitration 120 5.2 Data-dependent PC Incrementer circuit 123 5.3 Adjacent pipeline stages and interfaces to the instruction buffer 124 5.4 Signal timings for decode unit to instruction buffer communication 124 5.5 Micropipeline FIFO structure 126 5.6 Word-slice FIFO structure 126 5.7 .Standard (i) and looping (ii) word-slice FIFO operation 128 7 5.8 Looping FIFO element 129 5.9 Looping FIFO datapath diagram 131 5.10 Top-level diagram of control datapath 133 6.1 Structure of the instruction decode stage 138 6.2 Second and subsequent instruction decode stages 143 6.3 Index ALU structure 148 6.4 Passing of index registers for parallel instructions 149 7.1 Ordering for ALU operations and loads 153 7.2 Ordering for ALU writebacks and stores 153 7.3 Illegal and legal sequences of operations with writebacks 154 7.4 Load / store operations and main pipeline interactions 157 7.5 Structure of the address generation unit 159 7.6 Address generator ALU schematic 160 7.7 Lock interface schematic 163 7.8 Multiported register cell 164 7.9 Word and bit lines in a register bank 164 7.10 Register bank organization 167 7.11 Write request distribution 168 7.12 Arbitration block structure and arbitration component 170 7.13 Read mechanism 172 8.1 Primary interfaces to a functional unit 177 8.2 Top-level schematic of functional unit 181 8.3 Internal structure ofmac_unit 182 8.4 Sequencing of events within the functional unit 183 8.5 Arithmetic / logic datapath structure 184 8.6 Signed digit Booth multiplexer and input latch 188 8.7 Multiplier compression tree structure 189 8.8 Late-increment adder structure 191 8.9 Logic unit structure 192 9.1 Average distribution of energy per operation throughout CADRE 201 9.2 Breakdown of MAC unit power consumption 202 8 List of Tables 1.1 DSP primitive mathematical operations 16 1.2 Bit-reversed addressing for 8-point FFT 22 1.3 Computation load of GSM full-rate speech coding sections 33 1.4 Required processing power, in MIPS, of GSM baseband functions 34 2.1 Average Transitions per Operation 59 2.2 Millions of multiplications per second with different latch controllers 76 3.1 Distribution of operations for simple FIR filter implementation 87 3.2 Distribution of operations for transformed block FIR filter algorithm 88 3.3 Distribution of operations for FFT butterfly 90 3.4 Parallel instruction encoding 98 5.1 PC Incrementer delays 122 5.2 Incrementer delays 134 5.3 Maximum throughput and minimum latency 135 5.4 Energy consumption per cycle 135 7.1 Autocorrelation data access patterns 166 9.1 Functional tests on CADRE 195 9.2 Parallel instruction issue rates and operations per second 198 9.3 Power consumption, run times and operation counts 199 9.4 Distributions of energy (nJ) per arithmetic operation 200 9.5 Read and write times with different levels of contention 203 9.6 Register access times for DSP algorithms 204 9.7 Energy per parallel instruction and per register bank access 205 9.8 Energy per index and address register update 207 9.9 Instruction issue count and energy per issue for the instruction buffer 208 9.10 Fabrication process details from [149], and those for CADRE (estimated values marked with =) 210 9.11 FIR benchmark results 211 9.12 FFT benchmark results 211 9 Abstract Cellularphonesrepresentahugeandrapidlygrowingmarket.Acrucialpartofthedesign of these phones is to minimise the power consumption of the electronic circuitry, as this to a large extent controls the size and longevity of the battery. One of the major sources of power consumption within the digital components of a mobile phone is the digital signal processor (DSP) which performs many of the complex operations required to transmit and receive compressed digital speech data over a noisy radio channel. This thesis describes an asynchronous DSP architecture called CADRE (Configurable Asynchronous DSP for Reduced Energy), which has been designed to have minimal power consumption while meeting the performance requirements of next-generation cellular phones. Design for low power requires correct decisions to be made at all levels ofthedesignprocess,fromthealgorithmicandarchitecturalstructuredowntothedevice technology used to fabricate individual transistors. CADREexploitsparallelismtomaintainhighthroughputatreducedsupplyvoltages,with 4parallelmultiply-accumulatefunctionalunits.Executionofinstructionsiscontrolledby configurationmemorieslocatedwithinthefunctionalunits,reducingthepoweroverhead of instruction fetch. A large register file supports the high data rate required by the functional units, while exploiting data access patterns to minimise power consumption. Sign-magnitude number representation for data is used to minimise switching activity throughout the system, and control overhead is minimised by exploiting the typical role of the DSP as an adjunct to a microprocessor in a mobile phone system. Theuseofasynchronousdesigntechniqueseliminatesredundantactivityduetotheclock signal, and gives automatic power-down when idle, with instantaneous restart. Furthermore,eliminationoftheclocksignalgreatlyreduceselectromagneticinterference. Simulation results show the benefits obtained from the different architectural features, and demonstrate CADRE’s efficiency at executing complex DSP algorithms. Low-level optimisation will allow these benefits to be fully exploited, particularly when the design is scaled onto deep sub-micron process technologies. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.