ebook img

Amber Core Specification PDF

26 Pages·2015·0.3 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Amber Core Specification

Amber Open Source Project Amber 2 Core Specification March 2015 Amber Amber 2 Core Specification March 2015 Table of Contents 1 Introduction 3 ...................................................................................................................................... 1.1 Amber 23 Features 4 ......................................................................................................... 1.2 Amber 25 Features 4 ......................................................................................................... 2 Amber 23 Pipeline Architecture 6 ............................................................................. 2.1 ALU 7 ............................................................................................................................................. 2.2 Pipeline Operation 8 ........................................................................................................... 3 Instruction Set 11 ............................................................................................................................... 4 Instruction Set Encoding 14 ................................................................................................ 4.1 Condition Encoding 14 ......................................................................................................... 4.2 Opcode Encoding 15 ............................................................................................................ 4.3 Shifter Operand Encoding 15 ......................................................................................... 4.4 Register transfer offset encoding 16 ......................................................................... 4.5 Shift Encoding 17 ..................................................................................................................... 4.6 Load & Store Multiple 17 .................................................................................................... 4.7 Branch offset 18 ........................................................................................................................ 4.8 Booth's Multiplication Algorithm 18 ............................................................................ 5 Interrupts 20 .............................................................................................................................................. 6 Registers 21 .............................................................................................................................................. 7 Cache 22 ........................................................................................................................................................ 8 Amber Project 23 ................................................................................................................................ 8.1 Amber Port List 23 .................................................................................................................. 8.2 Amber 23 Verilog Files 23 ................................................................................................. 9 License 26 .................................................................................................................................................... Released under the GNU Lesser General Public License (v2.1) terms 2 of 26 Amber Amber 2 Core Specification March 2015 1 Introduction The Amber processor core is an ARM-compatible 32-bit RISC processor. The Amber core is fully compatible with the ARM® v2a instruction set architecture (ISA) and is therefore supported by the GNU toolset. This older version of the ARM instruction set is supported because it is not covered by patents so can be implemented without a license from ARM. The Amber project provides a complete embedded system incorporating the Amber core and a number of peripherals, including UARTs, timers and an Ethernet MAC. There are two versions of the core provided in the Amber project. The Amber 23 has a 3-stage pipeline, a unified instruction & data cache, a Wishbone interface, and is capable of 0.8 DMIPS per MHz. The Amber 25 has a 5-stage pipeline, seperate data and instruction caches, a Wishbone interface, and is capable of 1.0 DMIPS per Mhz. Both cores implement exactly the same ISA and are 100% software compatible. The Amber 23 core is a very small 32-bit core that provides good performance. Register based instructions execute in a single cycle, except for instructions involving multiplication. Load and store instructions require three cycles. The core's pipeline is stalled either when a cache miss occurs, or when the core performs a wishbone access. The Amber 25 core is a little larger and provides 15% to 20% better performance that the 23 core. Register based instructions execute in a single cycle, except for instructions involving multiplication. Load and store instructions also execute in a single cycle unless there is a register conflict with a following instruction. The core's pipeline is stalled when a cache miss occurs in either cache, when an instruction conflict is detected, or when the core performs a wishbone access. Both cores has been verified by booting a 2.4 Linux kernel. Versions of the Linux kernel from the 2.4 branch and earlier contain configurations for the supported ISA. The 2.6 version of Linux does not explicitly support the ARM v2a ISA so requires more modifications to run. Also note that the cores do not contain a memory management unit (MMU) so they can only run the non-virtual memory variant of Linux. The cores were developed in Verilog 2001, and are optimized for FPGA synthesis. For example there is no reset logic, all registers are reset as part of FPGA initialization. The complete system has been tested extensively on the Xilinx SP605 Spartan-6 FPGA board. The full Amber system with the A23 core uses 32% of the Spartan-6 XC6SLX45T-3 FPGA Look Up Tables (LUTs), with the core itself occupying less than 20% of the device using the default configuration, and running at 40MHz. It has also been synthesized to a Virtex-6 device at 80MHz, but not yet tested on a real Virtex-6 device. The maximum frequency is limited by the execution stage of the pipline which includes a 32-bit barrel shifter, 32-bit ALU and address incrementing logic. For a description of the ISA, see "Archimedes Operating System - A Dabhand Guide, Copyright Dabs Press 1991", or "Acorn RISC Machine Family Data Manual, VLSI Released under the GNU Lesser General Public License (v2.1) terms 3 of 26 Amber Amber 2 Core Specification March 2015 Technology Inc., 1990". 1.1 Amber 23 Features • 3-stage pipeline. • 32-bit Wishbone system bus. • Unified instruction and data cache, with write through and a read-miss replacement policy. The cache can have 2, 3, 4 or 8 ways and each way is 4kB. • Multiply and multiply-accumulate operations with 32-bit inputs and 32-bit output in 34 clock cycles using the Booth algorithm. This is a small and slow multiplier implementation. • Little endian only, i.e. Byte 0 is stored in bits 7:0 and byte 3 in bits 31:24. The following diagram shows the data flow through the 3-stage core. Figure 1 - Amber 23 Core pipeline stages Address Write Data Stage 1 – Fetch Cache Read Instruction / Data Stage 2 – Decode Read Data D eco d Instruction Decode e S tate Control Signals Stage 3 - Execute R InEsxtreuccuttieon egister S et 1.2 Amber 25 Features • 5-stage pipeline. • 32-bit Wishbone system bus. • Seperate instruction and data caches. Each cache can be either 2,3,4 or 8 ways and each way is 4kB. Both caches use a read replacement policy and the data Released under the GNU Lesser General Public License (v2.1) terms 4 of 26 Amber Amber 2 Core Specification March 2015 cache operates as write through. The instruction cache is read only. • Multiply and multiply-accumulate operations with 32-bit inputs and 32-bit output in 34 clock cycles using the Booth algorithm. This is a small and slow multiplier implementation. • Little endian only, i.e. Byte 0 is stored in bits 7:0 and byte 3 in bits 31:24. The following diagram shows the data flow through the 5-stage core. Figure 2 - Amber 25 Core pipeline stages Instruction address Stage 1 – Fetch Instruction Cache Read Instruction Stage 2 – Decode D eco d Instruction Decode e S tate Control Signals Stage 3 - Execute R eg InEsxtreuccuttioen ister S et Data address Write Data Stage 4 - Memory Data Cache Read Data Stage 5 – Write Back Released under the GNU Lesser General Public License (v2.1) terms 5 of 26 Amber Amber 2 Core Specification March 2015 2 Amber 23 Pipeline Architecture The Amber 2 core has a 3-stage pipeline architecture. The best way to think of the pipeline structure is of a circle. There is no start or end point. The output from each stage is registered and fed into the next stage. The three stages are; • Fetch – The cache tag and data RAMs receive an unregistered version of the address output by the execution stage. The registered version of the address is compared to the tag RAM outputs one cycle later to decide if the cache hits or misses. If the cache misses, then the pipeline is stalled while the instruction is fetched from either boot memory or main memory via the Wishbone bus. The cache always does 4-word reads so a complete cache line gets filled. In the case of a cache hit, the output from the cache data RAM goes to the decode stage. This can either be an instruction or data word. • Decode - The instruction is received from the fetch stage and registered. One cycle later it is decoded and the datapath control signals prepared for the next cycle. This stage contains a state machine that handles multi-cycle instructions and interrupts. • Execute – The control signals from the decode stage are registered and passed into the execute stage, along with any read data from the fetch stage. The operands are read from the register bank, shifted, combined in the ALU and the result written back. The next address for the fetch stage is generated. The following diagram shows the datapath through the three stages in detail. This diagram closely corresponds to the Verilog implementation. Some details, like the wishbone interface and coprocessor #15 have been left out so as not to overload the diagram completely. Released under the GNU Lesser General Public License (v2.1) terms 6 of 26 Amber Amber 2 Core Specification March 2015 Figure 3 - Detailed 3-Stage Pipeline Structure write_enable address [31:0] address_nxt [11:4] write_data [31:0] Cache State Miss Address Address WData Address WData Address WData Address WData Cache Tag SRAM Cache Tag SRAM Cache Data SRAM Cache Data SRAM F address [31:12] E Way Select T Cache State Machine Hit? Hit? C Word Select H Data Prefetch Address address [3:2] Abort Abort Exception irq firq DABT PABT ADEX Decode State IRQ FIRQ Saved Current Instruction Pre-Fetch Instruction Read Instruction / Data DECO InstructionA Dnedcode Logic address [1:0] { address[1:0], 3'd0 } InsSterulecctiton read_data [31:0] D State Machine (Used for ldrb Shifts) E instruction [31:0] (for ldm data aborts) Execute Control Signals Base Address PC Register Bank Status Bits Rn Select Rd/s Select Rm Select rn_sel 3:0] rds_sel [3:0] rm_sel [3:0] pc rn rd rs rm copro_read_data imm_shift_amount [4:0] rd 5'h0 imm32 [31:0] pc Barrel Shift Amount Select Barrel Shift Data Select multiply_function [1:0] barrel_shift_amount_sel [1:0] barrel_shift_data_sel [1:0] a_in [31:0]b_in [31:0] Multiply +4 -4 shift_amount [4:0] in [31:0] flags [1:0] out [31:0] shifter_operand[3B1a:0r]rel Shift barrel_shift_function [1:0] out [31:0] carry_out E ''00xx0000000000000004'' rn barrel_shift_out barrel_shift_carry XECUT ''''''000000xxxxxx000000000000000000000000000000000000010111CC8048In''''''terrupt Vector pc_plus4 BL: Save PC-4 to LR a_in [31:0] out [31A:0L]Uflagbs_ [i3n: 0[3]1:0] carry alu_function [8:0] E Select interrupt_vector_sel [2:0] interrupt_vector alu_out copro_read_data[31:28] address_nxt [1:0] +4 +4 +4 4 { rds [7:0] } EBncytoed Ee nSainbglele 4'hF ASdedlreecsts Progr aSmel eCcotunter RegSisetelerc Wtrite St aStuesle Bctits address_sel [2:0] pc_sel [1:0] reg_write_sel [2:0] status_bits_sel [2:0] W rSitee leDcatta Byt Se eElencatble byte_enable_sel copro_write_data_wen write_data_wen adr_wen base_adr_wen pc_wen reg_bank_wen[14:0] status_wen Coprocessor Write Data Write Data Byte Enable Write Enable Address Base Address PC Register Bank Status Bits copro_write_data [31:0] write_data [31:0] byte_enable [3:0] write_enable address_nxt [31:0] address [31:0] 2.1 ALU The diagram below shows the structure of the Arithmetic Logic Unit (ALU). It consists of a set of different logical functions, a 32-bit adder and a mux to select the function. Released under the GNU Lesser General Public License (v2.1) terms 7 of 26 Amber Amber 2 Core Specification March 2015 Figure 4 - ALU Structure alu_function = { swap_sel, not_sel, cin_sel [1:0], cout_sel, out_sel [2:0] } a_in [31:0] b_in [31:0] cpsr_carry barrel_shift_carry A B Select Select swap_sel NOT Not Select not_sel '1' '0' Cin Select AND OR XOR ExZteernod 8 AFdudlelr cin_sel[1:0] 5 4 3 2 1 0 out_sel[2:0] Out Select dZeeteroct overflow SCeoleuctt cout_sel BE bit [31] Encode be [3:0] out [31:0] n z v c flags = { n, z, c, v } The alu_function[6:0] bus in the core is a concatenation of the individual control signals in the ALU. The following table describes these control signals. Table 1 ALU Function Encoding Field Function swap_sel Swaps the a and b inputs not_sel Selects the NOT version of b cin_sel[1:0] Selects the carry in to the full added from { c_in, !c_in, 1, 0 }. Note that bs_c_in is the carry_in from the barrel shifter. cout_sel Selects the carry out from { full_adder_cout, barrel_shifter_cout } out_sel[2:0] Selects the ALU output from { 0, b_zero_extend_8, b, and_out, or_out, xor_out, full_adder_out } 2.2 Pipeline Operation 2.2.1 Load Example The load instruction causes the pipeline to stall for two cycles. This section explains why this is necessary. The following is a simple fragment of assembly code with a single load instructon with register instructions before and after it. 0 mov r0, #0x100 4 add r1, r0, #8 8 ldr r4, [r1] c add r4, r4, r0 Released under the GNU Lesser General Public License (v2.1) terms 8 of 26 Amber Amber 2 Core Specification March 2015 The table below shows which instruction is active in each stage of the processor core for each clock tick. When the core comes out of reset the execute stage starts generating fetch addresses. It starts at 0 and increments by 4 each tick. In tick 1 the first instruction, at address 0, is fetched, This simple example assumes that all accesses are already present in the cache so fetches only take 1 cycle. Otherwise read accesses on the wishbone bus would add additional stalls and complicate this example. At tick 2 the first instruction, 0, is decoded and at tick 3 it is executed. This means that the r0 register, which is the destination for instruction 0, does not output the new value until tick 4, where it is used as an input to the second instruction. At tick 5 the load instruction, instruction 8, stalls the decode stage. In the execute stage it calculates the load address and this is used by the fetch stage in tick 6. Also in tick 5 the instruction c is saved to the pre_fetch_instruction register. This is used once the load instruction has finished and its use saves needing an additional stall cycle to reread instruction c. At tick 6 the value at address 0x108 is fetched and at tick 7 it is written into r4. The new value of r4 is then available for instruction c in tick 8. Table 2 Pipeline load example Stage Tick 0 Tick 1 Tick 2 Tick 3 Tick 4 Tick 5 Tick 6 Tick 7 Tick 8 Fetch address - 0 4 8 c 10 108 10 14 access type read read read read read, read read read ignored Decode instruction - - 0 4 8 8 8 c 10 pre_fetch_instruction - - - - - [c] [c] Execute instruction - - - 0 4 8 8 8 c address_nxt 0 4 8 c 10 108 10 14 18 2.2.2 Store Example The store instruction also causes the pipeline to stall for two cycles. This section explains why this is necessary. The following is a simple fragment of assembly code with a single store instructon with register instructions before and after it. 0 mov r0, #0x100 4 mov r1, #17 8 str r1, [r0] c add r1, r0, #20 The table below shows which instruction is active in each stage of the processor core for each clock tick. At tick 5 the store instruction, instruction 8, stalls the decode stage. In the execute stage it calculates the store address and this is used by the fetch stage in tick 6. Also in tick 5 the instruction c is saved to the pre_fetch_instruction register. This is used once the store instruction has finished and its use saves needing an additional stall cycle to reread instruction c. In tick 7 the instruction after the store instruction is decoded and in tick 8 it is executed. Released under the GNU Lesser General Public License (v2.1) terms 9 of 26 Amber Amber 2 Core Specification March 2015 Table 3 Pipeline store example Stage Tick 0 Tick 1 Tick 2 Tick 3 Tick 4 Tick 5 Tick 6 Tick 7 Tick 8 Fetch address - 0 4 8 c 10 100 10 14 access type read read read read read, write read read ignored Decode instruction - - 0 4 8 8 8 c 10 pre_fetch_instruction - - - - - [c] [c] Execute instruction - - - 0 4 8 8 8 c address_nxt 0 4 8 c 10 100 10 14 18 Released under the GNU Lesser General Public License (v2.1) terms 10 of 26

Description:
core is fully compatible with the ARM® v2a instruction set architecture (ISA) Unified instruction and data cache, with write through and a read-miss.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.