The Processor: Datapath and Control

The Processor: Datapath and Control
Computer Organization & Design 5th. Chapter 4 The Processor: Datapath and Control 處理器：資料路徑與控制 ROBERT CHEN

Outlines Introduction Logic Design Conventions Building a Datapath
Computer Organization & Design 5th. Outlines Introduction Logic Design Conventions Building a Datapath A Simple Implementation Scheme A Multicycle Implementation Exception

Introduction 計算機的效能受到下面三個因素影響：
Computer Organization & Design 5th. Introduction 計算機的效能受到下面三個因素影響：指令的數目(instruction count) 每個指令的時脈週期數目 (CPI) 整數指令, 算數邏輯指令, 記憶體相關指令及分支時脈週期的長短(clock cycle time) 編譯器(compiler)和指令集架構(ISA)決定了一個程式所需的指令數目的多寡。時脈週期的長度和每個指令的時脈週期數目(CPI)卻是由處理器本身的製作方式來決定。在本章中，我們分別對於兩種不同的MIPS指令製作方式，建構出其資料路徑和控制單元。單一時脈製作方法多重時脈製作方法

Introduction 製作MIPS時，其功能單元包含兩個不同的邏輯元件：能運算資料的元件含狀態的元件例：ALU
Computer Organization & Design 5th. Introduction 製作MIPS時，其功能單元包含兩個不同的邏輯元件：能運算資料的元件例：ALU 組合式(元件的輸出值僅取決於現有的輸入值) 含狀態的元件例：記憶體和暫存器檔案循序式(輸出值決定在輸入值及其內部的狀態) 循序邏輯

Introduction 執行指令的階段圖4.1以高階的概觀圖來說明MIPS的製作方式指令擷取(Instruction Fetch)
Computer Organization & Design 5th. Introduction 執行指令的階段指令擷取(Instruction Fetch) 解碼 (Decode) 運算元擷取 (Operand Fetch) 執行(Execute) 寫回(Write back) 圖4.1以高階的概觀圖來說明MIPS的製作方式

Introduction We're ready to look at an implementation of the MIPS
Computer Organization & Design 5th. Introduction We're ready to look at an implementation of the MIPS Simplified to contain only: memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j

Introduction State Elements Unclocked vs. Clocked
Computer Organization & Design 5th. Introduction State Elements Unclocked vs. Clocked Clocks used in synchronous logic when should an element that contains state be updated? cycle time rising edge falling edge

Introduction An unclocked state element Latches and Flip-flops
Computer Organization & Design 5th. Introduction An unclocked state element The set-reset latch output depends on present inputs and also on past inputs Latches and Flip-flops Latches and flip-flops are the simplest memory elements. Output is equal to the stored value inside the element (don't need to ask for permission to look at the value) Change of state (value) is based on the clock Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge (edge-triggered methodology) A clocking methodology defines when signals can be read and written Wouldn't want to read a signal at the same time it was being written

Introduction D-latch Two inputs: Two outputs:
Computer Organization & Design 5th. Introduction D-latch Two inputs: the data value to be stored (D) the clock signal (C) indicating when to read & store D Two outputs: the value of the internal state (Q) and it's complement When the latch is open (C asserted), the value of Q changes as D changes transparent latch.

Introduction D flip-flop(D型正反器) Flip-flops are not transparent
Computer Organization & Design 5th. Introduction D flip-flop(D型正反器) Flip-flops are not transparent Output changes only on the clock edge The first latch, called the master, is open and follows the input D when C is asserted. When the clock input falls, the first latch is closed, but the 2nd latch, called the slave, is open and gets its input from the output of the master latch. Q _ D l a t c h C

Introduction Set-up time and Hold time D C
Computer Organization & Design 5th. Introduction Set-up time and Hold time Set-up time: the minimum time that the input must remain valid before the clock edge Hold time: the minimum time that the input must be valid after the clock edge (usually very small) D C Set-up time Hold time

Introduction An edge triggered methodology(邊緣觸發)
Computer Organization & Design 5th. Introduction An edge triggered methodology(邊緣觸發) Decide signals when to be read, when to be written Typical execution: read contents of some state elements, send values through some combinational logic write results to one or more state elements C l o c k y e S t a m n 1 b i g 2

Introduction Register File(暫存器檔案)
Computer Organization & Design 5th. Introduction Register File(暫存器檔案) A register file consists of a set of registers that can be read and written by supplying a register number to be accessed. Built using D flip-flops and decoders (specify register number) Read part (left) : supply a register number as input, and the output is the information stored in that register. A register file with 2 read ports and 1 write ports. (right) M u x R e g i s t r 1 n  a d 2 m b R e a d r g i s t n u m b 1 2 f l W

Introduction Register File
Computer Organization & Design 5th. Introduction Register File Write part: need 3 inputs: a register number, the data to write, and a clock that controls the writing into the register. Note: we still use the real clock to determine when to write n - t o 1 d e c r R g i s － C D u m b W a

Introduction Simple Implementation Basic components:
Computer Organization & Design 5th. Introduction Simple Implementation Basic components: two state elements instruction memory (指令記憶體)and program counter (PC) are needed to store and access instructions. An adder is needed to compute the next instruction address. Since the instruction memory is read-only(唯讀), we can treat it as combinational logic. P C I n s t r u c i o m e y a d . b g A S

Introduction Fetching instruction and incrementing PC (擷取指令並遞增PC)
Computer Organization & Design 5th. Introduction Fetching instruction and incrementing PC (擷取指令並遞增PC) A portion of the datapath used for fetching instructions and incrementing Program Counter PC送出位址讀取指令之後，立刻PC+4，指到下一個指令 P C I n s t r u c i o m e y R a d 4 A

Introduction R-Format ALU operations
Computer Organization & Design 5th. Introduction R-Format ALU operations R-format instruction has 3 register operands, 2 read and 1 write Rg. add $t0, $t1, $t2 Register numbers are 5 bits to indicate 32 registers, data bus are 32 bits and ALU control has 4 bits A L U c o n t r l R e g W i s a d 1 2 u D m b . Z 5 4

Introduction Datapath for R-type Instruction Eg. add $t0, $t1, $t2 I n
Computer Organization & Design 5th. Introduction Datapath for R-type Instruction Eg. add $t0, $t1, $t2 I n s t r u c i o R e g W a d 1 2 A L U l Z p 4

Introduction Load and Store Instructions
Computer Organization & Design 5th. Introduction Load and Store Instructions Load and store instructions compute a memory address by adding the base register, to a 16-bit signed offset field contained in the instruction “Sign extension unit” extends the 16-bit data to 32-bit data by replicating the high-order sign bit to the extra higher 16-bit data Eg. lw $t0, 40($t1) sw $t0, 32($t1) 1 6 3 2 S i g n e x t d b . - s o u M m R a W r D y A

Introduction Datapath for load and store instructions 資料路徑的載入和儲存動作
Computer Organization & Design 5th. Introduction Datapath for load and store instructions 資料路徑的載入和儲存動作暫存器的存取發生在記憶體位址計算之後。對記憶體的讀取。如果是載入指令，會有一個寫入動作到暫存器檔案中。 lw $t0, 40($t1) sw $t0, 32($t1) t1 I n s t r u c i o 1 6 3 2 R e g W a d D m y S x A L U l Z t0 40

Introduction J-type Instruction Branch datapath
Computer Organization & Design 5th. Introduction J-type Instruction Branch datapath Needs to compute the branch target address (計算分支目標位址) PC+4 is the address of the next instruction Offset field is left-shifted two bits to make a word offset. (PC0-27  Offset ) Needs to compare register contents(比較暫存器內容) 1 6 3 2 S i g n e x t d Z r o A L U u m h f l T b a c B P C + 4 s p I R W beq $t1, $t2, offset

Computer Organization & Design 5th.
Introduction 聖戰士組合利用多工器(MUX)或資料選擇器(data selector)將R形態指令和記憶體指令的資料路徑組合起來，而不用重複增加相同的功能單元 4

Introduction 聖戰士組合加入指令擷取部份的資料路徑

Introduction 聖戰士組合加入分支部份的資料路徑跳躍指令目標位址=指令之偏移量+跳躍指令之位址
Computer Organization & Design 5th. Introduction 聖戰士組合加入分支部份的資料路徑跳躍指令目標位址=指令之偏移量+跳躍指令之位址

Introduction 大功告成? 最難的是Control Unit 之設計
Computer Organization & Design 5th. Introduction 大功告成? 最難的是Control Unit 之設計

A Simple Implementation Scheme
Computer Organization & Design 5th. A Simple Implementation Scheme 這個簡易的製作方式包含載入字組 (lw) 及儲存字組 (sw) 相等分支 (beq) ALU 指令： add, sub, and , or, 及 set on less than 根據不同的指令形態，ALU需要可以做下列運算加法計算 lw 及 sw 的記憶體位址減法為了相等分支 AND, OR, subtraction, add, 或 slt 為了 R-形態指令需要 (由6位元的功能欄決定) ALU 控制輸入 0000 ： AND 0001 ： OR 0010 ：加法 0110 ：減法 0111 ：小於時設定 set on less than 1100 ：NOR (for other MIPS instructions) ALU a b Zero Result Overflow CarryOut ALU-operation 4

Computer Organization & Design 5th. A Simple Implementation Scheme Purpose Selecting the operations to perform (ALU, read/write, etc.) Controlling the flow of data (multiplexor inputs) How you get these control signals: Information comes from the 32 bits of the instruction Example: add $8, $17, $18 Instruction Format: ALU's operation based on instruction type and function code 000000 10001 10010 01000 00000 100000 op rs rt rd shamt funct

What Control Signals Do We Need?
Computer Organization & Design 5th. What Control Signals Do We Need?

Design Method for Control
Computer Organization & Design 5th. Design Method for Control Multi-level control (decoding) Instruction opcode: main control unit (first level) ALU control Sub-control for arithmetic MUX control Which source registers and destination registers ALU input source Input source of destination register Input source of PC Result for first level Seven 1-bit control lines 2-bit ALUOP control signals The above control signals can be set based solely on the opcode field of the instruction Exception: PCSrc (depends on the beq result)

Computer Organization & Design 5th. A Simple Implementation Scheme ALU控制位元的控制是由 ALUOp 控制位元所決定 ALUOp是來用決定不同的指令型態指令運算碼 ALUOp 指令的運算功能欄位需要的ALU運算 ALU的控制輸入 LW 00 載入字組 XXXXXX 加法 0010 SW 儲存字組 Branch equal 01 相等分支減法 0110 R-type 10 100000 100010 AND 100100 and 0000 OR 100101 or 0001 小於時設定 101010 小於時設定slt 0111

ALU Control ALU Control Instructions using ALU Branch eq R-type
Computer Organization & Design 5th. ALU Control ALU Control Instructions using ALU Load/store address calculation – add lw $t1, offset(t2) Branch eq Subtract for comparison ‘taken’ or ‘not taken’ add/subtract for address calculation beq $t1, $t2, offset R-type and/or set-on-less-than ALU control 4 2 6 function field ALUOp operation

ALU Control Multi-level control (decoding)
Computer Organization & Design 5th. ALU Control Multi-level control (decoding) Instruction opcode: main control unit – first level 00 = lw, sw 01 = beq, 10 = arithmetic 2nd level: function code for arithmetic : sub control Main CU generates the ALUOP bits as inputs of the ALU control unit Reduce the size of main control but may increase the delay

ALU Control Truth table X : don’t care term
Computer Organization & Design 5th. ALU Control Truth table X : don’t care term All zeros or don’t care terms are eliminated Input Output 注意事項： 1.ALUOP 目前無 ’11’項所以原來的’10’改成’1X’ 2.Funct field中F5F4皆為 ’10’故改成’XX’

Design Main Control Unit(設計主要的控制單元)
Computer Organization & Design 5th. Design Main Control Unit(設計主要的控制單元) 指令的格式 Op 欄位：Op[5 : 0] R 型指令、相等則分支(beq)指令及儲存指令中，暫存器：指令的25 : 21 位元及20 : 16 位元的rs 欄位及 rt 欄位載入及儲存指令中的基底暫存器：指令的25 : 21 位元(rs) 相等則分支(beq)指令﹑載入指令及儲存指令的16 位元偏移量(offset)：指令的15 : 0 位元

Computer Organization & Design 5th. A Simple Implementation Scheme Seven single-bit control lines, one 2-bit ALUOp control signal Except for PCSrc, the control signal can be set solely based on the opcode field of the instruction. To generate PCSrc, we need to AND together a signal from the control unit, which we call Branch, with the Zero signal out of the ALU. 未設定設定

The Simple Datapath with the Control Unit
Computer Organization & Design 5th. The Simple Datapath with the Control Unit P C I n s t r u c i o m e y R a d [ 3 1 ] 2 6 5 A M g L U O p W B h D S 4 x l f Z

Computer Organization & Design 5th. The Simple Datapath with the Control Unit 定義所有控制訊號線應該如何對每一種運作碼來設定第一列對應到R 格式的指令(add、sub、and、or 及slt)，來源暫存器都是rs 和rt，目的暫存器都是rd；定義了ALUSrc 和RegDst 控制訊號線如何設定 R-型指令將運算結果寫入暫存器(RegWrite=1)，但不會存取(讀寫)數據記憶體。當Branch 控制訊號等於0 時，PC 的值無條件地由PC+4 取代；否則，如果ALU 的Zero 輸出值也為1 時，PC 的值由分支目標位址取代 R-型指令的ALUOp 欄位被設定為10，表示ALU 的控制是由功能欄位(funct field)來產生

Computer Organization & Design 5th. The Simple Datapath with the Control Unit 第二列及第三列說明lw 指令及sw 指令的控制訊號設定。 ALUSrc 和ALUOp 欄位設定成執行位址的計算。 Mem-Read 及MemWrite 設定成執行記憶體的存取。載入指令，RegDst 及RegWrite 設定成將結果儲存到rt 暫存器中。

Computer Organization & Design 5th. The Simple Datapath with the Control Unit 分支指令的ALUOp 欄位設定成執行減法(ALU 控制=01)，用來測試兩個值是否相等。如果RegWrite 控制訊號為0 時MemtoReg 欄位是無關緊要的：因為暫存器不會被寫入，所以暫存器的Write data 輸入值不會被用到。表格最後兩列的MemtoReg 項目用X 來表示，代表「don’t care」。當RegWrite 為0 時，RegDst 也可以使用X 來表示。

Datapath operation(數據通道運作)
Computer Organization & Design 5th. Datapath operation(數據通道運作) 數據通道對R 型指令：add $t1, $t2, $t3 一個時脈週期內可以把它想像成執行了四個步驟：指令被擷取，並且遞增PC 的值兩個暫存器從暫存器檔案中讀出；同時，主控制單元計算控制訊號線的值 ALU 根據功能碼(指令中的5 :0 位元功能欄位)來產生ALU 功能的控制，以對從暫存器檔案中讀出的值運算 ALU 的運算結果使用指令的15:11 位元來選擇目的暫存器 ($t1)，以寫入暫存器檔案

35 or 43 rs rt address lw $t0, 32($s3) ； 35 19 8 32 31:26 25:21 20:16
Computer Organization & Design 5th. lw $t0, 32($s3) ； 35 or 43 rs rt address 31:26 25:21 20:16 15:0

Computer Organization & Design 5th. Datapath operation(數據通道運作) 載入指令中有動作的功能單元和被設定的控制訊號 lw $t0, 32($s3) 想像成執行了五個步驟：指令從指令記憶體中被擷取，並且遞增PC 的值暫存器的值從暫存器檔案中讀出 ALU 計算由暫存器檔案中讀出的值和符號延伸過的指令中較低的16 位元偏移量的和 ALU 所得的和作為數據記憶體的位址記憶體傳回的數據寫入暫存器檔案；暫存器的目的地可由指令中的位元20:16 得知

4 rs rt address beq $s1, $s2, 100 ； 4 17 18 25 31:26 25:21 20:16 15:0
Computer Organization & Design 5th. beq $s1, $s2, 100 ； 4 rs rt address 31:26 25:21 20:16 15:0

Computer Organization & Design 5th. Datapath operation(數據通道運作) beq指令： beq $s0, $s1, 100 想像成執行時的四個步驟：指令從指令記憶體中被擷取，並且遞增PC 的值兩個暫存器$t1 和$t2 從暫存器檔案中被讀出 ALU 對暫存器檔案中讀出的值執行減法。PC+4 的值與符號延伸並左移兩位的指令中較低的16 位元(偏移量)相加；其結果即為分支目的位址 ALU 的Zero 輸出被用來決定哪一個加法器的結果要寫回 PC

4 rs rt address beq $s1, $s2, 100 ； 4 17 18 25 31:26 25:21 20:16 15:0
Computer Organization & Design 5th. beq $s1, $s2, 100 ； 4 rs rt address 31:26 25:21 20:16 15:0

Complete Control Unit(完成控制單元)
Computer Organization & Design 5th. Complete Control Unit(完成控制單元) 加入跳躍(jump)指令，以便說明如何在基本的數據通道及控制中，再作延伸以處理其他的指令

2 address Jump 31:26 25:0 Computer Organization & Design 5th.

效能議題Performance Issues
Computer Organization & Design 5th. 效能議題Performance Issues Longest delay determines clock period 關鍵路徑：Load指令(Critical path: load instruction) 指令記憶體暫存器檔案 ALU 資料記憶體傳存器檔案 (Instruction memory  register file  ALU  data memory  register file) 對不同指令沒有彈性可以改變週期 (Not feasible to vary period for different instructions) 違反設計原則 Violates design principle 讓一般情況加快(Making the common case fast) 利用管線處理來增進效能 (We will improve performance by pipelining)

Computer Organization & Design 5th. A Simple Implementation Scheme 為什麼單一時脈週期的製作方式不被採用？每個指令的時脈週期都必須有相同長度(因此，CPI = 1) 計算機的運算處理指令中最長的路徑將決定時脈週期的長度整體效能似乎不是很好範例：單一時脈計算機的效能,假設功能單元的運算時間如下：記憶體單元： 2 ns ALU 及加法器： 2 ns 暫存器檔案 (讀取或寫入): 1 ns 下列的製作方式那一種會比較快？每個指令在一個固定長度的時脈週期內運作完成每個指令在一個時脈週期內運作完成，但時脈週期長度是可變動

Computer Organization & Design 5th. A Simple Implementation Scheme 範例 (續) 為了計算效能，假設我們使用下列指令的混合比例： 24% 載入， 12% 儲存， 44% R形態指令， 18% 分支及 2%跳躍指令解答 1. CPU 時脈週期為 8 ns. 2. CPU 時脈週期 = 8*24% + 7*12% + 6*44% + 5*18% + 2*2% = 6.3 ns 效能改進的比例為 8/6.3 = 1.27. 指令種類所用到的功能單元 R格式指令擷取暫存器存取 ALU 載入字組記憶體存取儲存字組分支跳躍指令種類指令記憶體暫存器讀取 ALU運算資料記憶體暫存器寫入總和 R格式 2 1 6ns 載入字組 8ns 儲存字組 7ns 分支 5ns 跳躍 2ns

Computer Organization & Design 5th. A Simple Implementation Scheme 範例假設我們有浮點指令單元：執行浮點加法需要8ns 執行浮點乘法需要16ns 所有功能單元所需的時間如同上例。下列的製作方式何會比較快？ 1.每個指令在一個固定長度的時脈週期內運作完成 2.每個指令在一個時脈週期內運作完成，但時脈週期長度是可變動為了計算效能，假設我們使用下列指令的混合比例： 31%載入， 21%儲存， 27% R形態指令， 5%分支，2% 跳躍指令， 7%浮點加法及7% FP浮點乘法解答 1. 最長的指令為浮點乘法，其時脈週期為 = 20 ns 2. 浮點指令的加法須時 = 12 ns. CPU 時脈週期 = 8*31% + 7*21% + 6*27% + 5*5% + 2*2% +20*7% + 12*7%= 7.0 ns 效能改進的比例為20/7 = 2.9.

管線化 Pipelining 管線式洗衣(Pipelined laundry: overlapping execution)
Computer Organization & Design 5th. 管線化 Pipelining 管線式洗衣(Pipelined laundry: overlapping execution) 平行處理增進效能(Parallelism improves performance) Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup = 2n/0.5n ≈ 4 = number of stages

MIPS管線處理(MIPS Pipeline)
Computer Organization & Design 5th. MIPS管線處理(MIPS Pipeline) 管線處理(pipelining)之定義將指令區分成數個步驟，分別由不同的功能單元同時加以執行，以增進整體程式之效能管線處理五個步驟(Five stages, one step per stage) IF: Instruction Fetch(擷取指令) 從記憶體擷取指令(Instruction fetch from memory) ID: Instruction Decode(指令解碼) 解碼指令並讀取暫存器(Instruction decode & register read) EX: Execution(執行指令) 執行運算或計算位址(Execute operation or calculate address) MEM: Memory Access(記憶體存取) 存取記憶體運算元(Access memory operand) WB: Write Back(寫回) 將結果寫回至暫存器(Write result back to register)

管線處理效能Pipeline Performance
Computer Organization & Design 5th. 管線處理效能Pipeline Performance 假設每一階段的時間為(Assume time for stages is) 暫存器讀寫：100ps for register read or write 其他階段：200ps for other stages 比較管線處理與單一週期的資料路徑 (Compare pipelined datapath with single-cycle datapath) Instr Instr fetch Register read ALU op Memory access Register write Total time lw 200ps 100 ps 800ps sw 700ps R-format 600ps beq 500ps

Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps)
Computer Organization & Design 5th. Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps)

管線處理加速Pipeline Speedup
Computer Organization & Design 5th. 管線處理加速Pipeline Speedup 所有階段都一致If all stages are balanced 每一階段時間都相同(i.e., all take the same time) Time between instructionspipelined = Time between instructionsnonpipelined Number of stages 若階段不一致，加速值較少If not balanced, speedup is less 管線處理的「加速』源自增加處理量(產量) Speedup due to increased throughput 延遲時間(latency每一指令的時間)沒有減少 Latency (time for each instruction) does not decrease

Pipelining and ISA Design
Computer Organization & Design 5th. Pipelining and ISA Design MIPS ISA專為管線化處理所設計 (MIPS ISA designed for pipelining) 所有指令皆為32位元(All instructions are 32-bits) 較容易在一個週期內擷取並解碼 (Easier to fetch and decode in one cycle) c.f. x86: 1- to 17-byte instructions 少量且規則的指令格式(Few and regular instruction formats) 能在一個步驟內解碼並讀取暫存器 (Can decode and read registers in one step) 載入與儲存定址(Load/store addressing) 能在第3階段計算位址，在第4階段存取記憶體 (Can calculate address in 3rd stage, access memory in 4th stage) 記憶體運算元對齊(Alignment of memory operands) 記憶體存取只需一個週期 (Memory access takes only one cycle)

危障Hazards 下一週期的起始位址不是下一指令 Situations that prevent starting the next instruction in the next cycle 危障種類Hazard types 結構危障(Structure hazards) 所需資源忙碌中(A required resource is busy) 資料(數據)危障(Data hazard) 等待前一指令完成資料讀寫 Need to wait for previous instruction to complete its data read/write 控制危障(Control hazard) 依前一指令結果決定控制動作 Deciding on control action depends on previous instruction

結構危障Structure Hazards
Computer Organization & Design 5th. 結構危障Structure Hazards 定義：當安排好的指令由於硬體無法支援當時應執行的指令在適當的時脈週期內執行使用資源衝突Conflict for use of a resource MIPS中只有一個記憶體 In MIPS pipeline with a single memory Load/store 需要做資料存取 Load/store requires data access 該週期的指令擷取必須延遲(stall)，需管線泡泡 Instruction fetch would have to stall for that cycle Would cause a pipeline “bubble” 管線式資料路徑需要獨立的指令/資料記憶體或獨立的指令/資料快取(記憶體) Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches

結構危障Structure Hazards
Computer Organization & Design 5th. 結構危障Structure Hazards [問題]若有第四個指令進入管線中，則….. 指令1在時間6~8需要存取記憶體，此時指令4也需要從記憶體讀取指令沖!沖!沖!

資料(數據)危障Data Hazards 定義：當安排好的指令執行所需的資料未取得而無法在適當的時脈週期內執行
Computer Organization & Design 5th. 資料(數據)危障Data Hazards 定義：當安排好的指令執行所需的資料未取得而無法在適當的時脈週期內執行與前依指令資料存取完成結果有關 An instruction depends on completion of data access by a previous instruction add $s0, $t0, $t1 sub $t2, $s0, $t3

前饋/繞送 (Forwarding/Bypassing)
Computer Organization & Design 5th. 前饋/繞送 (Forwarding/Bypassing) 使用已經計算完成的結果Use result when it is computed 不需等到存至暫存器中Don’t wait for it to be stored in a register 資料路徑需要額外的連接線 Requires extra connections in the datapath

Load指令-資料危障(Load-Use Data Hazard)
Computer Organization & Design 5th. Load指令-資料危障(Load-Use Data Hazard) 使用前饋仍無法避免要用延遲/停滯(stall) Can’t always avoid stalls by forwarding 當所需要的值尚未計算完成 If value not computed when needed 無法前饋至之前的時間 Can’t forward backward in time!

Code Scheduling to Avoid Stalls
Computer Organization & Design 5th. Code Scheduling to Avoid Stalls 利用指令重排避免下一指令為Load指令 Reorder code to avoid use of load result in the next instruction C code for A = B + E; C = B + F; stall lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) 11 cycles 13 cycles

控制危障Control Hazards 定義：當所擷取的指令並非所需的指令而造成適當的指令無法在適當的管線時脈週期中執行；亦即，指令位址產生的順序非管線所期待者分支決定控制流程Branch determines flow of control 擷取下一指令取決於分支結果 Fetching next instruction depends on branch outcome 管線處理不可能永遠擷取正確的下一個指令 Pipeline can’t always fetch correct instruction 仍在分支指令的ID階段 Still working on ID stage of branch MIPS的管線處理中In MIPS pipeline 需要在管線中比較暫存器與提早計算目標位址Need to compare registers and compute target early in the pipeline 在ID階段增加硬體來處理 Add hardware to do it in ID stage

分支中的延遲Stall on Branch 等到分支結果來決定擷取下一指令 Wait until branch outcome determined before fetching next instruction

分支預測Branch Prediction
Computer Organization & Design 5th. 分支預測Branch Prediction 較長的管線無法完全提早決定分之結果 Longer pipelines can’t readily determine branch outcome early 延遲時間變得無法接受 Stall penalty becomes unacceptable 分支預測結果Predict outcome of branch 預測錯誤只有造成延遲Only stall if prediction is wrong MIPS管線處理中In MIPS pipeline 可以預測分支未發生Can predict branches not taken 分支後的擷取指令沒有延遲 Fetch instruction after branch, with no delay

MIPS with Predict Not Taken
Computer Organization & Design 5th. MIPS with Predict Not Taken Prediction correct Prediction incorrect

More-Realistic Branch Prediction
Computer Organization & Design 5th. More-Realistic Branch Prediction 靜態分支預測Static branch prediction 基於典型分支行為Based on typical branch behavior 範例：迴圈與if指令Example: loop and if-statement branches 預測反向分支會發生Predict backward branches taken 預測前向分支不會發生Predict forward branches not taken 動態分支預測Dynamic branch prediction 硬體預測器根據每道分支的行為來作預測，並且在程式運作過程中可以改變對一道分支的預測硬體測量實際分支行為Hardware measures actual branch behavior 例如：記錄每一分支最近結果的歷史 e.g., record recent history of each branch 假設未來行為會持續趨勢 Assume future behavior will continue the trend 當猜錯時，使用重新擷取時延遲stall、並更新歷史紀錄 When wrong, stall while re-fetching, and update history

管線處理總結Pipeline Summary
Computer Organization & Design 5th. 管線處理總結Pipeline Summary 管線處理增加效能是利用增加指令處理量 Pipelining improves performance by increasing instruction throughput 同時執行多個指令Executes multiple instructions in parallel 每一指令有相同延遲時間Each instruction has the same latency 危障Subject to hazards 結構、資料、控制Structure, data, control 指令集設計影響實現管線處理的複雜度 Instruction set design affects complexity of pipeline implementation

了解程式效能有效率的管道運作通常是除了記憶體系統之外，決定處理器的CPI 也就是其效能最重要的因素
Computer Organization & Design 5th. 了解程式效能有效率的管道運作通常是除了記憶體系統之外，決定處理器的CPI 也就是其效能最重要的因素結構危障通常發生在可能無法完全管道化的浮點單元中控制危障通常在有較多分支而且分支較難預測的整數程式中較難處理數據危障通常在浮點程式中由於其較少的分支以及較規律的記憶體存取樣式，方便讓編譯器試著安排指令而較易避免數據危障對於偏向使用指標(pointer)而導致較不規律記憶體存取的整數程式則較難以這個方法來改善管道化改善了指令的處理量然而反而會增加單一指令的執行時間或延遲

管道化數據通道及控制 1. IF：指令擷取 2. ID：指令解碼與暫存器檔案讀取 3. EX：執行或位址計算 4. MEM：數據記憶體存取
Computer Organization & Design 5th. 管道化數據通道及控制單一週期數據通道 5 級的管道代表在任一時脈週期內最多有5 道指令正在執行 1. IF：指令擷取 2. ID：指令解碼與暫存器檔案讀取 3. EX：執行或位址計算 4. MEM：數據記憶體存取 5. WB：寫回

MIPS Pipelined Datapath
Computer Organization & Design 5th. MIPS Pipelined Datapath 2個例外 Write Back(WB) Next PC WB MEM Right-to-left flow leads to hazards

管線暫存器Pipeline registers
Computer Organization & Design 5th. 管線暫存器Pipeline registers 管線各階段間需要暫存器Need registers between stages 保存上一階段產生的資訊 To hold information produced in previous cycle 64bits 128 97 64bits

Pipeline Operation 管線化資料路徑「逐一週期」的流程 Cycle-by-cycle flow of instructions through the pipelined datapath 單一週期管線圖 “Single-clock-cycle” pipeline diagram 顯示單一週期管線的使用Shows pipeline usage in a single cycle 標示使用到的資源Highlight resources used 比較：多重時脈週期管線圖 c.f. “multi-clock-cycle” diagram Graph of operation over time 審視load跟store指令單一週期管線圖 We’ll look at “single-clock-cycle” diagrams for load & store

單一時脈週期數據路徑假設以管道化的方式來執行
Computer Organization & Design 5th. 單一時脈週期數據路徑假設以管道化的方式來執行

管道化數據通道及控制顯示載入指令通過管道中五個階段時數據通道中強調出有動作的部分的情形
Computer Organization & Design 5th. 管道化數據通道及控制顯示載入指令通過管道中五個階段時數據通道中強調出有動作的部分的情形任何在後方管道階段會使用到的資訊都必須透過管道暫存器來傳遞給各該階段數據通道中的每個邏輯元件如指令記憶體、暫存器讀取埠、ALU、數據記憶體和暫存器寫入埠只能用於唯一的管道階段期間，否則會發生結構危障現在我們來發現在載入指令設計中的一個錯誤。你看出來了嗎？寫入暫存器的編號！

IF for Load, Store, …

ID for Load, Store, …

EX for Load

MEM for Load

WB for Load Wrong register number

Load指令之修正資料路徑 (Corrected Datapath for Load)
Computer Organization & Design 5th. Load指令之修正資料路徑 (Corrected Datapath for Load)

EX for Store

MEM for Store

WB for Store

Multi-Cycle Pipeline Diagram
Computer Organization & Design 5th. Multi-Cycle Pipeline Diagram 顯示資源利用(Form showing resource) usage

The Processor: Datapath and Control

Similar presentations

Presentation on theme: "The Processor: Datapath and Control"— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

The Processor: Datapath and Control

Similar presentations

Presentation on theme: "The Processor: Datapath and Control"— Presentation transcript:

Similar presentations

About project

反馈