Lecture on High Performance Processor Architecture (CS05162)

Lecture on High Performance Processor Architecture (CS05162)
Review of Instruction Sets, Pipelines An Hong Fall 2009 School of Computer Science and Technology University of Science and Technology of China

Outline Quick review of everything you should have learned
Instruction Sets, Pipelines 2018/11/16 USTC CS AN Hong

Quick review of everything you should have learned
2018/11/16 USTC CS AN Hong

理解现代计算机系统设计的抽象层次软件指令集体系结构 (以及I/O接口) 硬件应用问题算法+数据结构语言微体系结构
逻辑和集成电路设计器件(晶体管/集成电路工艺学) 算法+数据结构软件硬件 1-4 2018/11/16 USTC CS AN Hong 4

计算机体系结构定义:经典,狭义定义 temp = v[k]; High Level Language Program
v[k] = v[k+1]; v[k+1] = temp; Compiler lw $15, 0($2) lw $16, 4($2) sw $16, 0($2) sw $15, 4($2) Assembly Language Program Assembler Machine Language Program Machine Interpretation Control Signal Specification ALUOP[0:3] <= InstReg[9:11] & MASK 2018/11/16 USTC CS AN Hong

计算机体系结构定义:经典,狭义定义 ... the attributes of a [computing] system as seen by the programmer, i.e. the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation – Amdahl, Blaaw, and Brooks, 1964 2018/11/16 USTC CS AN Hong

计算机体系结构定义:经典,狭义定义 ISA 是程序员看到的计算机的属性(概念结构和功能结构) 数据的表示:硬件能直接辨认和处理的数据结构类型
寻址规则:最小寻址单位,寻址方式及其表示寄存器:各种寄存器的定义,数量和使用方式指令集:机器指令的操作类型和格式,指令间的排序和控制机构等中断和例外系统:中断的类型和中断响应硬件的功能等机器工作状态的定义和切换:如管态和目态等存储系统:主存容量,程序员可用的最大存储容量等信息保护:包括信息保护方式和硬件对信息保护的支持 I/O结构:包括I/O联结方式,处理机/存储器与I/O设备间数据传送的方式和格式以及I/O操作的状态等 2018/11/16 USTC CS AN Hong

现代计算机体系结构的广义定义逻辑实现（组成设计）物理实现（硬件设计）设计任务设计问题特征(用途) 成本(价格) 性能
ISA设计逻辑实现: CPU设计,存储系统设计,总线设计物理实现集成电路设计,封装,电源设计,冷却 I/O system Instr. Set Proc. Compiler Operating System Application Logic Design Circuit Design Instruction Set Architecture Firmware Datapath & Control Layout 逻辑实现（组成设计）物理实现（硬件设计） 2018/11/16 USTC CS AN Hong

计算机设计者的任务 2018/11/16 USTC CS AN Hong

理解现代计算机系统设计的影响因素 2018/11/16 2018/11/16 USTC CS AN Hong 10

计算机系统性能评价方法怎样评测一个计算机系统的性能,与测试者所处的角度有关用户: 对单个应用系统的响应时间(快), 程序的运行时间(少)
管理员: 单位时间里系统的吞吐量(大) 2018/11/16 USTC CS AN Hong

Understanding Program Performance
Hardware or software component How this component affects performance Where is this topic covered? Algorithm Determine both the number of source-level statements and the number of I/O operations executed Book1 Programming language, compiler, and architecture Determine the number of machine instructions for each source-level statement Book2:chapters 2 and 3 Processor and memory system Determine how fast instructions can be executed Book2:chapters 5,6,7 I/O system(hardware and operating system) Determine how fast I/O operations may be executed Book2:chapters 8 2018/11/16 USTC CS AN Hong

Text Books Book1: 陈国良，并行算法分析与设计,高等教育出版社
Book2: David A. Patterson and John L. Hennessy，Computer Organization and Design: The Hardware/Software Interface, Third Edition.机械工业出版社. 2018/11/16 USTC CS AN Hong

Computer Performance Inst Count CPI Clock Rate Program X
Cycle time CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set X X Organization X X Technology X 2018/11/16 USTC CS AN Hong

CPU性能公式执行一个程序所需的时钟周期数× 时钟周期 CPU时间 = = IC ×CPI × Cycle time
时钟频率 IC ×CPI Frequency CPIi: 执行指令 i 所需的平均时钟周期数 ICi : 指令 i 在程序中执行的次数 CPI = = 所有指令的平均时钟周期数指令的执行频度 2018/11/16 USTC CS AN Hong

Instruction Set Design
software hardware Which is easier to change/design??? 2018/11/16 USTC CS AN Hong

指令集的演化 2018/11/16 USTC CS AN Hong

Instruction Set Architecture: 每个周期做什么事?
Fetch Decode Operand Execute Result Store Next Stage1:从存储系统中获得指令 Stage2a:确定做何动作 Stage2b:获得操作数 Stage3:产生运算结果或状态 Stage4:向存储系统中存放运算结果 Stage5:确定下一条要执行的指令 2018/11/16 USTC CS AN Hong

Instruction Set Architecture: 必须定义什么?
Instruction Format or Encoding（编码方式） how is it decoded? Location of operands and result（寻址方式） where other than memory? how many explicit operands? how are memory operands located? which can or cannot be in memory? Data type and Size（数据类型） Operations（操作类型） what are supported Successor instruction（控制流的显式表示） jumps, conditions, branches 指令处理必须经过 fetch-decode-execute ! Instruction Fetch Decode Operand Execute Result Store Next 2018/11/16 USTC CS AN Hong

指令集的操作:CISC指令集结构 CISC指令集结构优点: 指令功能强, 编译生成的程序指令条数少, 软件功能向硬件功能迁移
缺点: 指令功能不均衡; 20%的指令使用频率占运行时间的80%; 实现复杂 2018/11/16 USTC CS AN Hong

指令集的操作: Top 10 80x86 Instructions
Rank instruction Integer Average Percent total executed 1 load 22% 2 conditional branch 20% 3 compare 16% 4 store 12% 5 add 8% 6 and 6% 7 sub 5% 8 move register-register 4% 9 call 1% 10 return Total 96% Simple instructions dominate instruction frequency 2018/11/16 USTC CS AN Hong

指令集的操作:RISC指令集结构 RISC指令集结构 RISC指令集结构设计原则
优点: 只包含那些使用频率很高的指令和一些必要的指令, 降低指令集结构的复杂性,简化实现; 程序的运行时间缩短, 从而提高了性能. RISC指令集结构设计原则选取使用频率最高的指令每条指令的功能尽可能简单, 能在1个机器周期内完成所有指令长度相同，简化指令译码，支持高效的流水线简单的Load/Store型指令集：只有Load和Store指令可以访问存储器, 其它指令的操作均在寄存器间进行以最简单有效的方式支持高级语言，支持高效的编译器 2018/11/16 USTC CS AN Hong

MIPS I Instruction Set Architecture
Registers GPRs: 32bits; R0(=0), R1, …, R31 FPRs: 32bits; F0, …, F31(Single Precision) 64bits; F0(F0,F1), …, F30(F30,F31)(Double Precision) Multiply and Divide Registers, PC Other Registers r31 r0 r1 r30 : : 31 Floating-Point Registers f0 f1 f30 f31 HI Multiply and Divide Registers LO PC Program Counter General Purpose Registers 2018/11/16 USTC CS AN Hong

Example: MIPS ( DLX) Register-Register Op Rs1 Rs2 Rd Opx
31 26 25 21 20 16 15 11 10 6 5 Op Rs1 Rs2 Rd Opx Register-Immediate 31 26 25 21 20 16 15 immediate Op Rs1 Rd Branch 31 26 25 21 20 16 15 immediate Op Rs1 Rs2/Opx Jump / Call 31 26 25 target Op 2018/11/16 USTC CS AN Hong

MIPS I Instruction Set Architecture:Addressing Modes
All instructions 32 bits wide Register (direct) 寄存器寻址 op rs rt rd register Immediate 立即值寻址 immed op rs rt immed op rs rt register + Memory Base+index 偏移寻址 immed op rs rt PC + Memory PC-relative PC相对寻址 2018/11/16 USTC CS AN Hong

MIPS Addressing Modes/Instruction Formats:例子
op rs rt rd shamt funct 6 11 16 21 26 31 6 bits 5 bits ADD and SUB addU rd, rs, rt subU rd, rs, rt OR Immediate: ori rt, rs, imm16 LOAD and STORE Word lw rt, offset(rs) sw rt, offset(rs) BRANCH: beq rs, rt, offset op rs rt imm16 16 21 26 31 6 bits 16 bits 5 bits op rs rt offset 16 21 26 31 6 bits 16 bits 5 bits In today’s lecture, I will show you how to implement the following subset of MIPS instructions: add, subtract, or immediate, load, store, branch, and the jump instruction. The Add and Subtract instructions use the R format. The Op together with the Func fields together specified all the different kinds of add and subtract instructions. Rs and Rt specifies the source registers. And the Rd field specifies the destination register. The Or immediate instruction uses the I format. It only uses one source register, Rs. The other operand comes from the immediate field. The Rt field is used to specified the destination register. (Note that dest is the Rt field!) Both the load and store instructions use the I format and both add the Rs and the immediate filed together to from the memory address. The difference is that the load instruction will load the data from memory into Rt while the store instruction will store the data in Rt into the memory. The branch on equal instruction also uses the I format. Here Rs and Rt are used to specified the registers we need to compare. If these two registers are equal, we will branch to a location offset by the immediate field. Finally, the jump instruction uses the J format and always causes the program to jump to a memory location specified in the address field. I know I went over this rather quickly and you may have missed something. But don’t worry, this is just an overview. You will keep seeing these (point to the format) all day today. +3 = 13 min. (X:53) op rs rt offset 16 21 26 31 6 bits 16 bits 5 bits 2018/11/16 USTC CS AN Hong

Datapath vs Control Datapath Controller Control Points signals Datapath: Storage, FU, interconnect sufficient to perform the desired functions Inputs are Control Points Outputs are signals Controller: State machine to orchestrate operation on the data path Based on desired function and signals 2018/11/16 USTC CS AN Hong

Datapath and Control: Split state diag into 5 pieces
IR <- Mem[PC]; NPC <–PC+4; PC <– PC+4; A <- R[rs]; B<– R[rt]; V<- imm or offset<- imm S <– A + B; R-R型运算 R[rd] <– S; S <– A + offset; R-imm型Load M <– Mem[S] R[rt] <– M; S <– A or V; R-imm型运算 R[rt] <– S; R-imm型Store Mem[S] <- B S <– NPC+offset; Cond <– A op 0 取指（IFetch）读寄存器/译码（Reg/ID）执行（Exec）访存（Mem）写回（WB） If Cond PC <– S else PC <–NPC; 2018/11/16 USTC CS AN Hong

5 Steps of DLX Datapath 4 Instruction Fetch Instr. Decode Reg. Fetch
Execute Addr. Calc Memory Access Write Back Next PC MUX 4 Adder Next SEQ PC Zero? RS1 Reg File Address Memory Inst. RS2 MUX ALU Inst Memory Data L M D RD MUX MUX Sign Extend IR <= mem[PC]; PC <= PC + 4 Imm WB Data Reg[IRrd] <= Reg[IRrs] opIRop Reg[IRrt] 2018/11/16 USTC CS AN Hong

5 Steps of DLX Datapath 4 Instruction Fetch Instr. Decode Reg. Fetch
Execute Addr. Calc Memory Access Write Back Next PC IF/ID ID/EX MEM/WB EX/MEM MUX Next SEQ PC Next SEQ PC 4 Adder Zero? RS1 Reg File Address Memory Inst. MUX RS2 ALU Memory Data MUX MUX IR <= mem[PC]; PC <= PC + 4 Sign Extend Imm WB Data A <= Reg[IRrs]; B <= Reg[IRrt] RD RD RD rslt <= A opIRop B WB <= rslt 2018/11/16 USTC CS AN Hong Reg[IRrd] <= WB

5 Steps of DLX Datapath 4 Data stationary control Instruction Fetch
Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC IF/ID ID/EX MEM/WB EX/MEM MUX Next SEQ PC Next SEQ PC 4 Adder Zero? RS1 Reg File Address Memory Inst. MUX RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD Data stationary control local decode for each instruction phase / pipeline stage 2018/11/16 USTC CS AN Hong

Visualizing Pipelining
Time (clock cycles) Reg ALU DMem Ifetch Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5 I n s t r. O r d e 2018/11/16 USTC CS AN Hong

Pipelining is not quite that easy!
Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) Control hazards: Caused by delay between the fetching of instructions and decisions 2018/11/16 USTC CS AN Hong

Can pipelining get us into trouble?
Yes: Pipeline Hazards structural hazards: attempt to use the same resource two different ways at the same time E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) data hazards: attempt to use item before it is ready E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer instruction depends on result of prior instruction still in the pipeline control hazards: attempt to make a decision before condition is evaulated E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in branch instructions Can always resolve hazards by waiting ？ pipeline control must detect the hazard take action (or delay action) to resolve hazards 2018/11/16 USTC CS AN Hong

结构相关：由访存引起的结构相关 Load Instr 1 Instr 2 Instr 3 Instr 4
Mem I n s t r. O r d e Time (clock cycles) Load Instr 1 Instr 2 Instr 3 Instr 4 ALU Reg 取指和取数都要访问同一个存储器 Detection is easy in this case! (right half highlight means read, left half write) 2018/11/16 USTC CS AN Hong

结构相关的解决方案：阻塞 Load Instr 1 Instr 2 Instr 3 Instr 4 Time (clock cycles)
Mem I n s t r. O r d e Time (clock cycles) Load Instr 1 Instr 2 Instr 3 Instr 4 ALU Reg Stall 取指延迟一拍进行 2018/11/16 USTC CS AN Hong

控制相关： What’s the Problem?
例： BEQ rs, rt, offset if R[rs] == R[rt] then PC<- PC+offset bne r2, #0, r3 add r4,r5,r6 sub r7,r8,r9 T NT Need address here Instruction Fetch Decode Execute Memory Access Writeback Branch Delay Compute address here 分支处理问题可划分为两个子问题决定分支的方向（分支条件相关）对需要跳转的分支，使执行延迟最小化 -> 尽快获得转移的目标地址（分支地址相关） 2018/11/16 USTC CS AN Hong

Control Hazard Solution #1: Stall
e Time (clock cycles) Add Beq Load ALU Mem Reg Lost potential Stall: wait until decision is clear Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow Move decision to end of decode save 1 cycle per branch 2018/11/16 USTC CS AN Hong

Pipelined DLX Datapath
Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC Next SEQ PC ID/EX EX/MEM MEM/WB MUX 4 Adder IF/ID Adder Zero? RS1 Address Reg File Memory RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD Interplay of instruction set design and cycle time. 2018/11/16 USTC CS AN Hong

Control Hazard Solution #2: Predict
Time (clock cycles) Add Beq Load ALU Mem Reg Predict: guess one direction then back up if wrong Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right 50% of time) More dynamic scheme: history of 1 branch ( 90%) 2018/11/16 USTC CS AN Hong

Control Hazard Solution #3: Delayed Branch
Time (clock cycles) Add Beq Misc ALU Mem Reg Load Delayed Branch: Redefine branch behavior (takes place after next instruction) Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time) As launch more instruction per clock cycle, less useful 2018/11/16 USTC CS AN Hong

Data Hazard on R1 Read After Write (RAW) InstrJ tries to read operand before InstrI writes it Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. I: add r1,r2,r3 J: sub r4,r1,r3 2018/11/16 USTC CS AN Hong

Data Hazard on r1: Read after write hazard (RAW)
add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 2018/11/16 USTC CS AN Hong

Data Hazard on r1: Read after write hazard (RAW)
Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11 2018/11/16 USTC CS AN Hong

Data Hazard Solution: Forwarding
“Forward” result from one stage to another Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11 2018/11/16 USTC CS AN Hong

HW Change for Forwarding
ID/EX EX/MEM MEM/WR NextPC mux Registers ALU Data Memory mux mux Immediate What circuit detects and resolves this hazard? 2018/11/16 USTC CS AN Hong

Forwarding (or Bypassing): What about Loads?
Dependencies backwards in time are hazards Data Hazard Even with Forwarding Can’t solve with forwarding ，Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU sub r4,r1,r3 Im Reg Dm Reg 2018/11/16 USTC CS AN Hong

Forwarding (or Bypassing): What about Loads ?
Dependencies backwards in time are hazards Data Hazard Even with Forwarding Can’t solve with forwarding ，Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU Im Reg Dm sub r4,r1,r3 Stall 2018/11/16 USTC CS AN Hong

Software Scheduling to Avoid Load Hazards
Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd Compiler optimizes for performance. Hardware checks for safety. 2018/11/16 USTC CS AN Hong

Three Generic Data Hazards
Write After Read (WAR) InstrJ writes operand before InstrI reads it Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 2018/11/16 USTC CS AN Hong

Three Generic Data Hazards
Write After Write (WAW) InstrJ writes operand before InstrI writes it. Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in more complicated pipes I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 2018/11/16 USTC CS AN Hong

总结：影响指令级并行性的因素 Pipeline CPI=Ideal pipeline CPI + Structural stalls +
RAW stalls + WAR stalls + WAW stalls + Control stalls 改进理想的CPI：多发射（静态/动态）克服流水线中的相关性结构相关：由资源冲突导致的相关解决办法：增加资源数据相关：由RAW、WAW、WAR导致的相关解决办法（用软件）：编译器静态调度，循环展开，寄存器重命名，软流水（用硬件）：forwarding技术，寄存器重命名，动态调度的乱序执行技术（记分板，Tomasulo算法）控制相关：由分支引起的相关解决方法：静态/动态预测和推测执行 2018/11/16 USTC CS AN Hong

总结：数据相关(又称数据依赖) 在程序的一个基本块中存在的数据相关有以下几种情形：真数据依赖：两条指令之间存在数据流，有真正的数据依赖关系
RAW（Read After Write）相关：对于指令i和j，如果 (1) 指令j使用指令i产生的结果，则称指令j与指令i为RAW相关；或者 (2) 指令j与指令i存在RAW相关，而指令k与指令j存在RAW相关，则称指令k与指令i为RAW相关伪数据依赖（又称名相关）：指令使用的寄存器或存储器称为名。两条指令使用相同名，但它们之间不存在数据流，则它们之间是一种伪数据依赖关系，包括两种情形： WAR（Write After Read）相关：对于指令i和j，如果指令i先执行，指令j写的名是指令i读的名，则称指令j与指令i为WAR相关（又称反相关，anti-dependence） WAW（ Write After Write）相关: 对于指令i和j，如果指令i与指令j写相同的名，则称指令j与指令i为WAW相关（又称输出相关,output-dependence） 2018/11/16 USTC CS AN Hong

总结：开发指令级并行性的技术 2018/11/16 USTC CS AN Hong

Lecture on High Performance Processor Architecture (CS05162)

Similar presentations

Presentation on theme: "Lecture on High Performance Processor Architecture (CS05162)"— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

Lecture on High Performance Processor Architecture (CS05162)

Similar presentations

Presentation on theme: "Lecture on High Performance Processor Architecture (CS05162)"— Presentation transcript:

Similar presentations

About project

反馈