Lecture on High Performance Processor Architecture (CS05162)

Slides:



Advertisements
Similar presentations
期末考试作文讲解 % 的同学赞成住校 30% 的学生反对住校 1. 有利于培养我们良好的学 习和生活习惯; 1. 学生住校不利于了解外 界信息; 2 可与老师及同学充分交流有 利于共同进步。 2. 和家人交流少。 在寄宿制高中,大部分学生住校,但仍有一部分学生选 择走读。你校就就此开展了一次问卷调查,主题为.
Advertisements

考研英语复试 口语准备 考研英语口语复试. 考研英语复试 口语准备 服装 谦虚、微笑、自信 态度积极 乐观沉稳.
第七课:电脑和网络. 生词 上网 vs. 网上 我上网看天气预报。 今天早上看了网上的天气预报。 正式 zhèngshì (报告,会议,纪录) 他被这所学校正式录取 大桥已经落成,日内就可以正式通车 落伍 luòw ǔ 迟到 chídào 他怕迟到,六点就起床了.
§2 计算机系统结构、组成与实现 计算机系统结构、组成与实现的定义和内涵 计算机系统结构、组成和实现的相互关系.
嵌入式系统与单片机 北京科技大学电子信息系.
程序的执行 程序执行和指令执行概述 数据通路基本结构和工作原理 流水线方式下指令的执行
2007年8月龙星课程 周源源老师课程体会 包云岗 中科院计算所
-CHINESE TIME (中文时间): Free Response idea: 你周末做了什么?
建構 Beta電腦 – Fall /29/0.
第3章 流水线技术.
周学海 , 中国科学技术大学 2017/3/16 计算机体系结构 周学海 , 中国科学技术大学 中国科学技术大学.
信息科学与工程学院计算机科学系 2006年9月—2007年1月
資料庫設計 Database Design.
最新計算機概論 第3章 計算機組織.
第4章 VHDL设计初步.
Combinational Logic 組合邏輯
大连理工大学软件学院 软件工程系 赖晓晨 计算机组成与结构 大连理工大学软件学院 软件工程系 赖晓晨
Operating System CPU Scheduing - 3 Monday, August 11, 2008.
Lecture on High Performance Processor Architecture (CS05162)
CH.2 Introduction to Microprocessor-Based Control
第 2 章 中央處理單元.
微处理器设计1 刘鹏 College of ISEE Zhejiang University
臺北市立大學 資訊科學系(含碩士班) 賴阿福 CS TEAM
Quiz 3 假设各种分支占所有指令数的百分比如下表所示:
周学海 , 中国科学技术大学 2018/11/15 计算机体系结构 周学海 , 中国科学技术大学 11/15/2018 中国科学技术大学.
第4章 处理器(CPU) 4.1 引言 4.2 逻辑设计的一般方法 4.3 建立数据通路 4.4 一个简单的实现机制 4.5 多周期实现机制.
指令集架構 計算機也跟人類一樣,需要提供一套完整的語言讓人們跟它充分溝通,以完成正確的計算工作。
英语教学课件系列 八年级(上) it! for Go.
5 Computer Organization (計算機組織).
Lecture on High Performance Processor Architecture (CS05162)
The Processor: Datapath and Control
Fundamentals of Physics 8/e 27 - Circuit Theory
Decision Support System (靜宜資管楊子青)
第4章 网络互联与广域网 4.1 网络互联概述 4.2 网络互联设备 4.3 广域网 4.4 ISDN 4.5 DDN
微程序控制器 刘鹏 Dept. ISEE Zhejiang University
创建型设计模式.
第三章 流水线技术.
計算機結構 – 概論 陳鍾誠 於金門大學.
嵌入式微处理器系统 第二章 处理器技术(1) 北京大学软件与微电子学院.
组合逻辑3 Combinational Logic
Lecture on High Performance Processor Architecture (CS05162)
重點 資料結構之選定會影響演算法 選擇對的資料結構讓您上天堂 程式.
JTAG INTERFACE SRAM TESTER WITH C-LCM
陳慶瀚 機器智慧與自動化技術(MIAT)實驗室 國立中央大學資工系 2013年5月28日
第4章(1) 空间数据库 —数据库理论基础 北京建筑工程学院 王文宇.
数字系统设计 Digital System Design
Decision Support System (靜宜資管楊子青)
第3章 流水线技术 曲冠南
第十五课:在医院看病.
Unit 11.
計算機概論 第3章 計算機組織與結構概觀.
Instructions: Language of the Machine
Unit 8 Our Clothes Topic1 What a nice coat! Section D 赤峰市翁牛特旗梧桐花中学 赵亚平.
Guide to a successful PowerPoint design – simple is best
BORROWING SUBTRACTION WITHIN 20
The Processor: Datapath and Control (Multi-cycle implementation)
中国科学技术大学计算机系 陈香兰 2013Fall 第七讲 存储器管理 中国科学技术大学计算机系 陈香兰 2013Fall.
虚 拟 仪 器 virtual instrument
周学海 中国科学技术大学 2019/4/19 计算机体系结构 周学海 中国科学技术大学.
The Role of Parents in the Moral Development of the Child
计算机问题求解 – 论题1-5 - 数据与数据结构 2018年10月16日.
第4章 指令级并行 授课教师:车喜龙
清华大学计算机科学与技术系高性能计算研究所 郑纬民 教授 2005年5月
怎樣把同一評估 給與在不同班級的學生 How to administer the Same assessment to students from Different classes and groups.
Operating System Software School of SCU
Race Conditions and Semaphore
多维阅读第13级.
Principle and application of optical information technology
以分为镜知对错 以卷为鉴晓得失 —邯郸市一模得与失
Train Track and Children
Presentation transcript:

Lecture on High Performance Processor Architecture (CS05162) Review of Instruction Sets, Pipelines An Hong han@ustc.edu.cn Fall 2009 School of Computer Science and Technology University of Science and Technology of China

Outline Quick review of everything you should have learned Instruction Sets, Pipelines 2018/11/16 USTC CS AN Hong

Quick review of everything you should have learned 2018/11/16 USTC CS AN Hong

理解现代计算机系统设计的抽象层次 软件 指令集体系结构 (以及I/O接口) 硬件 应用问题 算法+数据结构 语言 微体系结构 逻辑和集成电路设计 器件(晶体管/集成电路工艺学) 算法+数据结构 软件 硬件 1-4 2018/11/16 USTC CS AN Hong 4

计算机体系结构定义:经典,狭义定义 temp = v[k]; High Level Language Program v[k] = v[k+1]; v[k+1] = temp; Compiler lw $15, 0($2) lw $16, 4($2) sw $16, 0($2) sw $15, 4($2) Assembly Language Program Assembler 0000 1001 1100 0110 1010 1111 0101 1000 1010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111 Machine Language Program Machine Interpretation Control Signal Specification ALUOP[0:3] <= InstReg[9:11] & MASK ° 2018/11/16 USTC CS AN Hong

计算机体系结构定义:经典,狭义定义 ... the attributes of a [computing] system as seen by the programmer, i.e. the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation. – Amdahl, Blaaw, and Brooks, 1964 2018/11/16 USTC CS AN Hong

计算机体系结构定义:经典,狭义定义 ISA 是程序员看到的计算机的属性(概念结构和功能结构) 数据的表示:硬件能直接辨认和处理的数据结构类型 寻址规则:最小寻址单位,寻址方式及其表示 寄存器:各种寄存器的定义,数量和使用方式 指令集:机器指令的操作类型和格式,指令间的排序和控制机构等 中断和例外系统:中断的类型和中断响应硬件的功能等 机器工作状态的定义和切换:如管态和目态等 存储系统:主存容量,程序员可用的最大存储容量等 信息保护:包括信息保护方式和硬件对信息保护的支持 I/O结构:包括I/O联结方式,处理机/存储器与I/O设备间数据传送的方式和格式以及I/O操作的状态等 2018/11/16 USTC CS AN Hong

现代计算机体系结构的广义定义 逻辑实现 (组成设计) 物理实现 (硬件设计) 设计任务 设计问题 特征(用途) 成本(价格) 性能 ISA设计 逻辑实现: CPU设计,存储系统设计,总线设计 物理实现 集成电路设计,封装,电源设计,冷却 I/O system Instr. Set Proc. Compiler Operating System Application Logic Design Circuit Design Instruction Set Architecture Firmware Datapath & Control Layout 逻辑实现 (组成设计) 物理实现 (硬件设计) 2018/11/16 USTC CS AN Hong

计算机设计者的任务 2018/11/16 USTC CS AN Hong

理解现代计算机系统设计的影响因素 2018/11/16 2018/11/16 USTC CS AN Hong 10

计算机系统性能评价方法 怎样评测一个计算机系统的性能,与测试者所处的角度有关 用户: 对单个应用系统的响应时间(快), 程序的运行时间(少) 管理员: 单位时间里系统的吞吐量(大) 2018/11/16 USTC CS AN Hong

Understanding Program Performance Hardware or software component How this component affects performance Where is this topic covered? Algorithm Determine both the number of source-level statements and the number of I/O operations executed Book1 Programming language, compiler, and architecture Determine the number of machine instructions for each source-level statement Book2:chapters 2 and 3 Processor and memory system Determine how fast instructions can be executed Book2:chapters 5,6,7 I/O system(hardware and operating system) Determine how fast I/O operations may be executed Book2:chapters 8 2018/11/16 USTC CS AN Hong

Text Books Book1: 陈国良,并行算法分析与设计,高等教育出版社 Book2: David A. Patterson and John L. Hennessy,Computer Organization and Design: The Hardware/Software Interface, Third Edition.机械工业出版社. 2018/11/16 USTC CS AN Hong

Computer Performance Inst Count CPI Clock Rate Program X Cycle time CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X Organization X X Technology X 2018/11/16 USTC CS AN Hong

CPU性能公式 执行一个程序所需的时钟周期数× 时钟周期 CPU时间 = = IC ×CPI × Cycle time 时钟频率 IC ×CPI Frequency CPIi: 执行指令 i 所需的平 均时钟周期数 ICi : 指令 i 在程序中执行 的次数 CPI = = 所有指令的平均时钟周期数 指令的执行频度 2018/11/16 USTC CS AN Hong

Instruction Set Design software hardware Which is easier to change/design??? 2018/11/16 USTC CS AN Hong

指令集的演化 2018/11/16 USTC CS AN Hong

Instruction Set Architecture: 每个周期做什么事? Fetch Decode Operand Execute Result Store Next Stage1:从存储系统中获得指令 Stage2a:确定做何动作 Stage2b:获得操作数 Stage3:产生运算结果或状态 Stage4:向存储系统中存放运算结果 Stage5:确定下一条要执行的指令 2018/11/16 USTC CS AN Hong

Instruction Set Architecture: 必须定义什么? Instruction Format or Encoding(编码方式) how is it decoded? Location of operands and result(寻址方式) where other than memory? how many explicit operands? how are memory operands located? which can or cannot be in memory? Data type and Size(数据类型) Operations(操作类型) what are supported Successor instruction(控制流的显式表示) jumps, conditions, branches 指令处理必须经过 fetch-decode-execute ! Instruction Fetch Decode Operand Execute Result Store Next 2018/11/16 USTC CS AN Hong

指令集的操作:CISC指令集结构 CISC指令集结构 优点: 指令功能强, 编译生成的程序指令条数少, 软件功能向硬件功能迁移 缺点: 指令功能不均衡; 20%的指令使用频率占运行时间的80%; 实现复杂 2018/11/16 USTC CS AN Hong

指令集的操作: Top 10 80x86 Instructions Rank instruction Integer Average Percent total executed 1 load 22% 2 conditional branch 20% 3 compare 16% 4 store 12% 5 add 8% 6 and 6% 7 sub 5% 8 move register-register 4% 9 call 1% 10 return Total 96% Simple instructions dominate instruction frequency 2018/11/16 USTC CS AN Hong

指令集的操作:RISC指令集结构 RISC指令集结构 RISC指令集结构设计原则 优点: 只包含那些使用频率很高的指令和一些必要的指令, 降低指令集结构的复杂性,简化实现; 程序的运行时间缩短, 从而提高了性能. RISC指令集结构设计原则 选取使用频率最高的指令 每条指令的功能尽可能简单, 能在1个机器周期内完成 所有指令长度相同,简化指令译码,支持高效的流水线 简单的Load/Store型指令集:只有Load和Store指令可以访问存储器, 其它指令的操作均在寄存器间进行 以最简单有效的方式支持高级语言,支持高效的编译器 2018/11/16 USTC CS AN Hong

MIPS I Instruction Set Architecture Registers GPRs: 32bits; R0(=0), R1, …, R31 FPRs: 32bits; F0, …, F31(Single Precision) 64bits; F0(F0,F1), …, F30(F30,F31)(Double Precision) Multiply and Divide Registers, PC Other Registers r31 r0 r1 r30 : : 31 Floating-Point Registers f0 f1 f30 f31 HI Multiply and Divide Registers LO PC Program Counter General Purpose Registers 2018/11/16 USTC CS AN Hong

Example: MIPS (­ DLX) Register-Register Op Rs1 Rs2 Rd Opx 31 26 25 21 20 16 15 11 10 6 5 Op Rs1 Rs2 Rd Opx Register-Immediate 31 26 25 21 20 16 15 immediate Op Rs1 Rd Branch 31 26 25 21 20 16 15 immediate Op Rs1 Rs2/Opx Jump / Call 31 26 25 target Op 2018/11/16 USTC CS AN Hong

MIPS I Instruction Set Architecture:Addressing Modes All instructions 32 bits wide Register (direct) 寄存器寻址 op rs rt rd register Immediate 立即值寻址 immed op rs rt immed op rs rt register + Memory Base+index 偏移寻址 immed op rs rt PC + Memory PC-relative PC相对寻址 2018/11/16 USTC CS AN Hong

MIPS Addressing Modes/Instruction Formats:例子 op rs rt rd shamt funct 6 11 16 21 26 31 6 bits 5 bits ADD and SUB addU rd, rs, rt subU rd, rs, rt OR Immediate: ori rt, rs, imm16 LOAD and STORE Word lw rt, offset(rs) sw rt, offset(rs) BRANCH: beq rs, rt, offset op rs rt imm16 16 21 26 31 6 bits 16 bits 5 bits op rs rt offset 16 21 26 31 6 bits 16 bits 5 bits In today’s lecture, I will show you how to implement the following subset of MIPS instructions: add, subtract, or immediate, load, store, branch, and the jump instruction. The Add and Subtract instructions use the R format. The Op together with the Func fields together specified all the different kinds of add and subtract instructions. Rs and Rt specifies the source registers. And the Rd field specifies the destination register. The Or immediate instruction uses the I format. It only uses one source register, Rs. The other operand comes from the immediate field. The Rt field is used to specified the destination register. (Note that dest is the Rt field!) Both the load and store instructions use the I format and both add the Rs and the immediate filed together to from the memory address. The difference is that the load instruction will load the data from memory into Rt while the store instruction will store the data in Rt into the memory. The branch on equal instruction also uses the I format. Here Rs and Rt are used to specified the registers we need to compare. If these two registers are equal, we will branch to a location offset by the immediate field. Finally, the jump instruction uses the J format and always causes the program to jump to a memory location specified in the address field. I know I went over this rather quickly and you may have missed something. But don’t worry, this is just an overview. You will keep seeing these (point to the format) all day today. +3 = 13 min. (X:53) op rs rt offset 16 21 26 31 6 bits 16 bits 5 bits 2018/11/16 USTC CS AN Hong

Datapath vs Control Datapath Controller Control Points signals Datapath: Storage, FU, interconnect sufficient to perform the desired functions Inputs are Control Points Outputs are signals Controller: State machine to orchestrate operation on the data path Based on desired function and signals 2018/11/16 USTC CS AN Hong

Datapath and Control: Split state diag into 5 pieces IR <- Mem[PC]; NPC <–PC+4; PC <– PC+4; A <- R[rs]; B<– R[rt]; V<- imm or offset<- imm S <– A + B; R-R型运算 R[rd] <– S; S <– A + offset; R-imm型Load M <– Mem[S] R[rt] <– M; S <– A or V; R-imm型运算 R[rt] <– S; R-imm型Store Mem[S] <- B S <– NPC+offset; Cond <– A op 0 取指 (IFetch) 读寄存器/译码 (Reg/ID) 执行 (Exec) 访存 (Mem) 写回 (WB) If Cond PC <– S else PC <–NPC; 2018/11/16 USTC CS AN Hong

5 Steps of DLX Datapath 4 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC MUX 4 Adder Next SEQ PC Zero? RS1 Reg File Address Memory Inst. RS2 MUX ALU Inst Memory Data L M D RD MUX MUX Sign Extend IR <= mem[PC]; PC <= PC + 4 Imm WB Data Reg[IRrd] <= Reg[IRrs] opIRop Reg[IRrt] 2018/11/16 USTC CS AN Hong

5 Steps of DLX Datapath 4 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC IF/ID ID/EX MEM/WB EX/MEM MUX Next SEQ PC Next SEQ PC 4 Adder Zero? RS1 Reg File Address Memory Inst. MUX RS2 ALU Memory Data MUX MUX IR <= mem[PC]; PC <= PC + 4 Sign Extend Imm WB Data A <= Reg[IRrs]; B <= Reg[IRrt] RD RD RD rslt <= A opIRop B WB <= rslt 2018/11/16 USTC CS AN Hong Reg[IRrd] <= WB

5 Steps of DLX Datapath 4 Data stationary control Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC IF/ID ID/EX MEM/WB EX/MEM MUX Next SEQ PC Next SEQ PC 4 Adder Zero? RS1 Reg File Address Memory Inst. MUX RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD Data stationary control local decode for each instruction phase / pipeline stage 2018/11/16 USTC CS AN Hong

Visualizing Pipelining Time (clock cycles) Reg ALU DMem Ifetch Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5 I n s t r. O r d e 2018/11/16 USTC CS AN Hong

Pipelining is not quite that easy! Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) Control hazards: Caused by delay between the fetching of instructions and decisions 2018/11/16 USTC CS AN Hong

Can pipelining get us into trouble? Yes: Pipeline Hazards structural hazards: attempt to use the same resource two different ways at the same time E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) data hazards: attempt to use item before it is ready E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer instruction depends on result of prior instruction still in the pipeline control hazards: attempt to make a decision before condition is evaulated E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in branch instructions Can always resolve hazards by waiting ? pipeline control must detect the hazard take action (or delay action) to resolve hazards 2018/11/16 USTC CS AN Hong

结构相关:由访存引起的结构相关 Load Instr 1 Instr 2 Instr 3 Instr 4 Mem I n s t r. O r d e Time (clock cycles) Load Instr 1 Instr 2 Instr 3 Instr 4 ALU Reg 取指和取数都要访问同一个存储器 Detection is easy in this case! (right half highlight means read, left half write) 2018/11/16 USTC CS AN Hong

结构相关的解决方案:阻塞 Load Instr 1 Instr 2 Instr 3 Instr 4 Time (clock cycles) Mem I n s t r. O r d e Time (clock cycles) Load Instr 1 Instr 2 Instr 3 Instr 4 ALU Reg Stall 取指延迟一拍进行 2018/11/16 USTC CS AN Hong

控制相关: What’s the Problem? 例: BEQ rs, rt, offset if R[rs] == R[rt] then PC<- PC+offset bne r2, #0, r3 add r4,r5,r6 sub r7,r8,r9 T NT Need address here Instruction Fetch Decode Execute Memory Access Writeback Branch Delay Compute address here 分支处理问题可划分为两个子问题 决定分支的方向(分支条件相关) 对需要跳转的分支,使执行延迟最小化 -> 尽快获得转移的目标地址(分支地址相关) 2018/11/16 USTC CS AN Hong

Control Hazard Solution #1: Stall e Time (clock cycles) Add Beq Load ALU Mem Reg Lost potential Stall: wait until decision is clear Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow Move decision to end of decode save 1 cycle per branch 2018/11/16 USTC CS AN Hong

Pipelined DLX Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC Next SEQ PC ID/EX EX/MEM MEM/WB MUX 4 Adder IF/ID Adder Zero? RS1 Address Reg File Memory RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD Interplay of instruction set design and cycle time. 2018/11/16 USTC CS AN Hong

Control Hazard Solution #2: Predict Time (clock cycles) Add Beq Load ALU Mem Reg Predict: guess one direction then back up if wrong Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right ­ 50% of time) More dynamic scheme: history of 1 branch (­ 90%) 2018/11/16 USTC CS AN Hong

Control Hazard Solution #3: Delayed Branch Time (clock cycles) Add Beq Misc ALU Mem Reg Load Delayed Branch: Redefine branch behavior (takes place after next instruction) Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” (­ 50% of time) As launch more instruction per clock cycle, less useful 2018/11/16 USTC CS AN Hong

Data Hazard on R1 Read After Write (RAW) InstrJ tries to read operand before InstrI writes it Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. I: add r1,r2,r3 J: sub r4,r1,r3 2018/11/16 USTC CS AN Hong

Data Hazard on r1: Read after write hazard (RAW) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 2018/11/16 USTC CS AN Hong

Data Hazard on r1: Read after write hazard (RAW) Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11 2018/11/16 USTC CS AN Hong

Data Hazard Solution: Forwarding “Forward” result from one stage to another Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11 2018/11/16 USTC CS AN Hong

HW Change for Forwarding ID/EX EX/MEM MEM/WR NextPC mux Registers ALU Data Memory mux mux Immediate What circuit detects and resolves this hazard? 2018/11/16 USTC CS AN Hong

Forwarding (or Bypassing): What about Loads? Dependencies backwards in time are hazards Data Hazard Even with Forwarding Can’t solve with forwarding ,Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU sub r4,r1,r3 Im Reg Dm Reg 2018/11/16 USTC CS AN Hong

Forwarding (or Bypassing): What about Loads ? Dependencies backwards in time are hazards Data Hazard Even with Forwarding Can’t solve with forwarding ,Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU Im Reg Dm sub r4,r1,r3 Stall 2018/11/16 USTC CS AN Hong

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd Compiler optimizes for performance. Hardware checks for safety. 2018/11/16 USTC CS AN Hong

Three Generic Data Hazards Write After Read (WAR) InstrJ writes operand before InstrI reads it Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 2018/11/16 USTC CS AN Hong

Three Generic Data Hazards Write After Write (WAW) InstrJ writes operand before InstrI writes it. Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in more complicated pipes I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 2018/11/16 USTC CS AN Hong

总结:影响指令级并行性的因素 Pipeline CPI=Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls 改进理想的CPI:多发射(静态/动态) 克服流水线中的相关性 结构相关:由资源冲突导致的相关 解决办法:增加资源 数据相关:由RAW、WAW、WAR导致的相关 解决办法 (用软件):编译器静态调度,循环展开,寄存器重命名,软流水 (用硬件):forwarding技术,寄存器重命名,动态调度的乱序执行技术(记分板,Tomasulo算法) 控制相关:由分支引起的相关 解决方法:静态/动态预测和推测执行 2018/11/16 USTC CS AN Hong

总结:数据相关(又称数据依赖) 在程序的一个基本块中存在的数据相关有以下几种情形: 真数据依赖:两条指令之间存在数据流,有真正的数据依赖关系 RAW(Read After Write)相关:对于指令i和j,如果 (1) 指令j使用指令i产生的结果,则称指令j与指令i为RAW相关;或者 (2) 指令j与指令i存在RAW相关,而指令k与指令j存在RAW相关,则称指令k与指令i为RAW相关 伪数据依赖(又称名相关):指令使用的寄存器或存储器称为名。两条指令使用相同名,但它们之间不存在数据流,则它们之间是一种伪数据依赖关系,包括两种情形: WAR(Write After Read)相关:对于指令i和j,如果指令i先执行,指令j写的名是指令i读的名,则称指令j与指令i为WAR相关(又称反相关,anti-dependence) WAW( Write After Write)相关: 对于指令i和j,如果指令i与指令j写相同的名,则称指令j与指令i为WAW相关(又称输出相关,output-dependence) 2018/11/16 USTC CS AN Hong

总结:开发指令级并行性的技术 2018/11/16 USTC CS AN Hong