Lecture on High Performance Processor Architecture (CS05162) Control Hazard and Solution An Hong han@ustc.edu.cn Fall 2007 University of Science and Technology of China Department of Computer Science and Technology CS258 S99
Outline What’s the Problem from Von Neumann Programming Model A Taxonomy of Speculation Execution Techniques What’s Control Hazard and Solution? What makes Control Speculation possible? 2018/11/22 CS of USTC AN Hong
What’s the Problem? Von Neumann Programming Model Instructions are fetched, and then executed in the order in which they appear in a program A program is a static sequence of instructions produced by a programmer or by a compiler. A compiler translates static sequences of high-level language into static sequences of lower-level instructions A machine(processor) executes a dynamic sequence of events based on the static sequence of instructions in the program. Programs contain four primary types of instructions: Data movement instructions(called Loads and Stores) Data processing instructions (e.g., add, subtract, multiply, AND, OR, etc.) Branch instructions Environmental instructions 2018/11/22 CS of USTC AN Hong
What’s the Problem? There are two fundamental restrictions in von Neumann programming model, which limit the amount of instruction level parallelism(ILP) that can be extracted from sequential programs A control flow, which dynamically determined the sequences in which these manipulations were done A data flow, which did the physical manipulation of the data values 2018/11/22 CS of USTC AN Hong
A Taxonomy of Speculation Execution Techniques What can we speculate on? Speculative Execution Control Speculation Data Speculation Data Location Branch Direction (binary) Aliased (binary) Branch Target (multi-valued) Address(multi-valued) Data Value (multi-valued) Get class to tell me these What makes speculation possible? 2018/11/22 CS of USTC AN Hong CS258 S99
Branch Instruction Semantics 例: BEQ rs, rt, offset if R[rs] == R[rt] then PC<- PC+offset If <condition> Then GoTo <address> <condition>: specifies some bits that can be set by other instruction; <address>: is the address of a place in the program to which instruction sequencing is diverted if <condition> is true Unconditional branch: <condition> is the constant value true, then the branch is always taken E.g.: Jump Conditional branch: <condition> is variable , may or may not be true 2018/11/22 CS of USTC AN Hong
What’s the Problem? 分支处理问题可划分为两个子问题 例: BEQ rs, rt, offset if R[rs] == R[rt] then PC<- PC+offset bne r2, #0, r3 add r4,r5,r6 sub r7,r8,r9 T NT Need address here Instruction Fetch Decode Execute Memory Access Writeback Branch Delay Compute address here 分支处理问题可划分为两个子问题 决定分支的方向(分支条件相关) 对需要跳转的分支,使执行延迟最小化 -> 尽快获得转移的目标地址(分支地址相关) 2018/11/22 CS of USTC AN Hong
分支指令对性能的影响 分支指令的影响是开发指令级并行性(增加ILP)的重大障碍 通用代码(SPEC CINT/CFP 2000) 一条指令流中,平均每5~7条指令中就有一条是分支指令,也即基本块大小为5~7条指令。 增大发射宽度 在发射宽度为n的处理器中,遇到分支指令的速度也快了n倍 当前的处理器发射4~6 ops/cycle 增加流水线深度 流水线越深,处理分支指令所需要的时钟周期数就越多 2018/11/22 CS of USTC AN Hong
分支的动态距离 Inter-branch distance(cycles) 模拟环境:4发射乱序机器 @SPECint2000 发射宽度增加,分支间的动态距离就会越小,分支预测的延迟对性能的影响就越大,当要求更高的发射宽度时,需要每个周期做不止一个预测 2018/11/22 CS of USTC AN Hong
解决分支条件相关的方法 阻塞(Stall) 用delay slot容忍延迟(Delayed Branch) 分支预测(Predict) 等待直到分支条件确定 用delay slot容忍延迟(Delayed Branch) 延迟槽指令的来源 分支预测(Predict) 分支条件未确定时预测分支是否成功 静态与动态预测 编译器优化 软件循环展开减少分支指令 产生条件的指令和分支指令隔得足够远 动态调度 硬件自动做循环展开 转换为数据相关 条件指令、谓词 2018/11/22 CS of USTC AN Hong
Control Hazard Solution(1): Stall e Time (clock cycles) Add Beq Load ALU Mem Reg Lost potential Stall: wait until decision is clear Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow 2018/11/22 CS of USTC AN Hong
Control Hazard Solution (2): Delayed Branch Time (clock cycles) Add Beq Misc ALU Mem Reg Load Delayed Branch: Redefine branch behavior (takes place after next instruction) Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time) As launch more instruction per clock cycle, less useful 2018/11/22 CS of USTC AN Hong
Delayed Branches(Delay slot,延迟槽) One or more instructions after branch get executed regardless of whether branch taken add r2, r3, r4 sub r7, r8, r9 bne r5, #0, r10 (stall) div r2, r1, r7 NOTE: Here, branch delay is 2 cycle. Exposes branch delay to compiler bne r5, #0, r10 add r2, r3, r4 (delay slot) sub r7, r8, r9 (delay slot) div r2, r1, r7 2018/11/22 CS of USTC AN Hong
Delayed Branches(Delay slot,延迟槽) 来自分支指令前:肯定执行 来自分支目标地址:分支成功才执行 来自分支不成功地址:分支不成功才执行 Cancelling branches allow more slots to be filled 单延迟槽的编译效果 能为 60% 左右的转移延迟槽找到有效操作 大约80%的延迟槽指令用于有效计算 因此大约50% (60% x 80%) 的延迟槽操作用于有效计算 延迟槽的限制 超流水情况下一条延迟槽不够 多发射情况下延迟槽反而成为需要特殊照顾的兼容负担 2018/11/22 CS of USTC AN Hong
Delayed Branches(Delay slot,延迟槽) Pros: Low Hardware Cost Cons: Depends on compiler to fill delay slots Ability to fill delay slots drops as # of slots increases Exposes implementation details to compiler Can’t change pipeline without breaking software compatibility Can’t add to existing architecture and retain compatibility 2018/11/22 CS of USTC AN Hong
Control Hazard Solution (3): 软件循环展开减少分支指令 例:向量的每个元素加常数 For (j=1; j<=1000; j++) x[j]=x[j]+s; 编译: 整数寄存器R1:循环计数器,初值为向量中最高端地址元素的地址 浮点寄存器F2:保存常数s 1 Loop:LD F0,0(R1) ;F0=vector element 2 ADD F4,F0,F2 ;add scalar in F2 3 SD 0(R1),F4 ;store result 4 SUBI R1,R1,8 ;decrement pointer 8B (DW) 5 BNEZ R1,Loop ;branch R1!=zero 2018/11/22 CS of USTC AN Hong
每次循环都有数据相关/控制相关引起的停顿 1 Loop: LD F0,0(R1) ;F0=vector element 2 stall 3 ADDD F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 SD 0(R1),F4 ;store result 7 SUBI R1,R1,8 ;decrement pointer 8B (DW) 8 BNEZ R1,Loop ;branch R1!=zero 9 stall ;delayed branch slot Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 每次循环需要 9 clocks Rewrite code to minimize stalls? 2018/11/22 CS of USTC AN Hong
静态调度的例子 Put SUBI to ADDD’s first stall slot 1 Loop: LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2 4 SUBI R1,R1,8 5 BNEZ R1,Loop ;delayed branch 6 SD 8(R1),F4 ;altered when move past SUBI Put SUBI to ADDD’s first stall slot Addjust the displacement of SD from 0(R1) to 8(R1) (3) Swap BNEZ and SD by changing address of SD 每次循环需要 6 clocks Unroll loop (循环展开) 4 times code to make faster? 3 clocks doing work, 3 overhead (stall, branch, sub) 2018/11/22 CS of USTC AN Hong CS258 S99
循环展开+寄存器重命名(Register Renaming)的例子 重命名前 重命名后 1 Loop:LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F0,-8(R1) 3 SD -8(R1),F4 7 LD F0,-16(R1) 8 ADDD F4,F0,F2 9 SD -16(R1),F4 10 LD F0,-24(R1) 11 ADDD F4,F0,F2 12 SD -24(R1),F4 13 SUBI R1,R1,#32 14 BNEZ R1,LOOP 15 NOP 1 Loop:LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,#32 14 BNEZ R1,LOOP 15 NOP 寄存器重命名的目的是避免引入新的WAW和WAR数据相关 2018/11/22 CS of USTC AN Hong
循环展开+寄存器重命名(Register Renaming)的例子 1 cycle stall 1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 ;drop SUBI & BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 ;drop SUBI & BNEZ 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,LOOP 15 NOP 15 + 4 x (1+2) = 27 clock cycles, or 6.8 per iteration Assumes R1 is multiple of 4 Rewrite loop to minimize stalls? 2 cycles stall 2018/11/22 CS of USTC AN Hong CS258 S99
静态调度减少停顿,减少分支的例子 14 clock cycles, or 3.5 per iteration 1 Loop: LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2 7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4 10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,#32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 ;8-32 = -24 14 clock cycles, or 3.5 per iteration When safe to move instructions? What assumptions made when moved code? OK to move store past SUBI even though changes register OK to move loads before stores: get right data? When is it safe for compiler to do such changes? 2018/11/22 CS of USTC AN Hong CS258 S99
Control Hazard Solution (4):动态调度进行自动循环展开 在Tomasulo算法中,利用动态调度以及定点与浮点运算的重叠对如下程序段可以掩盖分支指令的开销,达到自动循环展开的效果 Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1 SUBI R1 R1 #8 BNEZ R1 Loop 如果分支条件依赖于MULTD操作的结果又如何? 需要进行分支预测 2018/11/22 CS of USTC AN Hong
Control Hazard Solution(5): Prediction Time (clock cycles) Add Beq Load ALU Mem Reg Predict: guess one direction then back up if wrong Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right 50% of time) More dynamic scheme: history of 1 branch ( 90%) 2018/11/22 CS of USTC AN Hong
Control Hazard Solution(5): Prediction 分支预测的基本原理 在取指或译码阶段预测分支是否成功以及分支目标进行后续指令的取指,以减少指令流水线由于控制相关而堵塞 在分支条件或分支目标确定后再对预测结果进行修正 静态/动态转移预测 静态预测:总是预测分支成功或总是预测分支不成功 预测分支成功 预测分支不成功 动态预测:根据分支指令执行历史进行预测 复杂预测技术 混合预测:利用编译器的提示,结合动态和静态预测 每周期1个/多个分支预测(宽带分支预测) 每周期1个预测,基本可满足4-6 发射需要的取指带宽 每周期多个预测,大都与Trace Cache相结合,以满足6发射以上需要的取指带宽 2018/11/22 CS of USTC AN Hong
Control Hazard Solution(5): Prediction 单通路执行/多通路执行分支预测 Single path execution full eager execution disjoint eager execution .7 .3 .49 .21 .09 (b) 完全急进执行 .7 .3 .21 .49 .34 .24 .17 .12 .15 .10 .07 .05 (a) 单通路推测执行 .7 .3 .49 .21 .34 .15 .24 .10 (c) dis-joint 急进执行 2018/11/22 CS of USTC AN Hong
Static Prediction Predict whether branches will be taken before execution (compile-time) Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% DLX branches not taken on average PC+4 already calculated, so use it to get next instruction Predict Branch Taken 53% DLX branches taken on average But haven’t calculated branch target address in DLX DLX still incurs 1 cycle branch penalty Other machines: branch target known before outcome 2018/11/22 CS of USTC AN Hong
Static Prediction Pros: good performance for loops Cons: bad performance for data-dependent branches if (a == 0) b = 3; else b = 4; 2018/11/22 CS of USTC AN Hong
分支处理机制的性能 分支处理机制的性能取决于 预测精度(BPA)-> 设计好的预测器 预测精度越高,能抽取的并行性就越多 预测正确所付的代价 - > 设计分支目标缓冲(BTB)/分支目标地址Cache(BATC)/Trace Cache 为转到分支目标地址处执行所需的延迟 MIPS R10000 无BTAC:延迟为1个周期, MIPS R12000 有BTAC:32项 无延迟 预测错所付的代价 2018/11/22 CS of USTC AN Hong
分支预测错所付的代价 分支预测错所付的代价 -> 重新刷新流水线 影响因素: 静态方面 流水线的总体组织 流水线长度 预测错的指令是否从缓存移走,什么时候移走 动态方面 指令窗口的大小 重定序缓冲中所包含的尚未结束的推测指令条数 Pentium II/III和Alpha 21264: 重新刷新流水线需要11周期以上 Alpha21464(该项目在2001年6月,在设计的后期阶段被中止)处理器中至少为14个周期 2018/11/22 CS of USTC AN Hong
高性能分支处理机制的任务 高性能分支处理机制的任务 尽早确定分支的结果 预测成功,要使执行延迟最小化:如用BTB/BTAC 预测错,要有有效的流水线刷新机制 提供优良的分支预测器和推测执行机制:是研究得最多的 提供2~多级预测/推测机制 2018/11/22 CS of USTC AN Hong
高性能分支处理机制的任务 用于指导预测 Branch outcome Predicted direction and other info Predictor Branch outcome and other info Program Counter (PC) Predicted direction (taken/not taken) Predicted target address (where the branch is going) 用于指导预测 Predicting where the next instruction is in the instruction stream. 2018/11/22 CS of USTC AN Hong
分支目标缓存(BTB,Branch Target Buffer) Store branch address and predicted target (last target) only store targets for taken branches can combine with any prediction scheme 2018/11/22 CS of USTC AN Hong
Predict taken or untaken Example: BHT with BTB Address of branch index to get prediction AND branch address (if taken) Must check for branch match now, since can’t use wrong branch address Grab predicted PC from table since may take several cycles to compute Update predicted PC when branch is actually resolved Return instruction addresses predicted with stack Branch PC (Branch Address) Predicted PC (Taken Target Address) =? PC of instruction FETCH Predict taken or untaken BHT 2018/11/22 CS of USTC AN Hong
Example: PAg with BTB 2018/11/22 CS of USTC AN Hong
控制推测执行机制的分类 Not predict no branch prediction(block pipeline) Intel 8086 loop buffer(a very small I-Cache) Intel 80386 delayed branches(delay slots) eager execution(EE) IBM 360/91, Sun SuperSparc the number of paths grow exponentially unlimited resources not predictable branch, the branch direction changes in an irregular fashion disjoint eager execution(DEE) Alone with minimal control dependencies(MCD) 2018/11/22 CS of USTC AN Hong
推测执行机制的分类 Predict Static(at compile time, via software) Always not taken Intel 80486 Always taken Sun SuperSparc Backward taken;Forward not taken(BTFN) HP PA-7x00 Semistatic(Profiling) Early PowerPC Dynamic(at runtime, via hardware ) 1bit DEC Alpha 21064, AMD-K5 2bit NextGen 586, PowerPC 604, Cyrix 6x86, CyrixM2, MIPS R10000 Two-level adaptive P6, AMD-K6 Selector(local/global) DEC Alpha 21264 Hybrid (combining static and dynamic scheme) ? Multiscalar Pentium IV ? 2018/11/22 CS of USTC AN Hong
推测执行机制的分类 Others Predication alone Denelcor HEP Predication with software Cydrome Cydra5, P6 VLIW Itanium 2018/11/22 CS of USTC AN Hong
术语 预测率(Prediction Rate)或预测精度(Branch Prediction Accurary, BPA) dynamic percentage of branches which are predicted correctly 误预测率(Misprediction Rate) dynamic percentage of branches that are not predicted correctly 误预测代价(Misprediction Penalty) Delay when a branch is not predicted correctly (squash ops, refill pipeline) Taken Delay Delay when a taken branch is correctly predicted Not-taken (fall-through) Delay Delay when a not-taken branch is correctly predicted. 2018/11/22 CS of USTC AN Hong
实现动态分支预测技术要考虑的几个关键问题 如何保证准确的预测:根据记录的分支历史进行预测 如何记录分支历史,记录哪些分支历史 记录多少分支历史 何时更新:更新太早,分支指令也可能被取消;更新太晚,导致分支历史不准确 如何在取消推测执行的操作时现场精确性:使用例外处理的现场恢复机制 增加Commit流水级,在Commit时修改寄存器和内存 I/O指令的推测执行难以取消 如何识别流水线中的指令哪些需要取消,哪些不要取消:Commit时全部取消,执行后部分取消 例外取消一般在Commit时,取消所有后续指令 分支取消一般在执行后,只取消部分指令 延迟槽指令的处理 2018/11/22 CS of USTC AN Hong
分支指令的类型 条件分支和无条件分支 直接分支和间接分支 相对分支和绝对分支 条件分支(branch):常规情况下,需等待条件的确定和目标地址被计算出来后才能取指 整个程序中条件分支所占比例高 整型程序中的条件分支所占的比例,比浮点程序中条件分支所占的比例更高 无条件分支(Jump,call/return):常规情况下,只需等待目标地址被计算出来后才取指 直接分支和间接分支 直接分支(branch,Jump) : 目标地址在指令中给出 间接分支(call/return):目标地址在指令中不给出(一般需要专门的寄存器存放) 相对分支和绝对分支 相对分支:目标地址为当前PC值加上偏移量 绝对分支:目标地址由指令或寄存器内容直接给出 2018/11/22 CS of USTC AN Hong
间接分支(call/return) Predicting return jumps from subroutines is hard 2018/11/22 CS of USTC AN Hong
返回地址栈(Return Address Stack) Hardware stack for subroutine return addresses Push return address (in HW) when make call Pop return address to exit Most programs don’ t have deep call trees,can use small stack(e.g. 32 entries) 2018/11/22 CS of USTC AN Hong
分支指令的频度统计(整型程序) 2018/11/22 CS of USTC AN Hong
分支指令的频度统计(浮点程序) 2018/11/22 CS of USTC AN Hong
分支指令间距离统计(1) 发射宽度增加,分支间的动态距离就会越小,分支预测的延迟对性能的影响就越大,当要求更高的发射宽度时,需要每个周期做不止一个预测 2018/11/22 CS of USTC AN Hong
Inter-branch distance(cycles) 分支指令间距离统计(2) Inter-branch distance(cycles) 模拟环境:4发射乱序机器 @SPECint2000 2018/11/22 CS of USTC AN Hong
分支的可预测性分析(1): 指导设计或选择更好的分支预测算法 单个分支是有“脾气”的->基于模式(单个分支自身历史)的动态预测方法 循环型分支 for- 型循环:TT…TN(成功n次后跟一次不成功 ) while- 型循环:NN…NT(不成功n次后跟一次成功) 周期重复模式型分支 定长重复模式类分支:{pat}q , |pat|=k (每隔k个分支模式就重复一次) 块模式类分支:{Tn Nm}q 成功n次,不成功m次,如此循环 分支之间是有“关系”的->基于相关(多个分支全局历史)的动态预测方法 方向相关 路径相关 2018/11/22 CS of USTC AN Hong
当前分支:指当前试图预测的分支,如例中的X分支 相关分支:指其分支结果与当前分支相关的分支 分支的可预测性分析(2) 方向相关(direction correlation) 两个分支的条件(完全或部分)基于相同或相关的信息 第二个分支的结果基于第一个分支的结果产生 Y:if (cond1) X:if (cond1 AND cond2) Z:if (cond1) Y:if (cond2) Y:if (cond1) a = 2 X:if (a = = 0) Example: 当前分支:指当前试图预测的分支,如例中的X分支 相关分支:指其分支结果与当前分支相关的分支 2018/11/22 CS of USTC AN Hong
分支的可预测性分析(3) 路径相关(In-path correlation) Example: 处在当前分支路径上的分支与当前分支结果之间的相关称为路径相关 Z: if (NOT (cond1)) Y:else if (NOT(cond2)) V:else if (cond3) X:if (cond1 AND cond2) Example: 如果获得了分支V的结果为taken, 就能推知Z,Y的条件为False, 于是,能进一步推知分支X的条件必定满足,即Cond1和Cond2都为True 2018/11/22 CS of USTC AN Hong
分支的可预测性分析(4) 2018/11/22 CS of USTC AN Hong