Lecture on High Performance Processor Architecture (CS05162)

Lecture on High Performance Processor Architecture (CS05162)
Control Hazard and Solution An Hong Fall 2007 University of Science and Technology of China Department of Computer Science and Technology CS258 S99

Outline What’s the Problem from Von Neumann Programming Model
A Taxonomy of Speculation Execution Techniques What’s Control Hazard and Solution? What makes Control Speculation possible? 2018/11/22 CS of USTC AN Hong

What’s the Problem? Von Neumann Programming Model
Instructions are fetched, and then executed in the order in which they appear in a program A program is a static sequence of instructions produced by a programmer or by a compiler. A compiler translates static sequences of high-level language into static sequences of lower-level instructions A machine(processor) executes a dynamic sequence of events based on the static sequence of instructions in the program. Programs contain four primary types of instructions: Data movement instructions(called Loads and Stores) Data processing instructions (e.g., add, subtract, multiply, AND, OR, etc.) Branch instructions Environmental instructions 2018/11/22 CS of USTC AN Hong

What’s the Problem? There are two fundamental restrictions in von Neumann programming model, which limit the amount of instruction level parallelism(ILP) that can be extracted from sequential programs A control flow, which dynamically determined the sequences in which these manipulations were done A data flow, which did the physical manipulation of the data values 2018/11/22 CS of USTC AN Hong

A Taxonomy of Speculation Execution Techniques
What can we speculate on? Speculative Execution Control Speculation Data Speculation Data Location Branch Direction (binary) Aliased (binary) Branch Target (multi-valued) Address(multi-valued) Data Value (multi-valued) Get class to tell me these What makes speculation possible? 2018/11/22 CS of USTC AN Hong CS258 S99

Branch Instruction Semantics
例： BEQ rs, rt, offset if R[rs] == R[rt] then PC<- PC+offset If <condition> Then GoTo <address> <condition>: specifies some bits that can be set by other instruction; <address>: is the address of a place in the program to which instruction sequencing is diverted if <condition> is true Unconditional branch: <condition> is the constant value true, then the branch is always taken E.g.: Jump Conditional branch: <condition> is variable , may or may not be true 2018/11/22 CS of USTC AN Hong

What’s the Problem? 分支处理问题可划分为两个子问题例：
BEQ rs, rt, offset if R[rs] == R[rt] then PC<- PC+offset bne r2, #0, r3 add r4,r5,r6 sub r7,r8,r9 T NT Need address here Instruction Fetch Decode Execute Memory Access Writeback Branch Delay Compute address here 分支处理问题可划分为两个子问题决定分支的方向（分支条件相关）对需要跳转的分支，使执行延迟最小化 -> 尽快获得转移的目标地址（分支地址相关） 2018/11/22 CS of USTC AN Hong

分支指令对性能的影响分支指令的影响是开发指令级并行性（增加ILP）的重大障碍通用代码(SPEC CINT/CFP 2000)
一条指令流中，平均每5~7条指令中就有一条是分支指令，也即基本块大小为5~7条指令。增大发射宽度在发射宽度为n的处理器中，遇到分支指令的速度也快了n倍当前的处理器发射4~6 ops/cycle 增加流水线深度流水线越深，处理分支指令所需要的时钟周期数就越多 2018/11/22 CS of USTC AN Hong

分支的动态距离 Inter-branch distance(cycles)
发射宽度增加，分支间的动态距离就会越小，分支预测的延迟对性能的影响就越大，当要求更高的发射宽度时，需要每个周期做不止一个预测 2018/11/22 CS of USTC AN Hong

解决分支条件相关的方法阻塞（Stall）用delay slot容忍延迟（Delayed Branch）分支预测（Predict）
等待直到分支条件确定用delay slot容忍延迟（Delayed Branch）延迟槽指令的来源分支预测（Predict）分支条件未确定时预测分支是否成功静态与动态预测编译器优化软件循环展开减少分支指令产生条件的指令和分支指令隔得足够远动态调度硬件自动做循环展开转换为数据相关条件指令、谓词 2018/11/22 CS of USTC AN Hong

Control Hazard Solution（1）: Stall
e Time (clock cycles) Add Beq Load ALU Mem Reg Lost potential Stall: wait until decision is clear Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow 2018/11/22 CS of USTC AN Hong

Control Hazard Solution （2）: Delayed Branch
Time (clock cycles) Add Beq Misc ALU Mem Reg Load Delayed Branch: Redefine branch behavior (takes place after next instruction) Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time) As launch more instruction per clock cycle, less useful 2018/11/22 CS of USTC AN Hong

Delayed Branches（Delay slot，延迟槽）
One or more instructions after branch get executed regardless of whether branch taken add r2, r3, r4 sub r7, r8, r9 bne r5, #0, r10 (stall) div r2, r1, r7 NOTE: Here, branch delay is 2 cycle. Exposes branch delay to compiler bne r5, #0, r10 add r2, r3, r4 (delay slot) sub r7, r8, r9 (delay slot) div r2, r1, r7 2018/11/22 CS of USTC AN Hong

来自分支指令前：肯定执行来自分支目标地址：分支成功才执行来自分支不成功地址：分支不成功才执行 Cancelling branches allow more slots to be filled 单延迟槽的编译效果能为 60% 左右的转移延迟槽找到有效操作大约80%的延迟槽指令用于有效计算因此大约50% (60% x 80%) 的延迟槽操作用于有效计算延迟槽的限制超流水情况下一条延迟槽不够多发射情况下延迟槽反而成为需要特殊照顾的兼容负担 2018/11/22 CS of USTC AN Hong

Pros: Low Hardware Cost Cons: Depends on compiler to fill delay slots Ability to fill delay slots drops as # of slots increases Exposes implementation details to compiler Can’t change pipeline without breaking software compatibility Can’t add to existing architecture and retain compatibility 2018/11/22 CS of USTC AN Hong

Control Hazard Solution （3）: 软件循环展开减少分支指令
例：向量的每个元素加常数 For (j=1; j<=1000; j++) x[j]=x[j]+s; 编译：整数寄存器R1：循环计数器，初值为向量中最高端地址元素的地址浮点寄存器F2：保存常数s 1 Loop:LD F0,0(R1) ;F0=vector element 2 ADD F4,F0,F2 ;add scalar in F2 3 SD 0(R1),F4 ;store result 4 SUBI R1,R1,8 ;decrement pointer 8B (DW) 5 BNEZ R1,Loop ;branch R1!=zero 2018/11/22 CS of USTC AN Hong

每次循环都有数据相关/控制相关引起的停顿
1 Loop: LD F0,0(R1) ;F0=vector element 2 stall 3 ADDD F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 SD 0(R1),F4 ;store result 7 SUBI R1,R1,8 ;decrement pointer 8B (DW) 8 BNEZ R1,Loop ;branch R1!=zero 9 stall ;delayed branch slot Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 每次循环需要 9 clocks Rewrite code to minimize stalls? 2018/11/22 CS of USTC AN Hong

静态调度的例子 Put SUBI to ADDD’s first stall slot
1 Loop: LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2 4 SUBI R1,R1,8 5 BNEZ R1,Loop ;delayed branch 6 SD 8(R1),F4 ;altered when move past SUBI Put SUBI to ADDD’s first stall slot Addjust the displacement of SD from 0(R1) to 8(R1) (3) Swap BNEZ and SD by changing address of SD 每次循环需要 6 clocks Unroll loop (循环展开) 4 times code to make faster? 3 clocks doing work, 3 overhead (stall, branch, sub) 2018/11/22 CS of USTC AN Hong CS258 S99

循环展开+寄存器重命名（Register Renaming）的例子
重命名前重命名后 1 Loop:LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F0,-8(R1) 3 SD -8(R1),F4 7 LD F0,-16(R1) 8 ADDD F4,F0,F2 9 SD -16(R1),F4 10 LD F0,-24(R1) 11 ADDD F4,F0,F2 12 SD -24(R1),F4 13 SUBI R1,R1,#32 14 BNEZ R1,LOOP 15 NOP 1 Loop:LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,#32 14 BNEZ R1,LOOP 15 NOP 寄存器重命名的目的是避免引入新的WAW和WAR数据相关 2018/11/22 CS of USTC AN Hong

循环展开+寄存器重命名（Register Renaming）的例子
1 cycle stall 1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 ;drop SUBI & BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 ;drop SUBI & BNEZ 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,LOOP 15 NOP x (1+2) = 27 clock cycles, or 6.8 per iteration Assumes R1 is multiple of 4 Rewrite loop to minimize stalls? 2 cycles stall 2018/11/22 CS of USTC AN Hong CS258 S99

静态调度减少停顿,减少分支的例子 14 clock cycles, or 3.5 per iteration
1 Loop: LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2 7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4 10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,#32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 ;8-32 = -24 14 clock cycles, or 3.5 per iteration When safe to move instructions? What assumptions made when moved code? OK to move store past SUBI even though changes register OK to move loads before stores: get right data? When is it safe for compiler to do such changes? 2018/11/22 CS of USTC AN Hong CS258 S99

Control Hazard Solution （4）：动态调度进行自动循环展开
在Tomasulo算法中，利用动态调度以及定点与浮点运算的重叠对如下程序段可以掩盖分支指令的开销，达到自动循环展开的效果 Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1 SUBI R1 R1 #8 BNEZ R1 Loop 如果分支条件依赖于MULTD操作的结果又如何？需要进行分支预测 2018/11/22 CS of USTC AN Hong

Control Hazard Solution（5）: Prediction
Time (clock cycles) Add Beq Load ALU Mem Reg Predict: guess one direction then back up if wrong Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right 50% of time) More dynamic scheme: history of 1 branch ( 90%) 2018/11/22 CS of USTC AN Hong

分支预测的基本原理在取指或译码阶段预测分支是否成功以及分支目标进行后续指令的取指，以减少指令流水线由于控制相关而堵塞在分支条件或分支目标确定后再对预测结果进行修正静态/动态转移预测静态预测：总是预测分支成功或总是预测分支不成功预测分支成功预测分支不成功动态预测：根据分支指令执行历史进行预测复杂预测技术混合预测：利用编译器的提示，结合动态和静态预测每周期1个/多个分支预测（宽带分支预测）每周期1个预测，基本可满足4-6 发射需要的取指带宽每周期多个预测，大都与Trace Cache相结合，以满足6发射以上需要的取指带宽 2018/11/22 CS of USTC AN Hong

单通路执行/多通路执行分支预测 Single path execution full eager execution disjoint eager execution  .7 .3   .49 .21  .09   (b) 完全急进执行      .7 .3  .21 .49 .34 .24 .17 .12 .15 .10 .07 .05 (a) 单通路推测执行  .7 .3   .49 .21 .34 .15   .24 .10  (c) dis-joint 急进执行 2018/11/22 CS of USTC AN Hong

Static Prediction Predict whether branches will be taken before execution (compile-time) Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% DLX branches not taken on average PC+4 already calculated, so use it to get next instruction Predict Branch Taken 53% DLX branches taken on average But haven’t calculated branch target address in DLX DLX still incurs 1 cycle branch penalty Other machines: branch target known before outcome 2018/11/22 CS of USTC AN Hong

Static Prediction Pros: good performance for loops
Cons: bad performance for data-dependent branches if (a == 0) b = 3; else b = 4; 2018/11/22 CS of USTC AN Hong

分支处理机制的性能分支处理机制的性能取决于预测精度（BPA）-> 设计好的预测器预测精度越高，能抽取的并行性就越多
预测正确所付的代价 - > 设计分支目标缓冲(BTB)/分支目标地址Cache(BATC)/Trace Cache 为转到分支目标地址处执行所需的延迟 MIPS R10000 无BTAC：延迟为1个周期， MIPS R12000 有BTAC：32项无延迟预测错所付的代价 2018/11/22 CS of USTC AN Hong

分支预测错所付的代价分支预测错所付的代价 -> 重新刷新流水线影响因素：
静态方面流水线的总体组织流水线长度预测错的指令是否从缓存移走，什么时候移走动态方面指令窗口的大小重定序缓冲中所包含的尚未结束的推测指令条数 Pentium II/III和Alpha 21264: 重新刷新流水线需要11周期以上 Alpha21464（该项目在2001年6月，在设计的后期阶段被中止）处理器中至少为14个周期 2018/11/22 CS of USTC AN Hong

高性能分支处理机制的任务高性能分支处理机制的任务尽早确定分支的结果预测成功，要使执行延迟最小化：如用BTB/BTAC
预测错，要有有效的流水线刷新机制提供优良的分支预测器和推测执行机制：是研究得最多的提供2~多级预测/推测机制 2018/11/22 CS of USTC AN Hong

高性能分支处理机制的任务用于指导预测 Branch outcome Predicted direction and other info
Predictor Branch outcome and other info Program Counter (PC) Predicted direction (taken/not taken) Predicted target address (where the branch is going) 用于指导预测 Predicting where the next instruction is in the instruction stream. 2018/11/22 CS of USTC AN Hong

分支目标缓存(BTB,Branch Target Buffer)
Store branch address and predicted target (last target) only store targets for taken branches can combine with any prediction scheme 2018/11/22 CS of USTC AN Hong

Predict taken or untaken
Example: BHT with BTB Address of branch index to get prediction AND branch address (if taken) Must check for branch match now, since can’t use wrong branch address Grab predicted PC from table since may take several cycles to compute Update predicted PC when branch is actually resolved Return instruction addresses predicted with stack Branch PC （Branch Address） Predicted PC （Taken Target Address） =? PC of instruction FETCH Predict taken or untaken BHT 2018/11/22 CS of USTC AN Hong

Example: PAg with BTB 2018/11/22 CS of USTC AN Hong

控制推测执行机制的分类 Not predict
no branch prediction(block pipeline) Intel 8086 loop buffer(a very small I-Cache) Intel 80386 delayed branches(delay slots) eager execution(EE) IBM 360/91, Sun SuperSparc the number of paths grow exponentially unlimited resources not predictable branch, the branch direction changes in an irregular fashion disjoint eager execution(DEE) Alone with minimal control dependencies(MCD) 2018/11/22 CS of USTC AN Hong

推测执行机制的分类 Predict Static（at compile time, via software）
Always not taken Intel 80486 Always taken Sun SuperSparc Backward taken;Forward not taken(BTFN) HP PA-7x00 Semistatic(Profiling) Early PowerPC Dynamic（at runtime, via hardware ） 1bit DEC Alpha 21064, AMD-K5 2bit NextGen 586, PowerPC 604, Cyrix 6x86, CyrixM2, MIPS R10000 Two-level adaptive P6, AMD-K6 Selector(local/global) DEC Alpha 21264 Hybrid (combining static and dynamic scheme) ? Multiscalar Pentium IV ? 2018/11/22 CS of USTC AN Hong

推测执行机制的分类 Others Predication alone Denelcor HEP
Predication with software Cydrome Cydra5, P6 VLIW Itanium 2018/11/22 CS of USTC AN Hong

术语预测率（Prediction Rate）或预测精度（Branch Prediction Accurary, BPA）
dynamic percentage of branches which are predicted correctly 误预测率（Misprediction Rate） dynamic percentage of branches that are not predicted correctly 误预测代价（Misprediction Penalty） Delay when a branch is not predicted correctly (squash ops, refill pipeline) Taken Delay Delay when a taken branch is correctly predicted Not-taken (fall-through) Delay Delay when a not-taken branch is correctly predicted. 2018/11/22 CS of USTC AN Hong

实现动态分支预测技术要考虑的几个关键问题
如何保证准确的预测：根据记录的分支历史进行预测如何记录分支历史，记录哪些分支历史记录多少分支历史何时更新：更新太早，分支指令也可能被取消；更新太晚，导致分支历史不准确如何在取消推测执行的操作时现场精确性：使用例外处理的现场恢复机制增加Commit流水级，在Commit时修改寄存器和内存 I/O指令的推测执行难以取消如何识别流水线中的指令哪些需要取消，哪些不要取消：Commit时全部取消，执行后部分取消例外取消一般在Commit时，取消所有后续指令分支取消一般在执行后，只取消部分指令延迟槽指令的处理 2018/11/22 CS of USTC AN Hong

分支指令的类型条件分支和无条件分支直接分支和间接分支相对分支和绝对分支
条件分支（branch）：常规情况下，需等待条件的确定和目标地址被计算出来后才能取指整个程序中条件分支所占比例高整型程序中的条件分支所占的比例，比浮点程序中条件分支所占的比例更高无条件分支（Jump，call/return）：常规情况下，只需等待目标地址被计算出来后才取指直接分支和间接分支直接分支(branch,Jump) : 目标地址在指令中给出间接分支(call/return):目标地址在指令中不给出（一般需要专门的寄存器存放）相对分支和绝对分支相对分支：目标地址为当前PC值加上偏移量绝对分支：目标地址由指令或寄存器内容直接给出 2018/11/22 CS of USTC AN Hong

间接分支(call/return) Predicting return jumps from subroutines is hard
2018/11/22 CS of USTC AN Hong

返回地址栈(Return Address Stack)
Hardware stack for subroutine return addresses Push return address (in HW) when make call Pop return address to exit Most programs don’ t have deep call trees，can use small stack(e.g. 32 entries) 2018/11/22 CS of USTC AN Hong

分支指令的频度统计（整型程序） 2018/11/22 CS of USTC AN Hong

分支指令的频度统计(浮点程序) 2018/11/22 CS of USTC AN Hong

分支指令间距离统计（1）发射宽度增加，分支间的动态距离就会越小，分支预测的延迟对性能的影响就越大，当要求更高的发射宽度时，需要每个周期做不止一个预测 2018/11/22 CS of USTC AN Hong

Inter-branch distance(cycles)
分支指令间距离统计（2） Inter-branch distance(cycles) 2018/11/22 CS of USTC AN Hong

分支的可预测性分析（1）：指导设计或选择更好的分支预测算法
单个分支是有“脾气”的->基于模式（单个分支自身历史）的动态预测方法循环型分支 for- 型循环：TT…TN（成功n次后跟一次不成功） while- 型循环：NN…NT（不成功n次后跟一次成功）周期重复模式型分支定长重复模式类分支：{pat}q , |pat|=k （每隔k个分支模式就重复一次）块模式类分支：{Tn Nm}q 成功n次，不成功m次，如此循环分支之间是有“关系”的->基于相关（多个分支全局历史）的动态预测方法方向相关路径相关 2018/11/22 CS of USTC AN Hong

当前分支：指当前试图预测的分支，如例中的X分支相关分支：指其分支结果与当前分支相关的分支
分支的可预测性分析（2）方向相关(direction correlation) 两个分支的条件（完全或部分）基于相同或相关的信息第二个分支的结果基于第一个分支的结果产生 Y：if (cond1)  X：if (cond1 AND cond2) Z：if (cond1) Y：if (cond2) Y：if (cond1) a = 2 X：if (a = = 0) Example: 当前分支：指当前试图预测的分支，如例中的X分支相关分支：指其分支结果与当前分支相关的分支 2018/11/22 CS of USTC AN Hong

分支的可预测性分析（3）路径相关(In-path correlation) Example:
处在当前分支路径上的分支与当前分支结果之间的相关称为路径相关 Z： if (NOT (cond1)) Y：else if (NOT(cond2)) V：else if (cond3)  X：if (cond1 AND cond2) Example: 如果获得了分支V的结果为taken, 就能推知Z，Y的条件为False, 于是，能进一步推知分支X的条件必定满足，即Cond1和Cond2都为True 2018/11/22 CS of USTC AN Hong

分支的可预测性分析（4） 2018/11/22 CS of USTC AN Hong

Lecture on High Performance Processor Architecture (CS05162)

Similar presentations

Presentation on theme: "Lecture on High Performance Processor Architecture (CS05162)"— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

Lecture on High Performance Processor Architecture (CS05162)

Similar presentations

Presentation on theme: "Lecture on High Performance Processor Architecture (CS05162)"— Presentation transcript:

Similar presentations

About project

反馈