高等计算机系统结构 VLIW/EPIC 基于静态调度的ILP (第五讲) 程 旭 2011年4月16日
复习: 先进超标量技术 简单的转移预测技术也会有不错的效果 基于路径(Path-based)的预测器可取得>95%的正确率 BTB可以在流水线的前期就改变控制流, BHT的每个表项成本小,但需要等到指令译码 转移错误预测的恢复需要使用流水线的快照(snapshots)以减少损失 统一物理寄存器堆的设计,可以避免从不同地方 (ROB+arch regfile)读取数据 超标量处理器可以在一个时钟周期对多条相关指令进行重命名 需要采用推测式存储缓冲器(speculative store buffer)来避免等待存储操作的确认
复习: 转移预测和推测式执行 kill PC Fetch Decode & Rename Reorder Buffer Commit Update predictors Branch Prediction Branch Resolution kill PC Fetch Decode & Rename Reorder Buffer Commit Reg. File Branch Unit ALU MEM Store Buffer D$ Execute
超标量处理器控制逻辑的可扩展性 Issue Width W Issue Group Previously Issued Instructions Lifetime L 每条被发射的指令都必须检查与W*L条指令间的关系,即硬件的增长 W*(W*L) 对于按序机器,L与流水线延迟相关 对于乱序机器, L还包括指令缓冲器消耗的时间(指令窗口 或 ROB) 随着W的增加,需要更大的指令窗口来找都维持机器忙碌所需的指令级并行性 => L变得更大 => 乱序控制逻辑的复杂程度增长速度大于 W2 (~W3)
[ SGI/MIPS Technologies Inc., 1995 ] 乱序控制的复杂度:MIPS R10000 Control Logic [ SGI/MIPS Technologies Inc., 1995 ]
串行ISA的瓶颈 Superscalar compiler Superscalar processor Sequential source code Superscalar compiler Find independent operations Sequential machine code Schedule operations a = foo(b); for (i=0, i< Check instruction dependencies Superscalar processor Schedule execution
VLIW: Very Long Instruction Word Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Int Op 1 Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency Multiple operations packed into one instruction Each operation slot is for a fixed function Constant operation latencies are specified Architecture requires guarantee of: Parallelism within an instruction => no x-operation RAW check No data use before data ready => no data interlocks
VLIW 编译器的职责 编译器: 调度成最大并行执行 确保指令内的并行性 避免数据冒险(无互锁) 通常用显式的NOP指令来分隔实际操作
早期VLIW机器 FPS AP120B (1976) Multiflow Trace (1987) scientific attached array processor first commercial wide instruction machine hand-coded vector math libraries using software pipelining and loop unrolling Multiflow Trace (1987) commercialization of ideas from Fisher’s Yale group including “trace scheduling” available in configurations with 7, 14, or 28 operations/instruction 28 operations packed into a 1024-bit instruction word Cydrome Cydra-5 (1987) 7 operations encoded in 256-bit instruction word rotating register file
循环的执行 How many FP ops/cycle? 1 fadd / 8 cycles = 0.125 for (i=0; i<N; i++) B[i] = A[i] + C; Int1 Int 2 M1 M2 FP+ FPx loop: add r1 ld Compile loop: ld f1, 0(r1) add r1, 8 fadd f2, f0, f1 sd f2, 0(r2) add r2, 8 bne r1, r3, loop fadd Schedule add r2 bne sd How many FP ops/cycle? 1 fadd / 8 cycles = 0.125
Unroll inner loop to perform 4 iterations at once 循环展开(Loop Unrolling) for (i=0; i<N; i++) B[i] = A[i] + C; Unroll inner loop to perform 4 iterations at once for (i=0; i<N; i+=4) { B[i] = A[i] + C; B[i+1] = A[i+1] + C; B[i+2] = A[i+2] + C; B[i+3] = A[i+3] + C; } Need to handle values of N that are not multiples of unrolling factor with final cleanup loop
循环展开后代码的调度 How many FLOPS/cycle? 4 fadds / 11 cycles = 0.36 Unroll 4 ways loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) sd f8, 24(r2) add r2, 32 bne r1, r3, loop Int1 Int 2 M1 M2 FP+ FPx loop: ld f1 ld f2 ld f3 add r1 ld f4 fadd f5 Schedule fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 add r2 bne sd f8 How many FLOPS/cycle? 4 fadds / 11 cycles = 0.36
软件流水(Software Pipelining) Int1 Int 2 M1 M2 FP+ FPx Unroll 4 ways first ld f1 ld f2 ld f3 ld f4 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 sd f8 add r1 add r2 bne loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) add r2, 32 sd f8, -8(r2) bne r1, r3, loop loop: iterate prolog epilog ld f1 ld f2 ld f3 ld f4 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 sd f8 add r1 add r2 bne ld f1 ld f2 ld f3 ld f4 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 add r1 How many FLOPS/cycle? 4 fadds / 4 cycles = 1
软件流水与循环展开 Loop Unrolled Software Pipelined Wind-down overhead performance Startup overhead time Loop Iteration Software Pipelined performance time Loop Iteration Software pipelining pays startup/wind-down costs only once per loop, not once per iteration
如果没有循环,怎么办? Basic block 转移指令限制了控制流敏感的非规则代码中基本块的大小 很难从单独的基本块中发现ILP
踪迹调度(Trace Scheduling) [ Fisher,Ellis] Pick string of basic blocks, a trace, that represents most frequent branch path Use profiling feedback or compiler heuristics to find common branch paths Schedule whole “trace” at once Add fixup code to cope with branches jumping out of trace
“典型” VLIW的问题 目标代码的兼容性 目标代码的大小 需要调度访存延迟可变的操作 确认转移发生概率 对于静态时刻不可预测的转移进行调度 需要针对每种机器重新编译所有代码,甚至对于同一代中的两种不同机器也需要如此 目标代码的大小 指令填充浪费指令存取空间(存储器/cache) 循环展开/软件流水将复制代码 需要调度访存延迟可变的操作 caches 和/或 主存块的冲突可能产生静态时刻不可预知的变化 确认转移发生概率 动态剖视(Profiling)需要增加巨大的一步来进行静态转移预测 对于静态时刻不可预测的转移进行调度 根据转移路径的不同,最佳调度策略也不同
VLIW指令的编码 Schemes to reduce effect of unused fields Group 1 Group 2 Group 3 Schemes to reduce effect of unused fields Compressed format in memory, expand on I-cache refill used in Multiflow Trace introduces instruction addressing challenge Mark parallel groups used in TMS320C6x DSPs, Intel IA-64 Provide a single-op VLIW instruction Cydra-5 UniOp instructions
Rotating Register Files Problems: Scheduled loops require lots of registers, Lots of duplicated code in prolog, epilog ld r1, () add r2, r1, #1 st r2, () ld r1, () add r2, r1, #1 st r2, () ld r1, () add r2, r1, #1 st r2, () ld r1, () add r2, r1, #1 st r2, () Prolog Epilog Loop Solution: Allocate new set of registers for each loop iteration
Rotating Register File P0 P1 P2 P3 P4 P5 P6 P7 RRB=3 R1 + Rotating Register Base (RRB) register points to base of current register set. Value added on to logical register specifier to give physical register number. Usually, split into rotating and non-rotating registers. dec RRB bloop ld r1, () add r3, r2, #1 st r4, () add r2, r1, #1 Prolog Epilog Loop Loop closing branch decrements RRB
Rotating Register File (Previous Loop Example) Three cycle load latency encoded as difference of 3 in register specifier number (f4 - f1 = 3) Four cycle fadd latency encoded as difference of 4 in register specifier number (f9 – f5 = 4) bloop sd f9, () fadd f5, f4, ... ld f1, () bloop sd P17, () fadd P13, P12, ld P9, () RRB=8 bloop sd P16, () fadd P12, P11, ld P8, () RRB=7 bloop sd P15, () fadd P11, P10, ld P7, () RRB=6 bloop sd P14, () fadd P10, P9, ld P6, () RRB=5 bloop sd P13, () fadd P9, P8, ld P5, () RRB=4 bloop sd P12, () fadd P8, P7, ld P4, () RRB=3 bloop sd P11, () fadd P7, P6, ld P3, () RRB=2 bloop sd P10, () fadd P6, P5, ld P2, () RRB=1
Cydra-5: 存储延迟寄存器(Memory Latency Register:MLR) 问题: 装入操作可能出现时延变化 解决方案:让软件选择所需时延 编译对代码进行调度以满足最大装入-使用(load-use)距离 软件将MLR设置为机器代码调度的延迟 硬件保证装入操作在MLR要求的周期将所需数值返回给处理器流水线 硬件将缓存提前返回的装入数值 如果装入没有按时返回数值,硬件将暂停处理器
Intel EPIC IA-64 与传统CISC和RISC相比,EPIC是一种新型的体系结构 IA-64是Intel最新采用的一种ISA Explicitly Parallel Instruction Computing IA-64是Intel最新采用的一种ISA IA-64 = Intel Architecture 64-bit 一种目标代码兼容的VLIW Itanium (最早称为Merced)是采用IA-64的第一种处理器 1995年预计第一种产品在1997问世 (实际上为2001) McKinley是2002年推出的第二类产品
128-bit instruction bundle IA-64 的指令格式 Instruction 2 Instruction 1 Instruction 0 Template 128-bit instruction bundle Template bits describe grouping of these instructions with others in adjacent bundles Each group contains instructions that can execute in parallel bundle j-1 bundle j bundle j+1 bundle j+2 group i-1 group i group i+1 group i+2
IA-64中的寄存器 128个通用64位定点寄存器 128个通用64/80位浮点寄存器 64个1位预测寄存器 GPR可以旋转(rotate)以减小软件流水化循环的代码大小
IA-64 Predicated Execution Problem: Mispredicted branches limit ILP Solution: Eliminate hard to predict branches with predicated execution Almost all IA-64 instructions can be executed conditionally under predicate Instruction becomes NOP if predicate register false Inst 1 Inst 2 br a==b, b2 Inst 3 Inst 4 br b3 Inst 5 Inst 6 Inst 7 Inst 8 b0: b1: b2: b3: if else then Four basic blocks Inst 1 Inst 2 p1,p2 <- cmp(a==b) (p1) Inst 3 || (p2) Inst 5 (p1) Inst 4 || (p2) Inst 6 Inst 7 Inst 8 Predication One basic block Mahlke et al, ISCA95: On average >50% branches removed
Predicate Software Pipeline Stages Single VLIW Instruction (p1) bloop (p3) st r4 (p2) add r3 (p1) ld r1 Dynamic Execution (p1) ld r1 (p2) add r3 (p3) st r4 (p1) bloop Software pipeline stages turned on by rotating predicate registers Much denser encoding of loops
IA-64 Speculative Execution Problem: Branches restrict compiler code motion Solution: Speculative operations that don’t cause exceptions Load.s r1 Inst 1 Inst 2 br a==b, b2 Chk.s r1 Use r1 Inst 3 Speculative load never causes exception, but sets “poison” bit on destination register Check for exception in original home block jumps to fixup code if exception detected Inst 1 Inst 2 br a==b, b2 Load r1 Use r1 Inst 3 Can’t move load above branch because might cause spurious exception Particularly useful for scheduling long latency loads early
IA-64 Data Speculation Problem: Possible memory hazards limit code scheduling Solution: Hardware to check pointer hazards Load.a r1 Inst 1 Inst 2 Store Load.c Use r1 Inst 3 Data speculative load adds address to address check table Store invalidates any matching loads in address check table Check if load invalid (or missing), jump to fixup code if so Inst 1 Inst 2 Store Load r1 Use r1 Inst 3 Can’t move load above store because store might be to same address Requires associative hardware in address check table
集群化VLIW(Clustered VLIW) 将机器划分成具有局部寄存器堆和局部功能部件的集群 集群间采用低带宽/高时延的互联 由软件负责进行计算划分,以实现最小化通讯开销(communication overhead) 在商用嵌入式VLIW处理器中经常使用,例如, TI C6x DSPs, HP Lx处理器 (在一些超标量处理器中也使用该技术,例如, Alpha 21264) Cluster Interconnect Local Regfile Local Regfile Cluster Memory Interconnect Cache/Memory Banks
静态调度的制约 不可预知的转移 可变存储时延(不可预知的cache失效) 代码大小爆炸(Code size explosion) 编译的复杂性 问题: 对于传统RISC/CISC处理器,VLIW能够激发哪些可采用的技术?