高等计算机系统结构 VLIW/EPIC 基于静态调度的ILP （第五讲）程旭 2011年4月16日.

高等计算机系统结构 VLIW/EPIC 基于静态调度的ILP （第五讲）程旭 2011年4月16日

复习: 先进超标量技术简单的转移预测技术也会有不错的效果基于路径（Path-based）的预测器可取得>95%的正确率
BTB可以在流水线的前期就改变控制流, BHT的每个表项成本小，但需要等到指令译码转移错误预测的恢复需要使用流水线的快照（snapshots）以减少损失统一物理寄存器堆的设计，可以避免从不同地方 (ROB+arch regfile)读取数据超标量处理器可以在一个时钟周期对多条相关指令进行重命名需要采用推测式存储缓冲器（speculative store buffer）来避免等待存储操作的确认

复习: 转移预测和推测式执行 kill PC Fetch Decode & Rename Reorder Buffer Commit
Update predictors Branch Prediction Branch Resolution kill PC Fetch Decode & Rename Reorder Buffer Commit Reg. File Branch Unit ALU MEM Store Buffer D$ Execute

超标量处理器控制逻辑的可扩展性 Issue Width W Issue Group
Previously Issued Instructions Lifetime L 每条被发射的指令都必须检查与W*L条指令间的关系，即硬件的增长 W*(W*L) 对于按序机器，L与流水线延迟相关对于乱序机器, L还包括指令缓冲器消耗的时间（指令窗口或 ROB) 随着W的增加，需要更大的指令窗口来找都维持机器忙碌所需的指令级并行性 => L变得更大 => 乱序控制逻辑的复杂程度增长速度大于 W2 (~W3)

[ SGI/MIPS Technologies Inc., 1995 ]
乱序控制的复杂度:MIPS R10000 Control Logic [ SGI/MIPS Technologies Inc., 1995 ]

串行ISA的瓶颈 Superscalar compiler Superscalar processor
Sequential source code Superscalar compiler Find independent operations Sequential machine code Schedule operations a = foo(b); for (i=0, i< Check instruction dependencies Superscalar processor Schedule execution

VLIW: Very Long Instruction Word
Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Int Op 1 Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency Multiple operations packed into one instruction Each operation slot is for a fixed function Constant operation latencies are specified Architecture requires guarantee of: Parallelism within an instruction => no x-operation RAW check No data use before data ready => no data interlocks

VLIW 编译器的职责编译器: 调度成最大并行执行确保指令内的并行性避免数据冒险(无互锁) 通常用显式的NOP指令来分隔实际操作

早期VLIW机器 FPS AP120B (1976) Multiflow Trace (1987)
scientific attached array processor first commercial wide instruction machine hand-coded vector math libraries using software pipelining and loop unrolling Multiflow Trace (1987) commercialization of ideas from Fisher’s Yale group including “trace scheduling” available in configurations with 7, 14, or 28 operations/instruction 28 operations packed into a 1024-bit instruction word Cydrome Cydra-5 (1987) 7 operations encoded in 256-bit instruction word rotating register file

循环的执行 How many FP ops/cycle? 1 fadd / 8 cycles = 0.125
for (i=0; i<N; i++) B[i] = A[i] + C; Int1 Int 2 M1 M2 FP+ FPx loop: add r1 ld Compile loop: ld f1, 0(r1) add r1, 8 fadd f2, f0, f1 sd f2, 0(r2) add r2, 8 bne r1, r3, loop fadd Schedule add r2 bne sd How many FP ops/cycle? 1 fadd / 8 cycles = 0.125

Unroll inner loop to perform 4 iterations at once
循环展开（Loop Unrolling） for (i=0; i<N; i++) B[i] = A[i] + C; Unroll inner loop to perform 4 iterations at once for (i=0; i<N; i+=4) { B[i] = A[i] + C; B[i+1] = A[i+1] + C; B[i+2] = A[i+2] + C; B[i+3] = A[i+3] + C; } Need to handle values of N that are not multiples of unrolling factor with final cleanup loop

循环展开后代码的调度 How many FLOPS/cycle? 4 fadds / 11 cycles = 0.36
Unroll 4 ways loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) sd f8, 24(r2) add r2, 32 bne r1, r3, loop Int1 Int 2 M1 M2 FP+ FPx loop: ld f1 ld f2 ld f3 add r1 ld f4 fadd f5 Schedule fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 add r2 bne sd f8 How many FLOPS/cycle? 4 fadds / 11 cycles = 0.36

软件流水（Software Pipelining）
Int1 Int 2 M1 M2 FP+ FPx Unroll 4 ways first ld f1 ld f2 ld f3 ld f4 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 sd f8 add r1 add r2 bne loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) add r2, 32 sd f8, -8(r2) bne r1, r3, loop loop: iterate prolog epilog ld f1 ld f2 ld f3 ld f4 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 sd f8 add r1 add r2 bne ld f1 ld f2 ld f3 ld f4 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 add r1 How many FLOPS/cycle? 4 fadds / 4 cycles = 1

软件流水与循环展开 Loop Unrolled Software Pipelined
Wind-down overhead performance Startup overhead time Loop Iteration Software Pipelined performance time Loop Iteration Software pipelining pays startup/wind-down costs only once per loop, not once per iteration

如果没有循环，怎么办? Basic block 转移指令限制了控制流敏感的非规则代码中基本块的大小很难从单独的基本块中发现ILP

踪迹调度（Trace Scheduling） [ Fisher,Ellis]
Pick string of basic blocks, a trace, that represents most frequent branch path Use profiling feedback or compiler heuristics to find common branch paths Schedule whole “trace” at once Add fixup code to cope with branches jumping out of trace

“典型” VLIW的问题目标代码的兼容性目标代码的大小需要调度访存延迟可变的操作确认转移发生概率对于静态时刻不可预测的转移进行调度
需要针对每种机器重新编译所有代码，甚至对于同一代中的两种不同机器也需要如此目标代码的大小指令填充浪费指令存取空间（存储器/cache）循环展开/软件流水将复制代码需要调度访存延迟可变的操作 caches 和/或主存块的冲突可能产生静态时刻不可预知的变化确认转移发生概率动态剖视（Profiling）需要增加巨大的一步来进行静态转移预测对于静态时刻不可预测的转移进行调度根据转移路径的不同，最佳调度策略也不同

VLIW指令的编码 Schemes to reduce effect of unused fields
Group 1 Group 2 Group 3 Schemes to reduce effect of unused fields Compressed format in memory, expand on I-cache refill used in Multiflow Trace introduces instruction addressing challenge Mark parallel groups used in TMS320C6x DSPs, Intel IA-64 Provide a single-op VLIW instruction Cydra-5 UniOp instructions

Rotating Register Files
Problems: Scheduled loops require lots of registers, Lots of duplicated code in prolog, epilog ld r1, () add r2, r1, #1 st r2, () ld r1, () add r2, r1, #1 st r2, () ld r1, () add r2, r1, #1 st r2, () ld r1, () add r2, r1, #1 st r2, () Prolog Epilog Loop Solution: Allocate new set of registers for each loop iteration

Rotating Register File
P0 P1 P2 P3 P4 P5 P6 P7 RRB=3 R1 + Rotating Register Base (RRB) register points to base of current register set. Value added on to logical register specifier to give physical register number. Usually, split into rotating and non-rotating registers. dec RRB bloop ld r1, () add r3, r2, #1 st r4, () add r2, r1, #1 Prolog Epilog Loop Loop closing branch decrements RRB

Rotating Register File (Previous Loop Example)
Three cycle load latency encoded as difference of 3 in register specifier number (f4 - f1 = 3) Four cycle fadd latency encoded as difference of 4 in register specifier number (f9 – f5 = 4) bloop sd f9, () fadd f5, f4, ... ld f1, () bloop sd P17, () fadd P13, P12, ld P9, () RRB=8 bloop sd P16, () fadd P12, P11, ld P8, () RRB=7 bloop sd P15, () fadd P11, P10, ld P7, () RRB=6 bloop sd P14, () fadd P10, P9, ld P6, () RRB=5 bloop sd P13, () fadd P9, P8, ld P5, () RRB=4 bloop sd P12, () fadd P8, P7, ld P4, () RRB=3 bloop sd P11, () fadd P7, P6, ld P3, () RRB=2 bloop sd P10, () fadd P6, P5, ld P2, () RRB=1

Cydra-5: 存储延迟寄存器（Memory Latency Register：MLR)
问题: 装入操作可能出现时延变化解决方案：让软件选择所需时延编译对代码进行调度以满足最大装入-使用（load-use）距离软件将MLR设置为机器代码调度的延迟硬件保证装入操作在MLR要求的周期将所需数值返回给处理器流水线硬件将缓存提前返回的装入数值如果装入没有按时返回数值，硬件将暂停处理器

Intel EPIC IA-64 与传统CISC和RISC相比，EPIC是一种新型的体系结构 IA-64是Intel最新采用的一种ISA
Explicitly Parallel Instruction Computing IA-64是Intel最新采用的一种ISA IA-64 = Intel Architecture 64-bit 一种目标代码兼容的VLIW Itanium (最早称为Merced)是采用IA-64的第一种处理器 1995年预计第一种产品在1997问世 (实际上为2001) McKinley是2002年推出的第二类产品

128-bit instruction bundle
IA-64 的指令格式 Instruction 2 Instruction 1 Instruction 0 Template 128-bit instruction bundle Template bits describe grouping of these instructions with others in adjacent bundles Each group contains instructions that can execute in parallel bundle j-1 bundle j bundle j+1 bundle j+2 group i-1 group i group i+1 group i+2

IA-64中的寄存器 128个通用64位定点寄存器 128个通用64/80位浮点寄存器 64个1位预测寄存器
GPR可以旋转（rotate）以减小软件流水化循环的代码大小

IA-64 Predicated Execution
Problem: Mispredicted branches limit ILP Solution: Eliminate hard to predict branches with predicated execution Almost all IA-64 instructions can be executed conditionally under predicate Instruction becomes NOP if predicate register false Inst 1 Inst 2 br a==b, b2 Inst 3 Inst 4 br b3 Inst 5 Inst 6 Inst 7 Inst 8 b0: b1: b2: b3: if else then Four basic blocks Inst 1 Inst 2 p1,p2 <- cmp(a==b) (p1) Inst || (p2) Inst 5 (p1) Inst || (p2) Inst 6 Inst 7 Inst 8 Predication One basic block Mahlke et al, ISCA95: On average >50% branches removed

Predicate Software Pipeline Stages
Single VLIW Instruction (p1) bloop (p3) st r4 (p2) add r3 (p1) ld r1 Dynamic Execution (p1) ld r1 (p2) add r3 (p3) st r4 (p1) bloop Software pipeline stages turned on by rotating predicate registers  Much denser encoding of loops

IA-64 Speculative Execution
Problem: Branches restrict compiler code motion Solution: Speculative operations that don’t cause exceptions Load.s r1 Inst 1 Inst 2 br a==b, b2 Chk.s r1 Use r1 Inst 3 Speculative load never causes exception, but sets “poison” bit on destination register Check for exception in original home block jumps to fixup code if exception detected Inst 1 Inst 2 br a==b, b2 Load r1 Use r1 Inst 3 Can’t move load above branch because might cause spurious exception Particularly useful for scheduling long latency loads early

IA-64 Data Speculation Problem: Possible memory hazards limit code scheduling Solution: Hardware to check pointer hazards Load.a r1 Inst 1 Inst 2 Store Load.c Use r1 Inst 3 Data speculative load adds address to address check table Store invalidates any matching loads in address check table Check if load invalid (or missing), jump to fixup code if so Inst 1 Inst 2 Store Load r1 Use r1 Inst 3 Can’t move load above store because store might be to same address Requires associative hardware in address check table

集群化VLIW（Clustered VLIW）
将机器划分成具有局部寄存器堆和局部功能部件的集群集群间采用低带宽/高时延的互联由软件负责进行计算划分，以实现最小化通讯开销（communication overhead）在商用嵌入式VLIW处理器中经常使用，例如, TI C6x DSPs, HP Lx处理器 (在一些超标量处理器中也使用该技术，例如, Alpha 21264) Cluster Interconnect Local Regfile Local Regfile Cluster Memory Interconnect Cache/Memory Banks

静态调度的制约不可预知的转移可变存储时延(不可预知的cache失效) 代码大小爆炸（Code size explosion）
编译的复杂性问题: 对于传统RISC/CISC处理器，VLIW能够激发哪些可采用的技术?

高等计算机系统结构 VLIW/EPIC 基于静态调度的ILP （第五讲）程旭 2011年4月16日.

Similar presentations

Presentation on theme: "高等计算机系统结构 VLIW/EPIC 基于静态调度的ILP （第五讲）程旭 2011年4月16日."— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

高等计算机系统结构 VLIW/EPIC 基于静态调度的ILP （第五讲） 程 旭 2011年4月16日.

Similar presentations

Presentation on theme: "高等计算机系统结构 VLIW/EPIC 基于静态调度的ILP （第五讲） 程 旭 2011年4月16日."— Presentation transcript:

Similar presentations

About project

反馈

高等计算机系统结构 VLIW/EPIC 基于静态调度的ILP （第五讲）程旭 2011年4月16日.

Presentation on theme: "高等计算机系统结构 VLIW/EPIC 基于静态调度的ILP （第五讲）程旭 2011年4月16日."— Presentation transcript: