高等计算机系统结构 VLIW/EPIC 基于静态调度的ILP （第五讲）程旭 2011年4月16日.

Slides:

Advertisements

Similar presentations

程序的执行程序执行和指令执行概述数据通路基本结构和工作原理流水线方式下指令的执行

Advertisements

第 2 章中央處理單元.

多核结构与程序设计杨全胜东南大学成贤学院计算机系.

Chapter 10 效能測量與分析.

Memory Pool ACM Yanqing Peng.

如何在Elsevier期刊上发表文章 china.elsevier.com

信息科学与工程学院计算机科学系 2006年9月—2007年1月

Performance Evaluation

資料庫設計 Database Design.

Chapter 2: Computer-System Structures计算机系统结构

最新計算機概論第3章計算機組織.

新世代計算機概論第3章　電腦的系統單元.

第4章 VHDL设计初步.

大连理工大学软件学院软件工程系赖晓晨计算机组成与结构大连理工大学软件学院软件工程系赖晓晨

Leftmost Longest Regular Expression Matching in Reconfigurable Logic

隱藏之投影片 2016/8/ /8/29 Streamlining Inter-operation Memory Communication via Data Dependence Prediction 胡連鈞 Hu, Lien Chun 電機系, Department of Electrical.

Lecture on High Performance Processor Architecture (CS05162)

CH.2 Introduction to Microprocessor-Based Control

周学海 , 中国科学技术大学 2018/9/20 现代微处理器体系结构周学海 , 中国科学技术大学 2018/9/20 计算机体系结构.

周学海 , 中国科学技术大学 2018/9/21 计算机体系结构周学海 , 中国科学技术大学.

Chapter 5 電腦元件目標---- 研讀完本章後，你應該可以：閱讀有關電腦的廣告以及了解它的專業用語(行話)。

第 2 章中央處理單元.

微处理器设计1 刘鹏 College of ISEE Zhejiang University

Operating System Concepts 作業系統原理 Chapter 3 行程觀念 (Process Concept)

Quiz 3 假设各种分支占所有指令数的百分比如下表所示：

CPU資料處理醫務管理暨醫療資訊學系陳以德副教授: 濟世CS 轉

周学海中国科学技术大学 2018/11/16 计算机体系结构周学海中国科学技术大学.

PIC16F1827介紹以微控器為基礎之電路設計實務-微處理器實驗室.

Lecture on High Performance Processor Architecture (CS05162)

第4章处理器（CPU） 4.1 引言 4.2 逻辑设计的一般方法 4.3 建立数据通路 4.4 一个简单的实现机制 4.5 多周期实现机制.

第五讲数据的分组、合并与转换.

指令集架構計算機也跟人類一樣，需要提供一套完整的語言讓人們跟它充分溝通，以完成正確的計算工作。

1-1 微電腦系統單元 1-2 微電腦系統架構 1-3 微控制器（單晶片微電腦） 1-4 類比與數位訊號介面

1-1 微電腦系統單元 1-2 微電腦系統架構 1-3 微控制器（單晶片微電腦） 1-4 類比與數位訊號介面

5 Computer Organization (計算機組織).

Lecture on High Performance Processor Architecture (CS05162)

The Processor: Datapath and Control

Chapter 3 行程觀念 (Process Concept)

微程序控制器刘鹏 Dept. ISEE Zhejiang University

計算機結構 – 概論陳鍾誠於金門大學.

Ch 9: Input/Output System 输入/输出系统

微处理器设计2 刘鹏 College of ISEE Zhejiang University

C 語言簡介 - 2.

第五組 : 廖震昌 / 謝坤吉 / 黃麗珍陳曉伶 / 陳思因 / 林慧佳

第14章其它DSP设计库 14.1 总线控制库 14.2 复数信号库 14.3 Gates库 14.4 状态机函数库

KeyStone I DSP[C665x 与 C6678] 视频教程

重點資料結構之選定會影響演算法選擇對的資料結構讓您上天堂程式.

陳慶瀚機器智慧與自動化技術(MIAT)實驗室國立中央大學資工系 2013年5月28日

习题课第1-2次作业孙凡.

組合語言和程式範例.

計算機概論第3章計算機組織與結構概觀.

105-1 Data Structure Exam /12/27.

Architecture and Systems 研究群報告人：單智君陳昌居鍾崇斌中華民國95年11月30日

The Processor: Datapath and Control (Multi-cycle implementation)

中国科学技术大学计算机系陈香兰 2013Fall 第七讲存储器管理中国科学技术大学计算机系陈香兰 2013Fall.

虚拟仪器 virtual instrument

中国科学技术大学计算机系陈香兰 Fall 2013 第三讲线程中国科学技术大学计算机系陈香兰 Fall 2013.

计算机系统结构（2012年春） ----存储层次： Cache基本概念

周学海中国科学技术大学 2019/4/19 计算机体系结构周学海中国科学技术大学.

An Efficient MSB Prediction-based Method for High-capacity Reversible Data Hiding in Encrypted Images 基于有效MSB预测的加密图像大容量可逆数据隐藏方法。本文目的：做到既有较高的藏量（1bpp),

第六章記憶體.

唐常杰四川大学计算机学院计算机科学技术系

第4章指令级并行授课教师：车喜龙

 隐式欧拉法 /* implicit Euler method */

2 Number Systems, Operations, and Codes

清华大学计算机科学与技术系高性能计算研究所郑纬民教授 2005年5月

Race Conditions and Semaphore

MGT 213 System Management Server的昨天，今天和明天

Introduction to Computer Security and Cryptography

如何在Elsevier期刊上发表文章 china.elsevier.com

Presentation transcript:

高等计算机系统结构 VLIW/EPIC 基于静态调度的ILP （第五讲）程旭 2011年4月16日

复习: 先进超标量技术简单的转移预测技术也会有不错的效果基于路径（Path-based）的预测器可取得>95%的正确率 BTB可以在流水线的前期就改变控制流, BHT的每个表项成本小，但需要等到指令译码转移错误预测的恢复需要使用流水线的快照（snapshots）以减少损失统一物理寄存器堆的设计，可以避免从不同地方 (ROB+arch regfile)读取数据超标量处理器可以在一个时钟周期对多条相关指令进行重命名需要采用推测式存储缓冲器（speculative store buffer）来避免等待存储操作的确认

复习: 转移预测和推测式执行 kill PC Fetch Decode & Rename Reorder Buffer Commit Update predictors Branch Prediction Branch Resolution kill PC Fetch Decode & Rename Reorder Buffer Commit Reg. File Branch Unit ALU MEM Store Buffer D$ Execute

超标量处理器控制逻辑的可扩展性 Issue Width W Issue Group Previously Issued Instructions Lifetime L 每条被发射的指令都必须检查与W*L条指令间的关系，即硬件的增长 W*(W*L) 对于按序机器，L与流水线延迟相关对于乱序机器, L还包括指令缓冲器消耗的时间（指令窗口或 ROB) 随着W的增加，需要更大的指令窗口来找都维持机器忙碌所需的指令级并行性 => L变得更大 => 乱序控制逻辑的复杂程度增长速度大于 W2 (~W3)

[ SGI/MIPS Technologies Inc., 1995 ] 乱序控制的复杂度:MIPS R10000 Control Logic [ SGI/MIPS Technologies Inc., 1995 ]

串行ISA的瓶颈 Superscalar compiler Superscalar processor Sequential source code Superscalar compiler Find independent operations Sequential machine code Schedule operations a = foo(b); for (i=0, i< Check instruction dependencies Superscalar processor Schedule execution

VLIW: Very Long Instruction Word Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Int Op 1 Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency Multiple operations packed into one instruction Each operation slot is for a fixed function Constant operation latencies are specified Architecture requires guarantee of: Parallelism within an instruction => no x-operation RAW check No data use before data ready => no data interlocks

VLIW 编译器的职责编译器: 调度成最大并行执行确保指令内的并行性避免数据冒险(无互锁) 通常用显式的NOP指令来分隔实际操作

早期VLIW机器 FPS AP120B (1976) Multiflow Trace (1987) scientific attached array processor first commercial wide instruction machine hand-coded vector math libraries using software pipelining and loop unrolling Multiflow Trace (1987) commercialization of ideas from Fisher’s Yale group including “trace scheduling” available in configurations with 7, 14, or 28 operations/instruction 28 operations packed into a 1024-bit instruction word Cydrome Cydra-5 (1987) 7 operations encoded in 256-bit instruction word rotating register file

循环的执行 How many FP ops/cycle? 1 fadd / 8 cycles = 0.125 for (i=0; i<N; i++) B[i] = A[i] + C; Int1 Int 2 M1 M2 FP+ FPx loop: add r1 ld Compile loop: ld f1, 0(r1) add r1, 8 fadd f2, f0, f1 sd f2, 0(r2) add r2, 8 bne r1, r3, loop fadd Schedule add r2 bne sd How many FP ops/cycle? 1 fadd / 8 cycles = 0.125

Unroll inner loop to perform 4 iterations at once 循环展开（Loop Unrolling） for (i=0; i<N; i++) B[i] = A[i] + C; Unroll inner loop to perform 4 iterations at once for (i=0; i<N; i+=4) { B[i] = A[i] + C; B[i+1] = A[i+1] + C; B[i+2] = A[i+2] + C; B[i+3] = A[i+3] + C; } Need to handle values of N that are not multiples of unrolling factor with final cleanup loop

循环展开后代码的调度 How many FLOPS/cycle? 4 fadds / 11 cycles = 0.36 Unroll 4 ways loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) sd f8, 24(r2) add r2, 32 bne r1, r3, loop Int1 Int 2 M1 M2 FP+ FPx loop: ld f1 ld f2 ld f3 add r1 ld f4 fadd f5 Schedule fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 add r2 bne sd f8 How many FLOPS/cycle? 4 fadds / 11 cycles = 0.36

软件流水（Software Pipelining） Int1 Int 2 M1 M2 FP+ FPx Unroll 4 ways first ld f1 ld f2 ld f3 ld f4 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 sd f8 add r1 add r2 bne loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) add r2, 32 sd f8, -8(r2) bne r1, r3, loop loop: iterate prolog epilog ld f1 ld f2 ld f3 ld f4 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 sd f8 add r1 add r2 bne ld f1 ld f2 ld f3 ld f4 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 add r1 How many FLOPS/cycle? 4 fadds / 4 cycles = 1

软件流水与循环展开 Loop Unrolled Software Pipelined Wind-down overhead performance Startup overhead time Loop Iteration Software Pipelined performance time Loop Iteration Software pipelining pays startup/wind-down costs only once per loop, not once per iteration

如果没有循环，怎么办? Basic block 转移指令限制了控制流敏感的非规则代码中基本块的大小很难从单独的基本块中发现ILP

踪迹调度（Trace Scheduling） [ Fisher,Ellis] Pick string of basic blocks, a trace, that represents most frequent branch path Use profiling feedback or compiler heuristics to find common branch paths Schedule whole “trace” at once Add fixup code to cope with branches jumping out of trace

“典型” VLIW的问题目标代码的兼容性目标代码的大小需要调度访存延迟可变的操作确认转移发生概率对于静态时刻不可预测的转移进行调度需要针对每种机器重新编译所有代码，甚至对于同一代中的两种不同机器也需要如此目标代码的大小指令填充浪费指令存取空间（存储器/cache）循环展开/软件流水将复制代码需要调度访存延迟可变的操作 caches 和/或主存块的冲突可能产生静态时刻不可预知的变化确认转移发生概率动态剖视（Profiling）需要增加巨大的一步来进行静态转移预测对于静态时刻不可预测的转移进行调度根据转移路径的不同，最佳调度策略也不同

VLIW指令的编码 Schemes to reduce effect of unused fields Group 1 Group 2 Group 3 Schemes to reduce effect of unused fields Compressed format in memory, expand on I-cache refill used in Multiflow Trace introduces instruction addressing challenge Mark parallel groups used in TMS320C6x DSPs, Intel IA-64 Provide a single-op VLIW instruction Cydra-5 UniOp instructions

Rotating Register Files Problems: Scheduled loops require lots of registers, Lots of duplicated code in prolog, epilog ld r1, () add r2, r1, #1 st r2, () ld r1, () add r2, r1, #1 st r2, () ld r1, () add r2, r1, #1 st r2, () ld r1, () add r2, r1, #1 st r2, () Prolog Epilog Loop Solution: Allocate new set of registers for each loop iteration

Rotating Register File P0 P1 P2 P3 P4 P5 P6 P7 RRB=3 R1 + Rotating Register Base (RRB) register points to base of current register set. Value added on to logical register specifier to give physical register number. Usually, split into rotating and non-rotating registers. dec RRB bloop ld r1, () add r3, r2, #1 st r4, () add r2, r1, #1 Prolog Epilog Loop Loop closing branch decrements RRB

Rotating Register File (Previous Loop Example) Three cycle load latency encoded as difference of 3 in register specifier number (f4 - f1 = 3) Four cycle fadd latency encoded as difference of 4 in register specifier number (f9 – f5 = 4) bloop sd f9, () fadd f5, f4, ... ld f1, () bloop sd P17, () fadd P13, P12, ld P9, () RRB=8 bloop sd P16, () fadd P12, P11, ld P8, () RRB=7 bloop sd P15, () fadd P11, P10, ld P7, () RRB=6 bloop sd P14, () fadd P10, P9, ld P6, () RRB=5 bloop sd P13, () fadd P9, P8, ld P5, () RRB=4 bloop sd P12, () fadd P8, P7, ld P4, () RRB=3 bloop sd P11, () fadd P7, P6, ld P3, () RRB=2 bloop sd P10, () fadd P6, P5, ld P2, () RRB=1

Cydra-5: 存储延迟寄存器（Memory Latency Register：MLR) 问题: 装入操作可能出现时延变化解决方案：让软件选择所需时延编译对代码进行调度以满足最大装入-使用（load-use）距离软件将MLR设置为机器代码调度的延迟硬件保证装入操作在MLR要求的周期将所需数值返回给处理器流水线硬件将缓存提前返回的装入数值如果装入没有按时返回数值，硬件将暂停处理器

Intel EPIC IA-64 与传统CISC和RISC相比，EPIC是一种新型的体系结构 IA-64是Intel最新采用的一种ISA Explicitly Parallel Instruction Computing IA-64是Intel最新采用的一种ISA IA-64 = Intel Architecture 64-bit 一种目标代码兼容的VLIW Itanium (最早称为Merced)是采用IA-64的第一种处理器 1995年预计第一种产品在1997问世 (实际上为2001) McKinley是2002年推出的第二类产品

128-bit instruction bundle IA-64 的指令格式 Instruction 2 Instruction 1 Instruction 0 Template 128-bit instruction bundle Template bits describe grouping of these instructions with others in adjacent bundles Each group contains instructions that can execute in parallel bundle j-1 bundle j bundle j+1 bundle j+2 group i-1 group i group i+1 group i+2

IA-64中的寄存器 128个通用64位定点寄存器 128个通用64/80位浮点寄存器 64个1位预测寄存器 GPR可以旋转（rotate）以减小软件流水化循环的代码大小

IA-64 Predicated Execution Problem: Mispredicted branches limit ILP Solution: Eliminate hard to predict branches with predicated execution Almost all IA-64 instructions can be executed conditionally under predicate Instruction becomes NOP if predicate register false Inst 1 Inst 2 br a==b, b2 Inst 3 Inst 4 br b3 Inst 5 Inst 6 Inst 7 Inst 8 b0: b1: b2: b3: if else then Four basic blocks Inst 1 Inst 2 p1,p2 <- cmp(a==b) (p1) Inst 3 || (p2) Inst 5 (p1) Inst 4 || (p2) Inst 6 Inst 7 Inst 8 Predication One basic block Mahlke et al, ISCA95: On average >50% branches removed

Predicate Software Pipeline Stages Single VLIW Instruction (p1) bloop (p3) st r4 (p2) add r3 (p1) ld r1 Dynamic Execution (p1) ld r1 (p2) add r3 (p3) st r4 (p1) bloop Software pipeline stages turned on by rotating predicate registers  Much denser encoding of loops

IA-64 Speculative Execution Problem: Branches restrict compiler code motion Solution: Speculative operations that don’t cause exceptions Load.s r1 Inst 1 Inst 2 br a==b, b2 Chk.s r1 Use r1 Inst 3 Speculative load never causes exception, but sets “poison” bit on destination register Check for exception in original home block jumps to fixup code if exception detected Inst 1 Inst 2 br a==b, b2 Load r1 Use r1 Inst 3 Can’t move load above branch because might cause spurious exception Particularly useful for scheduling long latency loads early

IA-64 Data Speculation Problem: Possible memory hazards limit code scheduling Solution: Hardware to check pointer hazards Load.a r1 Inst 1 Inst 2 Store Load.c Use r1 Inst 3 Data speculative load adds address to address check table Store invalidates any matching loads in address check table Check if load invalid (or missing), jump to fixup code if so Inst 1 Inst 2 Store Load r1 Use r1 Inst 3 Can’t move load above store because store might be to same address Requires associative hardware in address check table

集群化VLIW（Clustered VLIW）将机器划分成具有局部寄存器堆和局部功能部件的集群集群间采用低带宽/高时延的互联由软件负责进行计算划分，以实现最小化通讯开销（communication overhead）在商用嵌入式VLIW处理器中经常使用，例如, TI C6x DSPs, HP Lx处理器 (在一些超标量处理器中也使用该技术，例如, Alpha 21264) Cluster Interconnect Local Regfile Local Regfile Cluster Memory Interconnect Cache/Memory Banks

静态调度的制约不可预知的转移可变存储时延(不可预知的cache失效) 代码大小爆炸（Code size explosion）编译的复杂性问题: 对于传统RISC/CISC处理器，VLIW能够激发哪些可采用的技术?