Presentation is loading. Please wait.

Presentation is loading. Please wait.

周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学 2018/11/16 计算机体系结构 周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学.

Similar presentations


Presentation on theme: "周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学 2018/11/16 计算机体系结构 周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学."— Presentation transcript:

1 周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学
2018/11/16 计算机体系结构 周学海 , 中国科学技术大学

2 第6章 Data-Level Parallelism in Vector, SIMD, and GPU Architectures
向量体系结构 多媒体SIMD指令集扩展 图形处理单元 GPU 2018/11/16 计算机体系结构

3 传统指令级并行技术的问题 挖掘ILP的传统方法的主要缺陷:
提高流水线的时钟频率: 提高时钟频率,有时导 致CPI随着增加 (branches, other hazards) 指令预取和译码: 有时在每个时钟周期很难预取 和译码多条指令 提高Cache命中率 : 在有些计算量较大的应用中 (科学计算)需要大量的数据,其局部性较差,有 些程序处理的是连续的媒体流(multimedia),其局部 性也较差。 2018/11/16 计算机体系结构

4 The University of Adelaide, School of Computer Science
16 November 2018 Introduction SIMD 结构可有效地挖掘数据级并行: 基于矩阵运算的科学计算 图像和声音处理 SIMD比MIMD更节能 针对每组数据操作仅需要取指一次 SIMD对PMD( personal mobile devices)更具 吸引力 SIMD 允许程序员继续以串行模式思维 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

5 The University of Adelaide, School of Computer Science
16 November 2018 SIMD Parallelism 向量体系结构 多媒体SIMD指令集 扩展 Graphics Processor Units (GPUs) For x86 processors: 每年增加2cores/chip SIMD 宽度每4年翻一番 SIMD潜在加速比是MIMD的2倍 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

6 2018/11/16 计算机体系结构

7 Supercomputers Supercomputer的定义: 由Seymour Cray设计的机器
对于给定任务而言世界上最快的机器 任何造价超过3千万美元的机器 计算能力达到每秒万亿次的机器 由Seymour Cray设计的机器 CDC6600 (Cray, 1964) 被认为是第一 台超级计算机 2018/11/16 计算机体系结构

8 CDC 6600 Seymour Cray, 1963 A fast pipelined machine with 60-bit words
128 Kword main memory capacity, 32 banks Ten functional units (parallel, unpipelined) Floating Point: adder, 2 multipliers, divider Integer: adder, 2 incrementers, ... Hardwired control (no microcoding) Scoreboard for dynamic scheduling of instructions Ten Peripheral Processors for Input/Output a fast multi-threaded 12-bit integer ALU Very fast clock, 10 MHz (FP add in 4 clocks) >400,000 transistors, 750 sq. ft., 5 tons, 150 kW, novel freon-based technology for cooling Fastest machine in world for 5 years (until 7600) over 100 sold ($7-10M each) 新颖的制冷技术:弗里昂制冷 第一次使用:动态调度技术 2018/11/16 计算机体系结构 CS252 S05

9 IBM Memo on CDC6600 Thomas Watson Jr., IBM CEO, August 1963: “Last week, Control Data ... announced the 6600 system. I understand that in the laboratory developing the system there are only 34 people including the janitor. Of these, 14 are engineers and 4 are programmers... Contrasting this modest effort with our vast development activities, I fail to understand why we have lost our industry leadership position by letting someone else offer the world's most powerful computer.” To which Cray replied: “It seems like Mr. Watson has answered his own question.” 2018/11/16 计算机体系结构 CS252 S05

10 Supercomputer Applications
典型应用领域 军事研究领域(核武器研制、密码学) 科学研究 天气预报 石油勘探 工业设计 (car crash simulation) 生物信息学 密码学 均涉及大量的数据集处理 70-80年代Supercomputer = Vector Machine 2018/11/16 计算机体系结构

11 Alternative Model:Vector Processing
向量处理机具有更高层次的操作,一条向量指令 可以处理N个或N对操作数(处理对象是向量) + r1 r2 r3 add r3, r1, r2 SCALAR (1 operation) v1 v2 v3 vector length add.vv v3, v1, v2 VECTOR (N operations) 2018/11/16 计算机体系结构 25

12 向量处理机的基本特性 基本思想:两个向量的对应分量进行运算,产生一个结果向量。
简单的一条向量指令包含了多个操作=> fewer instruction fetches 每一结果独立于前面的结果 长流水线,编译器保证操作间没有相关性 硬件仅需检测两条向量指令间的相关性 较高的时钟频率 向量指令以已知的模式访问存储器 可有效发挥多体交叉存储器的优势 可通过重叠减少存储器操作的延时 ­ 64 elements 不需要数据Cache! (仅使用指令cache) 在流水线控制中减少了控制相关 2018/11/16 计算机体系结构

13 向量处理机的基本结构 memory-memory vector processors: 所有的向 量操作是存储器到存储器
vector-register processors: 除了load 和store操 作外,所有的操作是向量寄存器与向量寄存 器间的操作 向量机的Load/Store结构 1980年以后的所有的向量处理机都是这 种结构: Cray, Convex, Fujitsu, Hitachi, NEC 我们也主要针对这种结构 2018/11/16 计算机体系结构

14 Vector Memory-Memory versus Vector Register Machines
存储器-存储器型向量机所有指令操作的操作数来源于存储器 第一台向量机 CDC Star-100 (‘73) and TI ASC (‘71), 是存储器-存储 器型机器 Cray-1 (’76) 是第一台寄存器型向量机 ADDV C, A, B SUBV D, A, B Vector Memory-Memory Code for (i=0; i<N; i++) { C[i] = A[i] + B[i]; D[i] = A[i] - B[i]; } Example Source Code LV V1, A LV V2, B ADDV V3, V1, V2 SV V3, C SUBV V4, V1, V2 SV V4, D Vector Register Code 2018/11/16 计算机体系结构

15 Vector Memory-Memory vs. Vector Register Machines
存储器-存储器型向量机 (VMMA) 需要更高的存储器 带宽 All operands must be read in and out of memory VMMA结构使得多个向量操作重叠执行更困难 Must check dependencies on memory addresses VMMA启动时间更长 CDC Star-100 在向量元素小于100时,标量代码的性能高于 向量化代码 For Cray-1, vector/scalar 均衡点在2个元素 CDC Cray-1后续的机器 (Cyber-205, ETA-10) 都是寄存 器型向量机 2018/11/16 计算机体系结构

16 向量处理机的基本组成单元 Vector Register: 固定长度的一块区域,存放单个向量
至少2个读端口和一个写端口(一般最少16个读端口,8个写 端口) 典型的有8-32 向量寄存器,每个寄存器存放64到128个64位的 元素 Vector Functional Units (FUs): 全流水化的,每一个clock启 动一个新的操作 一般4到8个FUs: FP add, FP mult, FP reciprocal (1/X), integer add, logical, shift; 可能有些重复设置的部件 Vector Load-Store Units (LSUs): 全流水化地load 或store一 个向量,可能会配置多个LSU部件 Scalar registers: 存放单个元素用于标量处理或存储地址 用交叉开关连接(Cross-bar) FUs , LSUs, registers 2018/11/16 计算机体系结构

17 2018/11/16 计算机体系结构

18 Vector Supercomputers
Cray-1的变体(1976): Scalar Unit:Load/Store Architecture Vector Extension Vector Registers Vector Instructions Implementation 硬布线逻辑控制 高效流水化的功能部件 多体交叉存储系统 无Data Cache 不支持 Virtual Memory 2018/11/16 计算机体系结构

19 Vector Instruction Set Advantages
格式紧凑 一条指令包含N个操作 表达能力强, 一条指令能告诉硬件: N个操作之间无相关性 使用同样的功能部件 访问不相交的寄存器 与前面的操作以相同模式访问寄存器 访问存储器中的连续块 (unit-stride load/store) 以已知的模式访问存储器 (strided load/store) 可扩展性好 可以在多个并行的流水线上运行同样的代码 (lanes) 2018/11/16 计算机体系结构

20 Vector Arithmetic Execution
使用较深的流水线(=> fast clock) 执行向量元素的操作 由于向量元素相互独立,简化了深度流水线的控制 (=> no hazards!) V1 V2 V3 Six stage multiply pipeline V3 <- v1 * v2 2018/11/16 计算机体系结构

21 Vector Instruction Execution
ADDV C,A,B C[1] C[2] C[0] A[3] B[3] A[4] B[4] A[5] B[5] A[6] B[6] 使用一条流水化的功能部件执行 C[4] C[8] C[0] A[12] B[12] A[16] B[16] A[20] B[20] A[24] B[24] C[5] C[9] C[1] A[13] B[13] A[17] B[17] A[21] B[21] A[25] B[25] C[6] C[10] C[2] A[14] B[14] A[18] B[18] A[22] B[22] A[26] B[26] C[7] C[11] C[3] A[15] B[15] A[19] B[19] A[23] B[23] A[27] B[27] 使用4条流水化的功能部件执行 2018/11/16 计算机体系结构

22 Interleaved Vector Memory System
Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency Bank busy time: Time before bank ready to accept next request 1 2 3 4 5 6 7 8 9 A B C D E F + Base Stride Vector Registers Memory Banks Address Generator 2018/11/16 计算机体系结构

23 Vector Unit Structure Functional Unit Vector Registers Lane
Elements 0, 4, 8, … Elements 1, 5, 9, … Elements 2, 6, 10, … Elements 3, 7, 11, … Lane Memory Subsystem 2018/11/16 计算机体系结构

24 T0 Vector Microprocessor (UCB/ICSI, 1995)
Vector register elements striped over lanes [0] [8] [16] [24] [1] [9] [17] [25] [2] [10] [18] [26] [3] [11] [19] [27] [4] [12] [20] [28] [5] [13] [21] [29] [6] [14] [22] [30] [7] [15] [23] [31] Lane 2018/11/16 计算机体系结构

25 Vector Instruction Parallelism
多条向量指令可重叠执行(链接技术) 例如:每个向量 32 个元素,8 lanes(车道) Load Unit Multiply Unit Add Unit load mul add time load mul add Instruction issue Complete 24 operations/cycle while issuing 1 short instruction/cycle 2018/11/16 计算机体系结构

26 “DLXV” Vector Instructions
Instr. Operands Operation Comment ADDV V1,V2,V3 V1=V2+V3 vector + vector ADDSV V1,F0,V2 V1=F0+V2 scalar + vector MULTV V1,V2,V3 V1=V2xV3 vector x vector MULSV V1,F0,V2 V1=F0xV2 scalar x vector LV V1,R1 V1=M[R1..R1+63] load, stride=1 LVWS V1,R1,R2 V1=M[R1..R1+63*R2] load, stride=R2 LVI V1,R1,V2 V1=M[R1+V2i,i=0..63] indir.("gather") CeqV VM,V1,V2 VMASKi = (V1i=V2i)? comp. setmask MOV VLR,R1 Vec. Len. Reg. = R1 set vector length MOV VM,R1 Vec. Mask = R1 set vector mask 2018/11/16 计算机体系结构

27 05/13-review: Multithreading
Simultaneous Multithreading Time (processor cycle) Superscalar Fine-Grained Coarse-Grained Multiprocessing Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot 2018/11/16 计算机体系结构

28 05/13-Review 向量处理机基本概念 向量处理机基本特征 向量处理机基本结构 向量处理机性能评估
基本思想:两个向量的对应分量进行运算,产生一个 结果向量 向量处理机基本特征 VSIW-一条指令包含多个操作 单条向量指令内所包含的操作相互独立 以已知模式访问存储器 控制相关少 向量处理机基本结构 向量处理机性能评估 向量指令流执行时间: Convey, Chimes, Start-up time 其他指标: R , N1/2 , NV 2018/11/16 计算机体系结构

29 Vector Execution Time Time = f(vector length, data dependencies, struct. hazards) Initiation rate: 功能部件消耗向量元素的速率 Convoy: 可在同一时钟周期开始执行的指令集合 (no structural or data hazards) Chime: 执行一个convoy所花费的大致时间(approx. time) m convoys take m chimes; 如果每个向量长度为n, 那么m个convoys 所花费的时间是m个chimes 每个chime所花费的时间是n个clocks,该程序所花费的总时间大约为m x n clock cycles (ignores overhead; good approximization for long vectors) 1: LV V1,Rx ;load vector X 2: MULV V2,F0,V1 ;vector-scalar mult. LV V3,Ry ;load vector Y 3: ADDV V4,V2,V3 ;add 4: SV Ry,V4 ;store the result 4 convoys, 1 lane, VL=64 => 4 x 64 = 256 clocks (or 4 clocks per result) 2018/11/16 计算机体系结构

30 Vector Startup 向量启动时间由两部分构成 功能部件延时:一个操作通过功能部件的时间
截止时间或恢复时间(dead time or recovery time ): 运行下一条向量指令的间隔时间 Functional Unit Latency R X W R X W First Vector Instruction R X W R X W R X W Dead Time R X W R X W R X W Dead Time Second Vector Instruction R X W R X W 2018/11/16 计算机体系结构

31 Dead Time and Short Vectors
T0, Eight lanes No dead time 100% efficiency with 8 element vectors 4 cycles dead time 64 cycles active Cray C90, Two lanes 4 cycle dead time Maximum efficiency 94% with 128 element vectors 2018/11/16 计算机体系结构

32 DLXV Start-up Time Start-up time: FU 部件流水线的深度
Operation Start-up penalty (from CRAY-1) Vector load/store Vector multiply Vector add 6 Assume convoys don't overlap; vector length = n Convoy Start st result last result 1. LV n (12+n-1) 2. MULV, LV n n n Multiply startup 12+n n n Load start-up 3. ADDV n n n Wait convoy 2 4. SV n n n Wait convoy 3 2018/11/16 计算机体系结构

33 Vector Length 当向量的长度不是64时(假设向量寄存器的长度 是64)怎么办?
vector-length register (VLR) 控制特定向量操作的 长度, 包括向量的load/store. (当然一次操作的向 量的长度不能 > 向量寄存器的长度) 例如: do 10 i = 1, n 10 Y(i) = a * X(i) + Y(i) n的值只有在运行时才能知道 n > Max. Vector Length (MVL)怎么办? 2018/11/16 计算机体系结构

34 Strip Mining(分段开采) 假设Vector Length > Max. Vector Length (MVL)?
Strip mining: 产生新的代码,使得每个向量操作的元素数  MVL 第一次循环做最小片(n mod MVL), 以后按VL = MVL操作 low = VL = (n mod MVL) /*find the odd size piece*/ do 1 j = 0, (n / MVL) /*outer loop*/ do 10 i = low, low+VL-1 /*runs for length VL*/ Y(i) = a*X(i) + Y(i) /*main operation*/ 10 continue low = low+VL /*start of next vector*/ VL = MVL /*reset the length to max*/ 1 continue 2018/11/16 计算机体系结构

35 Strip Mining的向量执行时间计算
试计算A=B×s,其中A,B为长度为200的向量(每个向量元素占8个字节),s是一个标量。向量寄存器长度为64。各功能部件的启动时间如前所述,求总的执行时间,(Tloop = 15) 2018/11/16 计算机体系结构

36 ADDI R2,R0,#1600 ;total # bytes in vector ADD R2,R2,Ra ;address of the end of A vector ADDI R1,R0,#8 ;loads length of 1st segment MOVI2S VLR,R1 ;load vector length in VLR ADDI R1,R0,#64 ;length in bytes of 1st segment ADDI R3,R0,#64 ;vector length of other segments Loop: LV V1,Rb ;load B MULSV V2,V1,Fs ;vector * scalar SV Ra,V2 ;store A ADD Ra,Ra,R1 ;address of next segment of A ADD Rb,Rb,R1 ;address of next segment of B ADDI R1,R0,#512 ;load byte offset next segment MOVI2S VLR,R3 ;set length to 64 elements SUB R4,R2,Ra ;at the end of A? BNEZ R4,Loop ;if not, go back 2018/11/16 计算机体系结构

37 Tstart = = 31 T200 = 660+4*31 = 784 每一元素的执行时间 = 784/200 = 3.9 2018/11/16 计算机体系结构

38 2018/11/16 计算机体系结构

39 Common Vector Metrics Rn 表示向量长度为n时的向量流水线的性能
R: 当向量长度为无穷大时的向量流水线的最大 性能。常在评价峰值性能时使用,单位为 MFLOPS 实际问题是向量长度不会无穷大,start-up的开销还 是比较大的 Rn 表示向量长度为n时的向量流水线的性能 N1/2: 达到R 一半的值所需的向量长度,是评价 向量流水线start-up 时间对性能的影响。 NV: 向量流水线方式的工作速度优于标量串行方 式工作时所需的向量长度临界值。 该参数既衡量建立时间,也衡量标量、向量速度比对 性能的影响 2018/11/16 计算机体系结构

40 Example Vector Machines
Machine Year Clock Regs Elements FUs LSUs Cray MHz Cray XMP MHz L, 1 S Cray YMP MHz L, 1 S Cray C MHz Cray T MHz Conv. C MHz Conv. C MHz Fuj. VP MHz Fuj. VP MHz NEC SX/ MHz 8+8K 256+var 16 8 NEC SX/ MHz 8+8K 256+var 16 8 Cray 1; fastest scalar computer + 1st commercially successful vector computer, offered another 10X 6600 1st scoreboard Cray XMP: 3 LSUs, Multiprocessor 4 way (not by Cray) => YMP, C-90, T-90; 2X processors, 1.5X clock Cray 2 went to DRAM to get more memory, not so great Like parallel teams as Intel (486, PPro, Pentium, next one) Japan Fujitsu, vary number of registers elements (8x1024 or 32x256) NEC, 8x K of varying elements 2018/11/16 计算机体系结构

41 A Modern Vector Super: NEC SX-9 (2008)
65nm CMOS technology Vector unit (3.2 GHz) 8 foreground VRegs background VRegs (256x64-bit elements/VReg) 64-bit functional units: 2 multiply, 2 add, 1 divide/sqrt, 1 logical, 1 mask unit 8 lanes (32+ FLOPS/cycle, GFLOPS peak per CPU) 1 load or store unit (8 x 8-byte accesses/cycle) Scalar unit (1.6 GHz) 4-way superscalar with out-of- order and speculative execution 64KB I-cache and 64KB data cache Memory system provides 256GB/s DRAM bandwidth per CPU Up to 16 CPUs and up to 1TB DRAM form shared-memory node total of 4TB/s bandwidth to shared DRAM memory Up to 512 nodes connected via 128GB/s network links (message passing between nodes) Picture from NEC article “A hardware overview of SX-6 and SX-7 supercomputer” 2018/11/16 计算机体系结构

42 Vector Linpack Performance (MFLOPS)
Matrix Inverse (gaussian elimination) Machine Year Clock 100x100 1kx1k Peak(Procs) Cray MHz (1) Cray XMP MHz (4) Cray YMP MHz ,667(8) Cray C MHz ,238(16) Cray T MHz ,600(32) Conv. C MHz (1) Conv. C MHz (4) Fuj. VP MHz (1) NEC SX/ MHz (1) NEC SX/ MHz ,600(4) 6X in 20 years; 32X in 20 years; Peak is 360X speedup Weighed tons 2018/11/16 计算机体系结构

43 Memory Banking 独立存储体方式:由多个相互独立的存储体(Bank) 构成存储器组 织。可独立访问存储体,各存储体共享数据和地址总线 (minimize pin cost) 每个周期可以启动和完成一个bank的访问 如果N个存储器访问访问不同的bank可以并行执行 2018/11/16 计算机体系结构

44 Interleaved Vector Memory System
Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency Bank busy time: Time before bank ready to accept next request If stride = 1 & consecutive elements interleaved across banks & number of banks >= bank latency, then can sustain 1 element/cycle throughput 1 2 3 4 5 6 7 8 9 A B C D E F + Base Stride Vector Registers Memory Banks Address Generator 2018/11/16 计算机体系结构

45 Example(AppF F-15) Suppose we want to fetch a vector of 64 elements starting at byte address 136,and a memory access takes 6 clocks. How many memory banks must we have to support one fetch per clock cycle? With what addresses are the banks accessed? When will the various elements arrive at the CPU? 2018/11/16 计算机体系结构

46 Vector Stride 假设处理顺序相邻的元素在存储器中不顺序存储。例如 do 10 i = 1,100 do 10 j = 1,100
A(i,j) = 0.0 do 10 k = 1,100 10 A(i,j) = A(i,j)+B(i,k)*C(k,j) B 或 C 的两次访问不会相邻 (相隔800 bytes) stride: 向量中相邻元素间的距离 => LVWS (load vector with stride) instruction Strides => 会导致体冲突 (e.g., stride = 32 and 16 banks) 2018/11/16 计算机体系结构

47 Memory operations Load/store 操作成组地在寄存器和存储器之 间移动数据 三类寻址方式
Unit stride (单步长) Fastest Non-unit (constant) stride (常数步长) Indexed (gather-scatter) (间接寻址) 等价于寄存器间接寻址方式 对稀疏矩阵有效 用于向量化操作的指令增多 2018/11/16 计算机体系结构 32 32

48 DAXPY (Y = a × X + Y) Scalar vs. Vector
Assuming vectors X, Y are length 64 Scalar vs. Vector LD F0,a ;load scalar a LV V1,Rx ;load vector X MULTS V2,F0,V1 ;vector-scalar mult. LV V3,Ry ;load vector Y ADDV V4,V2,V3 ;add SV Ry,V4 ;store the result LD F0,a ADDI R4,Rx,#512 ;last address to load loop: LD F2, 0(Rx) ;load X(i) MULTD F2,F0,F2 ;a*X(i) LD F4, 0(Ry) ;load Y(i) ADDD F4,F2, F4 ;a*X(i) + Y(i) SD F4 ,0(Ry) ;store into Y(i) ADDI Rx,Rx,#8 ;increment index to X ADDI Ry,Ry,#8 ;increment index to Y SUB R20,R4,Rx ;compute bound BNZ R20,loop ;check if done 578 (2+9*64) vs (1+5*64) ops (1.8X) 578 (2+9*64) vs instructions (96X) 64 operation vectors no loop overhead also 64X fewer pipeline hazards 2018/11/16 计算机体系结构

49 Vector Opt#1: Vector Chaining
寄存器定向路径的向量机版本 首次在Cray-1上使用 Memory V1 Load Unit Mult. V2 V3 Chain Add V4 V5 Chain LV v1 MULV v3,v1,v2 ADDV v5, v3, v4 2018/11/16 计算机体系结构

50 Vector Chaining Advantage
Load Mul Add Time 不采用链接技术,必须处理完前一条指令的最后一个元素,才能启动下一条相关的指令 采用链接技术,前一条指令的第一个结果出来后,就可以启动下一条相关指令的执行 Load Mul Add 2018/11/16 计算机体系结构

51 Vector Instruction Parallelism
多条向量指令可重叠执行(链接技术) 例如:每个向量 32 个元素,8 lanes(车道) Load Unit Multiply Unit Add Unit load mul add time load mul add Instruction issue Complete 24 operations/cycle while issuing 1 short instruction/cycle 2018/11/16 计算机体系结构

52 Vector Opt #2: Conditional Execution
Suppose: do 100 i = 1, 64 if (A(i) .ne. 0) then A(i) = A(i) – B(i) endif 100 continue vector-mask control 使用长度为MVL的布尔向量控 制向量指令的执行 当vector-mask register 使能时,向量指令操作仅 对 vector-mask register中 对应位为1的分量起作用 2018/11/16 计算机体系结构

53 Masked Vector Instructions
B[3] A[4] B[4] A[5] B[5] A[6] B[6] M[3]=0 M[4]=1 M[5]=1 M[6]=0 M[2]=0 M[1]=1 M[0]=0 Write data port Write Enable A[7] B[7] M[7]=1 Simple Implementation execute all N operations, turn off result writeback according to mask C[4] C[5] C[1] Write data port A[7] B[7] M[3]=0 M[4]=1 M[5]=1 M[6]=0 M[2]=0 M[1]=1 M[0]=0 M[7]=1 Density-Time Implementation scan mask vector and only execute elements with non-zero masks 2018/11/16 计算机体系结构

54 LV V1,Ra ; load vector A into V1 LV V2,Rb ; load vector B L
LV V1,Ra ; load vector A into V1 LV V2,Rb ; load vector B L.D F0,#0 ; load FP zero into F0 SNEVS.D V1,F0 ;sets VM(i) to 1 if V1(i)!=F0 SUBV.D V1,V1,V2 ;subtract under vector mask CVM ;set the vector mask to all 1s SV Ra,V1 ;store the result in A 使用vector-mask寄存器的缺陷 简单实现时,条件不满足时向量指令仍然需要花费时间 有些向量处理器带条件的向量执行仅控制向目标寄存器的写操作,可能会有除法错。 2018/11/16 计算机体系结构

55 Vector Opt #3: Sparse Matrices
Suppose: do 100 i = 1,n 100 A(K(i)) = A(K(i)) + C(M(i)) gather (LVI) operation 使用index vector 中给出的 偏移再加基址来读取 => a nonsparse vector in a vector register 这些元素以密集的方式操作完成后,再使用同样 的index vector存储到稀疏矩阵的对应位置 这些操作编译时可能无法完成。主要原因:编译 器无法预知Ki以及是否有数据相关 使用CVI 设置步长( index 0, 1xm, 2xm, ..., 63xm) CVI gets used under mask 2018/11/16 计算机体系结构

56 Sparse Matrix Example Cache (1993) vs. Vector (1988)
IBM RS6000 Cray YMP Clock 72 MHz 167 MHz Cache 256 KB 0.25 KB Linpack 140 MFLOPS 160 (1.1) Sparse Matrix 17 MFLOPS 125 (7.3) (Cholesky Blocked ) Cache: 1 address per cache block (32B to 64B) Vector: 1 address per element (4B) 2018/11/16 计算机体系结构

57 Automatic Code Vectorization
for (i=0; i < N; i++) C[i] = A[i] + B[i]; Vector Instruction load add store Iter. 1 Iter. 2 Vectorized Code Time load add store Iter. 1 Iter. 2 Scalar Sequential Code 向量化是指在编译期间对操作重定序 需要进行大量的循环相关分析 2018/11/16 计算机体系结构

58 并行的类型 指令级并行(ILP) 线程级并行 (TLP) 数据级并行(DLP) Which is easiest to program?
以并行方式执行某个指令流中的独立无关的指令 (pipelining, superscalar, VLIW) 线程级并行 (TLP) 以并行方式执行多个独立的指令流 (multithreading, multiple cores) 数据级并行(DLP) 以并行方式执行多个相同类型的操作 (vector/SIMD execution) Array Processor 、Vector Processor Which is easiest to program? Which is most flexible form of parallelism? i.e., can be used in more situations Which is most efficient? i.e., greatest tasks/second/area, lowest energy/task 2018/11/16 计算机体系结构

59 DLP的兴起 应用需求和技术约束推动着体系结构的发展 和选择
图形、机器视觉、语音识别、机器学习等新 的应用均需要大量的数值计算,其算法通常 具有数据并行 SIMD-based 结构 (vector-SIMD, subword-SIMD, SIMT/GPUs) 是执行这些算法的最有效途径 2018/11/16 计算机体系结构

60 Vector/SIMD Processing Summary
同样的操作作用于不同的数据元素 向量内的元素操作独立可有效提高性能,简化设计 性能的提升受限于代码的向量化 标量操作限制了向量机的性能 Amdahl’s Law 目前很多ISA包含SIMD操作指令 Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD 2018/11/16 计算机体系结构

61 Array vs. Vector Processors
2018/11/16 计算机体系结构

62 SIMD Array Processing vs. VLIW
2018/11/16 计算机体系结构

63 SIMD Array Processing vs. VLIW
Array processor: 单个操作作用在多个不同的数据元素上 2018/11/16 计算机体系结构

64 Multimedia Extensions (aka SIMD extensions)
64b 32b 16b 8b 在已有的ISA中添加一些向量长度很短的向量操作指令 将已有的 64-bit 寄存器拆分为 2x32b or 4x16b or 8x8b 1957年,Lincoln Labs TX-2 将36bit datapath 拆分为2x18b or 4x9b 新的设计具有较宽的寄存器 128b for PowerPC Altivec, Intel SSE2/3/4 256b for Intel AVX 单条指令可实现寄存器中所有向量元素的操作 16b + 4x16b adds 2018/11/16 计算机体系结构

65 Intel Pentium MMX Operations
idea: 一条指令操作同时作用于不同的数据元 全阵列处理 用于多媒体操作 No VLEN register Opcode determines data type: 8 8-bit bytes 4 16-bit words 2 32-bit doublewords 1 64-bit quadword Stride always equal to 1. 2018/11/16 计算机体系结构

66 MMX Example: Image Overlaying (I)
2018/11/16 计算机体系结构

67 MMX Example: Image Overlaying (II)
2018/11/16 计算机体系结构

68 Multimedia Extensions versus Vectors
受限的指令集: 无向量长度控制 Load/store操作无 常数步长寻址和 scatter/gather操作 loads 操作必须64/128-bit 边界对齐 受限的向量寄存器长度: 需要超标量发射以保持multiply/add/load 部件忙 通过循环展开隐藏延迟增加了寄存器读写压力 在微处理器设计中向全向量化发展 更好地支持非对齐存储器访问 支持双精度浮点数操作 (64-bit floating-point) Intel AVX spec (announced April 2008), 256b vector registers (expandable up to 1024b) 2018/11/16 计算机体系结构

69 05/18-Review 多媒体扩展指令 GPU 向量机的存储器访问 存储器组织:独立存储体、多体交叉方式
Stride : 固定步长(1 or 常数), 非固定步长(index) 基于向量机模型的优化 链接技术 有条件执行 稀疏矩阵的操作 多媒体扩展指令 扩展的指令类型较少 向量寄存器长度较短 GPU 2018/11/16 计算机体系结构

70 Recap: Vector/SIMD Processing Summary
同样的操作作用在许多数据元素上 提高性能、设计简单(向量内的操作相互独立) 性能的提升受限于代码的向量化 标量操作限制着向量机的性能 很多已有的ISA扩展了一些SIMD操作 Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD

71 Graphics Processing Units (GPUs)
原来的GPU是指带有高性能浮点运算部件、可高效生成3D图 形的具有固定功能的专用设备 (mid-late 1990s) 让PC机具有类似工作站的图形功能 用户可以配置图形处理流水线,但不是真正的对其编程 随着时间的推移,GPU加入了越来越多的可编程性 ( ) 例如新的语言 Cg可用来编写一些小的程序处理图形的顶点或像素, 是Windows DirectX的变体 大规模并行(针对每帧上百万顶点或像素)但非常受限于编程模型 有些用户注意到通过将输入和输出数据映射为图像,并对顶 点或像素渲染计算 可进行通用计算 因为不得不使用图形流水线模型,这对完成通用计算来说是个非常难 用的编程模型 2018/11/16 计算机体系结构

72 General-Purpose GPUs (GP-GPUs)
2006年, Nvidia 的 GeForce 8800 GPU 支持一种新的 编程语言: CUDA “Compute Unified Device Architecture” 随后工业界推出OpenCL,与CUDA具有相同的ideas, 但独 立于供应商 Idea: 针对通用计算,发挥GPU的计算的高性能和存 储器的高带宽来加速一些通用计算中的核心 (Kernels) 一种附加处理器模型(GPU作为附加设备):Host CPU发射数据并行的kernels 到GP-GPU上运行 我们仅讨论Nvidia CUDA样式的简化版本,仅考虑 GPU的计算核部分,不涉及图形加速部分 2018/11/16 计算机体系结构

73 Using CPU+GPU Architecture
异构系统(异构多核) 针对每个任务选择合适的处 理器和存储器 通用CPU 适合执行一些串行 的线程 串行执行快 带有cache,访问存储器延时低 GPU 适合执行大量并行的线 程 可扩放的并行执行 高带宽的并行存取 GPU SMem PCIe Bridge Host Memory CPU Cache Device Memory

74 GPUs are SIMD Engines Underneath
指令流水线类似于SIMD的流水线,但是,用线程来编程而不 是SIMD指令 两个概念 Programming Model (Software) vs Execution Model (Hardware) 编程模型指程序员如何表达代码 例如, 顺序模型 (von Neumann), 数据并行(SIMD), 数据流模型、多线程 模型 (MIMD, SPMD), … 执行模型指硬件底层如何执行代码 例如, 乱序执行、向量机、数据流处理机、多处理机、多线程处理机等 执行模型与编程模型可以差别很大 例如., 顺序模型可以在乱序执行的处理器上执行。 SPMD 模型可以用SIMD处理器实现 (a GPU)

75 How Can You Exploit Parallelism Here?
for (i=0; i < N; i++) C[i] = A[i] + B[i]; load add store Iter. 1 Iter. 2 Scalar Sequential Code Let’s examine three programming options to exploit instruction-level parallelism present in this sequential code: 1. Sequential (SISD) 2. Data-Parallel (SIMD) 3. Multithreaded (MIMD/SPMD)

76 Prog. Model 1: Sequential (SISD)
for (i=0; i < N; i++) C[i] = A[i] + B[i]; Can be executed on a: Pipelined processor Out-of-order execution processor Independent instructions executed when ready Different iterations are present in the instruction window and can execute in parallel in multiple functional units In other words, the loop is dynamically unrolled by the hardware Superscalar or VLIW processor Can fetch and execute multiple instructions per cycle load add store Iter. 1 Iter. 2 Scalar Sequential Code

77 Prog. Model 2: Data Parallel (SIMD)
for (i=0; i < N; i++) C[i] = A[i] + B[i]; Vectorized Code load add store Iter. 1 Iter. 2 Scalar Sequential Code Vector Instruction load add store load add store VLD A  V1 VLD B  V2 VADD V1 + V2  V3 VST V3  C Iter. 1 Iter. 2 Realization: Each iteration is independent Idea: Programmer or compiler generates a SIMD instruction to execute the same instruction from all iterations across different data Best executed by a SIMD processor (vector, array)

78 Prog. Model 3: Multithreaded
for (i=0; i < N; i++) C[i] = A[i] + B[i]; load add store Iter. 1 Iter. 2 Scalar Sequential Code load add store load add store Iter. 1 Iter. 2 Realization: Each iteration is independent Idea: Programmer or compiler generates a thread to execute each iteration. Each thread does the same thing (but on different data) Can be executed on a MIMD machine

79 Prog. Model 3: Multithreaded
for (i=0; i < N; i++) C[i] = A[i] + B[i]; load add store load add store Iter. 1 Iter. 2 Realization: Each iteration is independent Idea: Programmer or compiler generates a thread to execute each iteration. Each thread does the same thing (but on different data) Can be executed on a MIMD machine This particular model is also called: SPMD: Single Program Multiple Data Can be executed on a SIMT machine Single Instruction Multiple Thread Can be executed on a SIMD machine

80 A GPU is a SIMD (SIMT) Machine
GPU不是用SIMD指令编程 使用线程 (SPMD 编程模型) 每个线程执行同样的代码,但操作不同的数据元素 每个线程有自己的上下文(即可以独立地启动/执行等) 一组执行相同指令的线程由硬件动态组织成warp 一个warp是由硬件形成的SIMD操作

81 The University of Adelaide, School of Computer Science
16 November 2018 Threads and Blocks 一个线程对应一个数据元素 大量的线程组织成很多线程块 许多线程块组成一个网格 GPU 由硬件对线程进行管理 Thread Block Scheduler SIMD Thread Scheduler 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

82 Programmer’s View of Execution
blockIdx 0 threadId 0 threadId 1 threadId 255 blockDim = 256 (programmer can choose) 创建足够的线程块以适应输入向量 (Nvidia 中将由多个线程块构成的、在GPU上运行的代码 称为Grid,Grid可以是2维的) blockIdx 1 threadId 0 threadId 1 threadId 255 blockIdx (n+255)/256) threadId 0 threadId 1 threadId 255 Conditional (i<n) turns off unused threads in last block 2018/11/16 计算机体系结构

83 Hardware Execution Model
GPU Core 0 Lane 0 Lane 1 Lane 15 Core 1 Core 15 CPU CPU Memory GPU Memory GPU 由多个并行核构成,每个核包含一个多线程 SIMD处理器(包含多个车道) CPU 发送整个 “grid”到GPU,由GPU将这些线程块分 发到多个核上(每个线程块在一个核上运行) GPU上核的数量对程序员而言是透明的 2018/11/16 计算机体系结构

84 Simplified CUDA Programming Model
计算由大量的相互独立的线程(CUDA threads or microthreads) 完成,这些线程组合成线程块 (thread blocks) // C version of DAXPY loop. void daxpy(int n, double a, double*x, double*y) { for (int i=0; i<n; i++) y[i] = a*x[i] + y[i]; } // CUDA version. __host__ // Piece run on host processor. int nblocks = (n+255)/256; // 256 CUDA threads/block daxpy<<<nblocks,256>>>(n,2.0,x,y); __device__ // Piece run on GP-GPU. { int i = blockIdx.x*blockDim.x + threadId.x; if (i<n) y[i]=a*x[i]+y[i]; } 2018/11/16 计算机体系结构

85 NVIDIA Instruction Set Arch.
The University of Adelaide, School of Computer Science 16 November 2018 NVIDIA Instruction Set Arch. ISA 是硬件指令集的抽象 “Parallel Thread Execution (PTX)” 使用虚拟寄存器 用软件将其翻译成机器码 Example: shl.s32 R8, blockIdx, 9 ; Thread Block ID * Block size (512 or 29) add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread ID ld.global.f64 RD0, [X+R8] ; RD0 = X[i] ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i] mul.f64 R0D, RD0, RD4 ; Product in RD0 = RD0 * RD4 (scalar a) add.f64 R0D, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i]) st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i]) 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

86 CUDA kernel maps to Grid of Blocks
kernel_func<<<nblk, nthread>>>(param, … ); Host Thread Grid of Thread Blocks . . . econve GPU SMs: SMem PCIe Bridge Host Memory CPU Cache Device Memory

87 Thread blocks execute on an SM Thread instructions execute on a core
float myVar; __shared__ float shVar; __device__ float glVar; Thread Block Per-block Shared Memory Per-app Device Global Registers Per-thread Local Memory GPU SMs: SMem PCIe Bridge Host Memory CPU Cache Device Memory SM: Streaming Multiprocessor PTX: Parallel Thread Execution

88 CUDA:“Single Instruction, Multiple Thread”
GPUs 使用 SIMT模型, 每个CUDA线程的标量指 令流汇聚在一起在硬件上以SIMD方式执行 (Nvidia groups 32 CUDA threads into a warp) µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7 ld x Scalar instruction stream mul a ld y add st y SIMD execution across warp 2018/11/16 计算机体系结构

89 GPU Memory Hierarchy [ Nvidia, 2010] 2018/11/16 计算机体系结构

90 SPMD on SIMT Machine Warp: A set of threads that execute
for (i=0; i < N; i++) C[i] = A[i] + B[i]; load add store load add store Warp 0 at PC X Warp 0 at PC X+1 Warp 0 at PC X+2 Warp 0 at PC X+3 Iter. 1 Iter. 2 Warp: A set of threads that execute the same instruction (i.e., at the same PC) Realization: Each iteration is independent Idea: Programmer or compiler generates a thread to execute each iteration. Each thread does the same thing (but on different data) Can be executed on a MIMD machine This particular model is also called: SPMD: Single Program Multiple Data A GPU executes it using the SIMT model: Single Instruction Multiple Thread Can be executed on a SIMD machine

91 SIMD vs. SIMT Execution Model
SIMD: 一条指令流(一串顺序的SIMD指令),每条指 令对应多个数据输入(向量指令) SIMT: 多个指令流(标量指令)构成线程, 这些线程 动态构成warp。一个Warp处理多个数据元素 SIMT 主要优点: 可以独立地处理线程,即每个线程可以在任何标量流水线上 单独执行( MIMD 处理模式) 可以将线程组织成warp,即可以将执行相同指令流的线程构 成warp,形成SIMD 处理模式, 以充分发挥SIMD处理的优势

92 Multithreading of Warps
for (i=0; i < N; i++) C[i] = A[i] + B[i]; 设一个warp由 32 threads构成 如果有32K次循环  1K 个warps 这些Warps可以在同一条流水线上交替执行 Fine grained multithreading of warps load add store load add store Warp 1 at PC X Warp 0 at PC X Warp 20 at PC X+2 Iter. 1 Iter. 2 Iter. 33 Iter. 34 Iter. 20*32 + 2 Iter. 20*32 + 1

93 Warps and Warp-Level FGMT
Warp: The threads that run lengthwise in a woven fabric … Thread Warp 3 Thread Warp 8 Common PC Thread Warp In SIMD, you need to specify the data array + an instruction (on which to operate the data on) + THE INSTRUCTION WIDTH. Eg: You might want to add 2 integer arrays of length 16, then a SIMD instruction would look like (the instruction has been cooked-up by me for demo) add.16 arr1 arr2 However, SIMT doesn't bother about the instruction width. So, essentially, you could write the above example as: arr1[i] + arr2[i] and then launch as many threads as the length of the array, as you want. Note that, if the array size was, let us say, 32, then SIMD EXPECTS you to explicitly call two such 'add.16' instructions! Whereas, this is not the case with SIMT. Scalar Scalar Scalar Scalar Thread Warp 7 Thread Thread Thread Thread W X Y Z SIMD Pipeline

94 High-Level View of a GPU

95 Latency Hiding via Warp-Level FGMT
Fine-grained multithreading 流水线上每次执行每个线程的一 条指令 (No interlocking) 通过warp的交叉执行来隐藏延时 所有线程的寄存器值保存在寄存器文 件中 FGMT 可以容忍长延时 Millions of pixels Decode R F A L U D-Cache Thread Warp 6 Thread Warp 1 Thread Warp 2 Data All Hit? Miss? Warps accessing memory hierarchy Thread Warp 3 Thread Warp 8 Writeback Warps available for scheduling Thread Warp 7 I-Fetch SIMD Pipeline With a large number of shader threads multiplexed on the same execution re- sources, our architecture employs fine-grained multithreading where individual threads are interleaved by the fetch unit to proactively hide the potential latency of stalls before they occur. As illustrated by Figure, warps are issued fairly in a round-robin queue. When a thread is blocked by a memory request, shader core simply removes that thread’s warp from the pool of “ready” warps and thereby allows other threads to proceed while the memory system processes its request. With a large number of threads (1024 per shader core) interleaved on the same pipeline, FGMT effectively hides the latency of most memory operations since the pipeline is occupied with instructions from other threads while memory operations complete. also hides the pipeline latency so that data bypassing logic can potentially be omitted to save area with minimal impact on performance. simplify the dependency check logic design by restricting each thread to have at most one instruction running in the pipeline at any time. Slide credit: Tor Aamodt

96 Warp Execution (Recall the Slide)
32-thread warp executing ADD A[tid],B[tid]  C[tid] C[1] C[2] C[0] A[3] B[3] A[4] B[4] A[5] B[5] A[6] B[6] Execution using one pipelined functional unit C[4] C[8] C[0] A[12] B[12] A[16] B[16] A[20] B[20] A[24] B[24] C[5] C[9] C[1] A[13] B[13] A[17] B[17] A[21] B[21] A[25] B[25] C[6] C[10] C[2] A[14] B[14] A[18] B[18] A[22] B[22] A[26] B[26] C[7] C[11] C[3] A[15] B[15] A[19] B[19] A[23] B[23] A[27] B[27] Execution using four pipelined functional units Slide credit: Krste Asanovic

97 SIMD Execution Unit Structure
Functional Unit Lane Registers for each Thread Registers for thread IDs 0, 4, 8, … Registers for thread IDs 1, 5, 9, … Registers for thread IDs 2, 6, 10, … Registers for thread IDs 3, 7, 11, … Memory Subsystem Slide credit: Krste Asanovic

98 Warp Instruction Level Parallelism
Can overlap execution of multiple instructions Example machine has 32 threads per warp and 8 lanes Completes 24 operations/cycle while issuing 1 warp/cycle Load Unit Multiply Unit Add Unit W0 W1 W2 time W3 W4 W5 Warp issue Slide credit: Krste Asanovic

99 SIMT Memory Access 不同线程的相同指令使用线程id来存取不同的数据元素 +
Let’s assume N=16, 4 threads per warp  4 warps 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Threads + 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Data elements + + + + Warp 0 Warp 1 Warp 2 Warp 3 Slide credit: Hyesoon Kim

100 Sample GPU SIMT Code (Simplified)
CPU code for (ii = 0; ii < ; ++ii) { C[ii] = A[ii] + B[ii]; } CUDA code // there are threads __global__ void KernelFunction(…) { int tid = blockDim.x * blockIdx.x + threadIdx.x; int varA = aa[tid]; int varB = bb[tid]; C[tid] = varA + varB; } Slide credit: Hyesoon Kim

101 Sample GPU Program (Less Simplified)
Slide credit: Hyesoon Kim

102 Warp-based SIMD vs. Traditional SIMD
Lock step: 一条向量指令必须执行完然后启动下一条向量指令 编程模型为SIMD (no extra threads)  SW 需要知道向量长度 ISA 包含vector/SIMD指令 Warp-based SIMD 由多个标量线程构成,以SIMD方式执行 (即所有的 线程执行相同的指令) 不需要lock step 每个线程可以单独对待(即可以映射到不同的warp中),编程模型不是 SIMD SW 不必知道向量长度 多个线程可以动态构成warp ISA是标量ISA  可以动态形成逻辑上的向量指令 是一种在SIMD硬件上实现的SPMD编程模型

103 SPMD Single procedure/program, multiple data 多条指令流执行相同的程序
它是一种编程模型而不是计算机组织 每个处理单元执行同样的过程,处理不同的数据 这些过程可以在程序中的某个点上同步,例如 e.g. barriers 多条指令流执行相同的程序 每个程序/过程 操作不同的数据 运行时可以执行不同的控制流路径 许多科学计算应用以这种方式编程运行在MIMD硬件结构上 (multiprocessors) 现代 GPUs 以这种类似的方式编程运行在SIMD硬件上

104 SIMD vs. SIMT Execution Model
SIMD: 一条指令流(一串顺序的SIMD指令),每条指 令对应多个数据输入(向量指令) SIMT: 多个指令流(标量指令)构成线程, 这些线程 动态构成warp。一个Warp处理多个数据元素 SIMT 主要优点: 可以独立地处理线程,即每个线程可以在任何标量流水线上 单独执行( MIMD processing) 可以将线程组织成warp,即可以将执行相同指令流的线程构 成warp,形成SIMD processing, 以充分发挥SIMD处理的优势

105 Threads Can Take Different Paths in Warp-based SIMD
每个线程可以包含控制流指令 这些线程可以执行不同的控制流路径 B C D E F A G Thread Warp Common PC Thread 1 Thread 2 Thread 3 Thread 4 Slide credit: Tor Aamodt

106 Control Flow Problem in GPUs/SIMT
GPU 控制逻辑使用 SIMD 流水线 以节省资源 这些标量线程构成warp 当一个WARP中的线程分 支到不同的执行路径时 ,产生分支发散( Branch divergence) Branch Path A Path B Branch Path A Path B 与向量处理机模型的条件执行相同 (Vector Mask and Masked Vector Operations?) Slide credit: Tor Aamodt

107 Branch Divergence Handling (I)
Idea: Dynamic predicated (conditional) execution Reconv. PC Next PC Active Mask Stack B C D E F A G A/1111 E D 0110 1001 TOS - 1111 E D 0110 C 1001 TOS - 1111 - B 1111 TOS - A 1111 TOS - E 1111 TOS E D 0110 TOS - 1111 - G 1111 TOS B/1111 C/1001 D/0110 Thread Warp Common PC Thread 2 3 4 1 E/1111 Reconv. PC : 重新汇聚地址 G/1111 A B C D E G A Time Slide credit: Tor Aamodt

108 Branch Divergence Handling (II)
if (some condition) { B; } else { C; } D; One per warp Control Flow Stack Next PC Recv PC Active Mask TOS A -- 1111 D -- 1111 B D 1110 A C D 0001 D Execution Sequence 1 A 1 C 1 B 1 D B C D Time Slide credit: Tor Aamodt

109 Remember: Each Thread Is Independent
SIMT 主要优点: 可以独立地处理线程,即每个线程可以在任何标量流水线上单 独执行( MIMD 处理模式) 可以将线程组织成warp,即可以将执行相同指令流的线程构 成warp,形成SIMD 处理模式, 以充分发挥SIMD处理的优势 如果有许多线程,对具有相同PC值的线程可以将它们动 态组织到一个warp中 这样可以减少“分支发散” 提高SIMD 利用率 SIMD 利用率: 执行有用操作的SIMD lanes的比例 (即, 执行活动 线程的比例)

110 Dynamic Warp Formation/Merging
Idea: Dynamically merge threads executing the same instruction (after branch divergence) Form new warps from warps that are waiting Enough threads branching to each path enables the creation of full new warps Warp X Warp Z Warp Y

111 Dynamic Warp Formation/Merging
Idea: Dynamically merge threads executing the same instruction (after branch divergence) Fung et al., “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow,” MICRO 2007. Branch Path A Path B Branch Path A

112 Dynamic Warp Formation Example
Execution of Warp x at Basic Block A Execution of Warp y Legend A B x/1110 y/0011 C x/1000 D x/0110 F x/0001 y/0010 y/0001 y/1100 A new warp created from scalar threads of both Warp x and y executing at Basic Block D D E x/1110 y/0011 G x/1111 y/1111 A A B B C C D D E E F F G G A A Baseline Time Dynamic Warp Formation A A B B C D E E F G G A A Time Slide credit: Tor Aamodt

113 Hardware Constraints Limit Flexibility of Warp Grouping
Functional Unit Lane Registers for each Thread Registers for thread IDs 0, 4, 8, … Registers for thread IDs 1, 5, 9, … Registers for thread IDs 2, 6, 10, … Registers for thread IDs 3, 7, 11, … Can you move any thread flexibly to any lane? Memory Subsystem Slide credit: Krste Asanovic

114 When You Group Threads Dynamically
存储器访问如何处理? 固定模式的存储器访问相对简单,当动态构成warp 时,使得访问模式具有随机性,使得问题变得复杂。  降低存储器访问的局部性  导致存储器带宽利用率的下降

115 What About Memory Divergence?
现代 GPUs 包括高速缓存,减少对存储器的访问 Ideally: 一个warp中的所有线程的存储器访问都命 中 (互相没有冲突) Problem: 一个Warp中有些命中,有些失效 Problem: 一个线程的stall导致整个warp停顿 需要有相关技术来解决存储器发散访问问题

116 NVIDIA GeForce GTX 285 NVIDIA-speak: Generic speak:
240 stream processors “SIMT execution” Generic speak: 30 cores 8 SIMD functional units per core Slide credit: Kayvon Fatahalian

117 NVIDIA GeForce GTX 285 “core”
64 KB of storage for thread contexts (registers) 30 * 32 * 32 = 30 * 1024 = 30K fragments 64KB register file = bit registers per thread = 64B (1/32 that of LRB) 16KB of shared scratch 80KB / core available to software = SIMD functional unit, control shared across 8 units = instruction stream decode = multiply-add = execution context storage = multiply Slide credit: Kayvon Fatahalian

118 NVIDIA GeForce GTX 285 “core”
64 KB of storage for thread contexts (registers) To get maximal latency hiding: Run 1/32 of the time 16 words per thread = 64B Groups of 32 threads share instruction stream (each group is a Warp) Up to 32 warps are simultaneously interleaved Up to 1024 thread contexts can be stored Slide credit: Kayvon Fatahalian

119 30 cores on the GTX 285: 30,720 threads
NVIDIA GeForce GTX 285 Tex Tex Tex Tex Tex Tex Tex Tex If you’re running a CUDA program, and your not launching 30K threads, you are certainly not getting full latency hiding, and you might not be using the GPU well Tex Tex 30 cores on the GTX 285: 30,720 threads Slide credit: Kayvon Fatahalian

120 GPU Readings Lindholm et al., "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro 2008. Fatahalian and Houston, “A Closer Look at GPUs,” CACM Narasiman et al., “Improving GPU Performance via Large Warps and Two-Level Warp Scheduling,” MICRO 2011. Fung et al., “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow,” MICRO 2007. Jog et al., “Orchestrated Scheduling and Prefetching for GPGPUs,” ISCA 2013.

121 Conditionals in SIMT model
简单的 if-then-else 编译为谓词(有条件)执行 等价于在向量屏蔽寄存器的作用下的向量运算 比较复杂的控制流编译生成分支 如何处理分支? µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7 tid=threadid Scalar instruction stream If (tid >= n) skip Call func1 add st y skip: SIMD execution across warp 2018/11/16 计算机体系结构

122 Conditional Branching
The University of Adelaide, School of Computer Science 16 November 2018 Conditional Branching 与向量结构类似, GPU 使用内部的屏蔽字(masks) 还使用了 分支同步堆栈 保存分支成功的路径地址 保存该路径的每个SIMD lane 的屏蔽字(mask) 即指示哪些车道可以提交结果 (all threads execute) 指令标记(instruction markers) 管理何时分支到多个执行路径,何时路径汇合 PTX层 CUDA线程的控制流由PTX分支指令(branch、call、return and exit) 由程序员指定的每个线程车道的1-bit谓词寄存器 GPU硬件指令层,控制流包括: 分支指令(branch,jump call return) 特殊的指令用于管理分支同步栈 GPU硬件为每个SIMD thread 提供堆栈保存分支成功的路径 GPU硬件指令带有控制每个线程车道1-bit谓词寄存器 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

123 Branch divergence 硬件跟踪各µthreads转移的方向(判定哪些是成功的转移,哪 些是失败的转移)
如果所有线程所走的路径相同,那么可以保持这种 SIMD fashion 如果各线程选择的方向不一致,那么创建一个屏蔽(mask) 向量来指示各线程的转移方向(成功、失败) 继续执行分支失败的路径,将分支成功的路径压入硬件堆栈 (分支同步堆栈),待后续执行 SIMD 车道何时执行分支同步堆栈中的路径? 通过执行pop操作,弹出执行路径以及屏蔽字,执行转移成功路径 SIMD lane完成整个分支路径执行后再执行下一条指令 称为 converge(汇聚) 2018/11/16 计算机体系结构

124 The University of Adelaide, School of Computer Science
16 November 2018 Example if (X[i] != 0) X[i] = X[i] – Y[i]; else X[i] = Z[i]; ld.global.f64 RD0, [X+R8] ; RD0 = X[i] setp.neq.s32 P1, RD0, #0 ; P1 is predicate register 1 @!P1, bra ELSE1, *Push ; Push old mask, set new mask bits ; if P1 false, go to ELSE1 ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i] sub.f64 RD0, RD0, RD2 ; Difference in RD0 st.global.f64 [X+R8], RD0 ; X[i] = RD0 @P1, bra ENDIF1, *Comp ; complement mask bits ; if P1 true, go to ENDIF1 ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i] st.global.f64 [X+R8], RD0 ; X[i] = RD0 ENDIF1: <next instruction>, *Pop ; pop to restore old mask 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

125 Warps are multithreaded on core
一个 warp 由 32个 µthreads 构成 多个warp线程在单个核上交叉运 行,以隐藏存储器访问呢和功能 部件的延迟 单个线程块包含多个warp (up to 512 µT max in CUDA), 这些warp都 映射到同一个核上 多个线程块也可以在同一个核上 运行 2018/11/16 [Nvidia, 2010] 计算机体系结构

126 SIMT Warp Execution in the SM
Warp: 一组32个CUDA 线程,构成一条SIMD指令 SIMT: Single-Instruction Multi-Thread 建立适用于warp的并行线程的指令 SM 双发射流水线选择两个warp 发 射到并行的核上 SIMT warp 执行warp中的指令 (SIMD指令)(32个CUDA thread) 通过谓词寄存器控制单个线程 (CUDA thread ) 执行结果 (enable/disable) 通过同步堆栈管理线程的分支 通过规整的一致的计算来处理分 支速度高于非规整的分支转移 2018/11/16 计算机体系结构

127 2018/11/16 计算机体系结构

128 Nvidia Fermi GF100 GPU [Nvidia, 2010] 2018/11/16 计算机体系结构

129 Fermi “Streaming Multiprocessor” Core
2018/11/16 计算机体系结构

130 Fermi Dual-Issue Warp Scheduler
2018/11/16 计算机体系结构

131 Apple A5X Processor for iPad v3 (2012)
12.90mm x 12.79mm 45nm technology 2018/11/16 计算机体系结构 [Source: Chipworks, 2012]

132 Loop-Level Parallelism
The University of Adelaide, School of Computer Science 16 November 2018 Loop-Level Parallelism 研究的焦点问题是:在循环结构中,后面的循环是否依 赖于前面的循环 Loop-carried dependence Example 1: for (i=999; i>=0; i=i-1) x[i] = x[i] + s; 循环级并行:No loop-carried dependence 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

133 Loop-Level Parallelism
The University of Adelaide, School of Computer Science 16 November 2018 Loop-Level Parallelism Example 2: for (i=0; i<100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1]; /* S2 */ } S1 and S2 均使用了前一次循环的结果 S2 使用同一循环中的S1的结果 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

134 Loop-Level Parallelism
The University of Adelaide, School of Computer Science 16 November 2018 Loop-Level Parallelism Example 3: for (i=0; i<100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } S1 使用了S2前一次循环的结果,但数据依赖没有形成环路,也是循环 级并行。 Transform to: A[0] = A[0] + B[0]; for (i=0; i<99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; B[100] = C[99] + D[99]; 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

135 Loop-Level Parallelism
The University of Adelaide, School of Computer Science 16 November 2018 Loop-Level Parallelism Example 4: for (i=0;i<100;i=i+1) { A[i] = B[i] + C[i]; D[i] = A[i] * E[i]; } Example 5: for (i=1;i<100;i=i+1) { Y[i] = Y[i-1] + Y[i]; //递推关系 通常循环间相关表现形式为:递推关系 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

136 The University of Adelaide, School of Computer Science
16 November 2018 Finding dependencies Assume indices are affine: a x i + b (i is loop index) Assume: Store to a x i + b, then Load from c x i + d i runs from m to n Dependence exists if: Given j, k such that m ≤ j ≤ n, m ≤ k ≤ n Store to a x j + b, load from a x k + d, and a x j + b = c x k + d 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

137 The University of Adelaide, School of Computer Science
16 November 2018 Finding dependencies 通常在编译时无法确定是否相关 数据依赖关系的测试 GCD test: 如果存在相关,那么 GCD(c,a) 必须能整除 (d-b) Example: for (i=0; i<100; i=i+1) { X[2*i+3] = X[2*i] * 5.0; } 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

138 The University of Adelaide, School of Computer Science
16 November 2018 Finding dependencies Example 2: for (i=0; i<100; i=i+1) { Y[i] = X[i] / c; /* S1 */ X[i] = X[i] + c; /* S2 */ Z[i] = Y[i] + c; /* S3 */ Y[i] = c - Y[i]; /* S4 */ } 反相关和输出相关 通过重命名消除这两类相关 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

139 The University of Adelaide, School of Computer Science
16 November 2018 Reductions Reduction Operation: for (i=9999; i>=0; i=i-1) sum = sum + x[i] * y[i]; Transform to… sum [i] = x[i] * y[i]; finalsum = finalsum + sum[i]; //该循环存在循环间相关 Do on p processors: for (i=999; i>=0; i=i-1) finalsum[p] = finalsum[p] + sum[i+1000*p]; Note: assumes associativity! 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

140 GPU Future High-end desktops have separate GPU chip, but trend towards integrating GPU on same die as CPU (already in laptops, tablets and smartphones) Advantage is shared memory with CPU, no need to transfer data Disadvantage is reduced memory bandwidth compared to dedicated smaller-capacity specialized memory system Graphics DRAM (GDDR) versus regular DRAM (DDR3) Will GP-GPU survive? Or will improvements in CPU DLP make GP-GPU redundant? On same die, CPU and GPU should have same memory bandwidth GPU might have more FLOPS as needed for graphics anyway 2018/11/16 计算机体系结构

141 Acknowledgements These slides contain material developed and copyright by: John Kubiatowicz (UCB) Krste Asanovic (UCB) David Patterson (UCB) Chenxi Zhang (Tongji) UCB material derived from course CS152、CS252 CMU Introduction to Computer Architecture 2018/11/16 计算机体系结构


Download ppt "周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学 2018/11/16 计算机体系结构 周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学."

Similar presentations


Ads by Google