周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学 2018/11/16 计算机体系结构周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学.

周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学
2018/11/16 计算机体系结构周学海 , 中国科学技术大学

第6章 Data-Level Parallelism in Vector, SIMD, and GPU Architectures
向量体系结构多媒体SIMD指令集扩展图形处理单元 GPU 2018/11/16 计算机体系结构

传统指令级并行技术的问题挖掘ILP的传统方法的主要缺陷:
提高流水线的时钟频率: 提高时钟频率，有时导致CPI随着增加 (branches, other hazards) 指令预取和译码: 有时在每个时钟周期很难预取和译码多条指令提高Cache命中率 : 在有些计算量较大的应用中（科学计算）需要大量的数据，其局部性较差，有些程序处理的是连续的媒体流(multimedia),其局部性也较差。 2018/11/16 计算机体系结构

The University of Adelaide, School of Computer Science
16 November 2018 Introduction SIMD 结构可有效地挖掘数据级并行: 基于矩阵运算的科学计算图像和声音处理 SIMD比MIMD更节能针对每组数据操作仅需要取指一次 SIMD对PMD( personal mobile devices)更具吸引力 SIMD 允许程序员继续以串行模式思维 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

16 November 2018 SIMD Parallelism 向量体系结构多媒体SIMD指令集扩展 Graphics Processor Units (GPUs) For x86 processors: 每年增加2cores/chip SIMD 宽度每4年翻一番 SIMD潜在加速比是MIMD的2倍 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

2018/11/16 计算机体系结构

Supercomputers Supercomputer的定义: 由Seymour Cray设计的机器
对于给定任务而言世界上最快的机器任何造价超过3千万美元的机器计算能力达到每秒万亿次的机器由Seymour Cray设计的机器 CDC6600 (Cray, 1964) 被认为是第一台超级计算机 2018/11/16 计算机体系结构

CDC 6600 Seymour Cray, 1963 A fast pipelined machine with 60-bit words
128 Kword main memory capacity, 32 banks Ten functional units (parallel, unpipelined) Floating Point: adder, 2 multipliers, divider Integer: adder, 2 incrementers, ... Hardwired control (no microcoding) Scoreboard for dynamic scheduling of instructions Ten Peripheral Processors for Input/Output a fast multi-threaded 12-bit integer ALU Very fast clock, 10 MHz (FP add in 4 clocks) >400,000 transistors, 750 sq. ft., 5 tons, 150 kW, novel freon-based technology for cooling Fastest machine in world for 5 years (until 7600) over 100 sold ($7-10M each) 新颖的制冷技术:弗里昂制冷第一次使用:动态调度技术 2018/11/16 计算机体系结构 CS252 S05

IBM Memo on CDC6600 Thomas Watson Jr., IBM CEO, August 1963: “Last week, Control Data ... announced the 6600 system. I understand that in the laboratory developing the system there are only 34 people including the janitor. Of these, 14 are engineers and 4 are programmers... Contrasting this modest effort with our vast development activities, I fail to understand why we have lost our industry leadership position by letting someone else offer the world's most powerful computer.” To which Cray replied: “It seems like Mr. Watson has answered his own question.” 2018/11/16 计算机体系结构 CS252 S05

Supercomputer Applications
典型应用领域军事研究领域（核武器研制、密码学）科学研究天气预报石油勘探工业设计 (car crash simulation) 生物信息学密码学均涉及大量的数据集处理 70-80年代Supercomputer = Vector Machine 2018/11/16 计算机体系结构

Alternative Model:Vector Processing
向量处理机具有更高层次的操作，一条向量指令可以处理N个或N对操作数（处理对象是向量） + r1 r2 r3 add r3, r1, r2 SCALAR (1 operation) v1 v2 v3 vector length add.vv v3, v1, v2 VECTOR (N operations) 2018/11/16 计算机体系结构 25

向量处理机的基本特性基本思想：两个向量的对应分量进行运算，产生一个结果向量。
简单的一条向量指令包含了多个操作=> fewer instruction fetches 每一结果独立于前面的结果长流水线，编译器保证操作间没有相关性硬件仅需检测两条向量指令间的相关性较高的时钟频率向量指令以已知的模式访问存储器可有效发挥多体交叉存储器的优势可通过重叠减少存储器操作的延时 64 elements 不需要数据Cache! (仅使用指令cache) 在流水线控制中减少了控制相关 2018/11/16 计算机体系结构

向量处理机的基本结构 memory-memory vector processors: 所有的向量操作是存储器到存储器
vector-register processors: 除了load 和store操作外，所有的操作是向量寄存器与向量寄存器间的操作向量机的Load/Store结构 1980年以后的所有的向量处理机都是这种结构: Cray, Convex, Fujitsu, Hitachi, NEC 我们也主要针对这种结构 2018/11/16 计算机体系结构

Vector Memory-Memory versus Vector Register Machines
存储器-存储器型向量机所有指令操作的操作数来源于存储器第一台向量机 CDC Star-100 (‘73) and TI ASC (‘71), 是存储器-存储器型机器 Cray-1 (’76) 是第一台寄存器型向量机 ADDV C, A, B SUBV D, A, B Vector Memory-Memory Code for (i=0; i<N; i++) { C[i] = A[i] + B[i]; D[i] = A[i] - B[i]; } Example Source Code LV V1, A LV V2, B ADDV V3, V1, V2 SV V3, C SUBV V4, V1, V2 SV V4, D Vector Register Code 2018/11/16 计算机体系结构

Vector Memory-Memory vs. Vector Register Machines
存储器-存储器型向量机 (VMMA) 需要更高的存储器带宽 All operands must be read in and out of memory VMMA结构使得多个向量操作重叠执行更困难 Must check dependencies on memory addresses VMMA启动时间更长 CDC Star-100 在向量元素小于100时，标量代码的性能高于向量化代码 For Cray-1, vector/scalar 均衡点在2个元素 CDC Cray-1后续的机器 (Cyber-205, ETA-10) 都是寄存器型向量机 2018/11/16 计算机体系结构

向量处理机的基本组成单元 Vector Register: 固定长度的一块区域，存放单个向量
至少2个读端口和一个写端口（一般最少16个读端口，8个写端口）典型的有8-32 向量寄存器，每个寄存器存放64到128个64位的元素 Vector Functional Units (FUs): 全流水化的，每一个clock启动一个新的操作一般4到8个FUs: FP add, FP mult, FP reciprocal (1/X), integer add, logical, shift; 可能有些重复设置的部件 Vector Load-Store Units (LSUs): 全流水化地load 或store一个向量，可能会配置多个LSU部件 Scalar registers: 存放单个元素用于标量处理或存储地址用交叉开关连接(Cross-bar) FUs , LSUs, registers 2018/11/16 计算机体系结构

2018/11/16 计算机体系结构

Vector Supercomputers
Cray-1的变体（1976）: Scalar Unit：Load/Store Architecture Vector Extension Vector Registers Vector Instructions Implementation 硬布线逻辑控制高效流水化的功能部件多体交叉存储系统无Data Cache 不支持 Virtual Memory 2018/11/16 计算机体系结构

Vector Instruction Set Advantages
格式紧凑一条指令包含N个操作表达能力强, 一条指令能告诉硬件: N个操作之间无相关性使用同样的功能部件访问不相交的寄存器与前面的操作以相同模式访问寄存器访问存储器中的连续块 (unit-stride load/store) 以已知的模式访问存储器 (strided load/store) 可扩展性好可以在多个并行的流水线上运行同样的代码 (lanes) 2018/11/16 计算机体系结构

Vector Arithmetic Execution
使用较深的流水线(=> fast clock) 执行向量元素的操作由于向量元素相互独立，简化了深度流水线的控制 (=> no hazards!) V1 V2 V3 Six stage multiply pipeline V3 <- v1 * v2 2018/11/16 计算机体系结构

Vector Instruction Execution
ADDV C,A,B C[1] C[2] C[0] A[3] B[3] A[4] B[4] A[5] B[5] A[6] B[6] 使用一条流水化的功能部件执行 C[4] C[8] C[0] A[12] B[12] A[16] B[16] A[20] B[20] A[24] B[24] C[5] C[9] C[1] A[13] B[13] A[17] B[17] A[21] B[21] A[25] B[25] C[6] C[10] C[2] A[14] B[14] A[18] B[18] A[22] B[22] A[26] B[26] C[7] C[11] C[3] A[15] B[15] A[19] B[19] A[23] B[23] A[27] B[27] 使用4条流水化的功能部件执行 2018/11/16 计算机体系结构

Interleaved Vector Memory System
Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency Bank busy time: Time before bank ready to accept next request 1 2 3 4 5 6 7 8 9 A B C D E F + Base Stride Vector Registers Memory Banks Address Generator 2018/11/16 计算机体系结构

Vector Unit Structure Functional Unit Vector Registers Lane
Elements 0, 4, 8, … Elements 1, 5, 9, … Elements 2, 6, 10, … Elements 3, 7, 11, … Lane Memory Subsystem 2018/11/16 计算机体系结构

T0 Vector Microprocessor (UCB/ICSI, 1995)
Vector register elements striped over lanes [0] [8] [16] [24] [1] [9] [17] [25] [2] [10] [18] [26] [3] [11] [19] [27] [4] [12] [20] [28] [5] [13] [21] [29] [6] [14] [22] [30] [7] [15] [23] [31] Lane 2018/11/16 计算机体系结构

Vector Instruction Parallelism
多条向量指令可重叠执行(链接技术) 例如：每个向量 32 个元素，8 lanes（车道） Load Unit Multiply Unit Add Unit load mul add time load mul add Instruction issue Complete 24 operations/cycle while issuing 1 short instruction/cycle 2018/11/16 计算机体系结构

“DLXV” Vector Instructions
Instr. Operands Operation Comment ADDV V1,V2,V3 V1=V2+V3 vector + vector ADDSV V1,F0,V2 V1=F0+V2 scalar + vector MULTV V1,V2,V3 V1=V2xV3 vector x vector MULSV V1,F0,V2 V1=F0xV2 scalar x vector LV V1,R1 V1=M[R1..R1+63] load, stride=1 LVWS V1,R1,R2 V1=M[R1..R1+63*R2] load, stride=R2 LVI V1,R1,V2 V1=M[R1+V2i,i=0..63] indir.("gather") CeqV VM,V1,V2 VMASKi = (V1i=V2i)? comp. setmask MOV VLR,R1 Vec. Len. Reg. = R1 set vector length MOV VM,R1 Vec. Mask = R1 set vector mask 2018/11/16 计算机体系结构

05/13-review: Multithreading
Simultaneous Multithreading Time (processor cycle) Superscalar Fine-Grained Coarse-Grained Multiprocessing Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot 2018/11/16 计算机体系结构

05/13-Review 向量处理机基本概念向量处理机基本特征向量处理机基本结构向量处理机性能评估
基本思想：两个向量的对应分量进行运算，产生一个结果向量向量处理机基本特征 VSIW-一条指令包含多个操作单条向量指令内所包含的操作相互独立以已知模式访问存储器控制相关少向量处理机基本结构向量处理机性能评估向量指令流执行时间: Convey, Chimes, Start-up time 其他指标： R , N1/2 , NV 2018/11/16 计算机体系结构

Vector Execution Time Time = f(vector length, data dependencies, struct. hazards) Initiation rate: 功能部件消耗向量元素的速率 Convoy: 可在同一时钟周期开始执行的指令集合 (no structural or data hazards) Chime: 执行一个convoy所花费的大致时间（approx. time） m convoys take m chimes; 如果每个向量长度为n, 那么m个convoys 所花费的时间是m个chimes 每个chime所花费的时间是n个clocks，该程序所花费的总时间大约为m x n clock cycles (ignores overhead; good approximization for long vectors) 1: LV V1,Rx ;load vector X 2: MULV V2,F0,V1 ;vector-scalar mult. LV V3,Ry ;load vector Y 3: ADDV V4,V2,V3 ;add 4: SV Ry,V4 ;store the result 4 convoys, 1 lane, VL=64 => 4 x 64 = 256 clocks (or 4 clocks per result) 2018/11/16 计算机体系结构

Vector Startup 向量启动时间由两部分构成功能部件延时：一个操作通过功能部件的时间
截止时间或恢复时间（dead time or recovery time ）：运行下一条向量指令的间隔时间 Functional Unit Latency R X W R X W First Vector Instruction R X W R X W R X W Dead Time R X W R X W R X W Dead Time Second Vector Instruction R X W R X W 2018/11/16 计算机体系结构

Dead Time and Short Vectors
T0, Eight lanes No dead time 100% efficiency with 8 element vectors 4 cycles dead time 64 cycles active Cray C90, Two lanes 4 cycle dead time Maximum efficiency 94% with 128 element vectors 2018/11/16 计算机体系结构

DLXV Start-up Time Start-up time: FU 部件流水线的深度
Operation Start-up penalty (from CRAY-1) Vector load/store Vector multiply Vector add 6 Assume convoys don't overlap; vector length = n Convoy Start st result last result 1. LV n (12+n-1) 2. MULV, LV n n n Multiply startup 12+n n n Load start-up 3. ADDV n n n Wait convoy 2 4. SV n n n Wait convoy 3 2018/11/16 计算机体系结构

Vector Length 当向量的长度不是64时（假设向量寄存器的长度是64）怎么办？
vector-length register (VLR) 控制特定向量操作的长度, 包括向量的load/store. (当然一次操作的向量的长度不能 > 向量寄存器的长度) 例如： do 10 i = 1, n 10 Y(i) = a * X(i) + Y(i) n的值只有在运行时才能知道 n > Max. Vector Length (MVL)怎么办？ 2018/11/16 计算机体系结构

Strip Mining（分段开采）假设Vector Length > Max. Vector Length (MVL)?
Strip mining: 产生新的代码，使得每个向量操作的元素数  MVL 第一次循环做最小片(n mod MVL), 以后按VL = MVL操作 low = VL = (n mod MVL) /*find the odd size piece*/ do 1 j = 0, (n / MVL) /*outer loop*/ do 10 i = low, low+VL-1 /*runs for length VL*/ Y(i) = a*X(i) + Y(i) /*main operation*/ 10 continue low = low+VL /*start of next vector*/ VL = MVL /*reset the length to max*/ 1 continue 2018/11/16 计算机体系结构

Strip Mining的向量执行时间计算
试计算A=B×s，其中A,B为长度为200的向量（每个向量元素占8个字节），s是一个标量。向量寄存器长度为64。各功能部件的启动时间如前所述，求总的执行时间，(Tloop = 15) 2018/11/16 计算机体系结构

ADDI R2,R0,#1600 ;total # bytes in vector ADD R2,R2,Ra ;address of the end of A vector ADDI R1,R0,#8 ;loads length of 1st segment MOVI2S VLR,R1 ;load vector length in VLR ADDI R1,R0,#64 ;length in bytes of 1st segment ADDI R3,R0,#64 ;vector length of other segments Loop: LV V1,Rb ;load B MULSV V2,V1,Fs ;vector * scalar SV Ra,V2 ;store A ADD Ra,Ra,R1 ;address of next segment of A ADD Rb,Rb,R1 ;address of next segment of B ADDI R1,R0,#512 ;load byte offset next segment MOVI2S VLR,R3 ;set length to 64 elements SUB R4,R2,Ra ;at the end of A? BNEZ R4,Loop ;if not, go back 2018/11/16 计算机体系结构

Tstart = = 31 T200 = 660+4*31 = 784 每一元素的执行时间 = 784/200 = 3.9 2018/11/16 计算机体系结构

2018/11/16 计算机体系结构

Common Vector Metrics Rn 表示向量长度为n时的向量流水线的性能
R: 当向量长度为无穷大时的向量流水线的最大性能。常在评价峰值性能时使用，单位为 MFLOPS 实际问题是向量长度不会无穷大，start-up的开销还是比较大的 Rn 表示向量长度为n时的向量流水线的性能 N1/2: 达到R 一半的值所需的向量长度，是评价向量流水线start-up 时间对性能的影响。 NV: 向量流水线方式的工作速度优于标量串行方式工作时所需的向量长度临界值。该参数既衡量建立时间，也衡量标量、向量速度比对性能的影响 2018/11/16 计算机体系结构

Example Vector Machines
Machine Year Clock Regs Elements FUs LSUs Cray MHz Cray XMP MHz L, 1 S Cray YMP MHz L, 1 S Cray C MHz Cray T MHz Conv. C MHz Conv. C MHz Fuj. VP MHz Fuj. VP MHz NEC SX/ MHz 8+8K 256+var 16 8 NEC SX/ MHz 8+8K 256+var 16 8 Cray 1; fastest scalar computer + 1st commercially successful vector computer, offered another 10X 6600 1st scoreboard Cray XMP: 3 LSUs, Multiprocessor 4 way (not by Cray) => YMP, C-90, T-90; 2X processors, 1.5X clock Cray 2 went to DRAM to get more memory, not so great Like parallel teams as Intel (486, PPro, Pentium, next one) Japan Fujitsu, vary number of registers elements (8x1024 or 32x256) NEC, 8x K of varying elements 2018/11/16 计算机体系结构

A Modern Vector Super: NEC SX-9 (2008)
65nm CMOS technology Vector unit (3.2 GHz) 8 foreground VRegs background VRegs (256x64-bit elements/VReg) 64-bit functional units: 2 multiply, 2 add, 1 divide/sqrt, 1 logical, 1 mask unit 8 lanes (32+ FLOPS/cycle, GFLOPS peak per CPU) 1 load or store unit (8 x 8-byte accesses/cycle) Scalar unit (1.6 GHz) 4-way superscalar with out-of- order and speculative execution 64KB I-cache and 64KB data cache Memory system provides 256GB/s DRAM bandwidth per CPU Up to 16 CPUs and up to 1TB DRAM form shared-memory node total of 4TB/s bandwidth to shared DRAM memory Up to 512 nodes connected via 128GB/s network links (message passing between nodes) Picture from NEC article “A hardware overview of SX-6 and SX-7 supercomputer” 2018/11/16 计算机体系结构

Vector Linpack Performance (MFLOPS)
Matrix Inverse (gaussian elimination) Machine Year Clock 100x100 1kx1k Peak(Procs) Cray MHz (1) Cray XMP MHz (4) Cray YMP MHz ,667(8) Cray C MHz ,238(16) Cray T MHz ,600(32) Conv. C MHz (1) Conv. C MHz (4) Fuj. VP MHz (1) NEC SX/ MHz (1) NEC SX/ MHz ,600(4) 6X in 20 years; 32X in 20 years; Peak is 360X speedup Weighed tons 2018/11/16 计算机体系结构

Memory Banking 独立存储体方式：由多个相互独立的存储体（Bank) 构成存储器组织。可独立访问存储体，各存储体共享数据和地址总线 (minimize pin cost) 每个周期可以启动和完成一个bank的访问如果N个存储器访问访问不同的bank可以并行执行 2018/11/16 计算机体系结构

Interleaved Vector Memory System
Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency Bank busy time: Time before bank ready to accept next request If stride = 1 & consecutive elements interleaved across banks & number of banks >= bank latency, then can sustain 1 element/cycle throughput 1 2 3 4 5 6 7 8 9 A B C D E F + Base Stride Vector Registers Memory Banks Address Generator 2018/11/16 计算机体系结构

Example(AppF F-15) Suppose we want to fetch a vector of 64 elements starting at byte address 136,and a memory access takes 6 clocks. How many memory banks must we have to support one fetch per clock cycle? With what addresses are the banks accessed? When will the various elements arrive at the CPU? 2018/11/16 计算机体系结构

Vector Stride 假设处理顺序相邻的元素在存储器中不顺序存储。例如 do 10 i = 1,100 do 10 j = 1,100
A(i,j) = 0.0 do 10 k = 1,100 10 A(i,j) = A(i,j)+B(i,k)*C(k,j) B 或 C 的两次访问不会相邻 (相隔800 bytes) stride: 向量中相邻元素间的距离 => LVWS (load vector with stride) instruction Strides => 会导致体冲突 (e.g., stride = 32 and 16 banks) 2018/11/16 计算机体系结构

Memory operations Load/store 操作成组地在寄存器和存储器之间移动数据三类寻址方式
Unit stride (单步长) Fastest Non-unit (constant) stride (常数步长) Indexed (gather-scatter) (间接寻址) 等价于寄存器间接寻址方式对稀疏矩阵有效用于向量化操作的指令增多 2018/11/16 计算机体系结构 32 32

DAXPY (Y = a × X + Y) Scalar vs. Vector
Assuming vectors X, Y are length 64 Scalar vs. Vector LD F0,a ;load scalar a LV V1,Rx ;load vector X MULTS V2,F0,V1 ;vector-scalar mult. LV V3,Ry ;load vector Y ADDV V4,V2,V3 ;add SV Ry,V4 ;store the result LD F0,a ADDI R4,Rx,#512 ;last address to load loop: LD F2, 0(Rx) ;load X(i) MULTD F2,F0,F2 ;a*X(i) LD F4, 0(Ry) ;load Y(i) ADDD F4,F2, F4 ;a*X(i) + Y(i) SD F4 ,0(Ry) ;store into Y(i) ADDI Rx,Rx,#8 ;increment index to X ADDI Ry,Ry,#8 ;increment index to Y SUB R20,R4,Rx ;compute bound BNZ R20,loop ;check if done 578 (2+9*64) vs (1+5*64) ops (1.8X) 578 (2+9*64) vs instructions (96X) 64 operation vectors no loop overhead also 64X fewer pipeline hazards 2018/11/16 计算机体系结构

Vector Opt#1: Vector Chaining
寄存器定向路径的向量机版本首次在Cray-1上使用 Memory V1 Load Unit Mult. V2 V3 Chain Add V4 V5 Chain LV v1 MULV v3,v1,v2 ADDV v5, v3, v4 2018/11/16 计算机体系结构

Vector Chaining Advantage
Load Mul Add Time 不采用链接技术，必须处理完前一条指令的最后一个元素，才能启动下一条相关的指令采用链接技术，前一条指令的第一个结果出来后，就可以启动下一条相关指令的执行 Load Mul Add 2018/11/16 计算机体系结构

Vector Instruction Parallelism
多条向量指令可重叠执行(链接技术) 例如：每个向量 32 个元素，8 lanes（车道） Load Unit Multiply Unit Add Unit load mul add time load mul add Instruction issue Complete 24 operations/cycle while issuing 1 short instruction/cycle 2018/11/16 计算机体系结构

Vector Opt #2: Conditional Execution
Suppose: do 100 i = 1, 64 if (A(i) .ne. 0) then A(i) = A(i) – B(i) endif 100 continue vector-mask control 使用长度为MVL的布尔向量控制向量指令的执行当vector-mask register 使能时，向量指令操作仅对 vector-mask register中对应位为1的分量起作用 2018/11/16 计算机体系结构

Masked Vector Instructions
B[3] A[4] B[4] A[5] B[5] A[6] B[6] M[3]=0 M[4]=1 M[5]=1 M[6]=0 M[2]=0 M[1]=1 M[0]=0 Write data port Write Enable A[7] B[7] M[7]=1 Simple Implementation execute all N operations, turn off result writeback according to mask C[4] C[5] C[1] Write data port A[7] B[7] M[3]=0 M[4]=1 M[5]=1 M[6]=0 M[2]=0 M[1]=1 M[0]=0 M[7]=1 Density-Time Implementation scan mask vector and only execute elements with non-zero masks 2018/11/16 计算机体系结构

LV V1,Ra ; load vector A into V1 LV V2,Rb ; load vector B L
LV V1,Ra ; load vector A into V1 LV V2,Rb ; load vector B L.D F0,#0 ; load FP zero into F0 SNEVS.D V1,F0 ;sets VM(i) to 1 if V1(i)!=F0 SUBV.D V1,V1,V2 ;subtract under vector mask CVM ;set the vector mask to all 1s SV Ra,V1 ;store the result in A 使用vector-mask寄存器的缺陷简单实现时，条件不满足时向量指令仍然需要花费时间有些向量处理器带条件的向量执行仅控制向目标寄存器的写操作，可能会有除法错。 2018/11/16 计算机体系结构

Vector Opt #3: Sparse Matrices
Suppose: do 100 i = 1,n 100 A(K(i)) = A(K(i)) + C(M(i)) gather (LVI) operation 使用index vector 中给出的偏移再加基址来读取 => a nonsparse vector in a vector register 这些元素以密集的方式操作完成后，再使用同样的index vector存储到稀疏矩阵的对应位置这些操作编译时可能无法完成。主要原因：编译器无法预知Ki以及是否有数据相关使用CVI 设置步长（ index 0, 1xm, 2xm, ..., 63xm） CVI gets used under mask 2018/11/16 计算机体系结构

Sparse Matrix Example Cache (1993) vs. Vector (1988)
IBM RS6000 Cray YMP Clock 72 MHz 167 MHz Cache 256 KB 0.25 KB Linpack 140 MFLOPS 160 (1.1) Sparse Matrix 17 MFLOPS 125 (7.3) (Cholesky Blocked ) Cache: 1 address per cache block (32B to 64B) Vector: 1 address per element (4B) 2018/11/16 计算机体系结构

Automatic Code Vectorization
for (i=0; i < N; i++) C[i] = A[i] + B[i]; Vector Instruction load add store Iter. 1 Iter. 2 Vectorized Code Time load add store Iter. 1 Iter. 2 Scalar Sequential Code 向量化是指在编译期间对操作重定序 需要进行大量的循环相关分析 2018/11/16 计算机体系结构

并行的类型指令级并行(ILP) 线程级并行 (TLP) 数据级并行(DLP) Which is easiest to program?
以并行方式执行某个指令流中的独立无关的指令 (pipelining, superscalar, VLIW) 线程级并行 (TLP) 以并行方式执行多个独立的指令流 (multithreading, multiple cores) 数据级并行(DLP) 以并行方式执行多个相同类型的操作 (vector/SIMD execution) Array Processor 、Vector Processor Which is easiest to program? Which is most flexible form of parallelism? i.e., can be used in more situations Which is most efficient? i.e., greatest tasks/second/area, lowest energy/task 2018/11/16 计算机体系结构

DLP的兴起应用需求和技术约束推动着体系结构的发展和选择
图形、机器视觉、语音识别、机器学习等新的应用均需要大量的数值计算，其算法通常具有数据并行 SIMD-based 结构 (vector-SIMD, subword-SIMD, SIMT/GPUs) 是执行这些算法的最有效途径 2018/11/16 计算机体系结构

Vector/SIMD Processing Summary
同样的操作作用于不同的数据元素向量内的元素操作独立可有效提高性能，简化设计性能的提升受限于代码的向量化标量操作限制了向量机的性能 Amdahl’s Law 目前很多ISA包含SIMD操作指令 Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD 2018/11/16 计算机体系结构

Array vs. Vector Processors
2018/11/16 计算机体系结构

SIMD Array Processing vs. VLIW
2018/11/16 计算机体系结构

SIMD Array Processing vs. VLIW
Array processor: 单个操作作用在多个不同的数据元素上 2018/11/16 计算机体系结构

Multimedia Extensions (aka SIMD extensions)
64b 32b 16b 8b 在已有的ISA中添加一些向量长度很短的向量操作指令将已有的 64-bit 寄存器拆分为 2x32b or 4x16b or 8x8b 1957年，Lincoln Labs TX-2 将36bit datapath 拆分为2x18b or 4x9b 新的设计具有较宽的寄存器 128b for PowerPC Altivec, Intel SSE2/3/4 256b for Intel AVX 单条指令可实现寄存器中所有向量元素的操作 16b + 4x16b adds 2018/11/16 计算机体系结构

Intel Pentium MMX Operations
idea: 一条指令操作同时作用于不同的数据元全阵列处理用于多媒体操作 No VLEN register Opcode determines data type: 8 8-bit bytes 4 16-bit words 2 32-bit doublewords 1 64-bit quadword Stride always equal to 1. 2018/11/16 计算机体系结构

MMX Example: Image Overlaying (I)
2018/11/16 计算机体系结构

MMX Example: Image Overlaying (II)
2018/11/16 计算机体系结构

Multimedia Extensions versus Vectors
受限的指令集: 无向量长度控制 Load/store操作无常数步长寻址和 scatter/gather操作 loads 操作必须64/128-bit 边界对齐受限的向量寄存器长度: 需要超标量发射以保持multiply/add/load 部件忙通过循环展开隐藏延迟增加了寄存器读写压力在微处理器设计中向全向量化发展更好地支持非对齐存储器访问支持双精度浮点数操作 (64-bit floating-point) Intel AVX spec (announced April 2008), 256b vector registers (expandable up to 1024b) 2018/11/16 计算机体系结构

05/18-Review 多媒体扩展指令 GPU 向量机的存储器访问存储器组织：独立存储体、多体交叉方式
Stride : 固定步长（1 or 常数）, 非固定步长（index）基于向量机模型的优化链接技术有条件执行稀疏矩阵的操作多媒体扩展指令扩展的指令类型较少向量寄存器长度较短 GPU 2018/11/16 计算机体系结构

Recap: Vector/SIMD Processing Summary
同样的操作作用在许多数据元素上提高性能、设计简单（向量内的操作相互独立）性能的提升受限于代码的向量化标量操作限制着向量机的性能很多已有的ISA扩展了一些SIMD操作 Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD

Graphics Processing Units (GPUs)
原来的GPU是指带有高性能浮点运算部件、可高效生成3D图形的具有固定功能的专用设备 (mid-late 1990s) 让PC机具有类似工作站的图形功能用户可以配置图形处理流水线，但不是真正的对其编程随着时间的推移，GPU加入了越来越多的可编程性 ( ) 例如新的语言 Cg可用来编写一些小的程序处理图形的顶点或像素，是Windows DirectX的变体大规模并行（针对每帧上百万顶点或像素）但非常受限于编程模型有些用户注意到通过将输入和输出数据映射为图像，并对顶点或像素渲染计算可进行通用计算因为不得不使用图形流水线模型，这对完成通用计算来说是个非常难用的编程模型 2018/11/16 计算机体系结构

General-Purpose GPUs (GP-GPUs)
2006年, Nvidia 的 GeForce 8800 GPU 支持一种新的编程语言: CUDA “Compute Unified Device Architecture” 随后工业界推出OpenCL，与CUDA具有相同的ideas, 但独立于供应商 Idea: 针对通用计算，发挥GPU的计算的高性能和存储器的高带宽来加速一些通用计算中的核心（Kernels）一种附加处理器模型（GPU作为附加设备）：Host CPU发射数据并行的kernels 到GP-GPU上运行我们仅讨论Nvidia CUDA样式的简化版本，仅考虑 GPU的计算核部分，不涉及图形加速部分 2018/11/16 计算机体系结构

Using CPU+GPU Architecture
异构系统（异构多核）针对每个任务选择合适的处理器和存储器通用CPU 适合执行一些串行的线程串行执行快带有cache，访问存储器延时低 GPU 适合执行大量并行的线程可扩放的并行执行高带宽的并行存取 GPU SMem PCIe Bridge Host Memory CPU Cache Device Memory

GPUs are SIMD Engines Underneath
指令流水线类似于SIMD的流水线，但是，用线程来编程而不是SIMD指令两个概念 Programming Model (Software) vs Execution Model (Hardware) 编程模型指程序员如何表达代码例如, 顺序模型 (von Neumann), 数据并行(SIMD), 数据流模型、多线程模型 (MIMD, SPMD), … 执行模型指硬件底层如何执行代码例如, 乱序执行、向量机、数据流处理机、多处理机、多线程处理机等执行模型与编程模型可以差别很大例如., 顺序模型可以在乱序执行的处理器上执行。 SPMD 模型可以用SIMD处理器实现 (a GPU)

How Can You Exploit Parallelism Here?
for (i=0; i < N; i++) C[i] = A[i] + B[i]; load add store Iter. 1 Iter. 2 Scalar Sequential Code Let’s examine three programming options to exploit instruction-level parallelism present in this sequential code: 1. Sequential (SISD) 2. Data-Parallel (SIMD) 3. Multithreaded (MIMD/SPMD)

Prog. Model 1: Sequential (SISD)
for (i=0; i < N; i++) C[i] = A[i] + B[i]; Can be executed on a: Pipelined processor Out-of-order execution processor Independent instructions executed when ready Different iterations are present in the instruction window and can execute in parallel in multiple functional units In other words, the loop is dynamically unrolled by the hardware Superscalar or VLIW processor Can fetch and execute multiple instructions per cycle load add store Iter. 1 Iter. 2 Scalar Sequential Code

Prog. Model 2: Data Parallel (SIMD)
for (i=0; i < N; i++) C[i] = A[i] + B[i]; Vectorized Code load add store Iter. 1 Iter. 2 Scalar Sequential Code Vector Instruction load add store load add store VLD A  V1 VLD B  V2 VADD V1 + V2  V3 VST V3  C Iter. 1 Iter. 2 Realization: Each iteration is independent Idea: Programmer or compiler generates a SIMD instruction to execute the same instruction from all iterations across different data Best executed by a SIMD processor (vector, array)

Prog. Model 3: Multithreaded
for (i=0; i < N; i++) C[i] = A[i] + B[i]; load add store Iter. 1 Iter. 2 Scalar Sequential Code load add store load add store Iter. 1 Iter. 2 Realization: Each iteration is independent Idea: Programmer or compiler generates a thread to execute each iteration. Each thread does the same thing (but on different data) Can be executed on a MIMD machine

Prog. Model 3: Multithreaded
for (i=0; i < N; i++) C[i] = A[i] + B[i]; load add store load add store Iter. 1 Iter. 2 Realization: Each iteration is independent Idea: Programmer or compiler generates a thread to execute each iteration. Each thread does the same thing (but on different data) Can be executed on a MIMD machine This particular model is also called: SPMD: Single Program Multiple Data Can be executed on a SIMT machine Single Instruction Multiple Thread Can be executed on a SIMD machine

A GPU is a SIMD (SIMT) Machine
GPU不是用SIMD指令编程使用线程 (SPMD 编程模型) 每个线程执行同样的代码，但操作不同的数据元素每个线程有自己的上下文(即可以独立地启动/执行等）一组执行相同指令的线程由硬件动态组织成warp 一个warp是由硬件形成的SIMD操作

16 November 2018 Threads and Blocks 一个线程对应一个数据元素大量的线程组织成很多线程块许多线程块组成一个网格 GPU 由硬件对线程进行管理 Thread Block Scheduler SIMD Thread Scheduler 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

Programmer’s View of Execution
blockIdx 0 threadId 0 threadId 1 threadId 255 blockDim = 256 (programmer can choose) 创建足够的线程块以适应输入向量 (Nvidia 中将由多个线程块构成的、在GPU上运行的代码称为Grid，Grid可以是2维的) blockIdx 1 threadId 0 threadId 1 threadId 255 blockIdx (n+255)/256) threadId 0 threadId 1 threadId 255 Conditional (i<n) turns off unused threads in last block 2018/11/16 计算机体系结构

Hardware Execution Model
GPU Core 0 Lane 0 Lane 1 Lane 15 Core 1 Core 15 CPU CPU Memory GPU Memory GPU 由多个并行核构成，每个核包含一个多线程 SIMD处理器（包含多个车道） CPU 发送整个 “grid”到GPU，由GPU将这些线程块分发到多个核上（每个线程块在一个核上运行） GPU上核的数量对程序员而言是透明的 2018/11/16 计算机体系结构

Simplified CUDA Programming Model
计算由大量的相互独立的线程(CUDA threads or microthreads) 完成，这些线程组合成线程块（thread blocks） // C version of DAXPY loop. void daxpy(int n, double a, double*x, double*y) { for (int i=0; i<n; i++) y[i] = a*x[i] + y[i]; } // CUDA version. __host__ // Piece run on host processor. int nblocks = (n+255)/256; // 256 CUDA threads/block daxpy<<<nblocks,256>>>(n,2.0,x,y); __device__ // Piece run on GP-GPU. { int i = blockIdx.x*blockDim.x + threadId.x; if (i<n) y[i]=a*x[i]+y[i]; } 2018/11/16 计算机体系结构

NVIDIA Instruction Set Arch.
The University of Adelaide, School of Computer Science 16 November 2018 NVIDIA Instruction Set Arch. ISA 是硬件指令集的抽象 “Parallel Thread Execution (PTX)” 使用虚拟寄存器用软件将其翻译成机器码 Example: shl.s32 R8, blockIdx, 9 ; Thread Block ID * Block size (512 or 29) add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread ID ld.global.f64 RD0, [X+R8] ; RD0 = X[i] ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i] mul.f64 R0D, RD0, RD4 ; Product in RD0 = RD0 * RD4 (scalar a) add.f64 R0D, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i]) st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i]) 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

CUDA kernel maps to Grid of Blocks
kernel_func<<<nblk, nthread>>>(param, … ); Host Thread Grid of Thread Blocks . . . econve GPU SMs: SMem PCIe Bridge Host Memory CPU Cache Device Memory

Thread blocks execute on an SM Thread instructions execute on a core
float myVar; __shared__ float shVar; __device__ float glVar; Thread Block Per-block Shared Memory Per-app Device Global Registers Per-thread Local Memory GPU SMs: SMem PCIe Bridge Host Memory CPU Cache Device Memory SM: Streaming Multiprocessor PTX: Parallel Thread Execution

CUDA：“Single Instruction, Multiple Thread”
GPUs 使用 SIMT模型, 每个CUDA线程的标量指令流汇聚在一起在硬件上以SIMD方式执行 (Nvidia groups 32 CUDA threads into a warp) µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7 ld x Scalar instruction stream mul a ld y add st y SIMD execution across warp 2018/11/16 计算机体系结构

GPU Memory Hierarchy [ Nvidia, 2010] 2018/11/16 计算机体系结构

SPMD on SIMT Machine Warp: A set of threads that execute
for (i=0; i < N; i++) C[i] = A[i] + B[i]; load add store load add store Warp 0 at PC X Warp 0 at PC X+1 Warp 0 at PC X+2 Warp 0 at PC X+3 Iter. 1 Iter. 2 Warp: A set of threads that execute the same instruction (i.e., at the same PC) Realization: Each iteration is independent Idea: Programmer or compiler generates a thread to execute each iteration. Each thread does the same thing (but on different data) Can be executed on a MIMD machine This particular model is also called: SPMD: Single Program Multiple Data A GPU executes it using the SIMT model: Single Instruction Multiple Thread Can be executed on a SIMD machine

SIMD vs. SIMT Execution Model
SIMD: 一条指令流（一串顺序的SIMD指令），每条指令对应多个数据输入（向量指令） SIMT: 多个指令流（标量指令）构成线程，这些线程动态构成warp。一个Warp处理多个数据元素 SIMT 主要优点：可以独立地处理线程,即每个线程可以在任何标量流水线上单独执行（ MIMD 处理模式）可以将线程组织成warp，即可以将执行相同指令流的线程构成warp，形成SIMD 处理模式, 以充分发挥SIMD处理的优势

Multithreading of Warps
for (i=0; i < N; i++) C[i] = A[i] + B[i]; 设一个warp由 32 threads构成如果有32K次循环  1K 个warps 这些Warps可以在同一条流水线上交替执行 Fine grained multithreading of warps load add store load add store Warp 1 at PC X Warp 0 at PC X Warp 20 at PC X+2 Iter. 1 Iter. 2 Iter. 33 Iter. 34 Iter. 20*32 + 2 Iter. 20*32 + 1

Warps and Warp-Level FGMT
Warp: The threads that run lengthwise in a woven fabric … Thread Warp 3 Thread Warp 8 Common PC Thread Warp In SIMD, you need to specify the data array + an instruction (on which to operate the data on) + THE INSTRUCTION WIDTH. Eg: You might want to add 2 integer arrays of length 16, then a SIMD instruction would look like (the instruction has been cooked-up by me for demo) add.16 arr1 arr2 However, SIMT doesn't bother about the instruction width. So, essentially, you could write the above example as: arr1[i] + arr2[i] and then launch as many threads as the length of the array, as you want. Note that, if the array size was, let us say, 32, then SIMD EXPECTS you to explicitly call two such 'add.16' instructions! Whereas, this is not the case with SIMT. Scalar Scalar Scalar Scalar Thread Warp 7 Thread Thread Thread Thread W X Y Z SIMD Pipeline

High-Level View of a GPU

Latency Hiding via Warp-Level FGMT
Fine-grained multithreading 流水线上每次执行每个线程的一条指令 (No interlocking) 通过warp的交叉执行来隐藏延时所有线程的寄存器值保存在寄存器文件中 FGMT 可以容忍长延时 Millions of pixels Decode R F A L U D-Cache Thread Warp 6 Thread Warp 1 Thread Warp 2 Data All Hit? Miss? Warps accessing memory hierarchy Thread Warp 3 Thread Warp 8 Writeback Warps available for scheduling Thread Warp 7 I-Fetch SIMD Pipeline With a large number of shader threads multiplexed on the same execution re- sources, our architecture employs fine-grained multithreading where individual threads are interleaved by the fetch unit to proactively hide the potential latency of stalls before they occur. As illustrated by Figure, warps are issued fairly in a round-robin queue. When a thread is blocked by a memory request, shader core simply removes that thread’s warp from the pool of “ready” warps and thereby allows other threads to proceed while the memory system processes its request. With a large number of threads (1024 per shader core) interleaved on the same pipeline, FGMT effectively hides the latency of most memory operations since the pipeline is occupied with instructions from other threads while memory operations complete. also hides the pipeline latency so that data bypassing logic can potentially be omitted to save area with minimal impact on performance. simplify the dependency check logic design by restricting each thread to have at most one instruction running in the pipeline at any time. Slide credit: Tor Aamodt

Warp Execution (Recall the Slide)
32-thread warp executing ADD A[tid],B[tid]  C[tid] C[1] C[2] C[0] A[3] B[3] A[4] B[4] A[5] B[5] A[6] B[6] Execution using one pipelined functional unit C[4] C[8] C[0] A[12] B[12] A[16] B[16] A[20] B[20] A[24] B[24] C[5] C[9] C[1] A[13] B[13] A[17] B[17] A[21] B[21] A[25] B[25] C[6] C[10] C[2] A[14] B[14] A[18] B[18] A[22] B[22] A[26] B[26] C[7] C[11] C[3] A[15] B[15] A[19] B[19] A[23] B[23] A[27] B[27] Execution using four pipelined functional units Slide credit: Krste Asanovic

SIMD Execution Unit Structure
Functional Unit Lane Registers for each Thread Registers for thread IDs 0, 4, 8, … Registers for thread IDs 1, 5, 9, … Registers for thread IDs 2, 6, 10, … Registers for thread IDs 3, 7, 11, … Memory Subsystem Slide credit: Krste Asanovic

Warp Instruction Level Parallelism
Can overlap execution of multiple instructions Example machine has 32 threads per warp and 8 lanes Completes 24 operations/cycle while issuing 1 warp/cycle Load Unit Multiply Unit Add Unit W0 W1 W2 time W3 W4 W5 Warp issue Slide credit: Krste Asanovic

SIMT Memory Access 不同线程的相同指令使用线程id来存取不同的数据元素 +
Let’s assume N=16, 4 threads per warp  4 warps 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Threads + 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Data elements + + + + Warp 0 Warp 1 Warp 2 Warp 3 Slide credit: Hyesoon Kim

Sample GPU SIMT Code (Simplified)
CPU code for (ii = 0; ii < ; ++ii) { C[ii] = A[ii] + B[ii]; } CUDA code // there are threads __global__ void KernelFunction(…) { int tid = blockDim.x * blockIdx.x + threadIdx.x; int varA = aa[tid]; int varB = bb[tid]; C[tid] = varA + varB; } Slide credit: Hyesoon Kim

Sample GPU Program (Less Simplified)
Slide credit: Hyesoon Kim

Warp-based SIMD vs. Traditional SIMD
Lock step: 一条向量指令必须执行完然后启动下一条向量指令编程模型为SIMD (no extra threads)  SW 需要知道向量长度 ISA 包含vector/SIMD指令 Warp-based SIMD 由多个标量线程构成，以SIMD方式执行 (即所有的线程执行相同的指令) 不需要lock step 每个线程可以单独对待(即可以映射到不同的warp中），编程模型不是 SIMD SW 不必知道向量长度多个线程可以动态构成warp ISA是标量ISA  可以动态形成逻辑上的向量指令是一种在SIMD硬件上实现的SPMD编程模型

SPMD Single procedure/program, multiple data 多条指令流执行相同的程序
它是一种编程模型而不是计算机组织每个处理单元执行同样的过程，处理不同的数据这些过程可以在程序中的某个点上同步，例如 e.g. barriers 多条指令流执行相同的程序每个程序/过程操作不同的数据运行时可以执行不同的控制流路径许多科学计算应用以这种方式编程运行在MIMD硬件结构上 (multiprocessors) 现代 GPUs 以这种类似的方式编程运行在SIMD硬件上

SIMD vs. SIMT Execution Model
SIMD: 一条指令流（一串顺序的SIMD指令），每条指令对应多个数据输入（向量指令） SIMT: 多个指令流（标量指令）构成线程，这些线程动态构成warp。一个Warp处理多个数据元素 SIMT 主要优点：可以独立地处理线程,即每个线程可以在任何标量流水线上单独执行（ MIMD processing）可以将线程组织成warp，即可以将执行相同指令流的线程构成warp，形成SIMD processing, 以充分发挥SIMD处理的优势

Threads Can Take Different Paths in Warp-based SIMD
每个线程可以包含控制流指令这些线程可以执行不同的控制流路径 B C D E F A G Thread Warp Common PC Thread 1 Thread 2 Thread 3 Thread 4 Slide credit: Tor Aamodt

Control Flow Problem in GPUs/SIMT
GPU 控制逻辑使用 SIMD 流水线以节省资源这些标量线程构成warp 当一个WARP中的线程分支到不同的执行路径时，产生分支发散（ Branch divergence） Branch Path A Path B Branch Path A Path B 与向量处理机模型的条件执行相同（Vector Mask and Masked Vector Operations?） Slide credit: Tor Aamodt

Branch Divergence Handling (I)
Idea: Dynamic predicated (conditional) execution Reconv. PC Next PC Active Mask Stack B C D E F A G A/1111 E D 0110 1001 TOS - 1111 E D 0110 C 1001 TOS - 1111 - B 1111 TOS - A 1111 TOS - E 1111 TOS E D 0110 TOS - 1111 - G 1111 TOS B/1111 C/1001 D/0110 Thread Warp Common PC Thread 2 3 4 1 E/1111 Reconv. PC : 重新汇聚地址 G/1111 A B C D E G A Time Slide credit: Tor Aamodt

Branch Divergence Handling (II)
if (some condition) { B; } else { C; } D; One per warp Control Flow Stack Next PC Recv PC Active Mask TOS A -- 1111 D -- 1111 B D 1110 A C D 0001 D Execution Sequence 1 A 1 C 1 B 1 D B C D Time Slide credit: Tor Aamodt

Remember: Each Thread Is Independent
SIMT 主要优点：可以独立地处理线程,即每个线程可以在任何标量流水线上单独执行（ MIMD 处理模式）可以将线程组织成warp，即可以将执行相同指令流的线程构成warp，形成SIMD 处理模式, 以充分发挥SIMD处理的优势如果有许多线程，对具有相同PC值的线程可以将它们动态组织到一个warp中这样可以减少“分支发散” 提高SIMD 利用率 SIMD 利用率: 执行有用操作的SIMD lanes的比例 (即, 执行活动线程的比例）

Dynamic Warp Formation/Merging
Idea: Dynamically merge threads executing the same instruction (after branch divergence) Form new warps from warps that are waiting Enough threads branching to each path enables the creation of full new warps Warp X Warp Z Warp Y

Dynamic Warp Formation/Merging
Idea: Dynamically merge threads executing the same instruction (after branch divergence) Fung et al., “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow,” MICRO 2007. Branch Path A Path B Branch Path A

Dynamic Warp Formation Example
Execution of Warp x at Basic Block A Execution of Warp y Legend A B x/1110 y/0011 C x/1000 D x/0110 F x/0001 y/0010 y/0001 y/1100 A new warp created from scalar threads of both Warp x and y executing at Basic Block D D E x/1110 y/0011 G x/1111 y/1111 A A B B C C D D E E F F G G A A Baseline Time Dynamic Warp Formation A A B B C D E E F G G A A Time Slide credit: Tor Aamodt

Hardware Constraints Limit Flexibility of Warp Grouping
Functional Unit Lane Registers for each Thread Registers for thread IDs 0, 4, 8, … Registers for thread IDs 1, 5, 9, … Registers for thread IDs 2, 6, 10, … Registers for thread IDs 3, 7, 11, … Can you move any thread flexibly to any lane? Memory Subsystem Slide credit: Krste Asanovic

When You Group Threads Dynamically
存储器访问如何处理？固定模式的存储器访问相对简单，当动态构成warp 时，使得访问模式具有随机性，使得问题变得复杂。  降低存储器访问的局部性  导致存储器带宽利用率的下降

What About Memory Divergence?
现代 GPUs 包括高速缓存，减少对存储器的访问 Ideally: 一个warp中的所有线程的存储器访问都命中 (互相没有冲突） Problem: 一个Warp中有些命中，有些失效 Problem: 一个线程的stall导致整个warp停顿需要有相关技术来解决存储器发散访问问题

NVIDIA GeForce GTX 285 NVIDIA-speak: Generic speak:
240 stream processors “SIMT execution” Generic speak: 30 cores 8 SIMD functional units per core Slide credit: Kayvon Fatahalian

NVIDIA GeForce GTX 285 “core”
… 64 KB of storage for thread contexts (registers) 30 * 32 * 32 = 30 * 1024 = 30K fragments 64KB register file = bit registers per thread = 64B (1/32 that of LRB) 16KB of shared scratch 80KB / core available to software = SIMD functional unit, control shared across 8 units = instruction stream decode = multiply-add = execution context storage = multiply Slide credit: Kayvon Fatahalian

NVIDIA GeForce GTX 285 “core”
… 64 KB of storage for thread contexts (registers) To get maximal latency hiding: Run 1/32 of the time 16 words per thread = 64B Groups of 32 threads share instruction stream (each group is a Warp) Up to 32 warps are simultaneously interleaved Up to 1024 thread contexts can be stored Slide credit: Kayvon Fatahalian

30 cores on the GTX 285: 30,720 threads
NVIDIA GeForce GTX 285 Tex Tex … … … … Tex Tex … … Tex Tex … … Tex Tex If you’re running a CUDA program, and your not launching 30K threads, you are certainly not getting full latency hiding, and you might not be using the GPU well … … Tex Tex … … 30 cores on the GTX 285: 30,720 threads Slide credit: Kayvon Fatahalian

GPU Readings Lindholm et al., "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro 2008. Fatahalian and Houston, “A Closer Look at GPUs,” CACM Narasiman et al., “Improving GPU Performance via Large Warps and Two-Level Warp Scheduling,” MICRO 2011. Fung et al., “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow,” MICRO 2007. Jog et al., “Orchestrated Scheduling and Prefetching for GPGPUs,” ISCA 2013.

Conditionals in SIMT model
简单的 if-then-else 编译为谓词（有条件）执行等价于在向量屏蔽寄存器的作用下的向量运算比较复杂的控制流编译生成分支如何处理分支？ µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7 tid=threadid Scalar instruction stream If (tid >= n) skip Call func1 add st y skip: SIMD execution across warp 2018/11/16 计算机体系结构

Conditional Branching
The University of Adelaide, School of Computer Science 16 November 2018 Conditional Branching 与向量结构类似, GPU 使用内部的屏蔽字(masks) 还使用了分支同步堆栈保存分支成功的路径地址保存该路径的每个SIMD lane 的屏蔽字(mask) 即指示哪些车道可以提交结果 (all threads execute) 指令标记(instruction markers）管理何时分支到多个执行路径，何时路径汇合 PTX层 CUDA线程的控制流由PTX分支指令(branch、call、return and exit）由程序员指定的每个线程车道的1-bit谓词寄存器 GPU硬件指令层,控制流包括：分支指令(branch,jump call return) 特殊的指令用于管理分支同步栈 GPU硬件为每个SIMD thread 提供堆栈保存分支成功的路径 GPU硬件指令带有控制每个线程车道1-bit谓词寄存器 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

Branch divergence 硬件跟踪各µthreads转移的方向（判定哪些是成功的转移，哪些是失败的转移）
如果所有线程所走的路径相同，那么可以保持这种 SIMD fashion 如果各线程选择的方向不一致，那么创建一个屏蔽（mask）向量来指示各线程的转移方向（成功、失败）继续执行分支失败的路径，将分支成功的路径压入硬件堆栈（分支同步堆栈），待后续执行 SIMD 车道何时执行分支同步堆栈中的路径? 通过执行pop操作，弹出执行路径以及屏蔽字，执行转移成功路径 SIMD lane完成整个分支路径执行后再执行下一条指令称为 converge(汇聚） 2018/11/16 计算机体系结构

16 November 2018 Example if (X[i] != 0) X[i] = X[i] – Y[i]; else X[i] = Z[i]; ld.global.f64 RD0, [X+R8] ; RD0 = X[i] setp.neq.s32 P1, RD0, #0 ; P1 is predicate register 1 @!P1, bra ELSE1, *Push ; Push old mask, set new mask bits ; if P1 false, go to ELSE1 ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i] sub.f64 RD0, RD0, RD2 ; Difference in RD0 st.global.f64 [X+R8], RD0 ; X[i] = RD0 @P1, bra ENDIF1, *Comp ; complement mask bits ; if P1 true, go to ENDIF1 ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i] st.global.f64 [X+R8], RD0 ; X[i] = RD0 ENDIF1: <next instruction>, *Pop ; pop to restore old mask 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

Warps are multithreaded on core
一个 warp 由 32个 µthreads 构成多个warp线程在单个核上交叉运行，以隐藏存储器访问呢和功能部件的延迟单个线程块包含多个warp (up to 512 µT max in CUDA), 这些warp都映射到同一个核上多个线程块也可以在同一个核上运行 2018/11/16 [Nvidia, 2010] 计算机体系结构

SIMT Warp Execution in the SM
Warp: 一组32个CUDA 线程，构成一条SIMD指令 SIMT: Single-Instruction Multi-Thread 建立适用于warp的并行线程的指令 SM 双发射流水线选择两个warp 发射到并行的核上 SIMT warp 执行warp中的指令（SIMD指令）(32个CUDA thread）通过谓词寄存器控制单个线程 (CUDA thread ) 执行结果 (enable/disable) 通过同步堆栈管理线程的分支通过规整的一致的计算来处理分支速度高于非规整的分支转移 2018/11/16 计算机体系结构

2018/11/16 计算机体系结构

Nvidia Fermi GF100 GPU [Nvidia, 2010] 2018/11/16 计算机体系结构

Fermi “Streaming Multiprocessor” Core
2018/11/16 计算机体系结构

Fermi Dual-Issue Warp Scheduler
2018/11/16 计算机体系结构

Apple A5X Processor for iPad v3 (2012)
12.90mm x 12.79mm 45nm technology 2018/11/16 计算机体系结构 [Source: Chipworks, 2012]

Loop-Level Parallelism
The University of Adelaide, School of Computer Science 16 November 2018 Loop-Level Parallelism 研究的焦点问题是：在循环结构中，后面的循环是否依赖于前面的循环 Loop-carried dependence Example 1: for (i=999; i>=0; i=i-1) x[i] = x[i] + s; 循环级并行：No loop-carried dependence 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 16 November 2018 Loop-Level Parallelism Example 2: for (i=0; i<100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1]; /* S2 */ } S1 and S2 均使用了前一次循环的结果 S2 使用同一循环中的S1的结果 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 16 November 2018 Loop-Level Parallelism Example 3: for (i=0; i<100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } S1 使用了S2前一次循环的结果，但数据依赖没有形成环路，也是循环级并行。 Transform to: A[0] = A[0] + B[0]; for (i=0; i<99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; B[100] = C[99] + D[99]; 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 16 November 2018 Loop-Level Parallelism Example 4: for (i=0;i<100;i=i+1) { A[i] = B[i] + C[i]; D[i] = A[i] * E[i]; } Example 5: for (i=1;i<100;i=i+1) { Y[i] = Y[i-1] + Y[i]; //递推关系通常循环间相关表现形式为：递推关系 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

16 November 2018 Finding dependencies Assume indices are affine: a x i + b (i is loop index) Assume: Store to a x i + b, then Load from c x i + d i runs from m to n Dependence exists if: Given j, k such that m ≤ j ≤ n, m ≤ k ≤ n Store to a x j + b, load from a x k + d, and a x j + b = c x k + d 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

16 November 2018 Finding dependencies 通常在编译时无法确定是否相关数据依赖关系的测试 GCD test: 如果存在相关，那么 GCD(c,a) 必须能整除 (d-b) Example: for (i=0; i<100; i=i+1) { X[2*i+3] = X[2*i] * 5.0; } 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

16 November 2018 Finding dependencies Example 2: for (i=0; i<100; i=i+1) { Y[i] = X[i] / c; /* S1 */ X[i] = X[i] + c; /* S2 */ Z[i] = Y[i] + c; /* S3 */ Y[i] = c - Y[i]; /* S4 */ } 反相关和输出相关通过重命名消除这两类相关 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

16 November 2018 Reductions Reduction Operation: for (i=9999; i>=0; i=i-1) sum = sum + x[i] * y[i]; Transform to… sum [i] = x[i] * y[i]; finalsum = finalsum + sum[i]; //该循环存在循环间相关 Do on p processors: for (i=999; i>=0; i=i-1) finalsum[p] = finalsum[p] + sum[i+1000*p]; Note: assumes associativity! 2018/11/16 计算机体系结构 Chapter 2 — Instructions: Language of the Computer

GPU Future High-end desktops have separate GPU chip, but trend towards integrating GPU on same die as CPU (already in laptops, tablets and smartphones) Advantage is shared memory with CPU, no need to transfer data Disadvantage is reduced memory bandwidth compared to dedicated smaller-capacity specialized memory system Graphics DRAM (GDDR) versus regular DRAM (DDR3) Will GP-GPU survive? Or will improvements in CPU DLP make GP-GPU redundant? On same die, CPU and GPU should have same memory bandwidth GPU might have more FLOPS as needed for graphics anyway 2018/11/16 计算机体系结构

Acknowledgements These slides contain material developed and copyright by: John Kubiatowicz (UCB) Krste Asanovic (UCB) David Patterson (UCB) Chenxi Zhang (Tongji) UCB material derived from course CS152、CS252 CMU Introduction to Computer Architecture 2018/11/16 计算机体系结构

周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学 2018/11/16 计算机体系结构周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学.

Similar presentations

Presentation on theme: "周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学 2018/11/16 计算机体系结构周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学."— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学 2018/11/16 计算机体系结构 周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学.

Similar presentations

Presentation on theme: "周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学 2018/11/16 计算机体系结构 周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学."— Presentation transcript:

Similar presentations

About project

反馈

周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学 2018/11/16 计算机体系结构周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学.

Presentation on theme: "周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学 2018/11/16 计算机体系结构周学海 xhzhou@ustc.edu.cn 0551-63601556, 63492149 中国科学技术大学."— Presentation transcript: