周学海 xhzhou@ustc.edu.cn 0551-63606864 中国科学技术大学 2019/7/23 计算机体系结构 周学海 xhzhou@ustc.edu.cn 0551-63606864 中国科学技术大学
05/13-Review 向量机的存储器访问 基于向量机模型的优化 多媒体扩展指令 GPU 存储器组织:独立存储体、多体交叉方式 Stride : 固定步长(1 or 常数), 非固定步长(index) 基于向量机模型的优化 链接技术 有条件执行 稀疏矩阵的操作 多媒体扩展指令 扩展的指令类型较少 向量寄存器长度较短 GPU 7/23/2019 中国科学技术大学
Array vs. Vector Processors Array processor:又称为并行处理机、SIMD处理器。其核心是一个由多个处理单元构成的阵列,用单一的控制部件来控制多个处理单元对各自的数据进行相同的运算和操作。 7/23/2019 中国科学技术大学
SIMD Array Processing vs. VLIW 7/23/2019 中国科学技术大学
SIMD Array Processing vs. VLIW Array processor: 单个操作作用在多个不同的数据元素上 7/23/2019 中国科学技术大学
Vector/SIMD Processing Summary 同样的操作作用于不同的数据元素 向量内的元素操作独立,可有效提高性能,简化设计 性能的提升受限于代码的向量化 标量操作限制了向量机的性能 Amdahl’s Law 很多ISA包含SIMD操作指令 Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD 7/23/2019 中国科学技术大学
Multimedia Extensions (aka SIMD extensions) 在已有的ISA中添加一些向量长度很短的向量操作指令 将已有的 64-bit 寄存器拆分为 2x32b or 4x16b or 8x8b 1957年,Lincoln Labs TX-2 将36bit datapath 拆分为2x18b or 4x9b 新的设计具有较宽的寄存器 128b for PowerPC Altivec, Intel SSE2/3/4 (Sreaming SIMD Extensions) 256b for Intel AVX (Advanced Vector Extensions) 单条指令可实现寄存器中所有向量元素的操作 7/23/2019 中国科学技术大学
Multimedia Extensions (aka SIMD extensions) 64b 32b 16b 8b 16b + 4x16b adds 7/23/2019 中国科学技术大学
Intel Pentium MMX Operations idea: 一条指令操作同时作用于不同的数据元 全阵列处理 用于多媒体操作 No VLEN register Opcode determines data type: 8 8-bit bytes 4 16-bit words 2 32-bit doublewords 1 64-bit quadword Stride always equal to 1. 7/23/2019 中国科学技术大学
MMX Example: Image Overlaying (I) 7/23/2019 中国科学技术大学
MMX Example: Image Overlaying (II) 7/23/2019 中国科学技术大学
Multimedia Extensions versus Vectors 受限的指令集: 无向量长度控制 Load/store操作无 常数步长寻址和 scatter/gather操作 loads 操作必须64/128-bit 边界对齐 受限的向量寄存器长度: 需要超标量发射以保持multiply/add/load 部件忙 通过循环展开隐藏延迟增加了寄存器读写压力 在微处理器设计中向全向量化发展 更好地支持非对齐存储器访问 支持双精度浮点数操作 (64-bit floating-point) Intel AVX spec (announced April 2008), 256b vector registers (expandable up to 1024b) 7/23/2019 中国科学技术大学
GPU GPU的起源 GPU基本的硬件结构 GPU编程模型 GPU的存储层次 GPU分支处理(发散与汇聚) 7/23/2019 中国科学技术大学
Graphics Processing Units (GPUs) 早期的GPU是指带有高性能浮点运算部件、可高效生成3D图形的具有固定功能的专用设备 (mid-late 1990s) 让PC机具有类似工作站的图形功能 用户可以配置图形处理流水线,但不是真正的对其编程 2001-2005,GPU加入了越来越多的可编程性 例如新的语言 Cg可用来编写一些小的程序处理图形的顶点或像素,是Windows DirectX的变体 大规模并行(针对每帧上百万顶点或像素)但非常受限于编程模型 有些用户注意到通过将输入和输出数据映射为图像,并对顶点或像素渲染计算 可进行通用计算 因为不得不使用图形流水线模型,这对完成通用计算来说是个非常难用的编程模型 7/23/2019 中国科学技术大学
General-Purpose GPUs (GP-GPUs) 2006年, Nvidia 的 GeForce 8800 GPU 支持一种新的编程语言: CUDA “Compute Unified Device Architecture” 随后工业界推出OpenCL,与CUDA具有相同的ideas, 但独立于供应商 Idea: 针对通用计算,发挥GPU计算的高性能和存储器的高带宽来加速一些通用计算中的核心(Kernels) 一种协处理器模型(GPU作为附加设备):Host CPU发射数据并行的kernels 到GP-GPU上运行 我们仅讨论Nvidia CUDA样式的简化版本,仅考虑GPU的计算核部分,不涉及图形加速部分 7/23/2019 中国科学技术大学
Using CPU+GPU Architecture 针对每个任务选择合适的处理器和存储器 通用CPU 适合执行一些串行的线程 串行执行快 带有cache,访问存储器延时低 GPU 适合执行大量并行线程 可扩放的并行执行 高带宽的并行存取 强控制、弱计算 弱控制、强计算 GPU SMem PCIe Bridge Host Memory CPU Cache Device Memory 7/23/2019 中国科学技术大学
GPU: a multithreaded coprocessor SM SP: scalar processor ‘CUDA core’ Executes one thread SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SM streaming multiprocessor 32xSP (or 16, 48 or more) Fast local ‘shared memory’ (shared between SPs) 16 KiB (or 64 KiB) SHARED MEMORY GLOBAL MEMORY (ON DEVICE) 7/23/2019 中国科学技术大学
GPUs are SIMD Engines Underneath 基于一般的指令,应用由一组线程构成。 两个概念 Programming Model (Software) vs Execution Model (Hardware) 编程模型指程序员如何描述应用(从程序员角度看到的机器模型) 例如, 顺序模型 (von Neumann), 数据并行(SIMD), 数据流模型、多线程模型 (MIMD, SPMD), … 执行模型指硬件底层如何执行代码 例如, 乱序执行、向量机、数据流处理机、多处理机、多线程处理机等 执行模型与编程模型可以差别很大 例如., 顺序模型可以在乱序执行的处理器上执行。 SPMD 模型可以用SIMD处理器实现 (a GPU) 7/23/2019 中国科学技术大学
How Can You Exploit Parallelism Here? for (i=0; i < N; i++) C[i] = A[i] + B[i]; load add store Iter. 1 Iter. 2 Scalar Sequential Code Let’s examine three programming options to exploit instruction-level parallelism present in this sequential code: 1. Sequential (SISD) 2. Data-Parallel (SIMD) 3. Multithreaded (MIMD/SPMD) 7/23/2019 中国科学技术大学
Prog. Model 1: Sequential (SISD) load add store Iter. 1 Iter. 2 Scalar Sequential Code Can be executed on a: Pipelined processor Out-of-order execution processor Independent instructions executed when ready Different iterations are present in the instruction window and can execute in parallel in multiple functional units In other words, the loop is dynamically unrolled by the hardware Superscalar or VLIW processor Can fetch and execute multiple instructions per cycle for (i=0; i < N; i++) C[i] = A[i] + B[i]; 7/23/2019 中国科学技术大学
Prog. Model 2: Data Parallel (SIMD) Vectorized Code load add store Iter. 1 Iter. 2 Scalar Sequential Code Vector Instruction load add store load add store VLD A V1 VLD B V2 VADD V1 + V2 V3 VST V3 C Iter. 1 Iter. 2 Realization: Each iteration is independent Idea: Programmer or compiler generates a SIMD instruction to execute the same instruction from all iterations across different data Best executed by a SIMD processor (vector, array) for (i=0; i < N; i++) C[i] = A[i] + B[i]; 7/23/2019 中国科学技术大学
Prog. Model 3: Multithreaded load add store Iter. 1 Iter. 2 Scalar Sequential Code load add store load add store for (i=0; i < N; i++) C[i] = A[i] + B[i]; Iter. 1 Iter. 2 Realization: Each iteration is independent Idea: Programmer or compiler generates a thread to execute each iteration. Each thread does the same thing (but on different data) Can be executed on a MIMD machine 7/23/2019 中国科学技术大学
Prog. Model 3: Multithreaded for (i=0; i < N; i++) C[i] = A[i] + B[i]; load add store load add store Iter. 1 Iter. 2 Realization: Each iteration is independent Idea: Programmer or compiler generates a thread to execute each iteration. Each thread does the same thing (but on different data) Can be executed on a MIMD machine This particular model is also called: SPMD: Single Program Multiple Data Can be executed on a SIMT machine Single Instruction Multiple Thread Can be executed on a SIMD machine 7/23/2019 中国科学技术大学
SPMD Single procedure/program, multiple data 每个处理单元执行同样的过程,处理不同的数据 它是一种编程模型而不是计算机组织 每个处理单元执行同样的过程,处理不同的数据 这些过程可以在程序中的某个点上同步,例如 barriers 多条指令流执行相同的程序 每个程序/过程 操作不同的数据 运行时可以执行不同的控制流路径 许多科学计算应用以这种方式编程,运行在MIMD硬件结构上 (multiprocessors) 现代 GPUs 以这种类似的方式编程,运行在SIMD硬件上 7/23/2019 中国科学技术大学
A GPU is a SIMD (SIMT) Machine GPU不是用SIMD指令编程 使用线程 (SPMD 编程模型) 每个线程执行同样的代码,但操作不同的数据元素 每个线程有自己的上下文(即可以独立地启动/执行等) 一组执行相同指令的线程由硬件动态组织成warp 一个warp是由硬件形成的SIMD操作 7/23/2019 中国科学技术大学
The University of Adelaide, School of Computer Science 23 July 2019 Threads and Blocks 一个线程对应一个数据元素 大量的线程组织成很多线程块 许多线程块组成一个网格 GPU 由硬件对线程进行管理 Thread Block Scheduler SIMD Thread Scheduler 7/23/2019 中国科学技术大学 Chapter 2 — Instructions: Language of the Computer
Programmer’s View of Execution blockIdx 0 threadId 0 threadId 1 threadId 255 blockDim = 256 (programmer can choose) 创建足够的线程块以适应输入向量 (Nvidia 中将由多个线程块构成的、在GPU上运行的代码 称为Grid,Grid可以是2维的) blockIdx 1 threadId 0 threadId 1 threadId 255 blockIdx (n+255)/256) threadId 0 threadId 1 threadId 255 Conditional (i<n) turns off unused threads in last block 7/23/2019 中国科学技术大学
Grid 包含16个线程块(512个元素/Block) Block包含16个SIMD 线程 32个CUDA 线程/SIMD线程 处理向量长度为8192的程序组织: Grid 包含16个线程块(512个元素/Block) Block包含16个SIMD 线程 32个CUDA 线程/SIMD线程 处理一个元素/CUDA线程 7/23/2019 中国科学技术大学
Hardware Execution Model GPU Core 0 Lane 0 Lane 1 Lane 15 Core 1 Core 15 CPU CPU Memory GPU Memory GPU 由多个并行核构成,每个核是一个多线程SIMD处理器(包含多个车道(Lanes)) CPU 发送整个 “grid”到GPU,由GPU将这些线程块分发到多个核上(每个线程块在一个核上运行) GPU上核的数量对程序员而言是透明的 7/23/2019 中国科学技术大学
Simplified CUDA Programming Model 计算由大量的相互独立的线程(CUDA threads or microthreads) 完成,这些线程 组合成线程块(thread blocks) // C version of DAXPY loop. void daxpy(int n, double a, double*x, double*y) { for (int i=0; i<n; i++) y[i] = a*x[i] + y[i]; } // CUDA version. __host__ // Piece run on host processor. int nblocks = (n+255)/256; // 256 CUDA threads/block daxpy<<<nblocks,256>>>(n,2.0,x,y); __device__ // Piece run on GP-GPU. { int i = blockIdx.x*blockDim.x + threadId.x; if (i<n) y[i]=a*x[i]+y[i]; } 7/23/2019 中国科学技术大学
NVIDIA Instruction Set Arch. The University of Adelaide, School of Computer Science NVIDIA Instruction Set Arch. 23 July 2019 ISA 是硬件指令集的抽象 “Parallel Thread Execution (PTX)” 使用虚拟寄存器 用软件将其翻译成机器码 Example: shl.s32 R8, blockIdx, 9 ; Thread Block ID * Block size (512 or 29) add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread ID ld.global.f64 RD0, [X+R8] ; RD0 = X[i] ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i] mul.f64 RD0, RD0, RD4 ; Product in RD0 = RD0 * RD4 (scalar a) add.f64 RD0, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i]) st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i]) 7/23/2019 中国科学技术大学 Chapter 2 — Instructions: Language of the Computer
CUDA kernel maps to Grid of Blocks kernel_func<<<nblk, nthread>>>(param, … ); GPU SMs: SMem PCIe Bridge Host Memory CPU Cache Device Memory . . . Host Thread Grid of Thread Blocks 7/23/2019 中国科学技术大学
Thread blocks execute on an SM Thread instructions execute on a core float myVar; __shared__ float shVar; __device__ float glVar; Registers GPU SMs: SMem PCIe Bridge Host Memory CPU Cache Device Memory Block Per-block Shared Memory Per-app Device Global Per-thread Local Memory Thread SM: Streaming Multiprocessor PTX: Parallel Thread Execution 7/23/2019 中国科学技术大学
CUDA:“Single Instruction, Multiple Thread” GPUs 使用 SIMT模型, 每个CUDA线程的标量指令流汇聚在一起在硬件上以SIMD方式执行 (Nvidia groups 32 CUDA threads into a warp) µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7 ld x mul a ld y add st y Scalar instruction stream SIMD execution across warp 7/23/2019 中国科学技术大学
7/23/2019 中国科学技术大学
SPMD on SIMT Machine Warp: A set of threads that execute load add store load add store Warp 0 at PC X Warp 0 at PC X+1 Warp 0 at PC X+2 Warp 0 at PC X+3 Iter. 1 Iter. 2 Warp: A set of threads that execute the same instruction (i.e., at the same PC) Realization: Each iteration is independent Idea: Programmer or compiler generates a thread to execute each iteration. Each thread does the same thing (but on different data) Can be executed on a MIMD machine for (i=0; i < N; i++) C[i] = A[i] + B[i]; This particular model is also called: SPMD: Single Program Multiple Data A GPU executes it using the SIMT model: Single Instruction Multiple Thread Can be executed on a SIMD machine 7/23/2019 中国科学技术大学
SIMD vs. SIMT Execution Model SIMD: 一条指令流(一串顺序的SIMD指令),每条指令对应多个数据输入(向量指令) SIMT: 多个指令流(标量指令)构成线程, 这些线程动态构成warp。一个Warp处理多个数据元素 SIMT 主要优点: 可以独立地处理线程,即每个线程可以在任何标量流水线上单独执行( MIMD 处理模式) 可以将线程组织成warp,即可以将执行相同指令流的线程构成warp,形成SIMD 处理模式, 以充分发挥SIMD处理的优势 7/23/2019 中国科学技术大学
Multithreading of Warps 设一个warp由 32 threads构成 如果有32K次循环 1K 个warps 这些Warps可以在同一条流水线上交替执行 Fine grained multithreading of warps for (i=0; i < N; i++) C[i] = A[i] + B[i]; load add store load add store Warp 0 at PC X Warp 1 at PC X Warp 20 at PC X+2 Iter. 1 Iter. 2 Iter. 32+1 Iter. 32+2 7/23/2019 中国科学技术大学 Iter. 20*32 + 1 Iter. 20*32 + 2
Warps and Warp-Level FGMT Warp: The threads that run lengthwise in a woven fabric … Thread Warp 3 Thread Warp 8 Common PC Thread Warp In SIMD, you need to specify the data array + an instruction (on which to operate the data on) + THE INSTRUCTION WIDTH. Eg: You might want to add 2 integer arrays of length 16, then a SIMD instruction would look like (the instruction has been cooked-up by me for demo) add.16 arr1 arr2 However, SIMT doesn't bother about the instruction width. So, essentially, you could write the above example as: arr1[i] + arr2[i] and then launch as many threads as the length of the array, as you want. Note that, if the array size was, let us say, 32, then SIMD EXPECTS you to explicitly call two such 'add.16' instructions! Whereas, this is not the case with SIMT. Scalar Scalar Scalar Scalar Thread Warp 7 Thread Thread Thread Thread W X Y Z SIMD Pipeline 7/23/2019 中国科学技术大学
Warp Instruction Level Parallelism Can overlap execution of multiple instructions Example machine has 32 threads per warp and 8 lanes Completes 24 operations/cycle while issuing 1 warp/cycle Load Unit Multiply Unit Add Unit W0 W1 W2 time W3 W4 W5 Warp issue 7/23/2019 中国科学技术大学 Slide credit: Krste Asanovic
Latency Hiding via Warp-Level FGMT Fine-grained multithreading 流水线上每次执行每个线程的一条指令 (No interlocking) 通过warp的交叉执行来隐藏延时 所有线程的寄存器值保存在寄存器文件中 FGMT 可以容忍长延时 Millions of pixels Decode R F A L U D-Cache Thread Warp 6 Thread Warp 1 Thread Warp 2 Data All Hit? Miss? Warps accessing memory hierarchy Thread Warp 3 Thread Warp 8 Writeback Warps available for scheduling Thread Warp 7 I-Fetch SIMD Pipeline With a large number of shader threads multiplexed on the same execution re- sources, our architecture employs fine-grained multithreading where individual threads are interleaved by the fetch unit to proactively hide the potential latency of stalls before they occur. As illustrated by Figure, warps are issued fairly in a round-robin queue. When a thread is blocked by a memory request, shader core simply removes that thread’s warp from the pool of “ready” warps and thereby allows other threads to proceed while the memory system processes its request. With a large number of threads (1024 per shader core) interleaved on the same pipeline, FGMT effectively hides the latency of most memory operations since the pipeline is occupied with instructions from other threads while memory operations complete. also hides the pipeline latency so that data bypassing logic can potentially be omitted to save area with minimal impact on performance. simplify the dependency check logic design by restricting each thread to have at most one instruction running in the pipeline at any time. 7/23/2019 中国科学技术大学 Slide credit: Tor Aamodt
Warp-based SIMD vs. Traditional SIMD Lock step: 一条向量指令执行完,然后启动下一条向量指令 编程模型为SIMD (no extra threads) SW 需要知道向量长度 ISA 包含vector/SIMD指令 Warp-based SIMD 由多个标量线程构成,以SIMD方式执行 (即所有的线程执行相同的指令) 不需要lock step 每个线程可以单独对待(即可以映射到不同的warp中),编程模型不是SIMD SW 不必知道向量长度 多个线程可以动态构成warp ISA是标量ISA 可以动态形成逻辑上的向量指令 是一种在SIMD硬件上实现的SPMD编程模型 7/23/2019 中国科学技术大学
Warp Execution (Recall the Slide) 32-thread warp executing ADD A[tid],B[tid] C[tid] C[1] C[2] C[0] A[3] B[3] A[4] B[4] A[5] B[5] A[6] B[6] Execution using one pipelined functional unit C[4] C[8] C[0] A[12] B[12] A[16] B[16] A[20] B[20] A[24] B[24] C[5] C[9] C[1] A[13] B[13] A[17] B[17] A[21] B[21] A[25] B[25] C[6] C[10] C[2] A[14] B[14] A[18] B[18] A[22] B[22] A[26] B[26] C[7] C[11] C[3] A[15] B[15] A[19] B[19] A[23] B[23] A[27] B[27] Execution using four pipelined functional units 7/23/2019 中国科学技术大学 Slide credit: Krste Asanovic
SIMD Execution Unit Structure Functional Unit Lane Registers for each Thread Registers for thread IDs 0, 4, 8, … Registers for thread IDs 1, 5, 9, … Registers for thread IDs 2, 6, 10, … Registers for thread IDs 3, 7, 11, … Memory Subsystem 7/23/2019 中国科学技术大学 Slide credit: Krste Asanovic
GPU Memory Hierarchy [ Nvidia, 2010] 7/23/2019 中国科学技术大学
SIMT Memory Access 不同线程的相同指令使用线程id来存取不同的数据元素 Let’s assume N=16, 4 threads per warp 4 warps 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Threads + 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Data elements + + + + Warp 0 Warp 1 Warp 2 Warp 3 7/23/2019 中国科学技术大学 Slide credit: Hyesoon Kim
Summary- 向量处理机 vs. GPU 不同层次相近的术语比较 7/23/2019 中国科学技术大学
Summary-向量处理机 vs. GPU 7/23/2019 中国科学技术大学
Acknowledgements These slides contain material developed and copyright by: John Kubiatowicz (UCB) Krste Asanovic (UCB) John Hennessy (Standford)and David Patterson (UCB) Chenxi Zhang (Tongji) Muhamed Mudawar (KFUPM) UCB material derived from course CS152、CS252、CS61C KFUPM material derived from course COE501、COE502 CMU Introduction to Computer Architecture http://www.ece.cmu.edu/~ece447/ 7/23/2019 中国科学技术大学