Presentation is loading. Please wait.

Presentation is loading. Please wait.

CUDA 超大规模并行程序设计 赵开勇 zhao.kaiyong@gmail.com http://www.comp.hkbu.edu.hk/~kyzhao http://blog.csdn.net/openhero 香港浸会大学计算机系 浪潮GPU高性能开发顾问.

Similar presentations


Presentation on theme: "CUDA 超大规模并行程序设计 赵开勇 zhao.kaiyong@gmail.com http://www.comp.hkbu.edu.hk/~kyzhao http://blog.csdn.net/openhero 香港浸会大学计算机系 浪潮GPU高性能开发顾问."— Presentation transcript:

1 CUDA 超大规模并行程序设计 赵开勇 zhao.kaiyong@gmail.com
香港浸会大学计算机系 浪潮GPU高性能开发顾问

2 提纲 从GPGPU到CUDA 并行程序组织 并行执行模型 CUDA基础 存储器 CUDA程序设计工具 新一代Fermi GPU

3 Graphic Processing Unit (GPU)
用于个人计算机、工作站和游戏机的专用图像显示设备 显示卡 nVidia和ATI (now AMD)是主要制造商 Intel准备通过Larrabee进入这一市场 主板集成 Intel

4 3维图像流水线 一帧典型图像 30 frames/s 1M triangles 3M vertices 25M fragments
30M triangles/s 90M vertices/s 750M fragments/s 3维图像流水线

5 传统GPU架构 Graphics program Vertex processors Fragment processors
Pixel operations Output image

6 GPU的强大运算能力 数据级并行: 计算一致性 专用存储器通道 有效隐藏存储器延时

7 General Purpose Computing on GPU (GPGPU)

8 GPGPU 核心思想 但是 用图形语言描述通用计算问题 把数据映射到vertex或者fragment处理器 硬件资源使用不充分
存储器访问方式严重受限 难以调试和查错 高度图形处理和编程技巧

9 G80 GPU Streaming Processor (SP) Streaming Multiprocessor (SM) L2 FB
TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread Issue Pixel Thread Issue Input Assembler Host Streaming Processor (SP) Streaming Multiprocessor (SM)

10 CUDA: Compute Unified Device Architecture
CUDA: 集成CPU + GPU C应用程序 通用并行计算模型 单指令、多数据执行模式 (SIMD) 所有线程执行同一段代码(1000s threads on the fly) 大量并行计算资源处理不同数据 隐藏存储器延时 提升计算/通信比例 合并相邻地址的内存访问 快速线程切换1 vs. ~1000

11 Evolution of CUDA-Enabled GPUs
Compute 1.0: basic CUDA compatibility G80 Compute 1.1: asynchronous memory copies and atomic global operations G84, G86, G92, G94, G96, and G98 Compute 1.2: dramatically improved memory coalescing rules, double the register count, intra-warp voting primitives, atomic shared memory operations GT21X Compute 1.3: double precision GT200

12 CUDA成功案例

13 提纲 从GPGPU到CUDA 并行程序组织 并行执行模型 CUDA基础 存储器 CUDA程序设计工具 新一代Fermi GPU

14 并行性的维度 1维 2维 3维 y = a + b //y, a, b vectors
P = M  N //P, M, N matrices 3维 CT or MRI imaging a[0] a[1] a[n] b[0] b[1] b[n] y[0] y[1] y[n] + = =

15 并行线程组织结构 Thread: 并行的基本单位 Thread block: 互相合作的线程组 Grid: 一组thread block
Cooperative Thread Array (CTA) 允许彼此同步 通过快速共享内存交换数据 以1维、2维或3维组织 最多包含512个线程 Grid: 一组thread block 以1维或2维组织 共享全局内存 Kernel: 在GPU上执行的核心程序 One kernel  one grid Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0)

16 Parallel Program Organization in CUDA
Software Hardware SP Thread Thread block SM GPU Grid

17 并行线程执行 调用kernel function 需要指定执行配置 Threads和blocks具有IDs
threadIdx: 1D, 2D, or 3D blockIdx: 1D, or 2D 由此决定相应处理数据 __global__ void kernel(...); dim3 DimGrid(3, 2); // 6 thread blocks dim3 DimBlock(16, 16); // 256 threads per block kernel<<< DimGrid, DimBlock>>> (...);

18 实例1: Element-Wise Addition
//CPU program //sum of two vectors a and b void add_cpu(float *a, float *b, int N) { for (int idx = 0; idx<N; idx++) a[idx] += b[idx]; } void main() ..... fun_add(a, b, N); //CUDA program //sum of two vectors a and b __global__ void add_gpu(float *a, float *b, int N) { Int idx =blockIdx.x* blockDim.x+ threadIdx.x; if (idx < N) a[idx] += b[idx]; } void main() ….. dim3 dimBlock (256); dim3 dimGrid( ceil( N / 256 ); fun_add<<<dimGrid, dimBlock>>>(a, b, N);

19 提纲 从GPGPU到CUDA 并行程序组织 并行执行模型 CUDA基础 存储器 CUDA程序设计工具 新一代Fermi GPU

20 CUDA Processing Flow

21 并行线程执行 SM内以(warp即32 threads)为单位并行执行 Warp Warp内线程执行同一条指令
Block 0 Block 1 Block 2 SM内以(warp即32 threads)为单位并行执行 Warp内线程执行同一条指令 Half-warp是存储操作的基本单位 Warp

22 控制流(Control Flow) 同一warp内的分支语句可能执行不同的指令路径 不同指令路径的线程只能顺序执行
N条指令路径→1/N throughput 只需要考虑同一warp即可,不同warp的不同的指令路径不具相关性 G80上使用指令预测技术加速指令执行

23 控制流(Control Flow) 常见情况: 分支条件是thread ID的函数时, 容易导致分支(divergence)
Example with divergence: If (threadIdx.x > 2) { } 在thread block产生两条不同指令路径 Branch granularity < warp size threads 0 and 1与1st warp中其它指令的指令路径不同 Example without divergence: If (threadIdx.x / WARP_SIZE > 2) { } 也在thread block产生两条不同指令路径 Branch granularity is a whole multiple of warp size 同一warp的所有线程具备相同指令路径

24 线程同步 void __syncthreads(); Barrier synchronization
同步thread block之内的所有线程 避免访问共享内存时发生RAW/WAR/WAW 冒险(hazard) __shared__ float scratch[256]; scratch[threadID] = begin[threadID]; __syncthreads(); int left = scratch[threadID -1]; 在此等待,直至所有线程到达才开始执行下面的代码

25 Dead-Lock with __syncthreads
Dead-lock if Some threads have val larger than threshold And others not __global__ void compute(...) { // do some computation for val if( val > threshold ) return;  __syncthreads();  // work with val & store it  return; }

26 提纲 从GPGPU到CUDA 并行程序组织 并行执行模型 CUDA基础 存储器 CUDA程序设计工具 新一代Fermi GPU

27 CUDA扩展语言结构 Declspecs Keywords Intrinsics Runtime API Function launch
__device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; ... region[threadIdx] = image[i]; __syncthreads() image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block foo<<<100, 10>>> (parameters); Declspecs global, device, shared, local, constant Keywords threadIdx, blockIdx threadDim, blockDim Intrinsics __syncthreads Runtime API Memory, symbol, execution management Function launch

28 存储器空间 R/W per-thread registers R/W per-thread local memory
1-cycle latency R/W per-thread local memory Slow – register spilling to global memory R/W per-block shared memory “__shared__” But bank conflicts may drag down R/W per-grid global memory ~500-cycle latency “__device__” But coalescing accessing could hide latency Read only per-grid constant and texture memories ~500-cycle latency, but cached

29 GPU Global Memory分配 cudaMalloc() cudaFree() 分配显存中的global memory 两个参数
对象数组指针和数组尺寸 cudaFree() 释放显存中的global memory 对象数组指针 int blk_sz = 64; float* Md; int size = blk_sz * blk_sz * sizeof(float); cudaMalloc((void**)&Md, size); cudaFree(Md);

30 Host – Device数据交换 cudaMemcpy() Memory data transfer
Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer Host to Host, Host to Device, Device to Host, Device to Device cudaMemcpy(Md, M.elements, size, cudaMemcpyHostToDevice); cudaMemcpy(M.elements, Md, size, cudaMemcpyDeviceToHost);

31 Only callable from the:
CUDA函数定义 Executed on the: Only callable from the: __device__ float DeviceFunc() device __global__ void KernelFunc() host __host__ float HostFunc() __global__ 定义kernel函数 必须返回void __device__ 函数 不能用&运算符取地址, 不支持递归调用, 不支持静态变量(static variable), 不支持可变长度参数函数调用

32 CUDA数学函数 pow, sqrt, cbrt, hypot, exp, exp2, expm1, log, log2, log10, log1p, sin, cos, tan, asin, acos, atan, atan2, sinh, cosh, tanh, asinh, acosh, atanh, ceil, floor, trunc, round, etc. 只支持标量运算 许多函数有一个快速、较不精确的对应版本 以”__”为前缀,如__sin() 编译开关-use_fast_math强制生成该版本的目标码

33 实例2: 矩阵相乘 矩阵数据类型 – 不属于CUDA! 单精度浮点数 width  height个元素 矩阵元素在elements中
B 矩阵数据类型 – 不属于CUDA! 单精度浮点数 width  height个元素 矩阵元素在elements中 1-D数组存放矩阵数据 Row-major storage typedef struct { int width; int height; float* elements; } Matrix; WM.width = N.heightI A C M.height M.width N.width

34 实例2: 矩阵相乘 C = A  B of size WIDTH x WIDTH 一个线程处理一个矩阵元素
只需要一个thread block 线程载入A的一行和B的一列 A和B的一对相应元素作一次乘法和一次加法 WIDTH A C WIDTH WIDTH WIDTH

35 CUDA Implementation – Host Side
// Matrix multiplication on the device void Mul(const Matrix A, const Matrix B, Matrix C) { int size = A.width  A.width  sizeof(float); // Load M and N to the device float *Ad, *Bd, *Cd; cudaMalloc((void**)&Ad, size); //matrix stored in linear order cudaMemcpy(Ad, A.elements, size, cudaMemcpyHostToDevice); cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B.elements, size, cudaMemcpyHostToDevice); // Allocate C on the device cudaMalloc((void**)&Cd, size);

36 CUDA Implementation – Host Side
// Launch the device computation threads! dim3 dimGrid(1); dim3 dimBlock(M.width, M.width); Muld<<<dimGrid, dimBlock>>>(Ad, Bd, Cd, M.width); // Read P from the device copyFromDeviceMatrix(C.elements, Cd); cudaMemCopy(C, Cd, N * size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Ad); cudaFree(Bd); cudaFree(Cd); }

37 CUDA Implementation – Kernel
// Matrix multiplication kernel – thread specification __global__ void Muld (float* Ad, float* Bd, float* Cd, int width) { // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y; // cvalue is used to store the element of the matrix // that is computed by the thread float cvalue = 0;

38 CUDA Implementation – Kernel
B for (int k = 0; k < width; ++k) { float ae = Ad[ty * width + k]; float be = Bd [tx + k * width]; cvalue += ae * be; } // Write the matrix to device memory; // each thread writes one element Cd[ty * width + tx] = cvalue; WIDTH A C ty WIDTH tx WIDTH WIDTH

39 提纲 从GPGPU到CUDA 并行程序组织 并行执行模型 CUDA基础 存储器 CUDA程序设计工具 新一代Fermi GPU
Shared memory Global memory CUDA程序设计工具 新一代Fermi GPU

40 共享存储器(Shared Memory) 设置于streaming multiprocessor内部 由一个线程块内部全部线程共享
完全由软件控制 访问一个地址只需要1个时钟周期

41 共享存储器结构 … G80的共享存储器组织为16 banks 对同一bank的同时访问导致bank conflict
Addressed in 4 bytes Bank ID = 4-byte address % 16 相邻4-byte地址映射相邻banks 每一bank的带宽为4 bytes per clock cycle 对同一bank的同时访问导致bank conflict 只能顺序处理 仅限于同一线程块内的线程 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 00, 16, 32, … 01, 17, 33, … 02, 18, 34, … 03, 19, 35, … 15, 31, 47, …

42 Bank Addressing实例 No Bank Conflicts No Bank Conflicts
Linear addressing stride == 1 (s=1) No Bank Conflicts Random 1:1 Permutation __shared__ float shared[256]; float foo = shared[threadIdx.x];

43 Bank Addressing实例 2-way bank conflicts 8-way bank conflicts
Linear addressing stride == 2 (s=2) 8-way bank conflicts Linear addressing stride == 8 (s=8) __shared__ float shared[256]; float foo = shared[2 * threadIdx.x]; __shared__ float shared[256]; float foo = shared[8 * threadIdx.x];

44 提纲 从GPGPU到CUDA 并行程序组织 并行执行模型 CUDA基础 存储器 CUDA程序设计工具 新一代Fermi GPU
Shared memory Global memory CUDA程序设计工具 新一代Fermi GPU

45 全局内存(Global Memory) 全局内存在G80/G200上没有缓存 存取延时 非常容易成为性能瓶颈 优化是提高性能的关键!
Constant memory和texture memory有少量缓存 存取延时 clock cycles 非常容易成为性能瓶颈 优化是提高性能的关键!

46 Coalesced Global Memory Accesses

47 Non-Coalesced Global Memory Accesses

48 Non-Coalesced Global Memory Accesses

49 Coalescing on 1.2 and Higher Devices
Global memory access by threads in a half-warp can be coalesced When the words accessed by all threads lie in the same segment of size equal to: 32 bytes if all threads access 8-bit words 64 bytes if all threads access 16-bit words 128 bytes if all threads access 32-bit or 64-bit words Any pattern of addresses requested by the half-warp Including patterns where multiple threads access the same address

50 Example of New Coalescing Rules
Address 0 Thread 0 Address 4 Address … Address 116 Address 120 Address 124 Address 128 Address 172 Address 176 Address 180 Address 184 Address 188 Address 252 Thread 1 Thread 2 Thread 3 Thread … Thread 14 Thread 15 Segment 0 (128B) Segment 1 (128B) Reduced to 32B Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data.

51 提纲 从GPGPU到CUDA 并行程序组织 并行执行模型 CUDA基础 存储器 CUDA程序设计工具 新一代Fermi GPU
Shared memory Global memory CUDA程序设计工具 新一代Fermi GPU

52 下载CUDA软件 http://www.nvidia.cn/object/cuda_get_cn.html CUDA driver
硬件驱动 CUDA toolkit 工具包 CUDA SDK 程序范例及动态链接库 CUDA Visual Profiler 程序剖析工具

53 C/C++ CUDA Application
CUDA程序的编译(compile) CUDA源文件被nvcc处理 nvcc is a compiler driver nvcc输出: PTX (Parallel Thread eXecution) Virtual ISA for multiple GPU hardware Just-In-Time compilation by CUDA runtime GPU binary Device-specific binary object Standard C code With explicit parallelism C/C++ CUDA Application NVCC PTX Code C/C++ CPU Code Generic CUDA Runtime Specialized Other GPUs G80 GT200 CUDA Binary

54 DEBUG make dbg=1 make emu=1 CPU代码以debug模式编译
可以用debugger (e.g. gdb, visual studio)运行 但不能检查GPU代码的中间结果 make emu=1 在CPU上以emulation方式顺序运行 可以使用printf()打印中间结果 基本顺序执行 但不能再现线程间的竞争(race)现象 浮点运算结果可能有微小的差别

55 检查资源使用 使用-cubin flag编译开关 检查.cubin文件的”code”部分 architecture {sm_10}
abiversion {0} modname {cubin} code { name = BlackScholesGPU lmem = 0 smem = 68 reg = 20 bar = 0 bincode { 0xa x x40024c09 0x per thread local memory per thread block shared memory per thread registers

56 CUDA Debugger: cuda-gdb
Released with CUDA 2.2 A ported version of GNU Debugger, gdb Red Hat Enterprise Linux 5.x 32-bit and 64-bit Compiling with debug support nvcc –g –G foo.cu –o foo Single-step individual warps (“next” or “step”) Advances all threads in the same warp Display device memory in the device kernel Data that resides in various GPU memory regions such as shared, local, and global memory Switch to any CUDA block/thread thread <<<(BX,BY),(TX,TY,TZ)>>> Breaking into running applications Ctrl+C to break into hanging programs

57 “Nexus” GPU/CPU Development Suite
Major components Nexus Debugger Source code debugger for GPU source code CUDA, DirectCompute, HLSL, … Nexus Analyzer System-wide event viewer for both GPU & CPU events Nexus Graphics Inspector For frame based, deep inspection of textures and geometry Full integration with Visual Studio Windows 7/Vista Available on Oct. 29, 2009

58 提纲 从GPGPU到CUDA 并行程序组织 并行执行模型 CUDA基础 存储器 CUDA程序设计工具 新一代Fermi GPU
Shared memory Global memory CUDA程序设计工具 新一代Fermi GPU

59 3 Major Generations of CUDA GPUs
GT200 GT300 (Fermi) CUDA cores 128 240 512 Process (nm) 90 45 40 Transistors 681 Million 1.4 Billion 3.0 Billion Double precision floating point capability None 30 FMA ops/clock 256 FMA ops/clock Single precision floating point capability 128 MAD ops/clock 240 MAD ops/clock 512 MAD ops/clock Warp scheduler 1 2 Special function units / SM 4 CUDA cores / SM 8 32 Shared memory / SM 16KB Configurable 48KB or 16KB L1 cache / SM Configurable 16KB or 48KB L2 cache / SM 786KB Concurrent kernels Up to 16 Load/store memory space 32-bit 64-bit

60 Fermi GPU 架构 CUDA core (SP) GDDR5 DRAM Interface LD/ST unit
Special functional unit 786KB L2 cache Thread scheduler

61 Third Generation Streaming Multiprocessor
32 CUDA cores (SPs) per SM, 4x over G200 8x peak double precision floating point performance over G200 Dual warp scheduler that schedule and dispatch two warps of 32 threads Memory 16x128KB = 2048KB register file 16x64KB=768KB L1 cache/shared memory Configurable partitioning 768KB L2 cache

62 Dual Warp Scheduler

63 Second Generation Parallel Thread Execution ISA
64-bit memory address space - with ECC Unified address space with full C++ support Optimized execution of OpenCL and DirectCompute Full IEEE single and double precision floating point numbers

64 NVIDIA GigaThreadTM Engine
Multi-kernel execution 10x faster application context switching Concurrent kernel execution Out-of order thread block execution Two streaming transfer engines Work in a pipelined, overlapped manner Each could saturate the PCIe interface

65 Performance

66 参考书目 NVIDIA, CUDA Programming Guide, NVidia, 2008, 2009.
张舒、褚艳利、 赵开勇、张钰勃, GPU高性能运算之CUDA, 2009. T. Mattson, et al., Patterns for Parallel Programming, Addison Wesley, 2005.

67 参考文献 Special issues on GPU computing
IEEE, Proceedings of IEEE, Vol. 96, No. 5, May, 2008. ACM, Queue, Vol. 6, No. 2, March/April, 2008. Elsevier, Journal of Parallel and Distributed Computing, Vol. 68, No. 10, October, 2008.


Download ppt "CUDA 超大规模并行程序设计 赵开勇 zhao.kaiyong@gmail.com http://www.comp.hkbu.edu.hk/~kyzhao http://blog.csdn.net/openhero 香港浸会大学计算机系 浪潮GPU高性能开发顾问."

Similar presentations


Ads by Google