CUDA 超大规模并行程序设计赵开勇 zhao.kaiyong@gmail.com http://www.comp.hkbu.edu.hk/~kyzhao http://blog.csdn.net/openhero 香港浸会大学计算机系浪潮GPU高性能开发顾问.

Slides:

Advertisements

Similar presentations

1 I/O 设备访问方式和类型. 2 Overview n The two main jobs of a computer: l I/O (Input/Output) l processing n The control of devices connneted to the computer is.

Advertisements

基本概論 Basic concepts.

第 2 章中央處理單元.

DATE: 14/10/2009 陳威宇格網技術組雲端運算相關應用 (Based on Hadoop)

Foundations of Computer Science

Memory Pool ACM Yanqing Peng.

Performance Evaluation

操作系统结构.

最新計算機概論第3章計算機組織.

Oracle 并行服务器介绍

Operating System Process Management - 4 Monday, August 11, 2008.

Leftmost Longest Regular Expression Matching in Reconfigurable Logic

C# 程式設計第一部分第1-4章 C# 程式設計 - 南華大學資管系.

CH.2 Introduction to Microprocessor-Based Control

FC OB1 FB SFC 操作系统 SFB OBs 结构化编程其它

第6章電腦軟體應用軟體多元程式處理系統軟體記憶體配置作業系統簡介虛擬記憶體作業系統的演進與發展行程管理

第 2 章中央處理單元.

CUDA程序设计.

異質計算教學課程內容「異質計算」種子教師研習營洪士灝國立台灣大學資訊工程學系

Operating System Concepts 作業系統原理 Chapter 3 行程觀念 (Process Concept)

数字系统设计 I Digital System Design I

臺北市立大學資訊科學系(含碩士班) 賴阿福 CS TEAM

第一章 C语言概述.

Chapter 1 用VC++撰寫程式 Text book: Ivor Horton.

PIC16F1827介紹以微控器為基礎之電路設計實務-微處理器實驗室.

并行计算实验上机国家高性能计算中心（合肥）.

Cuda 平行運算機制報告者：林威辰.

Chapter 2. The Graphics Rendering Pipeline 图形绘制流水线

簡易 Visual Studio 2010 C++ 使用手冊

结构化编程 FC OB1 FB SFC 操作系统 SFB OBs 其它

GPU分散式演算法設計與單機系統模擬(第二季)

5 Computer Organization (計算機組織).

C 程式設計— 語言簡介台大資訊工程學系資訊系統訓練班.

Operating System Concepts 作業系統原理 CHAPTER 2 系統結構 (System Structures)

中国散裂中子源小角谱仪的实验数据格式与处理算法报告人：张晟恺中国科学院高能物理研究所 SCE 年8月18日

Operating System Internals and Design principles

Chapter 3 行程觀念 (Process Concept)

微程序控制器刘鹏 Dept. ISEE Zhejiang University

C++ 程式設計— 語言簡介台大資訊工程學系資訊系統訓練班.

ICT RTOS Research Group 胡伟平，王剑

华南理工大学陈虎博士 CUDA编程模型华南理工大学陈虎博士

Introduction to OpenGL (1)

預官考試輔導計算機概論提要 91年12月4日.

第三章项目设定.

KeyStone I DSP[C665x 与 C6678] 视频教程

Programmable Logic Architecture Verilog HDL FPGA Design

重點資料結構之選定會影響演算法選擇對的資料結構讓您上天堂程式.

第七讲网际协议IP.

邹佳恒第十八届全国科学计算与信息化会议 • 威海，

第9章 DSP集成开发环境CCS 内容提要 CCS是TI公司推出的用于开发DSP芯片的集成开发环境，它采用Windows风格界面，集编辑、编译、链接、软件仿真、硬件调试以及实时跟踪等功能于一体，极大地方便了DSP芯片的开发与设计，是目前使用最为广泛的DSP开发软件之一。本章对CCS开发软件的使用作了详细地介绍。首先，对CCS开发软件作了简要地说明，并介绍了该软件的安装及配置；其次，介绍了CCS的基本操作，包括：CCS的窗口和工具条、文件的编辑、反汇编窗口、存储器窗口、寄存器窗口、观察窗口

簡易 Visual Studio 2005 C++ 使用手冊

第3章認識處理元.

华南理工大学陈虎博士 CUDA例子程序——矩阵乘法华南理工大学陈虎博士

Unit 11.Operating System 11.1 What’s OS 11.2 Related Courses

TinyOS 石万兵 2019/4/6 mice.

The Practical Issues of Sonar Image Processing

易成 Institute of High Energy Physics

Architecture and Systems 研究群報告人：單智君陳昌居鍾崇斌中華民國95年11月30日

中国科学技术大学计算机系陈香兰 2013Fall 第七讲存储器管理中国科学技术大学计算机系陈香兰 2013Fall.

虚拟仪器 virtual instrument

中国科学技术大学计算机系陈香兰 Fall 2013 第三讲线程中国科学技术大学计算机系陈香兰 Fall 2013.

ARM Developer Suite 介绍.

计算机系统结构（2012年春） ----存储层次： Cache基本概念

從 ER 到 Logical Schema ──兼談Schema Integration

OpenMP程序设计 2019/4/25.

第10章存储器接口罗文坚中国科大计算机学院

Race Conditions and Semaphore

MGT 213 System Management Server的昨天，今天和明天

ADX series Configuration

第三章计算机体系结构.

Presentation transcript:

CUDA 超大规模并行程序设计赵开勇 zhao.kaiyong@gmail.com http://www.comp.hkbu.edu.hk/~kyzhao http://blog.csdn.net/openhero 香港浸会大学计算机系浪潮GPU高性能开发顾问

提纲从GPGPU到CUDA 并行程序组织并行执行模型 CUDA基础存储器 CUDA程序设计工具新一代Fermi GPU

Graphic Processing Unit (GPU) 用于个人计算机、工作站和游戏机的专用图像显示设备显示卡 nVidia和ATI (now AMD)是主要制造商 Intel准备通过Larrabee进入这一市场主板集成 Intel

3维图像流水线一帧典型图像 30 frames/s 1M triangles 3M vertices 25M fragments 30M triangles/s 90M vertices/s 750M fragments/s 3维图像流水线

传统GPU架构 Graphics program Vertex processors Fragment processors Pixel operations Output image

GPU的强大运算能力数据级并行: 计算一致性专用存储器通道有效隐藏存储器延时

General Purpose Computing on GPU (GPGPU)

GPGPU 核心思想但是用图形语言描述通用计算问题把数据映射到vertex或者fragment处理器硬件资源使用不充分存储器访问方式严重受限难以调试和查错高度图形处理和编程技巧

G80 GPU Streaming Processor (SP) Streaming Multiprocessor (SM) L2 FB TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread Issue Pixel Thread Issue Input Assembler Host Streaming Processor (SP) Streaming Multiprocessor (SM)

CUDA: Compute Unified Device Architecture CUDA: 集成CPU + GPU C应用程序通用并行计算模型单指令、多数据执行模式 (SIMD) 所有线程执行同一段代码(1000s threads on the fly) 大量并行计算资源处理不同数据隐藏存储器延时提升计算／通信比例合并相邻地址的内存访问快速线程切换1 cycle@GPU vs. ~1000 cycles@CPU

Evolution of CUDA-Enabled GPUs Compute 1.0: basic CUDA compatibility G80 Compute 1.1: asynchronous memory copies and atomic global operations G84, G86, G92, G94, G96, and G98 Compute 1.2: dramatically improved memory coalescing rules, double the register count, intra-warp voting primitives, atomic shared memory operations GT21X Compute 1.3: double precision GT200

CUDA成功案例

提纲从GPGPU到CUDA 并行程序组织并行执行模型 CUDA基础存储器 CUDA程序设计工具新一代Fermi GPU

并行性的维度 1维 2维 3维 y = a + b //y, a, b vectors P = M  N //P, M, N matrices 3维 CT or MRI imaging a[0] a[1] … a[n] b[0] b[1] b[n] y[0] y[1] y[n] + =  =

并行线程组织结构 Thread: 并行的基本单位 Thread block: 互相合作的线程组 Grid: 一组thread block Cooperative Thread Array (CTA) 允许彼此同步通过快速共享内存交换数据以1维、2维或3维组织最多包含512个线程 Grid: 一组thread block 以1维或2维组织共享全局内存 Kernel: 在GPU上执行的核心程序 One kernel  one grid Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0)

Parallel Program Organization in CUDA Software Hardware SP Thread Thread block SM … GPU Grid

并行线程执行调用kernel function 需要指定执行配置 Threads和blocks具有IDs threadIdx: 1D, 2D, or 3D blockIdx: 1D, or 2D 由此决定相应处理数据 __global__ void kernel(...); dim3 DimGrid(3, 2); // 6 thread blocks dim3 DimBlock(16, 16); // 256 threads per block kernel<<< DimGrid, DimBlock>>> (...);

实例1: Element-Wise Addition //CPU program //sum of two vectors a and b void add_cpu(float *a, float *b, int N) { for (int idx = 0; idx<N; idx++) a[idx] += b[idx]; } void main() ..... fun_add(a, b, N); //CUDA program //sum of two vectors a and b __global__ void add_gpu(float *a, float *b, int N) { Int idx =blockIdx.x* blockDim.x+ threadIdx.x; if (idx < N) a[idx] += b[idx]; } void main() ….. dim3 dimBlock (256); dim3 dimGrid( ceil( N / 256 ); fun_add<<<dimGrid, dimBlock>>>(a, b, N);

提纲从GPGPU到CUDA 并行程序组织并行执行模型 CUDA基础存储器 CUDA程序设计工具新一代Fermi GPU

CUDA Processing Flow

并行线程执行 SM内以(warp即32 threads)为单位并行执行 Warp Warp内线程执行同一条指令 Block 0 Block 1 Block 2 SM内以(warp即32 threads)为单位并行执行 Warp内线程执行同一条指令 Half-warp是存储操作的基本单位 Warp

控制流(Control Flow) 同一warp内的分支语句可能执行不同的指令路径不同指令路径的线程只能顺序执行 N条指令路径→1/N throughput 只需要考虑同一warp即可，不同warp的不同的指令路径不具相关性 G80上使用指令预测技术加速指令执行

控制流(Control Flow) 常见情况: 分支条件是thread ID的函数时, 容易导致分支（divergence） Example with divergence: If (threadIdx.x > 2) { } 在thread block产生两条不同指令路径 Branch granularity < warp size threads 0 and 1与1st warp中其它指令的指令路径不同 Example without divergence: If (threadIdx.x / WARP_SIZE > 2) { } 也在thread block产生两条不同指令路径 Branch granularity is a whole multiple of warp size 同一warp的所有线程具备相同指令路径

线程同步 void __syncthreads(); Barrier synchronization 同步thread block之内的所有线程避免访问共享内存时发生RAW/WAR/WAW 冒险(hazard) __shared__ float scratch[256]; scratch[threadID] = begin[threadID]; __syncthreads(); int left = scratch[threadID -1]; 在此等待，直至所有线程到达才开始执行下面的代码

Dead-Lock with __syncthreads Dead-lock if Some threads have val larger than threshold And others not __global__ void compute(...) { // do some computation for val if( val > threshold ) return; __syncthreads(); // work with val & store it return; }

提纲从GPGPU到CUDA 并行程序组织并行执行模型 CUDA基础存储器 CUDA程序设计工具新一代Fermi GPU

CUDA扩展语言结构 Declspecs Keywords Intrinsics Runtime API Function launch __device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; ... region[threadIdx] = image[i]; __syncthreads() image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block foo<<<100, 10>>> (parameters); Declspecs global, device, shared, local, constant Keywords threadIdx, blockIdx threadDim, blockDim Intrinsics __syncthreads Runtime API Memory, symbol, execution management Function launch

存储器空间 R/W per-thread registers R/W per-thread local memory 1-cycle latency R/W per-thread local memory Slow – register spilling to global memory R/W per-block shared memory “__shared__” But bank conflicts may drag down R/W per-grid global memory ~500-cycle latency “__device__” But coalescing accessing could hide latency Read only per-grid constant and texture memories ~500-cycle latency, but cached

GPU Global Memory分配 cudaMalloc() cudaFree() 分配显存中的global memory 两个参数对象数组指针和数组尺寸 cudaFree() 释放显存中的global memory 对象数组指针 int blk_sz = 64; float* Md; int size = blk_sz * blk_sz * sizeof(float); cudaMalloc((void**)&Md, size); … cudaFree(Md);

Host – Device数据交换 cudaMemcpy() Memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer Host to Host, Host to Device, Device to Host, Device to Device cudaMemcpy(Md, M.elements, size, cudaMemcpyHostToDevice); cudaMemcpy(M.elements, Md, size, cudaMemcpyDeviceToHost);

Only callable from the: CUDA函数定义 Executed on the: Only callable from the: __device__ float DeviceFunc() device __global__ void KernelFunc() host __host__ float HostFunc() __global__ 定义kernel函数必须返回void __device__ 函数不能用&运算符取地址, 不支持递归调用, 不支持静态变量(static variable), 不支持可变长度参数函数调用

CUDA数学函数 pow, sqrt, cbrt, hypot, exp, exp2, expm1, log, log2, log10, log1p, sin, cos, tan, asin, acos, atan, atan2, sinh, cosh, tanh, asinh, acosh, atanh, ceil, floor, trunc, round, etc. 只支持标量运算许多函数有一个快速、较不精确的对应版本以”__”为前缀，如__sin() 编译开关-use_fast_math强制生成该版本的目标码

实例2: 矩阵相乘矩阵数据类型 – 不属于CUDA! 单精度浮点数 width  height个元素矩阵元素在elements中 B 矩阵数据类型 – 不属于CUDA! 单精度浮点数 width  height个元素矩阵元素在elements中 1-D数组存放矩阵数据 Row-major storage typedef struct { int width; int height; float* elements; } Matrix; WM.width = N.heightI A C M.height M.width N.width

实例2: 矩阵相乘 C = A  B of size WIDTH x WIDTH 一个线程处理一个矩阵元素只需要一个thread block 线程载入A的一行和B的一列 A和B的一对相应元素作一次乘法和一次加法 WIDTH A C WIDTH WIDTH WIDTH

CUDA Implementation – Host Side // Matrix multiplication on the device void Mul(const Matrix A, const Matrix B, Matrix C) { int size = A.width  A.width  sizeof(float); // Load M and N to the device float *Ad, *Bd, *Cd; cudaMalloc((void**)&Ad, size); //matrix stored in linear order cudaMemcpy(Ad, A.elements, size, cudaMemcpyHostToDevice); cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B.elements, size, cudaMemcpyHostToDevice); // Allocate C on the device cudaMalloc((void**)&Cd, size);

CUDA Implementation – Host Side // Launch the device computation threads! dim3 dimGrid(1); dim3 dimBlock(M.width, M.width); Muld<<<dimGrid, dimBlock>>>(Ad, Bd, Cd, M.width); // Read P from the device copyFromDeviceMatrix(C.elements, Cd); cudaMemCopy(C, Cd, N * size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Ad); cudaFree(Bd); cudaFree(Cd); }

CUDA Implementation – Kernel // Matrix multiplication kernel – thread specification __global__ void Muld (float* Ad, float* Bd, float* Cd, int width) { // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y; // cvalue is used to store the element of the matrix // that is computed by the thread float cvalue = 0;

CUDA Implementation – Kernel B for (int k = 0; k < width; ++k) { float ae = Ad[ty * width + k]; float be = Bd [tx + k * width]; cvalue += ae * be; } // Write the matrix to device memory; // each thread writes one element Cd[ty * width + tx] = cvalue; WIDTH A C ty WIDTH tx WIDTH WIDTH

提纲从GPGPU到CUDA 并行程序组织并行执行模型 CUDA基础存储器 CUDA程序设计工具新一代Fermi GPU Shared memory Global memory CUDA程序设计工具新一代Fermi GPU

共享存储器(Shared Memory) 设置于streaming multiprocessor内部由一个线程块内部全部线程共享完全由软件控制访问一个地址只需要1个时钟周期

共享存储器结构 … G80的共享存储器组织为16 banks 对同一bank的同时访问导致bank conflict Addressed in 4 bytes Bank ID = 4-byte address % 16 相邻4-byte地址映射相邻banks 每一bank的带宽为4 bytes per clock cycle 对同一bank的同时访问导致bank conflict 只能顺序处理仅限于同一线程块内的线程 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 00, 16, 32, … 01, 17, 33, … 02, 18, 34, … 03, 19, 35, … 15, 31, 47, … …

Bank Addressing实例 No Bank Conflicts No Bank Conflicts Linear addressing stride == 1 (s=1) No Bank Conflicts Random 1:1 Permutation __shared__ float shared[256]; float foo = shared[threadIdx.x];

Bank Addressing实例 2-way bank conflicts 8-way bank conflicts Linear addressing stride == 2 (s=2) 8-way bank conflicts Linear addressing stride == 8 (s=8) __shared__ float shared[256]; float foo = shared[2 * threadIdx.x]; __shared__ float shared[256]; float foo = shared[8 * threadIdx.x];

提纲从GPGPU到CUDA 并行程序组织并行执行模型 CUDA基础存储器 CUDA程序设计工具新一代Fermi GPU Shared memory Global memory CUDA程序设计工具新一代Fermi GPU

全局内存(Global Memory) 全局内存在G80/G200上没有缓存存取延时非常容易成为性能瓶颈优化是提高性能的关键! Constant memory和texture memory有少量缓存存取延时 400-600 clock cycles 非常容易成为性能瓶颈优化是提高性能的关键!

Coalesced Global Memory Accesses

Non-Coalesced Global Memory Accesses

Non-Coalesced Global Memory Accesses

Coalescing on 1.2 and Higher Devices Global memory access by threads in a half-warp can be coalesced When the words accessed by all threads lie in the same segment of size equal to: 32 bytes if all threads access 8-bit words 64 bytes if all threads access 16-bit words 128 bytes if all threads access 32-bit or 64-bit words Any pattern of addresses requested by the half-warp Including patterns where multiple threads access the same address

Example of New Coalescing Rules Address 0 Thread 0 Address 4 Address … Address 116 Address 120 Address 124 Address 128 Address 172 Address 176 Address 180 Address 184 Address 188 Address 252 Thread 1 Thread 2 Thread 3 Thread … Thread 14 Thread 15 … Segment 0 (128B) Segment 1 (128B) Reduced to 32B Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data.

提纲从GPGPU到CUDA 并行程序组织并行执行模型 CUDA基础存储器 CUDA程序设计工具新一代Fermi GPU Shared memory Global memory CUDA程序设计工具新一代Fermi GPU

下载CUDA软件 http://www.nvidia.cn/object/cuda_get_cn.html CUDA driver 硬件驱动 CUDA toolkit 工具包 CUDA SDK 程序范例及动态链接库 CUDA Visual Profiler 程序剖析工具

C/C++ CUDA Application CUDA程序的编译(compile) CUDA源文件被nvcc处理 nvcc is a compiler driver nvcc输出： PTX (Parallel Thread eXecution) Virtual ISA for multiple GPU hardware Just-In-Time compilation by CUDA runtime GPU binary Device-specific binary object Standard C code With explicit parallelism C/C++ CUDA Application NVCC PTX Code C/C++ CPU Code Generic CUDA Runtime Specialized Other GPUs G80 GT200 CUDA Binary

DEBUG make dbg=1 make emu=1 CPU代码以debug模式编译可以用debugger (e.g. gdb, visual studio)运行但不能检查GPU代码的中间结果 make emu=1 在CPU上以emulation方式顺序运行可以使用printf()打印中间结果基本顺序执行但不能再现线程间的竞争(race)现象浮点运算结果可能有微小的差别

检查资源使用使用-cubin flag编译开关检查.cubin文件的”code”部分 architecture {sm_10} abiversion {0} modname {cubin} code { name = BlackScholesGPU lmem = 0 smem = 68 reg = 20 bar = 0 bincode { 0xa0004205 0x04200780 0x40024c09 0x00200780 … per thread local memory per thread block shared memory per thread registers

CUDA Debugger: cuda-gdb Released with CUDA 2.2 A ported version of GNU Debugger, gdb Red Hat Enterprise Linux 5.x 32-bit and 64-bit Compiling with debug support nvcc –g –G foo.cu –o foo Single-step individual warps (“next” or “step”) Advances all threads in the same warp Display device memory in the device kernel Data that resides in various GPU memory regions such as shared, local, and global memory Switch to any CUDA block/thread thread <<<(BX,BY),(TX,TY,TZ)>>> Breaking into running applications Ctrl+C to break into hanging programs

“Nexus” GPU/CPU Development Suite Major components Nexus Debugger Source code debugger for GPU source code CUDA, DirectCompute, HLSL, … Nexus Analyzer System-wide event viewer for both GPU & CPU events Nexus Graphics Inspector For frame based, deep inspection of textures and geometry Full integration with Visual Studio Windows 7/Vista Available on Oct. 29, 2009

提纲从GPGPU到CUDA 并行程序组织并行执行模型 CUDA基础存储器 CUDA程序设计工具新一代Fermi GPU Shared memory Global memory CUDA程序设计工具新一代Fermi GPU

3 Major Generations of CUDA GPUs GT200 GT300 (Fermi) CUDA cores 128 240 512 Process (nm) 90 45 40 Transistors 681 Million 1.4 Billion 3.0 Billion Double precision floating point capability None 30 FMA ops/clock 256 FMA ops/clock Single precision floating point capability 128 MAD ops/clock 240 MAD ops/clock 512 MAD ops/clock Warp scheduler 1 2 Special function units / SM 4 CUDA cores / SM 8 32 Shared memory / SM 16KB Configurable 48KB or 16KB L1 cache / SM Configurable 16KB or 48KB L2 cache / SM 786KB Concurrent kernels Up to 16 Load/store memory space 32-bit 64-bit

Fermi GPU 架构 CUDA core (SP) GDDR5 DRAM Interface LD/ST unit Special functional unit 786KB L2 cache Thread scheduler

Third Generation Streaming Multiprocessor 32 CUDA cores (SPs) per SM, 4x over G200 8x peak double precision floating point performance over G200 Dual warp scheduler that schedule and dispatch two warps of 32 threads Memory 16x128KB = 2048KB register file 16x64KB=768KB L1 cache/shared memory Configurable partitioning 768KB L2 cache

Dual Warp Scheduler

Second Generation Parallel Thread Execution ISA 64-bit memory address space - with ECC Unified address space with full C++ support Optimized execution of OpenCL and DirectCompute Full IEEE 754-2008 single and double precision floating point numbers

NVIDIA GigaThreadTM Engine Multi-kernel execution 10x faster application context switching Concurrent kernel execution Out-of order thread block execution Two streaming transfer engines Work in a pipelined, overlapped manner Each could saturate the PCIe interface

Performance

参考书目 NVIDIA, CUDA Programming Guide, NVidia, 2008, 2009. 张舒、褚艳利、赵开勇、张钰勃, GPU高性能运算之CUDA, 2009. http://www.hpctech.com/. T. Mattson, et al., Patterns for Parallel Programming, Addison Wesley, 2005.

参考文献 Special issues on GPU computing IEEE, Proceedings of IEEE, Vol. 96, No. 5, May, 2008. ACM, Queue, Vol. 6, No. 2, March/April, 2008. Elsevier, Journal of Parallel and Distributed Computing, Vol. 68, No. 10, October, 2008.