Cuda 平行運算機制 報告者:林威辰
Slides MPI基本定理 各種平行運算的簡介 CUDA簡介 使用VS2005.net 建置CUDA CUDA基本知識 CUDA硬體架構 CUDA缺點 My Research
Slides http://courses.ece.uiuc.edu/ece498/al/Syllabus.html National Center for High-Performance Computing http://sites.google.com/a/crypto.tw/cuda-lab/ http://pccluster.nchc.org.tw/main/tutor/09nctu/ http://www.nvidia.com.tw/object/cuda_home_tw.html NCHC教育訓練網 https://edu.nchc.org.tw/ http://heresy.spaces.live.com/blog/cns!E0070FB8ECF9015F!3114.entry http://www.kimicat.com/cuda%E7%B0%A1%E4%BB%8B
MPI基本定理
Parallel Computing Introduction Flynn’s Taxonomy Amdahl’s Law Moore’s Law Finding Concurrency
Flynn’s Taxonomy
Amdahl’s Law Expected speedup from partial improvement P:proportion of program that is parallel S:speedup of parallel portion
Moore’s Law The number of transistors on ICs doubles every 18 months In the past, CPUs have been taking advantage of Moore,s Law to: Increase clock frequency Increase exploitation of ILP ILP:Instruction-Level Parallelism Result:CPUs gets faster Increasingly difficult to get faster
各種平行運算的簡介
平行運算示意圖
一般計算 (Serial Computing)
平行計算:二處理器
平行計算:四處理器
N Processors
Performance Development
Parallel Processing Technology Shared Memory Multiprocessor Distributed Memory Multiprocessor System Clustering System
Shared Memory Multiprocessor Shared Memory Multiprocessor別稱為Symmetric Multiprocessors,簡稱為SMP。 此架構採用System Bus的方式,將系統的CPU、Local Memory以及I/O裝置相連接,透過相同的作業系統,將不同執行序的工作分發給比較輕鬆的CPU,以達到分工的作用。
Shared Memory Multiprocessor(續)
Distributed Memory Multiprocessor System Distributed Memory Multiprocessor System別稱Massive Parallel Processor,簡稱MPP。 這種架構是在同一部計算機中有許多CPU,並且這些CPU擁有屬於自己獨立的Local Memory,而CPU各自之間只能靠Message Passing 作為溝通橋樑 。
Distributed Memory Multiprocessor System(續)
Clustering System Clustering System架構是數台獨立的計算機,經由高速網路連結在一起,形成一個巨大的系統,而每台獨立的計算機都擁有各自的CPU、Local Memory和作業系統。Clustering System架構因為每部計算機都是獨立的,不需要大量資訊交換,只有在必要時經由高速網路交換資訊 。 目前Clustering System上的平行計算是使用Message Passing的概念,使用TCP/IP的通訊協定作為溝通的橋樑,常見的有兩種,分別如下: Parallel Virtual Machine Message Passing Interface
Clustering System(續)
Parallel Virtual Machine Parallel Virtual Machine提供一組Application Program Interface,簡稱API,讓使用者可以直覺並且有效率的開發平行處理程式在現有的硬體上,它將在此系統上的異質作業系統當作單一的平行處理計算機,透明化的處理所有訊息的傳遞、資料的轉換和網路工作的排程。
Message Passing Interface Message Passing Interface定義在一套標準的訊息傳遞介面,而跟PVM不同的是,不是讓不同作業系統在同一虛擬平台上執行,它不包含系統的Task,也不會直接控制I/O裝置的支援,它只是扮演一個溝通介面層的角色。這些特色使得很多PVM平台使用者轉向使用MPI平台。
CUDA簡介
GPGPU 將GPU用在非傳統的3D圖形顯示卡方面的應用,一般來說,會把這樣的應用叫作GPGPU ( General-pupose computing on graphics processing units ) 。 適用問題:大多是把一個可以用來大量拆解成多個相同、但彼此並不相關的小問題的情況;在這種情況下,用GPGPU的方法,就可以把這些一樣的小問題,給顯示卡的GPU來大量平行化的處理。 缺點:傳統的GPGPU的開發方法,都是透過OpenGL 或 Direct3D這一類現有的圖形函式庫,以編寫shading language 的方法,控制 shader 來想辦法做到自己想要的計算
CUDA “Compute Unified Device Architecture” 網址:http://www.nvidia.com.tw/object/cuda_home_tw.html#
選擇OS
下載套件
使用VS2005.net 建置CUDA
使用VS2005.NET 加入規則
使用VS2005.NET(續) 加入規則(續)
使用VS2005.NET(續) 規則檔的選取有兩種方式 在CUDA SDK目錄之中,有提供官方的build rule,位於「C://Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\common\scripts」底下,檔名是「nvcc.rules」 http://forums.nvidia.com/index.php?showtopic=30273,在這篇文章中最後所提供的rule
使用VS2005.NET(續) 自定建置規則
使用VS2005.NET(續) CUDA: http://forums.nvidia.com/index.php?showtopic=30273 CudaCompile: nvcc.rules
使用VS2005.NET(續)
使用VS2005.NET(續) 有執行GPU程式的副檔名:.cu
使用VS2005.NET(續)
使用VS2005.NET(續)
使用VS2005.NET(續)
使用VS2005.NET(續)
使用VS2005.NET(續)
CUDA基本知識
Design Philosophy GPUs use the transistor budget to get wider Good for data-parallel computations
CPU & GPU 比較
CUDA 效能測試 CPU GPU GPU Texture GPU without transfer 使用 CPU 來做計算,但是沒有使用 OpenMP 之類的平行化計算,所以應該只有用到一顆 CPU 核心。 GPU 簡單的透過 CUDA global memory 來做,沒有特別最佳化。 GPU Texture 使用 CUDA Linear memory 的 texture 來做存取。 GPU without transfer 簡單的透過 CUDA global memory 來做,沒有特別最佳化;不過在計算時間時,不考慮將記憶體由 host 複製到 device 的時間。 GPU Texture without transfer 使用 CUDA Linear memory 的 texture 來做存取;不過在計算時間時,不考慮將記憶體由 host 複製到 device 的時間。
CUDA 效能測試(續)
Finding Concurrency At high level, algorithms can be decomposed by tasks and data Task:Groups od instructions that can execute in parallel Data:Partitions in the data that can be used independently Inside tasks, there is parallelism among the instructions Level Tree
Example int a[10] = {1,2,3,4,5,6,7,8}; int sum = 0; for (i = 1 ; i <=8 ; i++) { sum += a[i]; } printf(“%d\n”,sum); sum = 36
Example(續) main(){ int compute = 8, sum = 0; int a[8] = {1,2,3,4,5,6,7,8}; sum = compute_sum(1,a); printf(“%d\n”,sum) } compute_sum (tid, a){ if (tid >= 8) return a[0]; for(i = 0 ; i < 8 ; i = i+tid*2) a[i] = a[i] + a[i+tid]; return compute_sum (start, tid*2, a); } sum = 36
Geforce 8800
CUDA的架構 CUDA的程式架構 Host (CPU) Device (GPU)
CUDA – C with no shader limitations Integrated host + device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code
CUDA Devices and Threads A compute device Is a coprocessor to the CPU or Host Has its own DREM Runs many threads in parallel Is typically a GPU but can also be another type of parallel processing Differences Between GPU and CPU threads CPU:software thread GPU:hardware thread (transfer more fast)
Arrays of Parallel Threads A CUDA kernel is executed by an array of threads All threads run the same code (SPMD) Each thread has an ID that it uses to compute memory addresses and make control decisions
Threads Blocks:Scalable Cooperation Divide monolithic thread array into multiple blocks Threads within a block cooperate via shared memory, atomic operations and barrier synchronization. Threads in different blocks cannot cooperate
Block IDs and Thread IDs Each thread uses IDs th decide what data to work on Block ID:1D or 2D Thread ID:1D , 2D , or 3D
CUDA Device Memory Space Each thread can : R/W per-thread registers R/W per-thread local memory R/W per-thread shared memory R/W per-grid global memory R/W per-grid constant memory R/W per-block texture memory The host can R/W global, constant, and texture memories
Parallel Memory Sharing Local Memory: (per-thread) Private per thread Auto variable, register spill Speed slow Shared Memory: (per-Block) Shared by threads of the same block Inter-thread communication Global Memory: (per-application) Shared by all threads Inter-Grid communication
CUDA Device Memory Allocation cudaMalloc() Allocates object in the device Global Memory Require two parameters Address of a pointer to the allocated object Size of allocated object cudaFree() Frees object from device Global Memory Pointer to freed object
CUDA Device Memory Allocation (續) Example: int width = 32; float* Array; int size = width * width * sizeof(float); cudaMalloc((void**) &Array, size); . . . . cudaFree(Array);
CUDA Host-Device Data Transfer cudaMemcpy() Memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer Host to Host Host to Device Device to Host Device to Device Asynchronous transfer
CUDA Host-Device Data Transfer(續) Example: int width = 32; float* Array; float HostArray[width * width] int size = width * width * sizeof(float); cudaMalloc((void**) &Array, size); cudaMemcpy(&Array, HostArray, size, cudaMemcpyHostToDevice); . . . cudaMemcpy(HostArray, Array, size, cudaMemcpyDeviceToDevice); cudaFree(Array);
CUDA Function Declarations Executed on the: Only callable from the: __device__ float DeviceFunc() device __global__ void KernelFunc() host __host__ float HostFunc() __global__ defines a kernel function Must return void __device__ and __host__ can be used together
Language Extension: Built-in Variables dim3 gridDim; Dimensions of the grid in blocks dim3 blockDim; dim3 blockIdx; Block index within the grid dim3 threadIdx; Thread index within the block
Device Runtime Component: Mathematical Functions Some mathematical functions( e.g. sin(x) ) have a less accurate, but faster device-only version ( e.g. __sin(x) ) __pow __log, __log2, __log10 __exp __sin, __cos, __tan
Device Runtime Component: Synchronization Funtion void __syncthreads(); Synchronizes all threads in a block Once all threads have reached this point, execution resumes normally Used to avoid RAW/WAR/WAW hazards when accessing shared or global memory
CUDA硬體架構
處理單元 nVidia的GPU裡處理單元 SP ( Streaming Processor ) SM ( Streaming Multiprocessor ) TPC ( Texture Processing Clusters ) 以G80/G92 的架構之下,總共有128個SP,以8個SP為一組,組成16個SM,再以2個SM為一個TPC,共分成8個TPC來運作。
處理單元 (續)
處理單元 (續)
SM 中的 Warp 和 Block device 實際在執行時,會以block為執行單位,把Block分配給SM作計算。 block中的thread,是以「 warp 」為單位,32個thread會組成一個warp來執行。 warp分組的動作是以SM自動進行,會以連續的方式來作分組。 一個SM一次只會執行一個block裡的一個warp。
Warp 排程 以下就是一個Warp排程的例子
Transparent Scalability Hardware is free to assign blocks to any processor at any time A kernel scales across any number of parallel processors
CUDA缺點
CUDA缺點 1. 太新 2. 綁顯示卡 3. 不支援 Double 4. debug麻煩 5. 記憶體配置常常會抓到已使用記憶體 不能使用遞迴
My Research
Master/Slave架構 主要是從兩個組件結合而成,運算的過程中,由一個Server的架構對應於多組Client,並且從Server動態的分割出多組運算區段,使用TCP/IP通訊協定,傳輸工作分散至各Client,使工作分配類似星狀架構。
Model
Model(續)
整體架構
分工 CPU: Message Passing Search GPU Compute arbitrage
Example Theorem 8 If C and P is a rationally determined American cll and put price, then C and P is convex function of its exercise price (X) three otherwise identical calls with strike prices Where Remark:The above arguments can also be applied to European options Robert C Merton (1973)
Example(續)
Thank you