Cuda 平行運算機制報告者：林威辰.

Slides:

Advertisements

Similar presentations

C enter of C omputational C hemistry 并行计算机与并行计算张鑫理论与计算化学国际合作研究中心分子反应动力学国家重点实验室.

Advertisements

1 I/O 设备访问方式和类型. 2 Overview n The two main jobs of a computer: l I/O (Input/Output) l processing n The control of devices connneted to the computer is.

——系统软件与软件安全实验室简介报告人：冯新宇

第 2 章中央處理單元.

赵永华中科院计算机网络信息中心超级计算中心

Foundations of Computer Science

Memory Pool ACM Yanqing Peng.

前言１．课程安排：第一章操作系统引论（7学时）第二章进程管理（14学时）第三章处理机调度与死锁（10学时）

Performance Evaluation

第8章系統架構.

操作系统结构.

CHAP 2 Computer-System Structures 计算机系统结构

Chapter 2: Computer-System Structures计算机系统结构

数字系统设计及VHDL实践专题五专用集成电路设计中的并行算法主讲人：徐向民单位：电子信息学院.

天文望远镜集成建模研究杨德华南京天文光学技术研究所 30 NOV, 年中国虚拟天文台年会广西师范大学桂林

Operating System CPU Scheduing - 2 Monday, August 11, 2008.

Operating System CPU Scheduing - 3 Monday, August 11, 2008.

Xbox one计算机系统介绍刘一帆.

CH.2 Introduction to Microprocessor-Based Control

第 2 章中央處理單元.

異質計算教學課程內容「異質計算」種子教師研習營洪士灝國立台灣大學資訊工程學系

Operating System Concepts 作業系統原理 Chapter 3 行程觀念 (Process Concept)

数字系统设计 I Digital System Design I

现场总线Fieldbus.

第六章应用程序结构.

并行计算实验上机国家高性能计算中心（合肥）.

核探测与核电子学国家重点实验室报告人：董磊指导老师：宋克柱

Chapter 2. The Graphics Rendering Pipeline 图形绘制流水线

Applied Operating System Concepts

基于压缩算法的tile64多核处理器性能研究

GPU分散式演算法設計與單機系統模擬(第二季)

5 Computer Organization (計算機組織).

Operating System Concepts 作業系統原理 CHAPTER 2 系統結構 (System Structures)

Operating System Internals and Design principles

HLA - Time Management 陳昱豪.

Chapter 3 行程觀念 (Process Concept)

微程序控制器刘鹏 Dept. ISEE Zhejiang University

创建型设计模式.

ICT RTOS Research Group 胡伟平，王剑

华南理工大学陈虎博士 CUDA编程模型华南理工大学陈虎博士

SAP 架構及基本操作 SAP前端軟體安裝與登入 Logical View of the SAP System SAP登入 IDES

預官考試輔導計算機概論提要 91年12月4日.

Chapter 4 多執行緒 (Multi Thread)

重點資料結構之選定會影響演算法選擇對的資料結構讓您上天堂程式.

校園網路架構介紹與資源利用主講人：趙志宏圖書資訊館網路通訊組.

Operating System Principles 作業系統原理

第3章認識處理元.

华南理工大学陈虎博士 CUDA例子程序——矩阵乘法华南理工大学陈虎博士

計算機概論第3章計算機組織與結構概觀.

Chapter 5 Recursion.

Chp.4 The Discount Factor

TinyOS 石万兵 2019/4/6 mice.

Version Control System Based DSNs

Introduction to C Programming

Real-Time System Software Group Lab 408 Wireless Networking and Embedded Systems Laboratory Virtualization, Parallelization, Service 實驗室主要是以系統軟體設計為主,

易成 Institute of High Energy Physics

Chp.4 The Discount Factor

中国科学技术大学计算机系陈香兰 2013Fall 第七讲存储器管理中国科学技术大学计算机系陈香兰 2013Fall.

虚拟仪器 virtual instrument

中国科学技术大学计算机系陈香兰 Fall 2013 第三讲线程中国科学技术大学计算机系陈香兰 Fall 2013.

OpenMP程序设计 2019/4/25.

第7章進階的同步觀念與實務.

Chp.4 The Discount Factor

Chapter 10 Mobile IP TCP/IP Protocol Suite

SAP 架構及基本操作 SAP前端軟體安裝與登入 Logical View of the SAP System SAP登入 IDES

11 Overview Cloud Computing 2012 NTHU. CS Che-Rung Lee

何正斌博士國立屏東科技大學工業管理研究所教授

Operating System Software School of SCU

MATLAB 結構化財務程式之撰寫 MATLAB財務程式實作應用研習主題五資管所陳竑廷

Experimental Analysis of Distributed Graph Systems

Presentation transcript:

Cuda 平行運算機制報告者：林威辰

Slides MPI基本定理各種平行運算的簡介 CUDA簡介使用VS2005.net 建置CUDA CUDA基本知識 CUDA硬體架構 CUDA缺點 My Research

Slides http://courses.ece.uiuc.edu/ece498/al/Syllabus.html National Center for High-Performance Computing http://sites.google.com/a/crypto.tw/cuda-lab/ http://pccluster.nchc.org.tw/main/tutor/09nctu/ http://www.nvidia.com.tw/object/cuda_home_tw.html NCHC教育訓練網 https://edu.nchc.org.tw/ http://heresy.spaces.live.com/blog/cns!E0070FB8ECF9015F!3114.entry http://www.kimicat.com/cuda%E7%B0%A1%E4%BB%8B

MPI基本定理

Parallel Computing Introduction Flynn’s Taxonomy Amdahl’s Law Moore’s Law Finding Concurrency

Flynn’s Taxonomy

Amdahl’s Law Expected speedup from partial improvement P：proportion of program that is parallel S：speedup of parallel portion

Moore’s Law The number of transistors on ICs doubles every 18 months In the past, CPUs have been taking advantage of Moore,s Law to： Increase clock frequency Increase exploitation of ILP ILP：Instruction-Level Parallelism Result：CPUs gets faster Increasingly difficult to get faster

各種平行運算的簡介

平行運算示意圖

一般計算 (Serial Computing)

平行計算：二處理器

平行計算：四處理器

N Processors

Performance Development

Parallel Processing Technology Shared Memory Multiprocessor Distributed Memory Multiprocessor System Clustering System

Shared Memory Multiprocessor Shared Memory Multiprocessor別稱為Symmetric Multiprocessors，簡稱為SMP。此架構採用System Bus的方式，將系統的CPU、Local Memory以及I/O裝置相連接，透過相同的作業系統，將不同執行序的工作分發給比較輕鬆的CPU，以達到分工的作用。

Shared Memory Multiprocessor(續)

Distributed Memory Multiprocessor System Distributed Memory Multiprocessor System別稱Massive Parallel Processor，簡稱MPP。這種架構是在同一部計算機中有許多CPU，並且這些CPU擁有屬於自己獨立的Local Memory，而CPU各自之間只能靠Message Passing 作為溝通橋樑。

Distributed Memory Multiprocessor System(續)

Clustering System Clustering System架構是數台獨立的計算機，經由高速網路連結在一起，形成一個巨大的系統，而每台獨立的計算機都擁有各自的CPU、Local Memory和作業系統。Clustering System架構因為每部計算機都是獨立的，不需要大量資訊交換，只有在必要時經由高速網路交換資訊。目前Clustering System上的平行計算是使用Message Passing的概念，使用TCP/IP的通訊協定作為溝通的橋樑，常見的有兩種，分別如下： Parallel Virtual Machine Message Passing Interface

Clustering System(續)

Parallel Virtual Machine Parallel Virtual Machine提供一組Application Program Interface，簡稱API，讓使用者可以直覺並且有效率的開發平行處理程式在現有的硬體上，它將在此系統上的異質作業系統當作單一的平行處理計算機，透明化的處理所有訊息的傳遞、資料的轉換和網路工作的排程。

Message Passing Interface Message Passing Interface定義在一套標準的訊息傳遞介面，而跟PVM不同的是，不是讓不同作業系統在同一虛擬平台上執行，它不包含系統的Task，也不會直接控制I/O裝置的支援，它只是扮演一個溝通介面層的角色。這些特色使得很多PVM平台使用者轉向使用MPI平台。

CUDA簡介

GPGPU 將GPU用在非傳統的3D圖形顯示卡方面的應用，一般來說，會把這樣的應用叫作GPGPU ( General-pupose computing on graphics processing units ) 。適用問題：大多是把一個可以用來大量拆解成多個相同、但彼此並不相關的小問題的情況；在這種情況下，用GPGPU的方法，就可以把這些一樣的小問題，給顯示卡的GPU來大量平行化的處理。缺點：傳統的GPGPU的開發方法，都是透過OpenGL 或 Direct3D這一類現有的圖形函式庫，以編寫shading language 的方法，控制 shader 來想辦法做到自己想要的計算

CUDA “Compute Unified Device Architecture” 網址：http://www.nvidia.com.tw/object/cuda_home_tw.html#

選擇OS

下載套件

使用VS2005.net 建置CUDA

使用VS2005.NET 加入規則

使用VS2005.NET(續) 加入規則(續)

使用VS2005.NET(續) 規則檔的選取有兩種方式在CUDA SDK目錄之中，有提供官方的build rule，位於「C://Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\common\scripts」底下，檔名是「nvcc.rules」 http://forums.nvidia.com/index.php?showtopic=30273，在這篇文章中最後所提供的rule

使用VS2005.NET(續) 自定建置規則

使用VS2005.NET(續) CUDA： http://forums.nvidia.com/index.php?showtopic=30273 CudaCompile： nvcc.rules

使用VS2005.NET(續)

使用VS2005.NET(續) 有執行GPU程式的副檔名：.cu

使用VS2005.NET(續)

使用VS2005.NET(續)

使用VS2005.NET(續)

使用VS2005.NET(續)

使用VS2005.NET(續)

CUDA基本知識

Design Philosophy GPUs use the transistor budget to get wider Good for data-parallel computations

CPU & GPU 比較

CUDA 效能測試 CPU GPU GPU Texture GPU without transfer 使用 CPU 來做計算，但是沒有使用 OpenMP 之類的平行化計算，所以應該只有用到一顆 CPU 核心。 GPU 簡單的透過 CUDA global memory 來做，沒有特別最佳化。 GPU Texture 使用 CUDA Linear memory 的 texture 來做存取。 GPU without transfer 簡單的透過 CUDA global memory 來做，沒有特別最佳化；不過在計算時間時，不考慮將記憶體由 host 複製到 device 的時間。 GPU Texture without transfer 使用 CUDA Linear memory 的 texture 來做存取；不過在計算時間時，不考慮將記憶體由 host 複製到 device 的時間。

CUDA 效能測試(續)

Finding Concurrency At high level, algorithms can be decomposed by tasks and data Task：Groups od instructions that can execute in parallel Data：Partitions in the data that can be used independently Inside tasks, there is parallelism among the instructions Level Tree

Example int a[10] = {1,2,3,4,5,6,7,8}; int sum = 0; for (i = 1 ; i <=8 ; i++) { sum += a[i]; } printf(“%d\n”,sum); sum = 36

Example(續) main(){ int compute = 8, sum = 0; int a[8] = {1,2,3,4,5,6,7,8}; sum = compute_sum(1,a); printf(“%d\n”,sum) } compute_sum (tid, a){ if (tid >= 8) return a[0]; for(i = 0 ; i < 8 ; i = i+tid*2) a[i] = a[i] + a[i+tid]; return compute_sum (start, tid*2, a); } sum = 36

Geforce 8800

CUDA的架構 CUDA的程式架構 Host (CPU) Device (GPU)

CUDA – C with no shader limitations Integrated host + device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code

CUDA Devices and Threads A compute device Is a coprocessor to the CPU or Host Has its own DREM Runs many threads in parallel Is typically a GPU but can also be another type of parallel processing Differences Between GPU and CPU threads CPU：software thread GPU：hardware thread (transfer more fast)

Arrays of Parallel Threads A CUDA kernel is executed by an array of threads All threads run the same code (SPMD) Each thread has an ID that it uses to compute memory addresses and make control decisions

Threads Blocks：Scalable Cooperation Divide monolithic thread array into multiple blocks Threads within a block cooperate via shared memory, atomic operations and barrier synchronization. Threads in different blocks cannot cooperate

Block IDs and Thread IDs Each thread uses IDs th decide what data to work on Block ID：1D or 2D Thread ID：1D , 2D , or 3D

CUDA Device Memory Space Each thread can ： R/W per-thread registers R/W per-thread local memory R/W per-thread shared memory R/W per-grid global memory R/W per-grid constant memory R/W per-block texture memory The host can R/W global, constant, and texture memories

Parallel Memory Sharing Local Memory： (per-thread) Private per thread Auto variable, register spill Speed slow Shared Memory： (per-Block) Shared by threads of the same block Inter-thread communication Global Memory: (per-application) Shared by all threads Inter-Grid communication

CUDA Device Memory Allocation cudaMalloc() Allocates object in the device Global Memory Require two parameters Address of a pointer to the allocated object Size of allocated object cudaFree() Frees object from device Global Memory Pointer to freed object

CUDA Device Memory Allocation (續) Example： int width = 32; float* Array; int size = width * width * sizeof(float); cudaMalloc((void**) &Array, size); . . . . cudaFree(Array);

CUDA Host-Device Data Transfer cudaMemcpy() Memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer Host to Host Host to Device Device to Host Device to Device Asynchronous transfer

CUDA Host-Device Data Transfer(續) Example： int width = 32; float* Array; float HostArray[width * width] int size = width * width * sizeof(float); cudaMalloc((void**) &Array, size); cudaMemcpy(&Array, HostArray, size, cudaMemcpyHostToDevice); . . . cudaMemcpy(HostArray, Array, size, cudaMemcpyDeviceToDevice); cudaFree(Array);

CUDA Function Declarations Executed on the： Only callable from the： __device__ float DeviceFunc() device __global__ void KernelFunc() host __host__ float HostFunc() __global__ defines a kernel function Must return void __device__ and __host__ can be used together

Language Extension： Built-in Variables dim3 gridDim; Dimensions of the grid in blocks dim3 blockDim; dim3 blockIdx; Block index within the grid dim3 threadIdx; Thread index within the block

Device Runtime Component： Mathematical Functions Some mathematical functions( e.g. sin(x) ) have a less accurate, but faster device-only version ( e.g. __sin(x) ) __pow __log, __log2, __log10 __exp __sin, __cos, __tan

Device Runtime Component： Synchronization Funtion void __syncthreads(); Synchronizes all threads in a block Once all threads have reached this point, execution resumes normally Used to avoid RAW/WAR/WAW hazards when accessing shared or global memory

CUDA硬體架構

處理單元 nVidia的GPU裡處理單元 SP ( Streaming Processor ) SM ( Streaming Multiprocessor ) TPC ( Texture Processing Clusters ) 以G80/G92 的架構之下，總共有128個SP，以8個SP為一組，組成16個SM，再以2個SM為一個TPC，共分成8個TPC來運作。

處理單元 (續)

處理單元 (續)

SM 中的 Warp 和 Block device 實際在執行時，會以block為執行單位，把Block分配給SM作計算。 block中的thread，是以「 warp 」為單位，32個thread會組成一個warp來執行。 warp分組的動作是以SM自動進行，會以連續的方式來作分組。一個SM一次只會執行一個block裡的一個warp。

Warp 排程以下就是一個Warp排程的例子

Transparent Scalability Hardware is free to assign blocks to any processor at any time A kernel scales across any number of parallel processors

CUDA缺點

CUDA缺點 1. 太新 2. 綁顯示卡 3. 不支援 Double 4. debug麻煩 5. 記憶體配置常常會抓到已使用記憶體不能使用遞迴

My Research

Master/Slave架構主要是從兩個組件結合而成，運算的過程中，由一個Server的架構對應於多組Client，並且從Server動態的分割出多組運算區段，使用TCP/IP通訊協定，傳輸工作分散至各Client，使工作分配類似星狀架構。

Model

Model(續)

整體架構

分工 CPU： Message Passing Search GPU Compute arbitrage

Example Theorem 8 If C and P is a rationally determined American cll and put price, then C and P is convex function of its exercise price (X) three otherwise identical calls with strike prices Where Remark：The above arguments can also be applied to European options Robert C Merton (1973)

Example(續)

Thank you