多核结构与程序设计 杨全胜 http://www.njyangqs.com/ 东南大学成贤学院计算机系
使用OpenMP编程 内容 什么是OpenMP? 并行区域 循环并行化 数据作用域与保护数据 明确的同步 调度子句 其他有用的结构和子句
什么是OpenMP OpenMP-Open specifications for Mulit Processing-用于多线程程序开发的API 一种面向共享内存以及分布式共享内存的多处理器多线程并行编程语言。 一种能够被用于显式指导多线程、共享内存并行的应用程序编程接口(API)。 OpenMP具有良好的可移植性,支持多种编程语言 OpenMP能够支持多种平台,包括大多数的类UNIX系统以及Windows NT系统(Windows 2000,Windows XP,Windows Vista等)。
什么是OpenMP 提供一组针对多线程程序设计的编译指导——数据嵌入到源代码中 容易创建线程的Fortran和C/C++代码 支持数据并行模型 将串行的程序逐步地改造成一个并行程序,达到增量更新程序的目的,减少程序编写人员一定的负担 支持粗粒度和细粒度并行 支持和标准化循环级并行 Introduce OpenMP by stating that OpenMP, Open specifications for Multi Processing, is an application programming interface (API) that was formulated in 1997. It is used for writing portable, multithreaded applications. OpenMP has made it easy to create threaded Fortran and C/C++ programs. Unlike Windows threads, OpenMP does not require you to create, synchronize, load balance, and destroy threads. OpenMP is not formally accepted as an industry standard for programming, but it is used widely. OpenMP provides a set of compiler directives—instructions embedded in source code—that informs compilers about the intent of compilation. Compiler directives tell the compiler how to compile. For example, the #include <stdio.h> compiler directive informs the compiler to include the stdio.h file and compile the code using the contents of this header file. Earlier, shared-memory platform vendors had their own sets of compiler directives to implement shared-memory parallelism. OpenMP standardizes almost 20 years of compiler-directed threading experience by combining common functionality from various vendors into a portable API. OpenMP provides a special syntax that programmers can use to create multithreaded code. It enables you to combine serial and parallel code in a single source. If your compiler does not recognize OpenMP directives, the compiler ignores the directives and the code enclosed within the directive. Such a situation yields a serial executable code. OpenMP also supports incremental parallelism. Incremental parallelism allows you to first thread one segment of the code and tests it for better performance. You can then successively modify and test other code segments. This helps you give priority to threading critical problems in a code segment. If threading results in worse performance, you can remove the new OpenMP directives and proceed to the next code segment. OpenMP also supports both coarse-grain and fine-grain parallelism. OpenMP also supports and standardizes loop-level parallelism. It uses special syntax to parallelize loops where performance hotspots or bottlenecks have been identified. As a result, threads can execute different iterations of the same loop simultaneously. For the participants who want to learn more about OpenMP, recommend them to visit http://www.openmp.org.
(combined C/C++ and Fortran) 什么是OpenMP C$OMP FLUSH #pragma omp critical C$OMP THREADPRIVATE(/ABC/) CALL OMP_SET_NUM_THREADS(10) call omp_test_lock(jlok) C$OMP parallel do shared(a, b, c) C$OMP MASTER call OMP_INIT_LOCK (ilok) http://www.openmp.org Current spec is OpenMP 3.1 (combined C/C++ and Fortran) C$OMP ATOMIC C$OMP SINGLE PRIVATE(X) setenv OMP_SCHEDULE “dynamic” C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C) C$OMP ORDERED C$OMP PARALLEL REDUCTION (+: A, B) C$OMP SECTIONS #pragma omp parallel for private(A, B) !$OMP BARRIER C$OMP PARALLEL COPYIN(/blk/) C$OMP DO lastprivate(XX) Nthrds = OMP_GET_NUM_PROCS() omp_set_lock(lck)
什么是OpenMP OpenMP* 结构 Fork-join模型 循环并行化 数据环境构造 同步构造 众多的用于更好控制的应用程序接口API
什么是OpenMP 编程模型 Fork-join并行性: 主线程根据需要产生一组线程 并行性是逐步添加的: 串行程序逐渐发展成并行程序 Fork-Join Model Master Thread Parallel Regions F O R K J I N Introduce the slide by saying that OpenMP is based on the fork-join model that facilitates parallel programming. Every threading activity in OpenMP follows the fork-join model. Explain the fork-join model with the help of the figure on the slide: All programs based on the fork-join model begin as a single process. In Figure, the application starts executing serially with a single thread called the master thread. The execution of parallel regions of a program following the fork-join model proceeds through the following two stages: Fork: When the master thread reaches a parallel region or a code segment that must be executed in parallel by a team of threads, it creates a team of parallel threads called worker threads that execute the code in the parallel region concurrently. Join: At the end of the parallel region, the worker threads synchronize and terminate, leaving only the master thread. In an OpenMP implementation, threads reside in a team or thread pool. A thread pool refers to a common set of threads to which tasks to be processed are assigned. The thread that is assigned to a task completes the task and returns to the pool to wait for the next assignment without terminating. When the threads are assigned to a parallel region that has more work than a previous parallel region, additional threads can be added at run time to the thread pool. Threads are not destroyed until the end of the last parallel region. A thread pool can prevent your machine from running out of memory because a small number of threads are created and scheduled to execute the tasks in parallel, as needed, rather than continuously creating threads in your program for each task. Continuous creation of threads can result in out of memory errors. The fork-join method enables incremental parallelism. You do not need to thread the entire algorithm. For example, if there are three hotspots in the code, you can concentrate on the hotspots one by one based on their severity levels.
什么是OpenMP OpenMP* 编译指导语法 OpenMP*中很多结构是由编译指导语句 pragmas引起的具有OpenMP语义的语句. 对C和C++, 编译指导的形式如下: #pragma omp construct [clause [clause]…]
什么是OpenMP 运行时库函数 OpenMP运行时函数库原本用以设置和获取执行环境相关的信息,它们当中也包含一系列用以同步的API。 要使用运行库函数所包含的函数,需在源程序中包含头文件omp.h omp_get_thread_num(void) omp_get_num_thread(void) 支持运行时对并行环境的改变和优化,给编程人员足够的灵活性来控制运行时的程序运行状况。 不同的平台提供不同的运行函数库 Windows提供动态链接库
举例: Hello Worlds #include “stdafx.h” #include “omp.h” int _tmain (int argc, _TCHAR* argv[]) { printf(“Hello from serial.\n”); printf (“Thread number = %d\n”,omp_get_thread_num()); //串行执行 #pragma omp parallel //开始并行执行 printf(“Hello from parallel. Thread number=%d\n”,omp_get_thread_num()); } printf (“Hello from serial again.\n”); return 0; Questions to Ask Students Q: Why is the output different between the two versions of the application that were created and run? Why don't we get multiple prints of the same iteration number in the second version? A: The for-iteration variable is shared when not included in the scope of the pragma. Thus, each thread is incrementing the single iteration variable from the multiple, concurrent for-loops being executed.
使用OpenMP编程 内容 什么是OpenMP? 并行区域 循环并行化 数据作用域与保护数据 明确的同步 调度子句 其他有用的结构和子句
并行区域 定义了构造代码块的并行区域 线程被作为交叉的并行pragma而创建。 线程在区域的最后被阻塞 除非另外指定,数据在线程间共享 1 #pragma omp parallel Thread 1 2 3 主线程 定义了构造代码块的并行区域 线程被作为交叉的并行pragma而创建。 线程在区域的最后被阻塞 除非另外指定,数据在线程间共享 隐含的栅障 主线程 C/C++中的并行区域定义 : #pragma omp parallel { block }
并行区域 使用多少线程 设置环境变量来决定使用的线程数 这个变量并没有标准的缺省值 很多系统: set OMP_NUM_THREADS=4 # of threads = # of processors Intel® 的编译器用这个做缺省 set OMP_NUM_THREADS=4
使用OpenMP编程 内容 什么是OpenMP? 并行区域 循环并行化 数据作用域与保护数据 明确的同步 调度子句 其他有用的结构和子句
循环并行化 管理线程 将循环迭代拆分到多个线程中 黄色部分必须在并行区域中 OMP语句必须先于循环 #pragma omp parallel #pragma omp for for (i=0;i<N;i++){ Do_Work(I); } Thread handles left lying around take up memory space. Continuously creating threads without cleaning up the handles when done, could lead to a memory leak. Another possibility would be that the maximum number of threads that can be supported within a process will be reached and no more new threads will be able to be created. Clean up after your threads!! Examples within this talk don’t use this since they are so small and the exit of the process will do the resource reclamation automatically.
循环并行化 循环并行化语句的限制 循环并行化的语句必须具有如下的形式 循环语句块应该是单出口与单入口的 循环变量必须是有符号整型数 for (index = start ; index < end ; increment_expr) 循环语句块应该是单出口与单入口的 循环变量必须是有符号整型数 循环语句中的比较操作只能是<, <=,>,>= 循环变量每次迭代必须是有效靠近循环结束目标的 循环的步长必须是整数加或整数减,加、减的数值必须是一个循环不变量。
循环并行化 #pragma omp parallel #pragma omp for for(i = 1, i < 13, i++) 隐含的栅障 i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 i = 8 i = 9 i = 10 i = 11 i = 12 #pragma omp parallel #pragma omp for for(i = 1, i < 13, i++) c[i] = a[i] + b[i] 线程被分配一组独立的迭代 线程必须在并行区域结尾等待(栅障) To understand the functioning of the omp for directive, consider the following code example: #pragma omp parallel #pragma omp for for (i=0; i<12; i++) c[i] = a[i] + b[i]; In this code example, the for-loop has 12 iterations. If there are three worker threads available in the parallel region, the iterations are split among these three threads. Each thread is assigned an independent set of iterations. The figure displays a static division of iterations based on the number of threads. The master thread forms a team of three worker threads. Each worker thread performs 4 out of the total 12 iterations. Questions to Ask Students Q: Why is there a barrier at the end of the work-sharing construct? A: Code following the for-loop may rely on the results of the computations within the for-loop. In serial code, the for-loop completes before proceeding on to the next computation. Thus, to remain serially consistent, the barrier at the end of the construct is enforced.
循环并行化 组合pragmas 这两个代码段是等价的 #pragma omp parallel { #pragma omp for for (i=0;i< MAX; i++) { res[i] = huge(); } #pragma omp parallel for for (i=0;i< MAX; i++) { res[i] = huge(); }
循环并行化 循环并行化语句的限制 循环嵌套 循环并行化编译指导语句可以加在任意一个循环之前,则对应的最近的循环语句被并行化,其它部分保持不变。 #pragma omp parallel for private(j) for (i=0;i< MAX; i++) for (j=0;j< MAX; j++ { printf(“i=%d,j=%d”,i,j); } for (i=0;i< MAX; i++) #pragma omp parallel for private(j) for (j=0;j< MAX; j++ { printf(“i=%d,j=%d”,i,j); }
使用OpenMP编程 内容 什么是OpenMP? 并行区域 循环并行化 数据作用域与保护数据 明确的同步 调度子句 其他有用的结构和子句
数据作用域和保护数据 数据环境 OpenMP 使用共享存储程序设计模型 大多数变量默认为共享的. 全局变量在线程间共享 线程1堆栈 数据环境 OpenMP 使用共享存储程序设计模型 大多数变量默认为共享的. 全局变量在线程间共享 C/C++: 文件域变量, 静态 但, 不是每一个变量都是共享的 在并行区域中被调用的函数中的栈变量(参数和局部变量)是私有的( PRIVATE) 循环变量是私有的 C/C+: 在嵌套循环的第一层循环的循环变量跟着: #pragma omp for 线程2堆栈 堆 代码段 Stack Variables: Stack variables in functions that are called from parallel regions are private. Actual parameters and variables declared locally are placed on the stack. Automatic Variables: Automatic variables within a statement block are private. These variables are primarily related to Fortran. Loop Index Variables: Loop index variables in work-sharing constructs are private. In C/C++, the first loop index variable in nested loops following a #pragma omp for is private.
数据作用域和保护数据 数据作用域属性 缺省的状态能被修改 数据作用域属性支持下列两个子句 default (shared | none) 设置作用域为共享的变量 设置作用域为私有的变量 default (shared | none) shared(varname,…) private(varname,…)
数据作用域和保护数据 私有子句 为每个线程复制变量 变量未初始化; C++ 对象按缺省构造 任何外部值在并行区域是不确定的 void* work(float* c, int N) { float x, y; int i; #pragma omp parallel for private(x,y) for(i=0; i<N; i++) { x = a[i]; y = b[i]; c[i] = x + y; } For-loop iteration variable is PRIVATE by default. Emphasize that private variables are uninitialized when entering the region. Thus, the assignment to x and y is done before reading the values. The private variables are destroyed at the end of the construct to which they are scoped.
数据作用域和保护数据 案例: 点积 错了吗? float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<N; i++) { sum += a[i] * b[i]; } return sum; Details Standard dot product algorithm. Multiply corresponding elements from 1-D vectors and add all the products to a single value. The “shared” clause is superfluous since this is the default. It is used here to emphasize the problem. We can’t make sum a private because we need to have the value after the parallel region. Questions to Ask Students Q: What is wrong with this code? A: The shared variable “sum” is being read and updated by multiple threads and the answer is likely to be incorrect because of this data race. 错了吗?
数据作用域和保护数据 保护共享数据 必须保护可更改的共享数据的访问 float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<N; i++) { #pragma omp critical sum += a[i] * b[i]; } return sum; Details The “critical” pragma allows only one thread at a time to execute the update of sum.
使用OpenMP编程 内容 什么是OpenMP? 并行区域 循环并行化 数据作用域与保护数据 明确的同步 调度子句 其他有用的结构和子句
明确的同步 OpenMP* 临界结构 在一个结构块中定义一个临界区 #pragma omp critical [(lock_name)] float R1,R2; #pragma omp parallel { float B; #pragma omp for for(int i=0; i<niters; i++){ B = big_job(i); #pragma omp critical (R1_lock) consum (B, &R1); A = bigger_job(i) #pragma omp critical (R2_lock) consum (A, &R2); } } 线程等待它们的轮次—在一个时刻,只有一个可以调用 consum() 因此在竞争条件中保护了R1和R2 命名临界结构是可选的,但是会提高性能 Build points reveal that the critical constructs can be named. Without names, each construct has the same name and only one thread will be allowed to execute within each different named region. Thus, in the example, since R1 and R2 are unrelated and will never be aliased, naming the constructs allows one thread to be in each at the same time.
明确的同步 OpenMP* 归约子句 在“list”中的变量在封闭的并行区域中必须是共享的。 在并行或循环并行化中: 每个list变量的一个PRIVATE(私有)拷贝被创建并且根据“op”操作来初始化 这些拷贝被线程在局部范围内被修改 在结构的最后,局部拷贝通过“op”运算被组合成一个值并且和原始的SHARED(共享)变量相组合。 reduction (op : list) The operation must be an associative operation. Different operations are defined for C and Fortran, depending on what intrinsics are available in the language. The private copies of the list variables will be initialized with a value that depends on the operation. (Initial values are shown in two slides.)
明确的同步 归约举例 sum在每个线程的局部拷贝 所有sum 的局部拷贝加到一起,并存储到“全局”变量中 #pragma omp parallel for reduction(+:sum) for(i=0; i<N; i++) { sum += a[i] * b[i]; }
明确的同步 C/C++ 归约操作 一系列关联操作能被用作归约 初始值是有算术意义的值 Operand Initial Value + * 1 * 1 - ^ Operand Initial Value & ~0 | && 1 ||
数值积分例子 4.0 4.0 f(x) = dx = (1+x2) (1+x2) 1 4.0 (1+x2) f(x) = static long num_steps=100000; double step, pi; void main() { int i; double x, sum = 0.0; step = 1.0/(double) num_steps; for (i=0; i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi); 2.0 Questions for Discussion: Where is the bulk of the work found in the code? Where should we focus attention if we were to thread this code? Answers: The for-loop. Now ask the participants to thread this serial code. This is a serial version of the source code. It uses a “sum” variable that could give a clue to an efficient solution, that is, local partial sum variable that is updated each loop iteration. This code is small and efficient in serial, but will challenge the students to come up with an efficient solution. Of course, efficiency is not one of the goals with such a short module. Getting the answer correct with multiple threads is enough of a goal for this. 0.0 X 1.0
实践2 – 计算π 使用OpenMP的并行数字集成代码 什么变量可以是共享的? 什么变量需要时私有的? 什么变量能够做归约运算? static long num_steps=1000000; double step, pi; void main() { int i; double x, sum = 0.0; step = 1.0/(double) num_steps; for (i=0; i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf("Pi = %12.9f\n",pi); 使用OpenMP的并行数字集成代码 什么变量可以是共享的? 什么变量需要时私有的? 什么变量能够做归约运算? Details This is a serial version of the source code. It usse a “sum” variable that could give a clue to an efficient solution (i.e., local partial sum variable that is updated each loop iteration). This code is small and efficient in serial, but will challenge the students to come up with an efficient solution. Of course, efficiency is not one of the goals with such a short module. Getting the answer correct with multiple threads is enough of a goal for this.
实践2 – 计算π static long num_steps=1000000; double step, pi; void main() { int i; double x, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel for private(x) reduction(+:sum) for (i=0; i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf("Pi = %12.9f\n",pi); Details This is a serial version of the source code. It usse a “sum” variable that could give a clue to an efficient solution (i.e., local partial sum variable that is updated each loop iteration). This code is small and efficient in serial, but will challenge the students to come up with an efficient solution. Of course, efficiency is not one of the goals with such a short module. Getting the answer correct with multiple threads is enough of a goal for this.
实践3-蒙托卡罗法求π 实验: 用蒙特卡罗法计算π 根据圆面积公式,S=πr2 看一个单位圆,其中,1/4个 单位圆的面积是单位矩形面积 的一部分,单位矩形面积为1, 现在在单位矩形内产生大量 随机的点,则落在1/4圆内的 点所占的百分比就是1/4的单位 圆面积。 一个点是否在1/4单位圆内的判断方法就是该点的坐标是否满足 x2+y2≤1 1 y x -1 1 -1
实践3-蒙托卡罗法求π 实验: 用蒙特卡罗法计算π 1 x y -1 1 -1 unsigned int iter=200000000; int i,j; double x, y; double dUnderCurve=0.0; double pi=0.0; double r[BLOCK_SIZE*2]; VSLStreamStatePtr stream; vslNewStream( &stream, BRNG, (int) clock() ); for(j=0;j<iter/BLOCK_SIZE; j++) { vdRngUniform( METHOD, stream, BLOCK_SIZE*2, r, 0.0, 1.0 ); for (i=0; i<BLOCK_SIZE; i++) { x=r[i]; y=r[i+BLOCK_SIZE]; if (x*x + y*y <= 1.0) { dUnderCurve++; } vslDeleteStream( &stream ); pi = dUnderCurve / (double) iter * 4 ; 1 y x -1 1 -1
实践3-蒙托卡罗法求π 实验: 用蒙特卡罗法计算π 1 x y -1 1 -1 unsigned int iter=200000000; int i,j; double x, y; double dUnderCurve=0.0; double pi=0.0; #pragma omp parallel private(stream,x,y,i) reduction(+:dUnderCurve) double r[BLOCK_SIZE*2]; VSLStreamStatePtr stream; vslNewStream( &stream, BRNG, (int) clock() ); #pragma omp for for(j=0;j<iter/BLOCK_SIZE; j++) { vdRngUniform( METHOD, stream, BLOCK_SIZE*2, r, 0.0, 1.0 ); for (i=0; i<BLOCK_SIZE; i++) { x=r[i]; y=r[i+BLOCK_SIZE]; if (x*x + y*y <= 1.0) { dUnderCurve++; } vslDeleteStream( &stream ); pi = dUnderCurve / (double) iter * 4 ; 实验: 用蒙特卡罗法计算π 1 y x -1 1 -1
去除同步,减少开销 看下面的程序: 如何去掉临界区? float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<N; i++) { #pragma omp critical sum += a[i] * b[i]; } return sum; 如何去掉临界区?
去除同步,减少开销 计算的开始与结束迭代: float dot_prod(float* a, float* b, int N) { float sum = 0.0, lsum[10]; int Ncore = omp_get_num_procs (); int nstep=N/Ncore; #pragma omp parallel for for (k=0;k<Ncore;k++) for(int i=k*nstep; i<(k+1)*nstep; i++) { lsum[k] += a[i] * b[i]; } for (i=0;i<Ncore;i++) sum+=lsun[i]; return sum;
使用OpenMP编程 内容 什么是OpenMP? 并行区域 循环并行化 数据作用域与保护数据 明确的同步 调度子句 其他有用的结构和子句
调度子句 负载均衡问题 看下面的程序 什么问题?怎么解决? int i, j; Int a[100][100]; for (i=0; i<100;i++) { for (j=0; j<100;j++) a[i][j]=i*j; } 什么问题?怎么解决?
调度子句 任务调度 多线程程序中,要实现较好地负载平衡而获得最优性能,就必须对循环进行高效的调度。 调度的目的是使得每个线程都处于忙的状态,而不至于有的忙有的闲。 采取的办法是合理分配数据和任务到线程。
调度子句 调度子句 kind参数有四种类型 chunksize参数(可选) static,dynamic,guided,runtime chunksize参数(可选) 表示循环迭代次数,该参数必须是整数。 kind是runtime时,不能有chunksize #pragma omp parallel for schedule(kind[, chunksize])
调度子句 可用的调度 调度类型 描 述 Static (缺省类型) 描 述 Static (缺省类型) 假设有n次循环迭代,m个线程,那么每个线程静态分配n/m次迭代,如果n/m不为整数,则实际迭代次数可能相差一个chunksize dynamic 使用一个内部任务队列,当某个线程可用时,为其分配chunksize数量的循环迭代。线程完成这一组迭代后,再到任务队列取下一组迭代。 guided 与dynamic类似,但迭代次数刚开始比较大,后来迭代次数会按指数级下降到chunksize大小。 runtime 在运行时使用OMP_SHEDULE环境变量来确定使用上述哪种调度类型。
调度子句 调度子句举例 迭代被划分,其中缺省的chunksize为1 如果线程数是2,则 线程 0计算i=1,2,3,4,5,6的情况 线程1计算i=7,8,9,10,11,12的情况 #pragma omp parallel for schedule(static) for(i = 1; i <= 12; i++ ) { a[i]=factorial(i); } Details C example uses STATIC scheduling. Set of iterations is divided up into chunks of size 8 and distributed to threads in round robin fashion.
调度子句 调度子句举例 迭代被划分,其中chunksize为3 如果线程数是2,则 线程 0计算i=1,2,3,7,8,9的情况 线程1计算i=4,5,6,10,11,12的情况 #pragma omp parallel for schedule(static,3) for(i = 1; i <= 12; i++ ) { a[i]=factorial(i); }
调度子句 调度子句举例 迭代被划分,其中chunksize为3 如果线程数是2,则 线程 0计算i=4,5,6,7,8,9的情况 线程1计算i=1,2,3,10,11,12的情况 #pragma omp parallel for schedule(dynamic,3) for(i = 1; i <= 12; i++ ) { a[i]=factorial(i); }
调度子句 调度子句举例 迭代被划分,其中chunksize为3 如果线程数是2,则 线程 0计算i=1,2,3,4,5,10,11,12的情况 线程1计算i=6,7,8,9的情况 #pragma omp parallel for schedule(guided,3) for(i = 1; i <= 12; i++ ) { a[i]=factorial(i); } 迭代数递减,3为最底限
使用OpenMP编程 内容 什么是OpenMP? 并行区域 循环并行化 数据作用域与保护数据 明确的同步 调度子句 其他有用的结构和子句
其它有用的结构与子句 并行部分(sections和section子句 代码可并行执行的独立部分 并行 串行 #pragma omp parallel sections { #pragma omp section phase1(); phase2(); phase3(); } Sections are distributed among the threads in the parallel team. Each section is executed only once and each thread may execute zero or more sections. It’s not possible to determine whether or not a section will be executed before another. Therefore, the output of one section should not serve as the input to another. Instead, the section that generates output should be moved before the sections construct. 串行 并行
其它有用的结构与子句 Single结构 被标注的代码块只能被一个线程执行 在尾部隐含栅障 选择第一个到达的线程 #pragma omp parallel { DoManyThings(); #pragma omp single ExchangeBoundaries(); } // 其他线程都在这里等待 DoManyMoreThings(); } Details Used when only one thread should execute a portion of code. This could be used when two parallel regions have very little serial code in between. Combine the two regions into a single region (less overhead for fork-join) and add the “single” construct for the serial portion. In the above example, the team of threads starts executing in the parallel region. The threads execute the DoManyThings() function concurrently. One of the threads, after completing the DoManyThings() function, enters the single region where an implicit barrier exists at the end. The remaining threads skip the single region and wait for all threads to reach the barrier at the end of the single region. When the thread completes the single region and all other threads have completed their execution of DoManyThings() function, the team of threads concurrently executes the DoManyMoreThings() function.
其它有用的结构与子句 Master结构 被标注的代码块只能被主线程执行 在尾部没有隐含的栅障 #pragma omp parallel { DoManyThings(); #pragma omp master ExchangeBoundaries(); } DoManyMoreThings(); Details Similar to “single” but the master thread is chosen. No barrier at end of construct. In the example presented on the slide, the team of threads starts executing in the parallel region. The threads execute the DoManyThings() function concurrently. Only the master thread is allowed to enter the master region. There is no implicit barrier at the end. Thus, the remaining threads skip the master region and start executing the DoManyMoreThings() function without waiting for the master thread to complete.
其它有用的结构与子句 隐含的栅障 多个OpenMP* 结构有隐含的栅障 不必要的栅障会损坏性能 parallel for single 不必要的栅障会损坏性能 等待的线程啥事都做不了! 安全的情况下,用nowait子句撤销隐含的栅障 Excessive use of implicit barriers defeats the purpose of parallel execution. Unnecessary barriers hamper performance because the waiting threads are idle. You can use the nowait clause to defy these barriers, when safe.
其它有用的结构与子句 Nowait子句 Nowait子句的文法有助于阻止隐含的栅障: 当线程需要在独立的计算间等待的时候使用 #pragma omp for nowait for(…) {…}; #pragma omp single nowait { […] } 当线程需要在独立的计算间等待的时候使用 #pragma omp for schedule (dynamic, 1) nowait for(int i=0; i<n; i++) a[i] = bigFunc1(i); #pragma omp for schedule (dynamic, 1) for(int j=0; j<m; j++) b[j] = bigFunc2(j); Details Schedule in each loop is (dynamic, 1) and computations in each loop are independent of each other (one loop updates a[], other loop updates b[]). Without nowait clause, threads would pause until all work is done in first loop; with nowait clause, when work is exhausted from first loop, threads can begin executing work in second loop. (It is possible that the second loop can complete all work before the work in the first loop is done.)
其它有用的结构与子句 Barrier结构 明确的栅障同步 每个线程都要等待直到所有线程到达 #pragma omp parallel shared (A, B, C) { DoSomeWork(A,B); printf(“Processed A into B\n”); #pragma omp barrier DoSomeWork(B,C); printf(“Processed B into C\n”); } A, B, and C are shared variables. Threads enter the parallel region and execute the DoSomeWork() function in parallel with A and B. You need to update the value of B before performing any computation involving B by calling the second DoSomeWork() function. Setting the barrier synchronization creates a threshold. Threads need to wait for the other threads before entering the next code block.
其它有用的结构与子句 Atomic结构 Atomic结构创建了一个小的临界区域,该临界区大多在一些处理器中是用一条指令或者很少的几条指令完成的。因为它小,所以执行起来比临界段要块。其语法如下: #pragma omp parallel for shared(x, y, index, n) for (i = 0; i < n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); } Only a small set of instruction can be used within the atomic pragma. Since index[i] can be the same for different i values, the update to x must be protected. In this case, the update to an element of x[] is atomic. The other computations (call to work1() and the computation of the value in index[i]) will not be done atomically. Use of a critical section would serialize updates to x. Atomic protects individual elements of x array, so that if multiple, concurrent instances of index[i] are different, updates can still be done in parallel.
其它有用的结构与子句 OpenMP* API 获取一个执行队列中的线程号 获取一个执行队列中的线程数 获取处理器(核)数 会导致代码的不连续一致性 除非确实有特殊用途(debugging) 但必须包含一个头文件 int omp_get_thread_num(void) int omp_get_num_thread (void) int omp_get_num_procs (void) int omp_get_thread_num(void);: Returns the thread number of the thread making the call. If called from a serial region, this function will return 0. To use the function calls, include the <omp.h> header file. The compiler automatically links to the correct libraries. These functions are not usually needed for OpenMP code except for debugging. int omp_get_num_threads(void);: Returns the number of threads currently in the team that executes the parallel region from which it is called. If called from a serial portion of the program or a nested parallel region that is serialized, this function will return 1. The default number of threads is implementation-dependent. Details Emphasize that API calls are usually not needed. Assignment of loop iterations and other computations to threads is already built into OpenMP. Background It has been seen that MPI programmers that use OpenMP use the API routines more often than others. This is likely due to the need in MPI to know the rank of the process and compute the work to be done based on that rank within the set of all processes. #include <omp.h>
小结 OpenMP* 是: 是一种简单的为共享存储模式使用的并行编程模型。 我们探讨了在OpenMP编码中如何: 建立并行的代码区域(omp parallel) 划分工作(omp for) 变量分类(omp private….) 同步(omp critical…) 调度(omp schedule…)
进一步的话题
更多关于OpenMP* 数据环境结构 FIRSTPRIVATE LASTPRIVATE THREADPRIVATE
Firstprivate子句 该私有变量的初始值来自同名共享变量 C++对象是拷贝构造 incr=30; #pragma omp parallel for firstprivate(incr) for (I=0;I<=MAX;I++) { if ((I%2)==0) incr++; A(I)=incr; } 私有incr的初始值来自共享的incr所以是30
Lastprivate子句 私有变量将最后一次迭代的值赋给同名的共享变量 C++对象就像被指派那样被修改 void sq2(int n, double *lastterm) { double x; int i; #pragma omp parallel #pragma omp for lastprivate(x) for (i = 0; i < n; i++){ x = a[i]*a[i] + b[i]*b[i]; b[i] = sqrt(x); } lastterm = x;
Threadprivate子句 使用threadprivate子句用来标明某一个全局变量是线程私有数据,在程序运行的过程中,不能被其他线程访问。 使用copyin子句对线程私有的全局变量进行初始化。 struct Astruct A; #pragma omp threadprivate(A) … #pragma omp parallel copyin(A) do_something_to(&A); #pragma omp parallel do_something_else_to(&A); Private copies of “A” persist between regions