ebook img

NASA Technical Reports Server (NTRS) 20030065987: Support of Multidimensional Parallelism in the OpenMP Programming Model PDF

14 Pages·0.99 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview NASA Technical Reports Server (NTRS) 20030065987: Support of Multidimensional Parallelism in the OpenMP Programming Model

Support of Multidimensional Parallelism in the OpenMP Programming Model Haoqiang Jin and Gabriele Jost* NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 e hjin,g jost @ nas.n asa.gov> Abstract OpenMP is the current standard for shared-memory programming. While providing ease of parallel programming, the OpenMP programming model also has limitations which often effect the scalability of applications. Examples for these limitations are work distribution and point-to-point synchronization among threads. We propose extensions to the OpenMP programming model which allow the user to easily distribute the work in multiple dimensions and synchronize the workflow among the threads. The proposed extensions include four new constructs and the associated runtime library. They do not require changes to the source code and can be implemented based on the existing OpenMP standard. We illustrate the concept in a prototype translator and test with benchmark codes and a cloud modeling code. 1. Introduction OpenMP [ 11 1 was introduced as an industrial standard for shared-memory programming with directives. It has gained significant popularity and wide compiler support. The main advantage of the OpenMP programming model is that it is easy to use and allows the incremental parallelization of existing sequential codes. OpenMP provides a fork-and-join execution model in which a program begins execution as a single process or thread. This thread executes sequentially until a PARALLEL construct is found. At this time, the thread creates a team of threads and it becomes its master thread. All threads execute the statements lexically enclosed by the parallel construct. Work-sharing constructs (DO, SECTIONS and SINGLE) are provided to divide the execution of the enclosed code region among the members of a team. All threads are independent and may synchronize at the end of each work-sharing construct or at specific points (specified by the BARRIER directive). Exclusive execution mode is also possible through the definition of CRITICAL and ORDERED regions. Thread synchronization of all threads is required at the end of PARALLEL constructs. An existing code can be easily parallelized by placing OpenMP directives around time consuming loops which do not contain data dependences, leaving the source code unchanged. * Computer Sciences Corporation, MIS T27A-1, NASA Ames Research Center. -1- I The ease of programming is a big advantage over a parallelization based on data distribution and message passhg, as it is required for distributed computer architectures. There are, however, limitations to the programming model which can affect the scalability of OpenMP programs. Comparative studies (i.e. [7])h ave shown that high scalability in message passing based codes is often due to user optimized work distribution and process synchronization. When eqloying OpenMP the user does not have this level of control anymore. For example, a problem arises if the outer loop in a loop nest does not contain a sufficient number of iterations to keep all of the threads busy. In [l] and [5] it has been shown that directive nesting can be beneficial in these cases. Even though the OpenMP standard allows the nesting of directives, this feature is not supported by most of the commercial compilers. Another issue is the workflow between threads. Data dependences may require point-to-point synchronization between individual threads. The OpenMP standard requires barrier synchronization at the end of parallel regions which potentially destroy the workflow, especially when nested parallel regions are used. The SPMD programming model is a way to mimic the data distribution and workflow achieved ,by message passing programs. In this model the programmer expresses manually the data and work distribution among the threads. The data is copied to small working arrays which are then accessed locally by each thread. The work is distributed according to the distribution of the data. The parallelization of the loop nests requires to compute explicitly which threads executes which iteration. The bounds for each loop have to be calculated based on the number of threads and the identifier of each thread, therefore introducing changes to the source code. The programming style allows the programmer to carefully manage the workflow, but completely defeats the advantages of the OpenMP programming model. We are proposing extensions to the OpenMP programming model, which allow the automatic generation of SPMD style code based on user directives. Our goal is to remove some of the I performance inhibiting limitations of OpenMP while preserving the ease of programming. In the current work we are addressing the issue of work distribution in multiple dimensions and point- to-point thread synchronization. The rest of the paper is structured as follows. In Section 2 we discuss related work regarding multidimensional parallelization and present our own approach in comparison. In Section 3 we i give a description of our prototype implementation. In Section 4 we present two case studies to I demonstrate the proposed concept and conclude in Section 5. , I 2. Multidimensional Parallelism The OpenMP standard allows the nesting of the OMP PAIZALLEL directive. According to the standard, a new team of threads is created when an inner OMP PARALLEL directive is encountered. The new team is joined at the inner OMP END PARALLEL. The OMP DO directive cannot be nested without nesting the OMP PARALLEL directive. - 2 - At this point not many compilers support this type of directive nesting. For example, the SGI compiler does not support nested OMP PARALLEL directives, but provides some support of multidimensional work distribution. The SGI compiler accepts the NEST clause on the OMP DO directive [lo]. The NEST clause requires at least two variables as arguments to identify indices of subsequent DO-loops. The identified loops must be perfectly nested and no code is allowed between the identified DO statements and the corresponding END DO statements. The NEST clause on the OMP DO directive informs the compiler that the entire set of iterations across the identified loops can be executed in parallel. The compiler can then linearize the execution of the loop iteration and divide them among the available single level of threads. An example of a research platform that supports true nested OpenMP parallelism is the OpenMP NanosCompiler [3]. The OpenMP NanosCompiler accepts Fortran-77 code containing OpenMP directives and generates plain Fortran-77 code with calls to the NthLib thread library [SI currently implemented for the SGI Origin. In contrast to the SGI MP library, NthLib allows for multilevel parallel execution such that inner parallel constructs are not being serialized. The NanosCompiler programming model supports several extensions to the OpenMP standard to allow the user to control the allocation of work to the participating threads. The NanosCompiler extension to multilevel parallelization is based on the concept of thread groups. A group of threads is composed of a subset of the total number of threads available in the team to run a parallel construct. In a parallel construct, the programmer may define the number of groups and the composition of each one. When a thread in the current team encounters a PARALLEL construct defining groups, the thread creates a new team and it becomes its master thread. The new team is composed of as many threads as the number of groups. The rest of the threads are used to support the execution of nested parallel constructs. In other words, the definition of groups establishes an allocation strategy for the inner levels of parallelism. To define groups of threads, the NanosCompiler supports the GROUPS clause extension to the PARALLEL directive. The NanosCompiler also provides the PRED/SUCC extensions [4] in order to allow point-to- point synchronization between threads. We propose the following extensions to OpenMP for support of multidimensional parallelism, for short MOMP directives. We introduce two new directives, TMAP and MDO, for mapping a team of threads to a grid of multiple dimensions and for distributing work in the multiple dimensions. The concept is illustrated for a two-dimensional grid. It can easily be extended to higher dimensions. TMAP (ndim, sfactorl,s factor2) Maps a team of threads to a grid of multiple dimensions. ndim is the dimension of the mapped thread grid, currently 1 or 2; sfactorl and sfactor2 are the shape factors for the grid, i.e. the ratio of the two factors is proportional to that of the grid sizes in each -3- dimension. For example, (2,1,1)d efines a squared grid; (2,N,O)in dicates that the number of threads mapped to the first dimension should not exceed N. See Figure 1 for an illustration. MDO (idim [, gplow,g phighl ) Binds or distributes a worksharing DO loop to the ‘idim’ dimension of the thread grid. The optional parameters gplow and gphigh can be used to specify additional ghost iterations to be assigned on the low and high ends of the bound loop. This is to mimic the ghost points concept used in a message-passing program. For the (2,1,1) mapping in Figure 1, if loops K and J are bound to idim=l and 2, respectively, threads 0,1,2,3 will be bound to the same iterations of loop K, while threads 0,4,8,12 will be bound to the same iterations of loop J. By default, threads are not synchronized at the end of an MDO loop, as opposite to the implicit synchronization at the end of an OMP DO loop. This selection reflects closer to the SPMD coding style. To enforce synchronization, the user needs to use explicit synchronization ‘ directives. To support flexible synchronization among threads in the thread grid, we include two more directives, TS IGNAL and TWAI T. TSIGNAL (idir[ , . .I ) Sends a signal to the direction ‘idir’:- 1 lower-neighbor in the first dimension, 1 higher- neighbor in the first dimension, -2 lower-neighbor in the second dimension, 2 higher- neighbor in the second dimension (see Figure 1). Multiple directions can be listed. TWAIT (idir[ , . . .I ) I Waits a signal from the direction ‘idir’: -1 lower-neighbor in the first dimension, 1 I higher-neighbor in the first dimension, - 2 lower-neighbor in the second dimension, 2 higher-neighbor in the second dimension. Multiple directions can be listed. idim=l 1 idim=2 (2,1,1) mapping Figure 1: Examples of thread mapped grids with TMAP for 16 threads. The left (2,1,1) mapping defines a squared shape, while the right (2,8,0) mapping limits the number of threads mapped to the first dimension to 8, which then gives an 8x2 topology. The light shaded boxes indicate the neighboring threads of a given thread (5 in the figure) corresponding to idir=-1 ,1,-2,2. -4- TSIGNAL/TWAITm ust always be used in a matching pair; else a deadlock will occur. We could easily include constructs for signallwait of a particular thread if required. A sample code using MOMP directives is given in Figure 2. The TMAP clause is added to the beginning of a parallel region to define a 2-D thread grid. The first MDO distributes the work of the K loop, while the second MDO distributes the work of the J loop. The MOMP extensions can work seamlessly with most of the OpenMP directives. ~~~ Serial code Code with MOMP directives . DO K=l,NZ ! $OMP PARALLEL TMAP (2,N Z , 0 ) ZETA = K*0.1 !$OMP MDO(1) DO J=l,NY DO K=l,NZ do more work. .. ZETA = K"0.1 ENDDO !$OMP MDO(2) ENDDO DO J=l,NY do more work. .. ENDDO ENDDO !$OMP END PARALLEL Code with MOMP directives OpenMP with Nanos Extensions OpenMP with SGI extensions ! $OMP PARA-LLEL !$OMP PARALLEL GROUPS(NZ) !$OMP PARALLEL DO !$OMP& TMAP(2,NZ,O) !$OMP DO ! $SGI+NEST ONTO (NZ, * ) !$OMP MDO(1) DO K=l,NZ DO K=l,NZ DO K=l,NZ !$OMP PARALLEL DO DO J=l,NY !$OMP MDO(2) DO J=1,N Y do more work ... DO J=l,NY do more work ... ENDDO do more work ... ENDDO ENDDO ENDDO !$OMP END PARALLEL DO !$OMP END PARALLEL DO ENDDO ENDDO !$OMP END PARALLEL !$OMP END PARALLEL - 5 - 3. Prototype Impl-mentati n To illustrate the concept of using the MOMP directives described above for multidimensional parallelization, we implemented a prototype translator and the supporting runtime library. The translator simply translates all the MOMP directives into appropriate runtime calls and leaves the other OpenMP directives intact. The resulting code is a standard OpenMP code and can be compiled with any OpenMP compiler and linked with the MOMP runtime library. In Table 1 we summarize the translation of MOMP directives into the runtime functions for FORTRAN codes; the concept applies to C as well. The translation of TMAP, TSIGNAL and TWAIT is straightforward, simply mapping into the corresponding runtime functions. The MDO directive is translated into a call to momp- g et- r ange that computes the new loop limit after the associated loop is distributed in the defined grid dimension, and the loop is then replaced with the new limit. For simplicity we only consider the block distribution which also simplifies the implementation of momp-t s ignal() and momp- t wait (). The ghost iterations are translated into the overlap of the iteration space between two neighboring threads. When no ghost iteration is involved, (gplow,gphigh)=(O,O) is used. - MOMP directive Runtime function I TMAP(ndim,sfactorl,sfactor2) call momp-create-map(ndim, & sfactorl,s factor21 MDO (idim,g plow,g phigh) call momp-get-range(idim, DO lvar=low,high,step & gplow,g phigh,l ow,h igh, & step,new-low,new-high) DO lvar=new-low,new-high, & step MDO (idim) call momp-get-range(idim, & o,o/ ...) TSIGNAL(idir[, . ..I call momp-tsignal (idir) [call momp-tsignal(. . -11 TWAIT (idir[ , . . . I 1 call momp-twait (idir) [call momp-twait (. . . ) 1 In addition, an environment variable MOMP-SHAPE is used to control the grid shape externally. MOMP -S HAPE takes a value like “SlxS2,”w hich is equivalent to specifying sfactorl=Sl and sfactor2=S2 to TMAP. This value overwrites the values given in TMAP. It allows a user to freely change the grid shape at runtime without recompiling the code. If SlxS2=N where N is the total number of threads, the shape “S1xS2”d efines the current grid topology. - 6 - An example of the translation of the code given in Figure 2 is shown in Figure 4. The translated code is a standard OpenMl? code with a proper list of private variables for the new loop limits and can be compiled with any OpenMP compiler. Code with MOMP directives Translated OpenMP code !$OMP PAPALLEL !$OMP PARALLEL PRIVATE(K-NLOW, !$OMP& TMAP(2,NZ,O) !$OMP& K-NHIGH,J-NLOW,J-NHIGH) ! $OMP MDO (1) CALL MOMP -C REATE-MAP (2, N Z , 0) DO K=l,NZ CALL MOMP -G ET-RANGE (1 0 , 0 1, I I ZETA = K"0.1 & NZ , 1 ,K -NLOW, K-NHIGH) !$OMP MDO(2) DO K=K-NLOW,K-"HIGH DO J=l,NY ZETA = K*0.1 do more work ... CALL MOMP-GET-RANGE (2, 0 0 , I ENDDO & 1 , NY , 1 ,J -NLOW ,J -NHIGH ) ENDDO DO J=J- N LOW,J -N?iIGH !$OMP END PARALLEL do more work ... ENDDO ENDDO !$OMP END PARALLEL Figure 4: Translation of a sample MOW code to the standard OpenMP code with MOW runtime calls. 4. Casestudy In this section we show examples for using our proposed MOMP directives. We will describe the usage of the directives and discuss the performance of the resulting code. All tests were run on an SGI Origin 3000 with 400MHz R12000 CPUs and 2GB local memory per node. We have parallelized two codes (BT and LU) from the NAS Parallel Benchmark suite [2]. We used the baseline implementation as described in [5]. In addition we have applied our extension to a full- scale cloud modeling code. 4.1. The BT Benchmark BT is a simulated CFD application. It uses an implicit algorithm to solve the 3D compressible Navier-Stokes equations. The x, y, and z dimensions are decoupled by usage of an Alternating Direction Implicit (ADI) factorization method. The resulting systems are block-tridiagonal with a block size of 5x5. The systems are solved sequentially along each dimension. All of the nested parallel loops are at least triple nested. The structure of the loops is such that the two outer most loops can be parallelized and enclose a reasonably large amount of computational work. We applied the TMAP and MDO directives as shown in Figure 2 throughout the code. In Figure 5 we show timings obtained for the BT benchmark class A, which corresponds to the problem size of 64 grid points in each dimension. We compare timings achieved for different numbers of threads -7- running in various topologies. We denote by n1 the first dimension of the thread topology. In our experiments nl is the total number of threads divided by the second dimension of the thread grid. For example, if we are running on 32 CPUs employing 32 threads, then the topology of 111x4 corresponds to employing 8 threads on the first and 4 threads on the second dimension. The timings show that for more than 64 threads distributing the work in multiple dimensions can improves the performance significantly. The loop length resulting from 64 points does not I provide enough iterations in one dimension to keep more than 64 threads busy. Distributing the I work in two dimensions allows the exploitation of additional parallelism. The ratio of L2 cache misses vs. floating instructions per thread increases as more threads assigned in the second dimension due to the large stride memory access, which causes the increase in running time. But the positive impact of better workload balance is stronger than the negative impact of a lack of data locality in the run th 1 60 50 40 30 20 10 9 8 16 32 64 128 Number of CPUs Figure 5: MOMP timing comparison for various numbers of threads running in different topologies. The first dimension nl corresponds to the number of CPUs divided by the second topology dimension. In Figure 6 we compare timings achieved for different thread topologies for the MOMP version with timings obtained for the nested OpenMP version using the NanosCompiler. For the NanosCompiler we use the GROUPS clause extension as shown in Figure 3. The timings for the MOMP version seam to be slightly better than for the Nanos version, but not significantly. The extra barrier synchronization points at the end of inner parallel regions do not introduce considerable overhead to the NanosCompiler generated code, which is due to the very efficient NthLib thread library. We also noticed that the thread scheduling applied by the NanosCompiler yields a somewhat better workload balance than our approach. This indicates that additional performance might be gained for MOMP with an optimized runtime library. -8- BT Class A on 100 CPUs SGI Origin 3000 16 14 fn 12 g0 10 .fr-n 8 fl Nanos .E- 6 r - 4 2 0 100x1 50x2 25x4 best Thread topology Figure 6: Time comparisons between Nanos and MOMP for different thread configurations. The columns in category best indicate the best time over all tested topologies. The SGI NEST clause is not applicable to some of the time consuming loops in the BT benchmark because they are not tightly nested. More details on that can be found in [5]. 4.2. The LU Benchmark The LU application benchmark is a simulated CFD application that uses the symmetric successive over-relaxation (SSOR) method to solve a seven band block-diagonal system, resulting from finite-difference discretization of the 3D compressible Navier-S tokes equations by splitting it into block lower and block upper triangular systems. All of the loops involved carry data dependences that prevent straightforward parallelization. There is, however, the possibility to exploit a certain level of parallelism by using software pipelining which requires explicit synchronization of individual threads. We use the MOMP TMAPa nd MDO directives to distribute the work in multiple dimensions. Distributing the work in multiple dimensions requires a two- dimensional thread pipeline. We use the MOMP TWAIT and TSIGNAL directives to set up threaded pipeline execution in multiple dimensions. An example of the source code for the 2D pipeline implemented using the MOMP directives is shown in Figure 7 At this point we note that a two-dimensional pipeline involving nested OpenMP parallel regions runs into ;he problem that the inner parallel region imposes an extra barrier synchronization point which inhibits the 2D pipeline. -9- Code with MOMP directives ! $OMP PAFLULEL TMAP ( 2, PjZ , 0 ) DO K=2,NZ !$OMP WAIT(-1,-2) !$OMP MDO(1) DO J=2,N Y !$OMP MDO(2) DO I=2,NX V(I,J,K) = V(I,J,K)+V(I-l,J,K)+ & V(1, J-l,K)+V(I, J,K-l) - . . ENDDO ENDDO $OMP SIGNAL(1,2) ENDDO $OMP END PARALLEL Figure 7: Code example for the implementation of a 2D thread pipeline using MOM directives. We used the Paraver [ 121 performance analysis system to show the effect of the 2D pipeline on the workflow of the threads. In Figure 8 we show the time line view of the useful thread time during the forward substitution phase of LU for a 16 CPU run. Dark shades indicate time the threads spend in computation, light shades indicate time spent in synchronization or other OpenMP introduced overhead. The left image shows a 16x1 thread topology, corresponding to a 1D pipeline. The right image shows a 4x4 topology corresponding to a 2D pipeline. The use of the 2D pipeline decreases a pipeline startup time for the computations. Figure 8: Time line views of the thread workflow during the forward substitution phase in LU running on 16 threads. Dark shading indicates time spent in computation, light shading indicates time spent in synchronization. The left images shows a ID thread pipeline, the right image shows a 2D thread pipeline. The 2D pipeline decreases the pipeline startup time. -10-

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.