ebook img

Performance Analysis and Grid Computing: Selected Articles from the Workshop on Performance Analysis and Distributed Computing August 19–23, 2002, Dagstuhl, Germany PDF

290 Pages·2004·0.54 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Performance Analysis and Grid Computing: Selected Articles from the Workshop on Performance Analysis and Distributed Computing August 19–23, 2002, Dagstuhl, Germany

Performance Modelling/or Task-Parallel Programs 85 We detennine the contention factor by comparing the delay in the execution times for communication operations, when they are perfonned concurrently, using the execution times of these operations without network contention. The contention factor itself is modelled as a function of P and n and the shape of the contention factor (or contention function) has been detennined by experiments from the measured delays in the execution times. The parameters within this function are then detennined by curve fitting. The resulting contention factor is used for modelling the communication times ofthose program parts of complex applications that are executed in a mixed task and data parallel way. Each parallel machine is characterised by a different contention factor because of the different network architectures of the machines. To summarize, using the contention factor, the structure of the runtime fonnulas for one communication operation does not change but the contention factor is used to adjust the byte transfer time tc to the specific situation. Especially, the startup times are not changed. Thus, for example, the runtime fonnula of a multi-broadcast operation IS (1) The same contention factor is used for each communication operation, i.e., there is no need to detennine different contention factors for different communication operations. The communication operations of pure data paraJlel parts of appli cation programs are modelled without the contention factor because there is no interference of message transmission. The contention factors for the modelling of concurrent message transmissions are C (p, n) = 0.04· P . log2(lo92(p)) ·log2(n) T3E CCLic(P, n) = 0.0045· p. p ·log2(p) . (log2(n) + log2(p)), where P denotes the total number of processors participating in the execution and n the size of the message transmitted. 4. Example Application To investigate the usefulness of the modeJling approach for complex appli cation programs, we consider parallel implementations of a specific solution method for ordinary differential equations (ODEs), the iterated Runge-Kutta method (iterated RK method). 4.1 Task structure of iterated RK methods The iterated RK method is an explicit one-step method for the solution of ini tial value problems of ODEs. The iterated RK methods detennines a sequence of approximation values Yl, Y2, Y3 ... for the exact solution of the ODE system in a series of sequential time steps. In each of the time steps a fixed number s of 86 PERFORMANCE ANALYSIS AND GRID COMPUTING stage vectors are iteratively computed and combined to the next approximation vector in the following way: for l = 1, "" s initialize stage vector vlo) = for j 1, .. " m vIi) for l = 1, .. " s: compute new stage vector approximation compute new approximation vector �Y�k�~�~�.� \) compute approximation vector �Y�k�~�~� for step size control. The number m of iterations is given by the specific RK method. Each com putation of a stage vector approximation requires an evaluation of the function f that describes the ODE to be solved. The advantage of the iterated RK methods for a parallel execution is that the iteration system of size n con 8 . sists of 8 independent function evaluations that can be performed in parallel [8J. For systems of differential equations, an additional data parallelism can be exploited. Thus, the algorithm provides several possibilities for a parallel implementation. The computation of the stage vectors in one iteration j of the stage-vector computation can be performed on subsets of processors at the same time (group implementation) or alternatively by all processors one after another (consecutive implementation). The group implementation is a mixed task and data parallel implementation whereas the consecutive implementation is a pure data parallel realization. 4.2 Consecutive implementation The computation time of the consecutive execution on p processors without the stepsize control can be modelled by the formula f where Tf denotes the time for the evaluation of one component of and top denotes the time for the execution of one arithmetic operation. In each loop body each processor has to compute nip components of the argument vector using 28 + 1 operations and nip components of f. Since f and its access pattern are not known in advance, the complete argument vector has to be made available to each processor with a multi-broadcast operation. For the communication time, the following formula is used with tmb from Table 1. 4.3 Group implementation A group implementation ofthe iterated RK method uses 8 independent groups = of processors where each group G of size gi, i 1, ... ,8, is responsible for i Performance Modelling for Task-Parallel Programs 87 foralli E {I, ... ,s} do in parallel (group parallelism) forall processors q E GI do (data parallelism) { rn / compute 911 components of �f�(�y�~�)�;� rn / initialize 911 components of J.l(O) , ... , J.lCO); } = for j 1, ... , m do ( sequential iteration) { foralll E {I, ... ,s} do in parallel (group parallelism) foraH q E GI do (data parallelism inside groups) { r n/ compute 911 components of argument J.l(l, j) group-multi-broadcast of local components of J.l(l, j); evaluate rn/911 components of If(JJ(l,j)); group-multi-broadcast of local components of f(JJ(l,j)) } foraH processors q do ( data parallelism on all processors) compute rn/p1 components �O�f�Y�~�+�l�;� multi-broadcast the computed components of �y�~�+� 1; } stepsize control; Figure 1. Group implementation for one time step of the iterated RK method. computing the approximations of one specific stage vector. The pseudo-code in Figure 1 illustrates the structure of the program. The processor groups should be of equal size, since the computation of each stage vector approximation requires an equal amount of computation. As it is possible that p is not a multiple of s, the group with the smallest number 9min = LP / s J of processors detennines the computation time The communication time is modelled by the runtime fonnula with the ma chine specific contention factor, see Equation (1), to reflect the concurrently executed multi-broadcast operations. 88 PERFORMANCE ANALYSIS AND GRID COMPUTING 5. Runtime experiments To validate the runtime fonnulas we have perfonned runtime tests on the and T3E and the CLiC for up to 128 processors. Since the execution time of the iterated RK method strongly depends on the ODE system to be solved, we consider two classes of ODE systems: Sparse ODE systems. These ODE systems have a right-hand side function f for which the evaluation of each component has a fixed evaluation time that is independent of the size n of the ODE system (sparse function), i.e., the evaluation time of the entire function f consisting of n components increases linearly with the size of the ODE system. Dense ODE systems. These ODE systems have a right-hand side function f for which the evaluation of each component has an evaluation time that in creases linearly with n, i.e., the evaluation time of the entire function f increases quadratically with the size of the ODE system. Dense ODE systems lead to the fact that the computation time usually dom inates the communication time, i.e., the comparison of measured and predicted execution times shows how good the computation times are modelled. On the other hand, for sparse ODE systems the communication time plays an important role and the comparison also includes this part. We have used an iterated RK method with s = 4 stages that is based on an implicit RadauIIA method. This method leads to a convergence order of 7, if m = 6 iterations are executed in each time step. The runtimes shown in the following tables and figures are runtimes for one time step of the method and are obtained by averaging over a large number of time steps. The (measured and predicted) runtimes include the time for stepsize and error control. On the T3E, the runtime of the consecutive implementations of the iterated RK method for dense ODE systems can be modelled very accurately and are not shown here. But sparse ODE systems also lead to good predictions. In this case, no concurrent message transmissions take place, and therefore the contention factor does not need to be used. For both cases, the predictions are quite accurate, but not as accurate as the predictions for dense ODE systems. Nevertheless, they are accurate enough to be used for predicting the effect of a task parallel implementation. Figure 2 shows the deviations between measured and predicted runtime for a group implementation of the iterated RK method for dense (left) and for sparse (right) ODE systems on a T3E, again using the contention factor for the concurrent message transmissions because of a task paraJlel execution. Fig ure 3 illustrates the accuracy of the group implementation for dense (left) and sparse (right) ODE systems of large sizes on the T3E for different numbers of processors. In contrast to dense ODE systems, the solution of sparse ODE systems leads only to considerable speedups for processor numbers of up to Performance Modelling for Task-Parallel Programs 89 �d�u�"�~�C�-�I�w�M�n� rn.",,".-d ..�n�d�'�~�t�.�d�r�u�n�l�l�o�r�w� 0fI0" dinitobCltl boli1W'Ml"l �t�n�M�l�K�I�~� 6IId CJltdictId hM'IlimIo �O�I�"�"�«�I�P�~�I�M�"�'�l�I�I�~�*� cMn •• OOE �"�'�~� Ill: �o�I�g�n�J�U�p�~�m�e�n�"�.�"�"�'�'�'� iipAI_OD£ .y ..... tlnblllf;; 18 '6 '00 '"'" Figure 2. Deviation between measured and predicted runtime of group implementation for dense (left) and sparse (right) ODE system on the Cray T3E-1200. ""1m-of OI'OUP �l�~ �K� rnlt1hod �~� dontolunetion With "_5001)01'1 �T�~� tU"ltfno eM gt04.IP IRK mothod 10, IPdIM function \!Vi'" n-327&8 CW'I T3E 80 ,. t�_El p�m�a�a�t�~�'� r ••I ,,,,,. 70 1 6 80 20 ' o. '0 02 • 12 If! 20 24 28 32 I!W 9EI 128 ",000...,,1 Figure 3. Measured and predicted execution times for a group implementation of one time step of the iterated RK method for a dense function of size n = 5000 (left) and for a Brusselator system of size n = 32768 (right) on a Cray T3E-1200. 16. For larger numbers of processors, the communication time dominates the computation times. Figure 4 compares measured and predicted execution times for group imple mentations of the iterated RK methods for dense ODE systems on the CLiC. Figure 5 shows an illustration for a fixed message size and different numbers of processors. For this machine, the deviations between the measured and pre dicted execution times are much larger than for the T3E, especially for a larger number of processors. Although the deviations may be quite large, they can still be used as a rough estimate of the performance of a task parallel execution. The main reason for the large deviations are caused by the large increase of the communication time with an increasing number of processors. But also for a smaller number of processors, there are considerable deviations for large sys- 90 PERFORMANCE ANALYSIS AND GRID COMPUTING �C�N�¥�I�*�1�i�o�n�D�t�M�~�~�t�d�a�n�d�l�l�t�l�d�o�c�;�.�.�c�s�l�U�f�l�~� �o�t�~�~�b�a�n �b� �~�"�'�O�O�E� �1�)�' �1�~�"�"�'�h �C�U�C� �l�0�0 �~�-�- �~�~�~ �-�-�~�-�-�~�-�-�~�~ �~� eo 30 Figure 4. Deviation between measured and predicted runtime of a group implementa cue tion of one time step of the iterated RK method for a dense function on the Beowulf cluster. tern sizes because of caching effects which are caused by the memory hierarchy of the single processors (Intel Pentium III). Such effects are much smaller on the Alpha 21164 processor of the T3E-1200 because of its different cache or ganization. The runtimes on the CLiC show that even for dense ODE systems, the machine only leads to satisfactory speedups for up to 32 processors. For larger processor numbers, the communication time and the network contention are too high. _ meefUremonl ,adction 20 • • Figure 5. Measured and predicted execution times for a group implementation of one = = time step of the iterated RK method for dense ODE of size n 1000 (left) and n 5000 cue (right) on the Beowulf cluster. Performance Modelling for Task-Parallel Programs 91 6. Conclusions In this article, we have shown that it is possible to model the execution times of mixed task and data parallel implementations by runtime formulas and that the use of a simple contention factor is sufficient to capture the interference of concurrent message transmissions. The runtime formulas model the execution times quite accurately for parallel machines like the T3E with a high-speed in terconnection network. For a Beowulf cluster with an Ethernet-based network, the network contention caused by concurrent transmissions is much larger and it is more difficult to capture the effects by a simple contention factor. But the predictions are still reasonable and give a first impression of the possible effects of a task parallel realization. References [I] A. Alexandrov, M. Ionescu, K.E. Schauser, and C. Scheiman. LogGP: Incorporating Long Messages into the LogP model - One step closer towards a realistic model for parallel computation. Technical Report TRCS95-09, University of California at Santa Barbara, 1995. [2] D.E. Culler, R. Karp, A. Sahay, K.E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: Towards a realistic model of parallel computation. 4th Symp. on Principles and Practice of Parallel Programming, 28(4): 1-12, 1993. [3] The MPI Forum. MPI: A Message Passing Interface Standard. Technical report, University Tennessee, April 1994. [4] R. Foschia, T. Rauber, and G. Riinger. Modeling the Communication Behavior of the Intel Paragon. In Proc. 5th Symp. on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS '97), IEEE, pages 117-124, 1997. [5] M. Hill, W. McColl, and D. Skillicorn. Questions and Answers about BSP. Scientific Programming, 6(3):249-274, 1997. [6] S. Johnsson. Performance Modeling of Distributed Memory Architecture. Journal of Parallel and Distributed Computing, 12:300-312, 1991. [7] W.E McColl. Universal Computing. In Proceedings of the EuroPar'96, Springer LNCS 1123, pages 25-36, 1996. [8] T. Rauber and G. Riinger. Parallel Iterated Runge-Kutta Methods and Applications. International Journal of Supercomputer Applications, I O( I ):62-90, 1996. [9] T. Rauber and G. Riinger. PYM and MPI Communication Operations on the IBM SP2: Modeling and Comparison. In Proc. 11th Symp. on High Performance Computing Systems (HPCS'97), 1997. [10] T. Rauber and G. Riinger. Modelling the runtime of scientific programs on parallel comput ers. In Proc. ICPP-Workshop on High Performance Scientific and Engineering Computing with Applications (HPSFCA -00), pages 307-314, Toronto, Kanada, August 2000. [I I] Z. Xu and K. Hwang. Early Prediction of MPP Performance: SP2, T3D and Paragon Experiences. Parallel Computing, 22:917-942, 1996. COLLECTIVE COMMUNICATION PATTERNS ON THE QUADRICS NETWORK· Salvador CoIl. Jose Duato. Francisco J. Mora Department of Electronic Engineering Technical University of Valencia, Valencia, Spain [email protected], [email protected], [email protected] Fabrizio Petrini. Adolfy Hoisie Performance and Architecture Laboratory, CCS-3 Los Alamos National Laboratory, Los Alamos, NM, USA [email protected], [email protected] Abstract The efficient implementation of collective communication is a key factor to pro vide good performance and scalability of communication patterns that involve global data movement and global control. Moreover, this is essential to enhance the fault-tolerance of a parallel computer. For instance. to check the status of the nodes. perform some distributed algorithm to balance the load. synchronize the local clocks. or do performance monitoring. Therefore. the support for multicast communications can improve the performance and resource utilization of a paral lel computer. The Quadrics interconnect (QsNET). which is being used in some of the largest machines in the world. provides hardware support for multicast. The basic mechanism consists of the capability for a message to be sent to any set of contiguous nodes in the same time it takes to send a unicast message. The two main collective communication primitives provided by the network software are the barrier synchronization and the broadcast. which are both implemented in two different ways. either using the hardware support. when nodes are contiguous. or a balanced tree and unicast messaging. otherwise. In this paper some performance results are given for the above collective communication services. that show. on the one hand. the outstanding performance of the hardware-based primitives even in the presence of a high network background traffic; and. on the other hand. the limited performance achieved with the software-based implementation. Keywords: interconnection networks. Quadrics. collective communication. multicast. per formance evaluation ·The work was supported by the Spanish CICYT through contract TIC2000-11SI-C07-0S 94 PERFORMANCE ANALYSIS AND GRID COMPUTING 1. Introduction Current trends on high-speed interconnects include the availability of a com munication processor in the network interface card [3, 11], which allows the implementation of high level messaging libraries without explicit intervention of the main CPU [4]; and the support for collective communications at hardware level [12], which outperforms traditional software-based multicast implemen tations. Both approaches can aid in the implementation of communication patterns which involve global data movement and global control. Hardware support for multicast communication combined with the local processing power provided by network processors gives the opportunity of addressing several open problems in current and future medium-and large-scale parallel computers: scalability, responsiveness, programmability, performance, resource utilization and fault-tolerance. Many recent research results show that job scheduling and resource management techniques based on gang scheduling and coscheduling algorithms can provide solutions to these open problems [1, 7, 6, 10]. Another aspect where efficient multicast communication can playa key role is performance diagnosis and tunning, and performance control [9, 13]. By integrating dynamic performance instrumentation with configurable resource management algorithms and a real-time adaptive control mechanism, runtime systems could automatically configure resource managers. Such systems would increase achieved performance by adapting to temporally varying application behavior. Hardware support for multicast communication requires many functionali ties, that are dependent on the network topology, the routing algorithm and the flow control strategy. For example, in a wormhole network, switches must be capable of forwarding flits from one input channel to multiple output channels at the same time [14]. Unfortunately, these tree-based algorithms can suffer from blocking problems in the presence of congestion [15]. Also, the packets must be able to encode the set of destinations in an easy-to-decode, compact manner, in order to reduce the packet size and to guarantee fast routing times in the switches. Software multicasts, based on unicast messages, are simpler to implement, do not require dedicated hardware and are not constrained by the network topology and routing algorithms, but they can be much slower than the hardware ones. In previous work we analyzed in depth how hardware- and software-based multicasts are designed and implemented in the Quadrics network (QsNET) [12]. In this paper some modifications have been pertormed in the communi cation libraries to evaluate the underlying mechanisms that provide multicast support. The initial part of the paper part introduces the mechanisms at the base of the hardware and software multicast primitives that, on their tum are at the base of Collective Communication Patterns on the Quadrics Network 95 more sophisticated collective communication patterns as broadcasts, barriers, scatter, gather, reduce, etc. In the second part we provide an extensive performance evaluation of two user-level collective communication patterns, barrier and broadcast, imple mented using both hardware and software multicast algorithms. The rest of this paper is organized as follows. Section 2 presents the ba sic mechanisms that support collective communication on the QsNET, while Section 3 gives a detailed description of the main collective communication services. Section 4 presents the experimental results and performance analysis. Finally, some concluding remarks and future directions are given in Section 5. 2. Collective Communication on the Quadrics Network The QsNET is a butterfly bidirectional multistage interconnection network with 4 x 4 switches [11], which can be viewed as a quaternary fat-tree. It is based on two building blocks, a programmable network interface called Elan, and a low-latency high-bandwidth communication switch called Elite. It uses wormhole switching with two virtual channels per physical link, source-based routing and adaptive routing. Some of the most important aspects of this net work are: the integration of the local memory into a distributed virtual shared memory, the support for zero-copy remote DMA transactions and the hardware support for collective communication [12]. The basic hardware mechanism that supports collective communication is provided by the Elite switches. The Elite switches can forward a packet to a set of physically contiguous output ports. Thus, a multicast packet can be sent to any group of adjacent nodes by using a single hardware-based multicast transaction. When the destination nodes are not contiguous a software-based implementation which uses a tree and point-to-point messages is used. 2.1 Hardware-Based Multicast Hardware-based broadcasts are propagated into the network by sending a packet to the top of the tree and then forwarding the packet to more than one switch output as the packet is sent down the tree. Deadlocks might occur on the way down when multiple broadcasts are sent simultaneously [5]. This situation is avoided by sending broadcast packets always to a fixed top tree switch, thus serializing all broadcasts. In Figure 1 (a) it is shown that the top leftmost switch is chosen as the logical root for the collective communication, and every request, in the ascending phase, must pass through one of the dotted paths until it gets to the root switch. In Figure 1 (b) we can see how a multicast packet reaches the root node; the multiple branches are then propagated in parallel. If another collective communication is issued while the first one is still in progress, it is serialized in the root switch. The second multicast packet will be able to proceed

Description:
Past and current research in computer performance analysis has focused primarily on dedicated parallel machines. However, future applications in the area of high-performance computing will not only use individual parallel systems but a large set of networked resources. This scenario of computational
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.