ebook img

NASA Technical Reports Server (NTRS) 20040110369: Distributed-Memory Computing With the Langley Aerothermodynamic Upwind Relaxation Algorithm (LAURA) PDF

9 Pages·1.7 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview NASA Technical Reports Server (NTRS) 20040110369: Distributed-Memory Computing With the Langley Aerothermodynamic Upwind Relaxation Algorithm (LAURA)

Distributed-Memory Computing With the Langley Aerothermodynamic Upwind Relaxation Algorithm (LAURA) (cid:3) y Christopher J. Riley and F. McNeil Cheatwood NASA Langley Research Center, Hampton, VA 23681 Paper Presented at the 4th NASA National Symposium on Large-Scale Analysis and Design on High-Performance Computers and Workstations Oct. 15{17, 1997/Williamsburg, VA TheLangleyAerothermodynamicUpwindRelaxationAlgorithm(LAURA),aNavier- Stokes solver, has been modi(cid:12)ed for use in a parallel, distributed-memory environment usingthe Message-Passing Interface(MPI) standard. A standarddomaindecomposition strategy is used in which the computational domain is divided into subdomains with each subdomain assigned to a processor. Performance is examined on dedicated parallel machines and a network of desktop workstations. The e(cid:11)ect of domain decomposition andfrequencyofboundaryupdatesonperformanceandconvergenceisalsoexaminedfor several realistic con(cid:12)gurations and conditions typical of large-scale computational (cid:13)uid dynamicanalysis. Introduction though the LAURA computer code is continually be- The design of an aerospace vehicle for space trans- ingupdated withnewcapabilities, itisamaturepiece portation and exploration requires knowledge of the of software with numerous options and utilities that aerodynamic forces and heating along its trajectory. allow the user to tailor the code to a particular appli- 13 Experiments(bothground-testand(cid:13)ight)andcompu- cation. tational (cid:13)uid dynamic (CFD) solutions are currently LAURAwasoriginallydevelopedandtunedformul- used to provide this information. At high-altitude, tiprocessor, vector computers with shared memory high-velocity conditions that are characteristic of at- such as the CRAY C-90. Parallelismusing LAURA is mospheric reentry, CFD contributes signi(cid:12)cantly to achievedthrough the use of macrotaskingwhere large the design because of the ability to duplicate (cid:13)ight sections of code are executed in parallel on multiple conditionsandtomodelhightemperaturee(cid:11)ects. Un- processors. Because LAURA employs a point-implicit fortunately, CFD solutions of the hypersonic, viscous, relaxation strategy that is free to use the latest avail- reacting-gas(cid:13)owoveracompletevehiclearebothCPU able data from neighboring cells, the solution may time and memory intensive even on the most power- evolve without the need to synchronize tasks. This ful supercomputers; hence, the design role of CFD is resultsin averye(cid:14)cientuseofthe multitaskingcapa- 14 generally limited to a few solutions along a vehicle’s bilities of the supercomputer. But future supercom- trajectory. puting may be performed on clusters of less powerful One CFD code that has been used extensively for machines that o(cid:11)er a better price per performance the computation of hypersonic, viscous, reacting-gas than current large-scalevector systems. Parallelcom- (cid:13)owsoverreentryvehiclesistheLangleyAerothermo- puterssuchastheIBMSP2consistoflargenumbersof 1,2 dynamic Upwind Relaxation Algorithm (LAURA). workstation-classprocessorswith memory distributed LAURAhasbeenusedinthepasttoprovideaerother- among the processors instead of being shared. In modynamic characteristics for a number of aerospace addition, improvements in workstation processor and 3 4 5 vehicles (e.g. AFE, HL-20, Shuttle Orbiter, Mars network speed and the availability of message-passing 6 7 Path(cid:12)nder, SSTO Accessto Space ) and is currently librariesallow networks of desktop workstations(that being used in the design and evaluation of blunt aero- may sit idle during non-work hours) to be used for 15 braking con(cid:12)gurations used in planetary exploration practical parallel computations. As a result, many 8,9 missions and Reusable Launch Vehicle (RLV) con- CFD codes are making the transition from serial to 10,11 12 16{20 cepts (e.g. the X-33 and X-34 programs). Al- parallel computing. The current shared-memory, macrotasking version of LAURA requires modi(cid:12)ca- (cid:3) Research Engineer, Aerothermodynamics Branch, Aero- tion before exploiting these distributed-memory par- andGas-DynamicsDivision. y allel computers and workstation clusters. ResearchEngineer,VehicleAnalysisBranch,SpaceSystems andConceptsDivision. Several issues need to be addressed in creating a 1of9 distributed-memory version of LAURA: 1) There is thechoiceofprogrammingparadigmtouse. Adomain 17 decompositionstrategy (which involvesdividing the computationaldomain into subdomainsand assigning eachtoaprocessor)isapopularapproachtomassively parallel processing and is chosen due to its similarity to the current macrotasking version. 2) To mini- PARTITIONS mize memory requirements, the current data struc- ture of the macrotasking, shared-memory version is changed since each processorrequires storageonly for its own subdomain. 3) The choice of message-passing library(whichprocessorsusetoexplicitlyexchangein- formation) may impact portability and performance. 4) The frequency of boundary data exchanges be- tween computational subdomains can in(cid:13)uence (and BLOCKS may impede) convergence of a solution although the point-implicit nature of LAURA already allows asyn- 14 chronous relaxation. 5) There are also portabil- ity and performance concerns involved in designing a Fig. 1 Domain decomposition of macrotasking version of LAURA to run on di(cid:11)erent (cache-based version. and vector) architectures. 6) Finally, a distributed- relaxation equation is memory, message-passing version of LAURA should retain all of the functionality, capabilities, utilities, ML(cid:14)qL =rL (1) and easeof useof the currentshared-memoryversion. where ML is the n x n point-implicit Jacobian, qL This paper describes the modi(cid:12)cations to LAURA is the vector of conserved variables, rL is the resid- that permit its use in a parallel, distributed-memory ualvector,andn isthe number of unknown variables. environment using the Message-Passing Interface 21 For a perfect gas and equilibrium air, n is equal to 5. (MPI) standard. An earlier, elementary version of For nonequilibrium chemistry, n is equal to 4 plus the LAURA for perfect gas (cid:13)ows using the Parallel Vir- 22 number of constituent species. The residual vector rL tualMachine(PVM)library providesaguideforthe 23 and the Jacobian ML are evaluated using the latest current modi(cid:12)cations. Performance of the modi(cid:12)ed availabledata. Thechangeinconservedvariables,qL, version of LAURA is examined on dedicated paral- maybecalculatedusingGaussianelimination. AnLU lel machines (e.g. IBM SP2, SGI Origin 2000, SGI factorizationoftheJacobiancanbesaved(frozen)over multiprocessor) as well as on a network of worksta- large blocks of iterations ((cid:25) 10 to 50) to reduce com- tions (e.g. SGI R10000). Also, the e(cid:11)ect of domain putational costs as the solution converges. However, decomposition and frequency of boundary updates on the Jacobian will need to be updated every iteration performance and convergence is examined for several earlyinthecomputationwhenthesolutionischanging realisticcon(cid:12)gurationsandconditionstypicaloflarge- rapidly. scale CFD analysis. Macrotasking LAURAutilizesmacrotaskingbyassigningpiecesof LAURA the computational domain to individual tasks. First, LAURA is a (cid:12)nite-volume, shock-capturing algo- the computational domain is divided into blocks, rithm for the steady-state solution of inviscid or vis- where a block is de(cid:12)ned as a rectangularly ordered cous,hypersonic(cid:13)owsonrectangularlyordered,struc- array of cells containing all or part of the solution tured grids. The upwind-biased inviscid (cid:13)ux is con- domain. Then each block may be subdivided in the 24 structed using Roe’s (cid:13)ux-di(cid:11)erence splitting and computational sweep direction into one or more par- 25 Harten’s entropy (cid:12)x with second-order corrections titions. Partitions are then separately assigned to a based on Yee’s symmetric total-variation-diminishing task (processor). Figure 1 shows a two-dimensional 26 scheme (TVD). Gas chemistry options include per- (2D) domaindivided into2blocks with eachblockdi- fectgas,equilibriumair,and airinchemical andther- vided into 2 partitions. Thus a task may work on one malnonequilibrium. Moredetailsofthealgorithmcan ormore partitionswhich may be containedin asingle be found in Refs. 1, 2 and 13. block or may overlap several blocks. Each task then The point-implicit relaxation strategy is obtained gathers and distributes its data to a master copy of by treating the variables at the local cell center L at the solution which resides in shared memory. With theadvancediterationlevelandusingthelatestavail- the point-implicit relaxation, there is no need to syn- able data from neighboring cells. Thus, the governing chronizetaskswhich results in a very e(cid:14)cient parallel 2of9 in terms of computational speed and convergence. BLOCK Measuring the elapsed wall clock time of the code on di(cid:11)erent machines estimates the communication over- COMMUNICATION head and message-passing e(cid:14)ciency of the code. The BETWEENBLOCKS communication overhead associated with exchanging boundary data information between nodes depends on the parallel machine, the size of the problem, and the frequency of exchanges. The frequency of data BOUNDARYDATA STORAGE exchangesmaybedecreasedifnecessarytoreducethe communication penalty, but this may adversely a(cid:11)ect convergence. Therefore, the impact of boundary data exchange frequency on convergence is determined for several realistic vehicles and (cid:13)ow conditions. Computational Speed Timingestimatesusingthe message-passingversion of LAURA are presented for an IBM SP2, an SGI Origin 2000, an SGI multiprocessor machine, and a networkofSGIR10000workstations. Thesingle-node Fig. 2 Domain decomposition of message-passing version. performanceofLAURA ona cache-based(asopposed to vector) architectureis not addressed. Viscous, per- implementation. fect gas computations are performed on the forebody 10,11 Message-passing ofanX-33 con(cid:12)gurationwithagridsizeof64x56 x64. Thecomputationaldomainissplitalongeachof In the new message-passingversion of LAURA, the thecoordinatedirections(dependingonthenumberof computationaldomainisagainsubdivided into blocks nodes) into blocks of equal size. The individual block along any of the three (i;j;k) coordinate directions sizesareshownbelow. Becausetheblocksareequalin with each block assigned to a processor. As com- pared to the macrotasking version, this is analogous Table 1 Block sizes for timing study. to de(cid:12)ning each block to contain only one partition Nodes Block and assigning each partition to a separate task. The 2 32 x 56 x 64 number of blocks is therefore equal to the total num- 4 32 x 28 x 64 ber of processors. Due to the distributed memory of 8 32 x 28 x 32 theprocessors,eachtaskonlyrequiresstorageonlyfor 16 16 x 28 x 32 its own block plus storage for boundary data from as 32 16 x 14 x 32 manyassixneighboringblocks(i.e. oneforeachofthe 64 16 x 14 x 16 six block faces). Figure 2 shows a 2D domain divided 128 8 x 14 x 16 equally into 4 separate blocks. Each processor works only on its own block and pauses at user-speci(cid:12)ed in- tervals to exchangeboundarydata with its neighbors. size, boundary data exchanges are synchronized at a Theboundarydataexchangeisexplicitlyhandledwith speci(cid:12)ed iteration interval for convenience. Each run send and receive calls from the MPI message-passing begins with a partially converged solution and is run 21 library. The MPI library was chosen because it is for 200 iterations with second order accuracy. Two a standard and because there are multiple implemen- values (1 and 20) are used for nexch, the number of tations that run on workstations as well as dedicated iterations between boundary data exchanges, to esti- 27,28 parallel machines. Synchronization of tasks oc- mate the communication overhead on each machine. curs when messages are exchanged, but this exchange The number of iterations that the Jacobian is held is not required for any particular iteration due to the (cid:12)xed, njcobian, is equal to 20 and represents a typical point-implicit relaxation scheme. As in the macro- value for solutions that are partially converged. tasking version, tasks (or blocks) of various sizes may Four di(cid:11)erent architectures are used to obtain tim- accumulate di(cid:11)ering numbers of iterations during a ingestimates. The(cid:12)rstisa160-nodeIBMSP2located run. For blocks of equal size, it may be convenient attheNumericalAerospaceSimulation(NAS)Facility to synchronize the messageexchange at speci(cid:12)ed iter- at NASA Ames using IBM’s implementation of MPI. ation intervals. Thesecondisa64processor(R10000)SGIOrigin2000 also located at NAS using SGI’s version of MPI. The Results third is a 12 processor (R10000) SGI machine oper- The performance of the distributed-memory, ating in a multiuser environment, and the fourth is message-passing version of LAURA is examined a network of SGI R10000 workstations connected by 3of9 nexch=1 nexch=1 104 nexch=20 104 nexch=20 linearspeedup linearspeedup ) IBMSP2 ) SGIR10000 sec 103 X-33forebody sec multiprocessor ( ( e ViscousPG e m m ti 64x56x64 ti 103 ck 200iterations ck o o allcl 102 allcl XV-is3c3ofuosrePbGody w CRAYC-90 w 64x56x64 (9CPU’s) 200iterations 101 102 100 101 102 1 2 3 4 5 6 7 8 910 nodes nodes Fig. 3 Elapsed wall clock time on IBM SP2. Fig. 5 Elapsed wall clock time on SGI multipro- cessor. nexch=1 104 nexch=20 linearspeedup 104 nexch=1 nexch=20 linearspeedup SGIOrigin2000 ) c se 103 ) SGIR10000 me( X-33forebody sec ti ViscousPG e( k m oc 64x56x64 ti 103 wallcl 102 200iterations clock X-33forebody all ViscousPG w 64x56x64 200iterations 101 100 101 102 nodes 1021 2 3 4 5 6 7 8 910 nodes Fig.4 ElapsedwallclocktimeonSGIOrigin2000. Fig. 6 Elapsed wall clock time on network of SGI Ethernet. BothoftheseSGImachinesusetheMPICH R10000 workstations. 27 implementation of MPI. On all architectures, the MPI de(cid:12)ned timer, MPI WTIME, is used to measure rate measure of the communication overhead. Never- elapsed wall clock time for the main algorithm only. theless, the speedup of the code on all machines is Thetimetoreadandwriterestart(cid:12)lesandtoperform good. As anticipated, the relative message-passing pre- and post-processing is not measured although it performance on the dedicated machines (IBM SP2, mayaccountforasigni(cid:12)cantfractionofthetotaltime. SGI Origin 2000, SGI multiprocessor) is better than Compiler options include ‘-O3 -qarch=pwr2’ on the on the network of SGI workstations. Also, the per- IBM SP2and ‘-O2 -n32-mips4’ onthe SGI machines. formance with data exchanged every 20 iterations is No e(cid:11)ort is made to optimize the single-node perfor- noticeablybetter on the networkof workstationsthan mance of LAURA on these cache-basedarchitectures. with data exchanged every iteration. However, there Figures 3 - 6 display the elapsed wall clock times is little in(cid:13)uence of nexch on elapsed time on the on the various machines. A time based on the dedicated machines which indicates that the commu- single-node time and assuming a linear speedup equal nication overhead is very low. The degradation in to the number of nodes is shown for comparison. The performance of the 8 processor runs on the SGI mul- measured times are less than the comparison time for tiprocessor is due to the load on the machine from most of the cases as a result of the smaller blocks other users and is not a result of the communication on each node making better use of the cache. This overhead. Of course, the times (and message-passing increase in cache performance o(cid:11)sets the communica- e(cid:14)ciency) measured will vary depending on machine tion penalty. Improving the single-node performance and problem size. Also shown in Fig. 3 is the elapsed of LAURA on these cache-based architectures would timefromamultitaskingrunwiththeoriginalversion reduce the single-node times and give a more accu- of LAURA on a CRAY C-90 using 9 CPU’s. This 4of9 Table 2 LAURA parameters - viscous. Iterations Order njcobian 0-100 1 1 101-300 1 2 301-500 2 10 >500 2 20 a)X-33 b)X-34 Table 3 LAURA parameters - inviscid. Iterations Order njcobian 0-100 1 1 101-300 1 2 301-900 1 10 901-1100 2 10 >1100 2 20 uration). A baseline solution is generated with nexch equal to 1. Updating the boundary data every itera- c)X-33 forebody d)Stardust capsule tion should mimic the communication between blocks Fig. 7 Vehicle geometries. in the shared-memory version of LAURA. A second computation is made with nexch equal to njcobian shows that performance comparable to current vector since acceptable values for both parameters depend supercomputers may be obtained on dedicated paral- on transients in the (cid:13)ow. Solutions that are chang- lel machines (albeit with more processors) using this ing rapidly should update the Jacobian and exchange distributed-memory version of LAURA. boundary data frequently, while partially converged solutions may be able to freeze the Jacobian and lag Convergence theboundarydataforanumberofiterations. Asimple The e(cid:11)ect of problem size, gas chemistry, and strategyistolinkthetwoparameters. Convergenceis boundary data exchange frequency on convergence is measured by the L2 norm de(cid:12)ned by 10,11 examined for four realistic geometries: the X-33 12 N and X-34 RLV concepts, the X-33 forebody, and 1 (rL(cid:1)rL) the Stardust sample return capsule forebody.8 All L2 = 2 X 2 (2) CN L=1 (cid:26)L four geometries are shown in Fig. 7. A viscous (thin- layer Navier-Stokes),perfect gas solution is computed whereCN istheCourantnumber,N isthetotalnum- over the X-33 and X-33 forebody con(cid:12)gurations. The ber of cells, rL is the residual vector, and (cid:26)L is the convergence of an inviscid, perfect gas solution is ex- local density. All solutions are generated on the IBM amined using the X-34 vehicle. Nonequilibrium air SP2. chemistrye(cid:11)ects on the convergenceand performance of the distributed-memory version of LAURA are de- X-33 terminedfromaviscous,11-speciesaircalculationover The viscous, perfect gas (cid:13)ow (cid:12)eld is computed over theStardustcapsule. Forallgeometries,the vehicleis the X-33 RLV con(cid:12)guration (without the wake) to de(cid:12)ned by the k =1 surface, and the outer boundary demonstrate the ability of the new message-passing of the volume grid is de(cid:12)ned by k = kmax. version of LAURA to handle large-scale problems in Each viscous solution is computed with the same a reasonable amount of time. The freestream Mach sequence of parameters for consistency and is started number is 9.2, the angle of attack is 18.1deg, and the with all (cid:13)ow-(cid:12)eld variables initially set to their altitude is 48.3 km. The grid size is 192 x 168 x 64 freestream values. Slightly di(cid:11)erent values are used and is divided into 64 blocks of 48 x 42 x 16. for the inviscid solutions due to low densities on the Figure 8 shows the L2 convergence as a function of leeside of the vehicle causing some instability when number of iterations. The elapsed wall clock time on switchingfrom(cid:12)rsttosecondorderaccuracy. Methods the SP2 is 12.7hr. Only the baseline case(nexch = 1) to speed convergence such as computing on a coarse was computed due to resource limitations. The e(cid:11)ect grid before proceeding to the (cid:12)ne grid and converging of nexch on convergence will be examined in greater blocks sequentially beginning at the nose (i.e. block detail on the nose region of this vehicle. The stall 5 marching) arenotused. TherelevantLAURAparam- in convergence after 10000 iterations is due to a limit eters areshownbelow. Two valuesof nexch areused cycle in the convergence at the trailing edge of the (exceptfortheruninvolvingthecompleteX-33con(cid:12)g- tip of the canted (cid:12)n. Iterations are continued past 5of9 X-33 106 M¥ =9.2 104 nexch=1 a =18.1deg splitinI-J-K } nexch=njcobian splitinI-J ViscousPG 104 192x168x64 102 64nodes-IBMSP2 X-33forebody 102 nexch=1 100 M¥ =9.2 a =18.1deg L2 12.7hr L2 ViscousPG 100 10-2 64x56x64 16nodes-IBMSP2 10-2 10-4 10-4 10-6 0 4000 8000 12000 16000 0 4000 8000 12000 iterations iterations a)Convergence as function of number of iterations Fig. 8 Convergencehistory of viscous, perfect gas (cid:13)ow (cid:12)eld over X-33 vehicle. this point to convergethe boundary layerand surface 104 nspelxitcihn=I-1J-K } heating. splitinI-J nexch=njcobian asynchronous X-33 forebody 102 The e(cid:11)ects of boundary data exchange frequency and block splitting on convergence are evaluated for 100 the nose section of the X-33. This is the same con- L2 (cid:12)guration used to obtain the timing estimates, and 10-2 freestream conditions correspond to the complete X- 33 vehicle case. The 64 x 56 x 64 grid is (cid:12)rst divided inthei-,j-,andk-directionsinto16blockscomprised 10-4 of 32 x 28 x 16 cells each. Two cases, nexch = 1 and nexch = njcobian, are run using this blocking. An- 10-6 other case is computed with the grid divided in the i- 0 1 2 3 4 andj-directionsonlyresultinginblocksof16x14x64 wallclocktime(hr) cells. Next,theasynchronousrelaxationcapabilitiesof LAURAaretestedbyreblockingapartiallyconverged b)Convergence as function of time restart (cid:12)le in the k-direction to cluster work (and it- Fig. 9 Convergence histories of viscous, perfect erations) in the boundary layer. Each block has i x j gas (cid:13)ow (cid:12)eld over X-33 forebody. dimensionsof32x28,butthek dimensionissplitinto 8, 8, 16, and 32 cells. Blocks near the wall contain 32 the k-direction. x 28 x 8 cells, while blocks near the outer boundary have 32 x 28 x 32 cells. Thus, the smaller blocks ac- Figure 9(b) shows the convergence as a function of cumulate more iterations than the larger outer blocks wall clock time. Because of the low communication in a given amount of time and should convergefaster. overhead on the IBM SP2, the time saved by mak- Figure9showsthe convergencehistoryforthis(cid:13)ow ing fewer boundary data exchanges is small. As seen (cid:12)eld. For viscous solutions, convergence is typically from the timing data, this would not necessarily be dividedintotwostages. First,the inviscidshocklayer true on a network of workstations where the decrease develops and then the majority of the iterations are in communication overhead might o(cid:11)set any increase spentconvergingtheboundarylayer(andsurfaceheat- in number of iterations. Also shown are LAURA’s ing). Laggingtheboundarydataappearstohavemore asynchronous relaxation capabilities. After 1 hr (and of an impact on the early convergence of the inviscid 3500 iterations), the outer inviscid layer is partially featuresofthe(cid:13)owandlessofanimpactonthebound- converged. Restructuring the block structure at this arylayerconvergence. Thise(cid:11)ectismuchlargerwhen point by splitting the k dimension into 8, 8, 16, and the blocksaresplit in the k-directionacrossthe shock 32cellsallowstheboundarylayertoaccumulatemore layer. The communication delay a(cid:11)ects the develop- iterationsandacceleratesconvergence. Theresultisa ing shock wave as it crosses the block boundaries in 15percentdecreaseinwallclocktimecomparedtothe 6of9 baseline (nexch = 1) case. A similar strategy would also have accelerated the convergence of the baseline 104 nspelxitcihn=I-1J-K } nexch=njcobian case. splitinI-J 103 X-34 102 Thee(cid:11)ectofboundarydataexchangefrequencyand X-34 block splitting on convergence of inviscid, perfect gas 101 M =6.32 ¥ a =23deg (cid:13)owsisinvestigatedfortheX-34con(cid:12)guration(minus the body (cid:13)ap and vertical tail). Inviscid solutions are L2100 InviscidPG usefulinpredictingaerodynamiccharacteristicsforve- 10-1 120x152x32 hicledesignandmaybecoupledwithaboundary-layer 32nodes-IBMSP2 techniquetopredictsurfaceheattransferaswell. The 10-2 freestreamMachnumberis6.32,theangleofattackis 10-3 23 deg, and the altitude is 36 km. The grid is 120 x 152x32andis(cid:12)rstdividedinto32blocksof30x38x 10-4 16cells. Thegridisalsosplitinthei-andj-directions 0 1000 2000 3000 4000 iterations into blocks of 30 x 19 x 32 cells to check the e(cid:11)ect of block structure on convergence. The convergence his- tories areshownin Figure10. The aerodynamics(not a)Convergence as function of number of iterations shown)ofthevehicleareconvergedat4000iterations. Thespikeinconvergenceat900iterationsiscausedby 104 nexch=1 the switch from (cid:12)rst to second order accuracy. With splitinI-J-K } nexch=njcobian splitinI-J the grid split in all directions, the bas(cid:0)e3line solution 103 (nexch = 1) reaches an L2 norm of 10 at 3300 it- erationswhilethesolutionwithboundarydatalagged 102 takes 3640iterations. The solution with the grid split 101 in the i- and j-directions requires 3530 iterations. As showninFig.10(b),thereisacorrespondingdi(cid:11)erence L2100 in run times to reach that convergence level because the savings from fewer boundary data exchanges are 10-1 small on the SP2. Nevertheless, the e(cid:11)ect of lagging 10-2 the boundary data on convergenceis minimal. Stardust 10-3 Theconvergenceofanonequilibriumair(11species, 10-4 0 1 2 two temperature), viscous computation is examined wallclocktime(hr) for the forebody of the Stardust capsule. The freestream Mach number is 17 and the angle of at- b)Convergence as function of time tackis 10 deg. The gridis 56 x 32 x 60and is divided into 32 blocks of 7 x 8 x 60 cells each. There are Fig. 10 Convergence histories of inviscid, perfect no splits in the k-direction. Figure 11 shows the con- gas (cid:13)ow (cid:12)eld over X-34 vehicle. vergence as a function of iterations and elapsed wall clock time. Because of the larger number of (cid:13)ow-(cid:12)eld take advantage of distributed-memory parallel ma- variables, considerably more data must be exchanged chines. A standard domain decomposition strategy between blocks for nonequilibrium (cid:13)ows. Even on a yields good speedup on dedicated parallel systems, dedicated parallel machine such as the IBM SP2, the but the single-nodeperformanceof LAURA on cache- communication penalty for this particular case has a based architecturesrequires further study. The point- signi(cid:12)cant impact on the elapsed time. The baseline (cid:0)4 implicit relaxation strategy in LAURA is well-suited case reaches an L2 norm of 10 at 6900 iterations for parallel computing and allows the communication compared to 7500 iterations for the nexch = njcobian overhead to be minimized (if necessary) by reducing solution. However,thesavingsincommunicationtime the frequency of boundary data exchanges. The com- allowsthe nexch = njcobian solution to converge1 hr munication overhead is greatest on the network of faster than the baseline case. workstationsandfornonequilibrium(cid:13)owsduetomore data passing between nodes. Lagging the boundary Conclusions databetweenblocksappearstoa(cid:11)ectthedevelopment The shared-memory, multitasking version of the of the inviscid shock layer more than the convergence CFD code LAURA has been successfully modi(cid:12)ed to of the boundary layer. Its largest e(cid:11)ect occurs when 7of9 Acknowledgements 104 nexch=1 } splitinI-J nexch=njcobian The authors wish to acknowledge Peter Gno(cid:11)o of the Aerothermodynamics Branch at NASA LaRC for 102 Stardust his assistancewith the inner workingsof LAURA and M =17 Jerry Mall of Computer Sciences Corporation for his ¥ a =10deg help in pro(cid:12)ling LAURA on the IBM SP2. 100 Viscous11-speciesair L2 56x32x60 References 32nodes-IBMSP2 1Gno(cid:11)o, P. A., \An Upwind-Biased, Point-Implicit Relax- 10-2 ationAlgorithmforViscous,CompressiblePerfect-GasFlows," NASATP{2953, Feb.1990. 2 Gno(cid:11)o, P. A., \Upwind-Biased, Point-Implicit Relaxation 10-4 Strategies for Viscous, Hypersonic Flows," AIAA Paper 89{ 1972,Jun.1989. 3 Gno(cid:11)o, P. A., \Code Calibration Program in Support of 10-6 the Aeroassist Flight Experiment," Journal of Spacecraft and 0 2000 4000 6000 8000 Rockets,Vol.27,No.2,1990,pp.131{142. iterations 4 Weilmuenster, K. J. and Greene, F. A., \HL-20 Compu- tational Fluid Dynamics Analysis," Journal of Spacecraft and a)Convergence as function of number of iterations Rock5ets,Vol.30,No.5,1993,pp.558{566. Gno(cid:11)o,P.A.,Weilmuenster,K.J.,andAlter,S.J.,\Multi- blockAnalysisforShuttleOrbiterRe-EntryHeatingFromMach 24 to Mach 12," Journal of Spacecraft and Rockets, Vol. 31, 104 nneexxcchh==1njcobian} splitinI-J No.3,1994, pp.367{377. 6 Mitcheltree, R. A. and Gno(cid:11)o, P. A., \Wake Flow About theMarsPath(cid:12)nder EntryVehicle,"Journal of Spacecraft and 102 Rockets,Vol.32,No.5,1994,pp.771{776. 7 Weilmuenster, K. J., Gno(cid:11)o, P. A., Greene, F. A., Riley, C. J., Hamilton, H. H., and Alter, S. J., \Hypersonic Aero- 100 dynamic Characteristics of a Proposed Single-Stage-to-Orbit Vehicle," Journal of Spacecraft and Rockets, Vol. 33, No. 4, L2 1995,pp.463{469. 8 Mitcheltree, R. A., Wilmoth, R. G., Cheatwood, F. M., 10-2 Brauckmann,G.J.,andGreene,F.A.,\AerodynamicsofStar- dustSampleReturnCapsule,"AIAAPaper97{2304,Jun.1997. 9 Mitcheltree,R.A.,Moss,J.N.,Cheatwood,F.M.,Greene, 10-4 F.A.,andBraun,R.D.,\AerodynamicsoftheMarsMicroprobe EntryVehicles,"AIAAPaper97{3658,Aug.1997. 10 Cook, S. A., \X-33 Reusable Launch Vehicle Structural 10-6 Technologies,"AIAAPaper96{4573,Nov.1996. 0 2 4 6 8 10 11 Gno(cid:11)o,P.A.,Weilmuenster,K.J.,Hamilton,H.H.,Olyn- wallclocktime(hr) ick,D.R.,andVenkatapathy,E.,\ComputationalAerothermo- dynamic Design Issues for Hypersonic Vehicles," AIAA Paper b)Convergence as function of time 97{2473, Jun.1997. 12 Levine,J.,\NASAX-34Program,"MeetingPapersonDisc Fig.11 Convergencehistoriesofviscous,nonequi- A9710806,AIAA,Nov.1996. 13 librium (cid:13)ow (cid:12)eld over Stardust capsule. Cheatwood, F. M. and Gno(cid:11)o, P. A., \User’s Manual for theLangleyAerothermodynamicUpwindRelaxationAlgorithm (LAURA),"NASATM{4674,Apr.1996. 14 Gno(cid:11)o, P. A., \Asynchronous, Macrotasked Relaxation StrategiesfortheSolutionofViscous,HypersonicFlows,"AIAA Paper91{1579,Jun.1991. the grid is split in the direction normal to the vehicle 15 Jayasimha, D. N., Hayder, M. E., and Pillay, S. K., \An surface. However, restructuring the blocks to cluster EvaluationofArchitecturalPlatformsforParallelNavier-Stokes work and iterations in the boundary layer improves Computations,"NASACR{198308,Mar.1996. 16 overall convergence once the inviscid features of the Venkatakrishnan,V.,\ParallelImplicitUnstructuredGrid (cid:13)ow have developed. These results demonstrate the EulerSolvers,"AIAAPaper94{0759,Jan.1994. 17 ability of the new message-passingversion of LAURA Wong, C.C.,Blottner, F.G.,Payne,J.L.,andSoetrisno, M., \A Domain Decomposition Study of Massively Parallel to e(cid:11)ectively use distributed-memory parallelsystems Computing in CompressibleGas Dynamics," AIAA Paper 95{ for realistic con(cid:12)gurations. As a result, the e(cid:11)ective- 0572,Jan.1995. 18 nessofLAURAasanaerospacedesigntoolisenhanced Borrelli, S., Schettino, A., and Schiano, P., \Hyper- byitsnewparallelcomputingcapabilities. Infact,this sonicNonequilibriumParallelMultiblockNavier-StokesSolver," JournalofSpacecraftandRockets,Vol.33,No.5,1996,pp.748{ new version of LAURA is currently being applied to 750. theevaluationofvehiclesusedinplanetaryexploration 19 Domel,N.D.,\Research inParallel Algorithmsand Soft- missions and the X-33 program. wareforComputationalAerosciences,"NAS96-004,Apr.1996. 8of9 20 VanderWijngaart,R.F.andYarrow,M.,\RANS-MP:A PortableParallelNavier-StokesSolver,"NAS97-004,Feb.1997. 21 Forum,M.P. I., \MPI: Amessage-passing interface stan- dard," Computer Science Dept. Technical Report CS{94{230, UniversityofTennessee,Knoxville,TN,1994. 22 Geist,A.,Beguelin,A.,Dongarra,J.,Jiang,W.,Manchek, R., and Sunderam, V., \PVM 3.0 User’s Guide and Reference Manual,"Tech.rep.,Feb.1993. 23 Balasubramanian, R., \Modi(cid:12)cation of Program LAURA to Execute in PVM Environment," Spectrex Report 95.10.01, Oct.1995. 24 Roe, P. L., \Approximate Riemann Solvers, Parameter Vectors, and Di(cid:11)erence Schemes," Journal of Computational Physics,Vol.43,No.2,1981,pp.357{372. 25 Harten,A.,\HighResolutionSchemesforHyperbolicCon- servation Laws," Journal of Computational Physics, Vol. 49, No.3,1983,pp.357{393. 26 Yee, H. C., \On Symmetric and Upwind TVD Schemes," NASATM{86842,Sep.1985. 27 Gropp, W. and Lusk, E., \User’s Guide for mpich, a PortableImplementationofMPI,"Tech.Rep.ANL/MCS{TM{ ANL{96/6,ArgonneNationalLaboratory,1996. 28 Burns,G.,Daoud,R.,andVaigl,J.,\LAM:AnOpenClus- ter Environment for MPI," Tech. rep., Ohio Supercomputing Center,May1994. 9of9

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.