ebook img

DTIC ADA374311: Design, Development, Benchmarking and Evaluation of Parallel Applications for High Performance Embedded Systems PDF

302 Pages·13.3 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA374311: Design, Development, Benchmarking and Evaluation of Parallel Applications for High Performance Embedded Systems

AFRL-IF-RS-TR-1999-269 Final Technical Report January 2000 *##*' DESIGN, DEVELOPMENT, BENCHMARKING AND EVALUATION OF PARALLEL APPLICATIONS FOR HIGH PERFORMANCE EMBEDDED SYSTEMS Syracuse University Wei-keng Liao, Donald Weiner, and Alok Choudhary APPROVED FOR PUBLIC RELEASE; DISTRIBUTION UNLIMITED. 20000308 OH AIR FORCE RESEARCH LABORATORY INFORMATION DIRECTORATE ROME RESEARCH SITE ROME, NEW YORK DTTC QUALITY XMHEfiClED 3 This report has been reviewed by the Air Force Research Laboratory, Information Directorate, Public Affairs Office (IFOIPA) and is releasable to the National Technical Information Service (NTIS). At NTIS it will be releasable to the general public, including foreign nations. AFRL-IF-RS-TR-1999-269 has been reviewed and is approved for publication. APPROVED: ^ ZENON J. PRYK Project Engineer FOR THE DIRECTOR: fpJ^SL. NORTHRUP FOWLER Technical Advisor Information Technology Division If your address has changed or if you wish to be removed from the Air Force Research Laboratory Rome Research Site mailing list, or if the addressee is no longer employed by your organization, please notify AFRL/IFTS,26 Electronic Pky, Rome, NY 13441-4514. This will assist us in maintaining a current mailing list. Do not return copies of this report unless contractual obligations or notices on a specific document require that it be returned. REPORT DOCUMENTATION PAGE Form Approved OMB No. 0704-0188 SL^ZTüi'i^Tf!^ a!" tmV?*J3?*.'im *!£".""""?. VJST.^T" '' "*" c*°"" '* "••""•<«• ««*««* ««««" tar reducing tho tartan, u Wartngtan Haadourtan Sareicu. uincMata lor Marmatien Oaaratara «id Raaaru. 1215 Jaffna» Dtvu Ughwly. SUM 1204. AitagtnVA 22202-4302. andta tha Offca of Managamant ml Budgat. Paaarwark FMuctnlPraia« I070W18JI. Wailangton. DC 20503 1. AGENCY USE ONLY (Lean blink) 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED JANUARY 2000 Final Nov 96 - May 99 4. TITLE AND SUBTITLE 5. FUNDING NUMBERS DESIGN, DEVELOPMENT, BENCHMARKING AND EVALUATION OF C - F30602-97-C-0026 PARALLEL APPLICATIONS FOR HIGH PERFORMANCE EMBEDDED PE- 63755D SYSTEMS PR- HPCM 6. AUTHOR(S) TA- 00 Wei-keng Liao, Donald Weiner, and Alok Choudhary WU-P1 7. PERFORMING ORGANIZATION NAMEIS) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Syracuse University REPORT NUMBER Office of Sponsored Programs N/A 113 Browne Hall Syracuse New York 13441-4514 9. SPONSORING/MONITORING AGENCY NAMEIS) AND ADDRESS(ES) 10.SP0NSORINGIMONITORING Air Force Research Laboratory/IFTC AGENCY REPORT NUMBER 26 Electronic Pky AFRL-IF-RS-TR-1999-269 Rome New York 1344-4514 11. SUPPLEMENTARY NOTES Air Force Research Laboratory Project Engineer: Zenon J. Pryk/IFTC/(315) 330-2596 12a. DISTRIBUTION AVAILABILITY STATEMENT 12b. DISTRIBUTION CODE APPROVED FOR PUBLIC RELEASE; DISTRIBUTION UNLIMITED. 13. ABSTRACT Maximum 200 words) Due to the nature of the algorithms typically employed in applications such as STAP, sensor data fusion, and target detection, it was decided to integrate the signal processing areas of space-time adaptive processing and signal detection. In particular, the following algorithms were parallelized: 1) AFRL (Rome) version of a PEI-staggered post-Doppler STAP algorithm. This algorithm, comprised of more than 23,000 lines of code, included the steps of a) Doppler filter processing, b) weight computation, c) beam forming, d) pulse compression, and e) constant false alarm rate (CFAR) processing. 2) Ozturk (clutter characterization) algorithm. This algorithm is used to analyze random data and includes the steps of a) goodness-of-fit test and b) probability distribution approximation. 3) Ordered-statistic CFAR algorithm. This CFAR algorithm is in addition to the cell averaging CFAR algorithm contained in the PRI-staggered post-Doppler STAP algorithm. In carrying out the algorithm parallelizations, the following task/technical requirements were accomplished: 1) Efficient techniques for high-speed, high-volume I/O applicable to embedded high-performance systems were designed and implemented. 2) Data distribution and redistribution strategies for both inter-task and intra-task data communications in real-time pipelined and parallelized applications were designed and implemented. 3) A documented beta code release was implemented to illustrate the full system with all major functional, technical, programming, documentation, installation, and user application features to be included in the full delivery. 4) The individual algorithms, as well as the integrated applications, were implemented, demonstrated, benchmarked, and evaluated on the Intel Parago and, IBM SP2. 14. SUBJECT TERMS 15. NUMBER OF PAGES Signal Processing, High Performance Computing, Programming 16. PRICE CODE 17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION 19. SECURITY CLASSIFICATION 20. LIMITATION OF OF REPORT OF THIS PAGE OF ABSTRACT ABSTRACT UNCLASSIFIED UNCLASSIFIED UNCLASSIFIED UL Standard Form 298 (Rev. 2-89) (EG) Preacrtad by «NSI Sid. 238.18 Daaaad uanj Part ami Pre, WHS/DIOR. Oct M Table of Contents 1.0 Background 1 2.0 Objectives ! 3.0 Administrative Details 1 4.0 Participants 2 5.0 Accomplishments 2 Appendix A A-l Appendix B B"l Appendix C • C-l Appendix D D-l Appendix E E-l FINAL REPORT FOR DESIGN, DEVELOPMENT, BENCHMARKING AND EVALUATION OF PARALLEL APPLICATIONS FOR HIGH PERFORMANCE EMBEDDED SYSTEMS 1.0 Background High performance computing is coming into the mainstream due to progress made in both hardware as well as software support in the past few years. For DoD applications, in particular, the trend toward leveraging off-the-shelf components and systems creates the need to address many system issues relevant to the DoD applications that were largely not considered when high performance computing was used mainly for scientific applications. These DoD specific issues arise from the particular functional requirements of the intended applications, the frequency requirements for high-speed high-volume data input and output, and real-time requirements for achieving specified throughout and latency. For benchmarking and evaluation of software systems, it is not just sufficient to compute the total execution time of an application, but it is extremely important to study the performance of individual components of an application, the overheads stemming from interactions among the component tasks (e.g. data flow), and the overall performance of an integrated system in terms of the achievable latency and throughput. 2.0 Objectives The objectives of this effort were to: (a) design, develop and implement individual parallel and portable algorithms plus integrated algorithm systems for applications such as Space-Time Adaptive Processing (STAP), sensor data fusion, and target detection; (b) design and implement efficient Input/Output (I/O), data redistribution and task assignment techniques for embedded high-performance system applications; (c) implement and benchmark the algorithms individually and in integrated applications in the Intel Paragon and demonstrate the performance levels achieved; (d) deliver high- quality software for distribution to DoD researchers nationwide. 3.0 Administrative Details The sponsor of this effort was the Information Directorate of the Air Force Research Laboratory (AFRL/IF) located in Rome, NY. Funding in the amount of $359,748 was provided as part of the Common HPC Software Support Initiative (CHS SI) under the DoD High Performance Computing Modernization Program (HPCMP). The duration of the effort was approximately 29 months, with a start date of December 24,1996 and an end date of May 15,1999. Syracuse University was the principal contractor while Northwestern University was a subcontractor. 4.0 Participants The principal investigators were Drs. Pramod Varshney and Donald Weiner of Syracuse University and Drs. Alok Choudhary and Nagaraj Shenoy of Northwestern University. They were assisted by doctoral students Wei-keng Liao of Syracuse University and Xiaohui Shen of Northwestern University. Valuable contributions were made by several AFRL (Rome) personnel Russ Brown Mike Little Mark Linderman and Richard Linderman clearly explained their rationale for the changes' they had implemented in the STAP algorithm chosen for parallel lotion. Charles Pedersen and Zen Pryk provided valuable guidance with the CHSSI and HPLMF documentation requirements. In addition, Zen Pryk assisted with the alpha and beta testing and made a major contribution to parallelization of the Ozturk algorithm by converting its FORTRAN code from an interactive to batch mode. Zen, also removed nonstandard FORTRAN features so that the parallelized version of the Ozturk algorithm could be compiled and run on a variety of high-performance computers. 5.0 Accomplishments Based upon Government review of our suggestions with regard to algorithms typically employed in applications such as STAP, sensor data fusion, and target detection, it was decided to integrate the signal processing areas of space-time adaptive processing and signal detection. In particular, the following algorithms were parallelized: 1) AFRL (Rome) version of a PRI-staggered post-Doppler STAP algorithm. This algorithm, comprised of more than 23,000 lines of code, included the steps of a) Doppler filter processing, b) weight computation, c) beam forming, d) pulse compression, and e) constant false alarm rate (CFAR) processing. 2) Ozturk algorithm. This algorithm is used to analyze random data and includes the steps of a) goodness-of-fit test and b) probability distribution approximation. 3) Ordered-statistic CFAR algorithm. This CFAR algorithm is in addition to the cell averaging CFAR algorithm contained in the PRI-staggered post-Doppler STAP algorithm. In carrying out the algorithm parallelizations, the following task/technical requirements were accomplished: 1) Efficient techniques for high-speed, high-volume I/O applicable to embedded high-performance systems were designed and implemented. 2) Data distribution and redistribution strategies for both inter-task and intra-task data communications in real-time pipelined and parallelized applications were designed and implemented. 3) Task assignment and scheduling techniques which can be used to satisfy latency and throughput requirements for high-performance embedded systems were designed and implemented. 4) A documented alpha code release was implemented in accordance with the contract schedule using algorithms that provide a representative example of all major technical, programming, documentation, installation and user application features planned for the full delivery. 5) A documented beta code release was implemented to illustrate the full system with all major functional, technical, programming, documentation, installation, and user application features to be included in the full delivery. 6) The individual algorithms, as well as the integrated applications, were implemented, demonstrated, benchmarked, and evaluated on the Intel Paragon at AFRL (Rome). The performance and optimization levels achieved were demonstrated and the final release delivered to AFRL (Rome). 7) A Software System Design Plan that presented prioritized and sequenced timelines for design, development, benchmarking, evaluation and documentation for the individual algorithms and applications chosen for parallelization was documented. Targeted levels of completion and functionality for the alpha, beta, and final code releases, and the format and planned content for the Application Programming Interface were included. 8) All computer software developed, assembled, and acquired was delivered to the Government in accordance with its specifications. Details of the work accomplished are documented in the publications, reports, and manuals included in the appendices attached to this report. These are itemized below: 1) Papers presented at conferences (Appendix A) Choudhary. A., Liao, W. K., Weiner, D., Varshney, P., Linderman, M.,Linderman, R., "Design of Parallel Pipelined STAP on High-Performance Computers", Proc. 1997 DoD High Performance Computing Modernization Program Users Group Meeting, San Diego, CA, June 23-26, 1997. Choudhary, A., Liao, W. K., Weiner, D., Varshney, P., Linderman, M, Linderman, R., "Design Implementation, and Evaluation of Parallel Pipelined STAP on Parallel Computers", Combined International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, Orlando, Florida, March 30-April 3,1998. Choudhary, A., Liao, W. K., Weiner, D., Varshney, P., Linderman M., L;nde™^;.^gn Z Impkmentkon of Space-Time Adaptive Processing Application « ^^^» Proc. 1998 DoD High Performance Computing Modernization Program Users Group Meeting, Houston, Texas, June 1-5,1998. Liao WK Choudhary, A., Weiner, D., Varshney, P., "Multi-Threaded Design and S^StafofpSd Pipelined STAP on Parallel Computers", 1999 International Parallel Processing Symposium, Puerto Rico, April 1999. 2) Papers submitted for publication (Appendix B) Choudhary A, Liao, W. K., Weiner, D., Varshney, P., Linderman, M., Linderman R., "Design SpTeStk,;, and Evaluation of Parallel Pipelined STAP on Parallel Computers", selected o asjJerid collection of papers on STAP and adaptive arrays to appear in an upcoming issue of the IEEE Transactions on Aerospace and Electronic Systems. Liao W K Choudhary, A., Weiner, D., Varshney, P., "I/O Implementation, and Evaluation of Paälel Pained STAP on Parallel Computers", International Conference on High-Performance Computing (fflPC 99), Calcutta, India, Dec. 17-20,1999. 3) Ph.D. Dissertation (Appendix C) Liao, W. K., "Parallel Pipelined Computational Model for Space-Time Adaptive Processing", Syracuse University, June 1999. 4) Report and Users' Manual for Ozturk Algorithm (Appendix D) 5) Users'Manual for STAP (Appendix E) Appendix A Papers presented at conferences A-1 Design of Parallel Pipelined STAP on High-Performance Computers R. Linderman and M. Alok Choudhary Wei-Keng Liao, D. Weiner Linderman ([email protected]) and P. Varshney ECE Department EECS Department Rome Laboratory Northwestern University Syracuse University Surveillance Directorate Evanston, IL 60208 Syracuse, NY 13244 Rome, NY 13441 Abstract This paper presents preliminary results for our ongoing implementation of parallel pipelined STAP algorithm on high- performance computers. In particular, the paper describes the issues involved in parallelization, our approach to parallelization and initial results on some tasks of the STAP algorithm. Initial results are encouraging and show significant performance benefits from our approach. The results demonstrate the scalability of computations and communication. 1. Introduction The detection of weak target returns embedded in strong ground clutter, interference, and receiver noise is a primary objective of airborne surveillance phased array radars. Space-time adaptive processing (STAP) refers to 2-dimensional adaptive filtering algorithms which take advantage of differences between the spatial and/or Doppler frequences of the target versus those of the unwanted components of the received waveform in order to separate the target from the disturbances. The spatial frequency of a signal is a function of its angle of arrival while its Doppler frequency is a function of a relative radial velocity between the airborne platform and that of the corresponding scatterer or jammer. Unwanted signals are attenuated by using STAP algorithms to place nulls in the 2-dimensional frequency plane with respect to their directions of arrival and/or Doppler frequencies. However, high performance computers are required to meet the STAP computations! requirements of real-time applications and to increase the flexibility, affordability, and scaleabihty of radar signal processing systems. In this paper we discuss our progress in implementing a PRI-staggered post-Doppler STAP algorithm on the Rome Laboratory Intel Paragon machine. The algorithm consists of the following steps: 1) application to the data of window and range correction multipliers, 2) calculation of 128-point FFT's for each PRI stagger and every range and channel 3) solution of the weight vector for each Doppler bin and range gate, 4) application of the weight vector to the test cell data for each Doppler bin and range gate, 5) pulse compression of the array output data for each Doppler bin and range gate^ For our study the data cube for a coherent processing interval (CPI) was assumed to be collected from 16 channels, 128 pulses and 512 range gates. For the parallel implementation we have designed parallel pipelined collection of tasks where'each task itself is parallel. In this paper we present some preliminary results from this implementation. In Section 2 we present the model of computation. Parallelization issues are discussed in Section 3. Section 4 presents some specific details of STAP implementation and software development. Preliminary results are presented in Section 5. A-2

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.