PREFACE This book aims at giving an impression of the way current research in algorithms, ar- chitectures and compilation for parallel systems is evolving. It is focused especially on domains where embedded systems are required, either oriented to application-specific or programmable realisations. These are crucial in domains such as audio, telecom, instrumen- tation, speech, robotics, medical and automotive processing, image and video processing, TV, multimedia, radar and sonar. Also the domain of scientific, numerical computing is covered. The material in the book is based on the author contributions presented at the 3rd Inter- national Workshop on Algorithms and Parallel VLSI Architectures, held in Leuven, Au- gust 29-31, 1994. This workshop was partly sponsored by EURASIP and the Belgian NFWO (National Fund for Scientific Research), and organized in co-operation with the IEEE Benelux Signal Processing Chapter, the IEEE Benelux Circuits and Systems Chap- ter, and INRIA, France. It was a continuation of two previous workshops of the same name which were held in Pont-&-Mousson, France, June 1990 1, and Bonas, France, June 1991 2. All of these workshops have been organized in the frame of the EC Basic Research Actions NANA and NANA2, Novel parallel Algorithms for New real.time Architectures, sponsored by the ESPRIT program of Directorate XIII of the European Commission. The NANA Contractors are IMEC, Leuven, Belgium (F. Catthoor), K.U. Leuven, Leuven, Bel- gium (J. Vandewalle), ENSL, Lyon, France (Y. Robert), TU Delft, Delft, The Netherlands (P. Dewilde and E. Deprettere), IRISA, Rennes, France (P. Quinton). The goal within these projects has been to contribute algorithms suited for parallel architecture realisation on the one hand, and on the other hand design methodologies and synthesis techniques which address the design trajectory from real behaviour down to the parallel architecture realisatlon of the system. As such, this is clearly overlapping with the scope of the workshop and the book. An overview of the main results presented in the different chapters combined with an attempt to structure all this information is available in the introductory chapter. We expect this book to be of interest in academia, both for detailed descriptions of research results as well as for the overview of the field given here, with many important but less widely known issues which must be addressed to arrive at practically relevant results. In addition, many authors have considered applications and the book is intended to reflect this fact. The real-life applications that have driven the research are described in several vi ecaferP contributions, and the impact of their characteristics on the methodologies is assessed. We therefore believe that the book will be of interest also to senior design engineers and CAD managers in industry, who wish either to anticipate the evolution of commercially available design tools over the next few years, or to make use of the concepts in their own research and development. It has been a pleasure for us to organize the workshop and to work together with the authors to assemble this book. We feel amply rewarded with the result of this co-operation, and we want to thank all the authors here for their effort. We have spent significant effort in trying to deliver as much as possible consistent material, by careful editing. The international aspect has allowed us to group the results of many research groups with a different background and "research culture," which si felt to be particularly enriching. We would be remiss not to thank Prof. L. Thiele of Universit~t des Saarlandes, Saarbriicken, Germany, who was an additional member to the workshops organizing committee, and F. Vanpoucke who was a perfect workshop managing director and also did a great job in collecting and processing the contributions to this book. We hope that the reader will find the book useful and enjoyable, and that the results presented will contribute to the continued progress of the field of parallel algorithms, ar- chitectures and compilation. Leuven, October ,t~991 eht editors References 1 E.Deprettere, A.Van der Veen (eds.), "Algorithms and Parallel VLSI Architectures", Elsevier, Amsterdam, 1991. 2 P.Quinton and Y.Robert (eds.), "Algorithms and Parallel VLSI Architectures II", El- sevier, Amsterdam, 1992. smhtiroglA dna Parallel VLSI Architectures III M. Moonen dna F. Catthoor )srotidE( (cid:14)9 1995 Elsevier Science B.V. All rights reserved. ALGORITHMS AND PARALLEL VLSI ARCHITECTURES F. CATTHOOtt IMEC Kapeldreef 57 800I Leuven, Belgium [email protected] M. MOONEN ESAT Katholieke Universiteit Leuven 800I Leuven, Belgium Marc. Moonen @esat.kuleu ven. .ca eb ABSTRACT. In this introductory chapter, ew will summarize the main contributions of the chapters collected in this book. Moreover, the topics addressed in these chapters will be linked to the major research trends in the domain of parallel algorithms, architectures and compilation. 1 STRUCTURE OF BOOK The contributions to the workshop and the book can be classified in three categories: .1 Parallel Algorithms: The emphasis lies on the search for more efficient and inherently parallelisable algorithms for particular computational kernels, mainly from linear al- gebra. The demand for fast matrix computations has arisen in a variety of fields, such as speech and image processing, telecommunication, radar and sonar, biomedi- cal signal processing, and os on. The work si motivated by the belief that preliminary algorithmic manipulations largely determine the success of, e.g, a dedicated hardware design, because radical algorithmic manipulations and engineering techniques are not easily captured, e.g, in automatic synthesis tools. Most of the contributions here deal with real-time signal processing applications, and in many cases, the research on these algorithms si already tightly linked to the potential parallel realisation options to be exploited in the architecture phase. .F Catthoor and M. Moonen .2 Parallel Architectures: Starting from an already paraUelized algorithm or a group of algorithms (a target application domain), the key issue here is to derive a particular architecture which efficiently realizes the intended behaviour for a specific technology. In this book, the target technology will be CMOS electronic circuitry. In order to achieve this architecture realisation, the detailed implementation characteristics of the building blocks - like registers/memories, arithmetic components, logic gates and connection networks- have to be incorporated. The end result is an optimized net- list/layout of either primitive custom components or of programmable building blocks. The trend of the last years is to mix both styles. So more custom features are embedded in the massively parallel programmable machines, especially in the storage hierarchy and the network topologies. In addition, (much) more flexibility is built into the custom architectures, sometimes leading to highly-flexible weakly-parallel processors. The path followed to arrive at such architectures is the starting point for the formalisation into reusable compilation methodologies. .3 Parallel Compilation: Most designs in industry suffer from increasing time pressure. As a result, the methods to derive efficient architectures and implementations have to become more efficient and less error-prone. For this purpose, an increasing amount of research is spent on formalized methodologies to map specific classes of algorithms (ap- plication domain) to selected architectural templates (target style). In addition, some steps in these methodologies are becoming supported by interactive or automated design techniques (architectural synthesis or compilation). In this book, the empha- sis will be on modular algorithms with much inherent parallelism to be mapped on (regular) parallel array styles. Both custom (application-specific) and programmable (general-purpose) target styles Uiw be considered. These categories correspond to the different parts of the book. An outline of the main contributions in each part is given next, along with an attempt to capture the key-features of the presented research. 2 PARALLEL ALGORITHMS In recent years, it has become clear that for many advanced real-time signal processing and adaptive systems and control applications the required level of computing power is well beyond that available on present-day programmable signal processors. Linear algebra and matrix computations play an increasingly prominent role here, and the demand for fast matrix computations has arisen in a variety of fields, such as speech and image processing, telecommunication, radar and sonar, biomedical signal processing, and so on. Dedicated architectures then provide a means of achieving orders of magnitude improvement in per- formance, consistent with the requirements. However, past experience has shown that preliminary algorithmic manipulations largely determine the success of such a design. This has led to a new research activity, aimed at tailoring algorithmic design to architectural design and vice versa, or in other words deriving numerically stable algorithms which are suitable for parallel computation. At this stage, there is also interaction already with the parallel architecture designers who Algorithms and Parallel VLSI Architectures have to evaluate the mapping possibilities onto parallel processing architectures, capable of performing the computation efficiently at the required throughput rate. In the first keynote contribution, KETPAHC 1 (~egalia), a tutorial overview si given of so-called subspace methods, which have received increasing attention in signal processing and control in recent years. Common features are extracted for two particular applications, namely multivariable system identification and source localization. Although these appli- cation areas have different physical origins, the mathematical structure of the problems they aim to solve are laced with parallels, so that, e.g. parallel and adaptive algorithms in one area find an immediate range of applications in neighbouring areas. In particular, both problems are based on finding spanning vectors for the null space of a spectral density matrix characterizing the available data, which si usually expressed numerically in terms of extremal singular vectors of a data matrix. Algorithmic aspects of such computations are treated in subsequent chapters, namely RHTPAHC 6 (G&tze et al. and RETPAHC 7 anexa~( et al.), see below. Linear least squares minimisation si no doubt one of the most widely used techniques in digital signal processing. It finds applications in channel equalisation as well as system identification and adaptive antenna array beamforming. At the same time, it si one of the most intensively studied linear algebra techniques when it comes to parallel implementa- tion. SRHTPAHC 2 through 5 all deal with various aspects of this. Of the many alternative algorithms that have been proposed over the years, one of the most attractive is the algo- rithm based on QR decomposition. To circumvent pipelining problems with this algorithm, several alternative algorithms have been developed, of which the covariance-type algorithm with inverse updating si receiving a lot of attention now. In RETPAHC 2 (Mc Whirter et al.), a formal derivation is given of two earlierly developed systolized versions of this algorithm. The derivation of these arrays si highly non-trivial due to the presence of data contra-flow in the underlying signal flow graph, which would normally prohibit pipelined processing. Algorithmic engineering techniques are applied to overcome these problems. Similar algorithmic techniques are used in RHTPAHC 3 (Brown et al.), which si focused on covariance-type algorithms for the more general Kalman Filtering problem. Here also, algorithmic engineering techniques are used to generate two systolic architectures, put forward in earlier publications, from an initial three-dimensional hierar- chical signal flow graph (or dependence graph). In KETPAHC 4 (Schier), it si shown how the inverse updates algorithm and systolic array treated in RETPAHC 2 may be equipped with a block-regularized exponential forgetting scheme. This allows to overcome numerical problems if the input data si not sufficiently informative. Finally, in RETPAHC 5 (Kadlec) the information-type RLS algorithm based on QR decomposition si reconsidered. A nor- malized version of this algorithm si presented which has potential for efficient fixed point implementation. The main contribution here si a global probability analysis which gives an understanding of the algorithms numerical properties and allows to formulate probability statements about the number of bits actually used in the fixed point representation. A second popular linear algebra tool si the singular value decomposition (and the related symmetric eigenvalue decomposition), which, e.g., finds applications in subspace techniques as outlined in RETPAHC .1 The next two chapters deal with the parallel implementation of F. Catthoor and M. Moonen such orthogonal decompositions. In RETPAHC 6 (G6tze et al.), it is explained how Jacobi- type methods may be speeded up through the use of so-called orthonormal p-rotations. Such CORDIC-like rotations require a minimal number of shift-add operations, and can be executed on a floating-point CORDIC architecture. Various methods for the construction of such orthonormal p-rotations of increasing complexity are presented and analysed. An alternative approach to developing parallel algorithms for the computation of eigenvalues and eigenvectors is presented in RETPAHC 7 (Sazena et al.). It is based on isospectral flows, that is matrix flows in which the eigenvalues of the matrix are preserved. Very few researchers in the past have used the isospectral flow approach to implement the eigenvalue problem in VLSI, even though, as explained in this chapter, it has several advantages from the VLSI point of view, such as simplicity and scalability. RETPAHC 8 (Arioli et al.) deals with block iterative methods for solving linear systems of equations in heterogeneous computingenvironments. Three different strategies are pro- posed for parallel distributed implementation of the Block Conjugate Gradient method, differing in the amount of computation performed in parallel, the communication scheme, and the distribution of tasks among processors. The best performing scheme is then used to accelerate the convergence of the Block Cimmino method. Finally, CHAPTER 9 (Cardarilli et al.) deals with RNS-to-binary conversion. RNS (Residue Number System) arithmetic is based on the decomposition of a number- rep- resented by a large number of bits - into reduced wordlength residual numbers. It is a very useful technique to reduce carry propagation delays and hence speed up signal processing implementations. Here, a conversion method is presented which is based on a novel class of coprime moduli and which is easily extended to a large number of moduli. In this way the proposed method allows the implementation of very fast and low complexity architectures. This paper, already bridges the gap with the detailed architecture realisation, treated in the second category of contributions. 3 PARALLEL ARCHITECTURES FOR HIGH-SPEED NUMERICAL AND SIGNAL PROCESSING Within this research topic, we have contributions on both customized and programmable architectures. For the application-specific array architectures, the main trend is towards more flexibility. This is visible for instance in the high degree of scalability and the different modes/options offered by the different architectures. We can make a further subdivision between the more "conventional" regular arrays with only local communication and the arrays which are combined with other communication support like tree networks to increase the speed of non-local dependencies. In the first class, two representative designs are reported in this book. In CHAPTER 10 (Riern et al.), a custom array architecture for long integer arithmetic computations is presented. It makes use of redundant arithmetic for high-speed and is very scalable for word-length. Moreover, several modes are available to perform various types of multiplication and division. The emphasis in this paper lies on the interaction with the algorithmic transformations which are needed to derive an optimized architecture and also on the methodology which is used Algorithms and Parallel VLSI Architectures throughout the design trajectory. Similarly, in CHAPTER 11 (Rosseel et al.), a regular array architecture for an image dif- fusion algorithm is derived. The resulting design is easily cascadable and scalable and the data-path supports many different interpolation functions. The extended formal method- ology used to arrive at the end result - oriented to fixed throughput applications - forms a red thread throughout the paper. Within the class of arrays extended with non-local communication also two representative designs are reported, again including a high degree of scalability. The topic of RETPAHC 12 (Duboux et al.) is a parallel array augmented with a tree network for fast and efficient dictionary manipulations. The memory and network organisation for handling the key- record data are heavily tuned to obtain the final efficiency. Also in CHAPTER 13 (Archambaud et al.), a basic systolic array is extended with an arbitration tree to speed up the realisation of the application. In this case, it is oriented to genetic sequence comparison including the presence of "holes". In order to achieve even higher speed, a set-associative memory is included too. For the class of programmable architectures, both massively and weakly parallel machines are available. Apparently, their use depends on the application domain which is targeted. For high-throughput real-time signal processing, in e.g. image and video processing, the main trend nowadays is towards lower degrees of parallelism (4 to 61 processor elements) and more customisation to support particular, frequently occurring operations and constructs. The latter is especially apparent in the storage and communication organisation. The reduced parallelism is motivated because the amount of available algorithmic parallelism si not necessarily that big and because the speed of the basic processors has become high enough to reduce the required parallelisation factor for the throughput to be obtained. Within the programmable class, the main emphasis in the book lies on the evolution of these novel, weakly parallel processor architectures for video and image processing type applications. Ill CHAPTER 14 (Vissers et al.), an overview is provided of the VSP2 architecture which si mainly intended for video processing as in HDTV, video compression and the like. It Supports a highly flexible connection network (cross-bar) and a very distributed memory organisation with dedicated register-banks and FIFO's. In CHAPTER 15 (Roenner et al.), the emphasis lies on a programmable processor mainly targeted to image processing algo- rithms. Here, the communication network si more restricted but the storage organisation is more diversified, efficiently supporting in hardware both regular and data-dependent, and both local and neighbourhood operations. The two processor architectures are however also partly overlapping in target domain and the future has to be show which of the options is best suited for a particular application. Using such video or image signal processors, it si possible to construct flexible higher-level templates which are tuned to a particular class of applications. This has for instance been achieved in CrIAPTER 61 (De Greef et al.)where motion-estimation like algorithms are considered. A highly efficient communication and storage organisation is proposed which allows to reduce these overheads considerably for the targeted applications. Real-time F. Catthoor and M. Moonen execution with limited board-space is obtained in this way for emulation and prototyping purposes. In addition, higher efficiency in the parallel execution within the data-path can poten- tiaUy be obtained by givingup the fully synchronous operation. This is demonstrated in CHAPTER 17 (Arvind et al.), where the interesting option of asynchronously communicat- ing micro-agents is explored. It is shown that several alternative mechanisms to handle dependencies and to distribute the control of the instruction ordering are feasible. Some of these lead to a significant speed-up. FinaUy, there is also a trend to simplify the processor data-path and to keep the instruc- tion set as small as possible (RISC processor style). Within the class of weakly parallel processors for image and video processing, this was already reflected in the previously mentioned architectures. In ,RETPAHC 18 (Hall et al.) however, this is put even more to the extreme by considering bit-serial processing elements which are communicating in an SIMD array. The use of special instructions and a custom memory organisation make global data-dependent operations possible though. This parallel programmable image processor is mainly oriented to wood inspection applications. Within the class of massively parallel machines, the main evolution is also towards more customisation. The majority of the applications targeted to such machines appears to come mainly from the scientific and numerical computing fields. In CHAPTER 19 (Vankats), a new shared memory multi-processor based on hypercube connections is proposed. The dedicated memory organisation with a directory based cache coherence scheme is the key for improved speed. An application of a fast DCT scheme mapped to such parallel machines is studied in CHAPTER 20 (Christopottlo8 et al.). Here, the emphasis lies on the influence of the al- gorithmic parameters and the load balancing on the efficiency of the parallel mapping. Efficient massive parallelism is only achievable for large system parameters. The power of a "general-purpose" array of processors realized on customizable field- programmable gate arrays (FPGAs) is demonstrated in ,RETPAHC 21 (Champeau et al.). This combination allows to extend the customisation further without overly limiting the functionality. An efficient realisation of parallel text matching is used as a test-case to show the advantages of the approach. Compiler support is a key issue for all of these parallel programmable machines so all the novel architectures have been developed with this in mind. Hence, each of the contributions in CHAPTER 14 (Vissers et al.), CHAPTER 51 (Roenner et al.), CHAPTER 18 (Hall et al.), CHAPTZ~t 21 (Champean et al.)and CHAPTZR 19 (Vankats)devotes a section to the compilation issues. Most of these compilers can however benefit from the novel insights and techniques which are emerging in the compilation field, as addressed in section 4. 4 PARALLEL COMPILATION FOR APPLICATION-SPECIFIC AND GENERAL-PURPOSE ARCHITECTURES As already mentioned, the key drive for more automated and more effective methodologies Algorithms and Parallel VLSI Architectures comes from the reduced design time available to system designers. In order to obtain these characteristics, methodologies have to be generally targeted towards application domains and target architecture styles. This is also true for the domain of parallel architectures. Still, a number of basic steps do reoccur in the methodologies and an overview of the major compilation steps in such a targeted methodology is provided in CHAPTER 22 (Featrier). In that contribution, the emphasis lies on array data-flow analysis, scheduling of the parallel operations on the time axis, allocation to processors and processor code generation including communication synthesis. Even though this survey is mainly oriented to the compilation on programmable machines, most of the concepts recur for the field of custom array sisehtnys (see also CHAPTER 10 (Riera et al.) and CHAPTER 11 (Rosseel et al.)). Still, the detailed realisation of the algorithmic techniques used for the design automation typically differs depending on the specific characteristics of the domain (see also below). The other papers in the compilation category are addressing specific tasks in the global methodology. Representative work in each of the different stages is collected in this book. The order in which these tasks will be addressed here is not fully fixed, but still most researchers converge on a methodology which is close to what is presented here. The first step is of course the representation of the algorithm to be mapped in a formal model, suitable for manipulation by the design automation techniques. The limitations of this model to affine, manifest index functions have been partly removed in the past few years. Important in this process is that the resulting models should still be amenable to the vast amount of compilation/synthesis techniques which are operating on the afnne model. This also means that array data-flow analysis should remain feasible. Interesting extensions to this "conventional" model which meet these requirements, are proposed in CHAPTER 23 (Held et al.) and RETPAHC 24 (Rapanotti et al.). The restriction to linear or affine index functions can be extended to piece-wise regular affine cases yb a normal form decomposition process. This allows to convert integer division, modulo, ceiling and floor functions to the existing models, as illustrated in CHAPTER 23 (Held et al.). Moreover, also so-called linearly bounded lattices can then be handled. The restrictions can be even further removed by considering the class of "integral" index functions, as studied in CHAPTER 24 (Rapanotti et al.). This allows to handle also more complicated cases as occurring e.g. in the knapsack algorithm. By especially extending the so-called uniformisation step in the design trajectory, it is still possible to arrive at synthesizable descriptions. There is also hope to deal with part of the data-dependent cases in this way. Finally, it is also possible to consider the problem of modelling from another point of view, namely as a matching between primitive operations for which efficient parallel im- plementations are known, and the algorithm to be mapped. This approach is taken in RETPAHC 25 (Rangaswarni), where a functional programming style with recursion is advo- cated. By providing a library of mappable functions, it is then possible to derive different options for compiling higher-level functions and to characterize each of the alternatives in terms of cost. Once the initial algorithm has been brought in this manipulatable form, it is usually nee- .F Catthoor and M. Moonen essary to apply a number of high-level algorithmic transformations to improve the efficiency of the eventual architecture realisations (see also CHAPTER 10 (Riem et al.) and CHAP- RET 11 (Rossee! et al.)). Support for these is considered in RETPAHC 26 (Durrieu et al.), where provably correct small transformations allow the designer to interactively modify the original algorithm into the desired form. Also the uniformisation transformation addressed in CHAPTER 24 (Rapanotti et al.)falls in principle under this stage, but for that purpose also more automated techniques have become available lately. Now that the algorithm has a suitable form for the final mapping stages, it is usually assumed that all index functions are uniform and manifest, and that the algorithm has been broken up into several pure loop nests. For each of these, the scheduling, allocation and code generation/communication synthesis steps then have to be performed. Within the target domain of massively parallel machines (either custom or programmable), the notion of affine mapping functions has been heavily exploited up to now (see also CHAPTEIt 22 For instance, the work in RETPAHC 27 (Bouchittg et al.) considers the mapping of eval- uation trees onto a parallel machine where communication and computation can coincide. This assumption complicates the process a lot and heuristics are needed and proposed to handle several practical cases within fine- and coarse-grain architectures. It is however clear from several practical designs that purely affine mapping are not always leading to optimal designs. This si clearly illustrated in C~APTER 28 (Werth et al.) for both scheduling and communication synthesis, and this for the test-case of the so- called Lamport loop. Therefore, several researchers have started looking at extensions to the conventional methods. A non-unimodular mapping technique including extended scheduling/allocation and es- pecially communication synthesis is proposed in RETPA~IC 29 (Reffay et al.). For the Cholesky factorisation kernel, it is shown that significantly increased efficiency can be ob- tained, while still providing automatable methods. Up till now, we have however still restricted ourselves to mapping onto homogeneous, locally connected parallel machines. As already demonstrated in section 3, the use of weakly parallel and not necessarily homogeneous architectures is finding a large market in high- throughput signal processing, as in video and image applications. As a result, much research has been spent lately on improved compilation techniques for these architectures too. Most of this work is originating from the vast amount of know-how which has been collected in the high-level synthesis community on mapping irregular algorithms onto heterogeneous single processor architectures. Several representative contributions in this area are taken up in this book. In CHAPTER 30 (Schwiegershausen et al.), the scheduling problem of coarse grain tasks onto a heterogeneous multi-processor is considered. The assumption is that several pro- cessor styles are available and that the mapping of the tasks on these styles has been characterized already. Given that information, it is possible to formulate an integer pro- gramming problem which allows to solve several practical applications in the image and video processing domain.
Description: