ebook img

The SIMD Model of Parallel Computation PDF

152 Pages·1994·2.74 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview The SIMD Model of Parallel Computation

The SIMD Model of Parallel Computation Robert Cypher Jorge L.C. Sanz The SIMD Model of Parallel Computation With 13 Illustrations Springer-Verlag New York Berlin Heidelberg London Paris Tokyo Hong Kong Barcelona Budapest Robert Cypher Jorge L. C. Sanz IBM T. 1. Watson University of Illinois at Research Center Urbana-Champaign Yorktown Heights, NY 10598, Department of Electrical USA and Computer Engineering Urbana, IL 61801, USA Cover illustration: Mesh connected computer. Detail from Fig. 4.1, p. 21 Library of Congress Cataloging-in-Publication Data Cypher, Robert. The SIMD model of parallel computation/ Robert Cypher, Jorge L. C. Sanzo p. cm. Includes bibliographical references and index. ISBN-I3: 978-1-4612-7606-7 e-ISBN-I3: 978-1-4612-2612-3 DOl: 10.1 007/978-1-4612-2612-3 1. Parallel processing (Electronic computers) 2. Computer architecture. I. Sanz, J. L. C. (Jorge L. C.), 1955- , II. Title. QA76.58.C96 1994 004'.35-dc20 93-27497 Printed on acid-free paper. © 1994 Springer-Verlag New York, Inc. Softcover reprint of the hardcover 1s t edition 1994 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer-Verlag New York, Inc., 175 Fifth Avenue, New York, NY 10010, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use of general descriptive names, trade names, trademarks, etc., in this publication, even if the former are not especially identified, is not to be taken as a sign that such names, as under stood by the Trade Marks and Merchandise Marks Act, may accordingly be used freely by anyone. Production managed by Terry Kornak; manufacturing supervised by Vincent Scelta. Typeset by Asco Trade Typesetting Ltd., Hong Kong. 987654321 Contents Introduction .......................................... . 2 Parallel Computer Architectures .......................... 4 3 High-Level Models ..................................... 12 4 Mesh Connected Computers ............................. 20 5 Algorithms for Mesh Connected Computers ................ 34 6 Pyramid Machines ...................................... 51 7 Algorithms for Pyramid Machines ........................ 56 8 Hypercube Computers .................................. 61 9 Hypercube-Derived Computers ........................... 69 10 Communication Primitives for Hypercube Computers ........ 78 11 Algorithms for Hypercube Computers ..................... 111 12 Conclusions ........................................... 124 Bibliography ............................................... 127 Index ..................................................... 143 v CHAPTER 1 Introduction 1.1 Background There are many paradigmatic statements in the literature claiming that this is the decade of parallel computation. A great deal of research is being de voted to developing architectures and algorithms for parallel machines with thousands, or even millions, of processors. Such massively parallel computers have been made feasible by advances in VLSI (very large scale integration) technology. In fact, a number of computers having over one thousand pro cessors are commercially available. Furthermore, it is reasonable to expect that as VLSI technology continues to improve, massively parallel computers will become increasingly affordable and common. However, despite the significant progress made in the field, many funda mental issues still remain unresolved. One of the most significant of these is the issue of a general purpose parallel architecture. There is currently a huge variety of parallel architectures that are either being built or proposed. The problem is whether a single parallel computer can perform efficiently on all computing applications. When considering this question, it is important to notice that there is no unique serial architecture that dominates over all of the others. The advent of special purpose workstations and dedicated architectures has provided cost effective solutions to many demanding computational problems. Today, no body would perform low-level image processing without a pipeline of dedi cated components performing specific image functions.207 Signal processing chip sets and systolic arrays are well-suited to many numerically intensive 140 problems. Graphics stations are exclusively dedicated to certain design prob lems, and in many cases, these computers are single-user oriented. In a com pletely analogous manner, vector computers have been heavily used by the numerical computing community. These devices are specialized attachments to central processors. It is unlikely that people will stop using them in favor of large-scale, general purpose serial computers. Thus, the profusion of paral lel architectures is not very different from the situation with regard to serial architectures. 2 1. Introduction To understand the arguments for and against a single, general purpose parallel architecture, it is important to recognize the motivation for parallel processing. Parallelism may be regarded as a vehicle to accomplish two dif ferent goals. First, it may become a viable way to provide a given amount of computation at a lower cost. Second, it may be used to extend the range of problems that can be solved by computers, regardless of the cost. Of course, these two motivations may coexist, and indeed, they are the ultimate goal of many oftoday's research projects. Research on general purpose parallel com puters could be beneficial for future systems that will provide inexpensive computer power. These computers will be aimed at supporting a very large number of simultaneous users, much in the same way as today's powerful mainframes. Software will provide users with multiprogramming environ ments, time-sharing operating systems, and programming languages oriented to a rich variety of applications. On the other hand, it is unlikely that these systems will be able to provide the amount of computing power that is demanded by applications involving numerical analysis, computer vision systems, or physics simulations. Fur thermore, the environment provided by a general purpose system is not the most appropriate one for the computing needs of these users. There will be other, more specialized, parallel computers operating as backend coproces sors to satisfy the computing needs of those sophisticated users. These com puters will be tailored to classes of applications and will playa role similar to today's special purpose attachments, such as vector processors. The software and hardware involved in these computers will be different from those re quired by general purpose parallel systems. Time and cost overheads intro duced by some software and hardware features are considerable and can be justified only in the presence of a general computing environment. An in sightful discussion on this topic, titled "shared memory and the corollary of modest potential", is given by L. Snyder.230 Overall, the parallel processing field should not be polarized. It is unlikely that a single parallel computer or architecture will satisfy the computing requirements of all possible applications. In retrospect, the short, but rich, history of computing demonstrates that there is a diversity of serial machines tailored to different applications. There is, and probably will continue to be, a large zoo of parallel computers. Ultimately, the nature of the application areas and cost considerations will make some of them more useful or ap pealing than others. On the other hand, it is likely that some consolidation will occur as parallel architectures become better understood and more widely used. The need for consolidation is particularly acute in the area of models for parallel programming. If a small number of models of parallel computation can be agreed upon, programmers and algorithm designers can focus on these models and create applications that will be portable across a variety of parallel architectures. In this monograph, a tour through this zoo of parallel architectures is presented. For each architecture that is studied, algorithms that are tailored 1.3. Outline 3 to the given architecture will be presented. Although a range of applications areas will be considered, a set of basic operations related to image processing, sorting and routing, and numerical computing will be examined for each of the architectures. These algorithms will be useful in their own right, but will also serve as a means of comparing the different types of parallel computers and will aid in selecting the correct architecture for a given problem area. The emphasis will be on the SIMD (single instruction-stream, multiple data-stream) model of parallel computation and its implementation on both SIMD and MIMD (muliple instruction-stream, multiple data-stream) architectures.8 5, 231 1.2 Notation N is used to represent the size of the input to a problem, such as the number of pixels in an image to be processed or the number of entries in each of two matrices to be multiplied. P is used to represent the number of processors in a parallel machine. A function F(X) is said to be O(G(X)) if, for all sufficiently large X, there exists a constant C, such that F(X) ~ C * G(X). If X is a nonnegative integer, then the Y-bit representation of X will be written as (X(Y-1)' X(Y-2)"'" X(O))' and the i-th bit of X will be denoted by Xli) (where the O-th bit is the least significant bit). Also, is the integer obtained X(i) by complementing the i-th bit of X. The notation log X will denote the base-2 logarithm of X. The function 10g(O) X = X, and for all integers i > 0, log(i) X = log(log(i-l) X). The function log* X equals the smallest nonnegative integer i, such that log(i) X ~ 1. 1.3 Outline The remaining chapters are organized as follows. Chapters 2 and 3 present an overview of parallel architectures and programming methodologies, re spectively. Chapters 4 through 12 provide a critical survey of various parallel architectures and algorithms, based on the topology of the connections be tween the processors. Specifically, Chapters 4 and 5 look at mesh connected computers, Chapters 6 and 7 focus on pyramid computers, and Chapters g through 11 are devoted to hypercube and related computers. For each to pology, several existing and proposed parallel machines are discussed and compared. Also, an analysis of parallel algorithms for image processing and scientific and symbolic tasks is presented. The effects of architectural deci sions on algorithm design are examined in detail. Finally, some conclusions are outlined in Chapter 12. CHAPTER 2 Parallel Computer Architectures The basic types of parallel computer architectures are examined in this chap ter. The focus here will be on the physical design of the computer. In Chapter 3, the different high-level models that can be presented to a programmer of a parallel machine will be studied. 2.1 Memory Organization The physical location of the memory in a parallel computer can be classified as being either shared or distributed. In a shared (also called "centralized") memory parallel architecture, there is a set of memory locations that are not local to any processor. In order for the processors to access these shared memory locations, they must issue read or write requests that are routed to the memory via a bus or a switching network. In addition to these shared memory locations, each processor in a shared memory architecture has a local private memory in which it can store private data, copies of shared data, and pointers to the shared memory. In a distributed memory parallel computer, each memory location is local to some processor. A processor can access its own local memory directly. However, to access another processor's local memory, it must send a message to the processor that owns the memory. A processor and its local memory will sometimes be referred to as a processing element (PE). 2.2 Communication Medium In both shared memory and distributed memory computers, a processor must communicate in order to access data that are not stored in its local memory. In shared memory computers, this communication occurs between processors and the shared memory, while in distributed memory computers, it occurs between pairs of processors. There are three techniques that are 4 2.2. Communication Medium 5 used for performing this communication, namely, busses, switching net works, and direct processor-to-processor links. Computers that use these communication techniques are called bus-based, switch-based, and processor based, respectively. These distinctions are not always sharp, since a single computer can have multiple communication media. For example, some com puters have both busses and direct links between pairs of processors. Both switch-based and processor-based architectures can use either packet routing or circuit switching to deliver messages. In packet routing, messages are divided into packets that are routed to their destinations. Packets com pete with other packets for resources (such as wires and buffers) in a dynamic manner. In circuit switching, an entire path between the message sender and receiver is established. All of the communication links along this path are reserved for the given sender-receiver pair. They cannot be accessed by other senders. Once the path is established, the sender may transmit messages without fear of interference. There are also several switching modes for packet routing. In store-and forward routing, the packet "hops" between buffers, and the head of the packet waits until the tail of the packet has been stored in the buffer. In wormhole routing63 and virtual cut-through routing,129 each packet is divided into small units called flits. The flits follow one another in a snakelike man ner from the sender to the receiver. Thus, if a packet consists of only one flit, these techniques store the entire packet after each hop, and a store-and forward implementation is obtained. On the other hand, if a message con tains a very large number offlits, the first flits will arrive at the receiver before the later flits have even been sent. As a result, the entire path between the sender and receiver will be occupied, as is the case with circuit switching. Wormhole and virtual cut-through routing behave differently when a packet encounters congestion. In wormhole routing, the entire packet is stopped in place, thus, blocking all of the communication links that it occupies. In virtual cut-through routing, the tail of the packet continues to advance, and the entire packet is stored in the node where the congestion was encountered. In a processor-based distributed memory computer, only certain pairs of processors are connected by direct communication links. Thus, access to another processor's memory may require that a message be routed to the other processor via several intermediate processors. Some processor-based computers allow data to be transferred in both directions simultaneously along a single communication link, while others require that data be trans ferred in one direction at a time. Some processor-based machines, called strong communication machines, allow a single processor to send a differ ent data item over each of its communication links simultaneously. Other processor-based machines, called weak communication machines, limit each processor to sending a single data item over a single communication link at a time. 6 2. Parallel Computer Architectures 2.3 Topology Whether busses, switches, or direct links between processors are used for the communication, the communication network can have a wide range of topol ogies. In bus-based systems, a single bus or multiple busses may be used. When multiple busses are present, every bus may be connected to every processor and/or memory bank, or different busses may be connected to different subsets of processors and memory banks. One of the most common topologies for switch-based computers is the Omega network. 145 An Omega network connects N inputs to N outputs by means of log N stages of switches. Each stage consists of N /2 switches, each of which has two inputs and two outputs. An Omega network can perform many useful permutations of the inputs to the outputs, but it cannot perform every permutation. As a result, it is possible that some collisions will occur within the network, even though each input is accessing a different output. Another switching network topology that has been used in a parallel com puter is the Benes network. 19. 20 Benes networks can be defined for various switch sizes. When it is composed of switches with two inputs and two out puts, the Benes network has 2(log N) - 1 stages, each of which consists of N /2 switches. A Benes network is capable of performing every permutation of the inputs to the outputs. However, it is time-consuming to calculate how the switches should be set in order to implement a given permutation without having collisions. Therefore, the Benes network is typically used only if the patterns of communication are known in advance and the switch set tings can be calculated by the compiler. A rich class of topologies has been proposed for processor-based parallel computers. These include trees, two- and three-dimensional meshes, pyra mids, hypercubes, shuffie-exchanges, and cube-connected cycles. Processor based architectures will be emphasized in this monograph, and the topol ogies for these architectures will be studied in depth in the remaining chapters. 2.4 Control An important characteristic of a parallel machine is whether it operatp.s in an SIMD or an MIMD mode. In an MIMD architecture, different processors can perform different operations at a single time. As a result, each processor in an MIMD machine must have its own copy of its program as well as instruction fetching and decoding logic in order to interpret its program. In an SIMD architecture, all of the processors are forced to execute the same instruction at the same time. Thus, in an SIMD machine, it is possible to have only one copy of the program and a single controller that fetches the instructions, decodes them, and broadcasts the control signals to all of the processors. However, most SIMD machines offer some degree of processor

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.