TCSv2n4.qxd 4/24/2008 11:56 AM Page 1 F n T T C Foundations and Trends® in S 2 Theoretical Computer Science :4 Algorithms and Data Structures A 2:4 for External Memory lg o r ith Jeffrey Scott Vitter m s a n d Data sets in large applications are often too massive to fit completely inside the computer's D a internal memory. The resulting input/output communication (or I/O) between fast internal ta Algorithms and Data Structures memory and slower external memory (such as disks) can be a major performance bottleneck. S tr Algorithms and Data Structures for External Memorysurveys the state of the art in the design u c and analysis of external memory (or EM) algorithms and data structures, where the goal is to tu for External Memory r e exploit locality in order to reduce the I/O costs. A variety of EM paradigms are considered for s solving batched and online problems efficiently in external memory. fo r E x Jeffrey Scott Vitter Algorithms and Data Structures for External Memory describes several useful paradigms for ter n the design and implementation of efficient EM algorithms and data structures. The problem a l M domains considered include sorting, permuting, FFT, scientific computing, computational e m geometry, graphs, databases, geographic information systems, and text and string o r processing. y Algorithms and Data Structures for External Memory is an invaluable reference for anybody J e interested in, or conducting research in the design, analysis, and implementation of algorithms ffr e and data structures. y S c o tt V itte r This book is originally published as Foundations and Trends®in Theoretical Computer Science Volume 2 Issue 4, ISSN: 1551-305X. n now o w the essence of knowledge Algorithms and Data Structures for External Memory Algorithms and Data Structures for External Memory Jeffrey Scott Vitter Department of Computer Science Purdue University West Lafayette Indiana, 47907–2107 USA [email protected] Boston – Delft (cid:1) Foundations and TrendsR in Theoretical Computer Science Published, sold and distributed by: now Publishers Inc. PO Box 1024 Hanover, MA 02339 USA Tel. +1-781-985-4510 www.nowpublishers.com [email protected] Outside North America: now Publishers Inc. PO Box 179 2600 AD Delft The Netherlands Tel. +31-6-51115274 The preferred citation for this publication is J. S. Vitter, Algorithms and Data (cid:1)R StructuresforExternalMemory,FoundationandTrends inTheoreticalComputer Science, vol 2, no 4, pp 305–474, 2006 ISBN: 978-1-60198-106-6 (cid:1)c 2008 J. S. Vitter All rights reserved. No part of this publication may be reproduced, stored in a retrieval system,ortransmittedinanyformorbyanymeans,mechanical,photocopying,recording orotherwise,withoutpriorwrittenpermissionofthepublishers. Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen- ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, is granted by nowPublishersInc.forusersregisteredwiththeCopyrightClearanceCenter(CCC).The ‘services’foruserscanbefoundontheinternetat:www.copyright.com For those organizations that have been granted a photocopy license, a separate system of payment has been arranged. Authorization does not extend to other kinds of copy- ing, such as that for general distribution, for advertising or promotional purposes, for creating new collective works, or for resale. In the rest of the world: Permission to pho- tocopy must be obtained from the copyright owner. Please apply to now Publishers Inc., PO Box 1024, Hanover, MA 02339, USA; Tel. +1-781-871-0245; www.nowpublishers.com; [email protected] nowPublishersInc.hasanexclusivelicensetopublishthismaterialworldwide.Permission tousethiscontentmustbeobtainedfromthecopyrightlicenseholder.Pleaseapplytonow Publishers,POBox179,2600ADDelft,TheNetherlands,www.nowpublishers.com;e-mail: [email protected] (cid:1) Foundations and Trends R in Theoretical Computer Science Volume 2 Issue 4, 2006 Editorial Board Editor-in-Chief: Madhu Sudan Department of CS and EE MIT, Stata Center, Room G640 32 Vassar Street, Cambridge MA 02139, USA [email protected] Editors Bernard Chazelle (Princeton) Oded Goldreich (Weizmann Inst.) Shafi Goldwasser (MIT and Weizmann Inst.) Jon Kleinberg (Cornell University) L´aszl´o Lova´sz (Microsoft Research) Christos Papadimitriou (UC. Berkeley) Prabhakar Raghavan (Yahoo! Research) Peter Shor (MIT) Madhu Sudan (MIT) E´va Tardos (Cornell University) Avi Wigderson (IAS) Editorial Scope Foundations and Trends(cid:1)R in Theoretical Computer Science will publish survey and tutorial articles in the following topics: • Algorithmic game theory • Computational Number Theory • Computational algebra • Cryptography and information • Computational aspects of security combinatorics and graph theory • Data structures • Computational aspects of • Database theory communication • Design and analysis of algorithms • Computational biology • Distributed computing • Computational complexity • Information retrieval • Computational geometry • Operations Research • Computational learning • Parallel algorithms • Computational Models and • Quantum Computation Complexity • Randomness in Computation Information for Librarians Foundations and Trends(cid:1)R in Theoretical Computer Science, 2006, Volume 2, 4 issues. ISSN paper version 1551-305X. ISSN online version 1551-3068. Also available as a combined paper and online subscription. FoundationsandTrends(cid:1)R in TheoreticalComputerScience Vol.2,No.4(2006)305–474 (cid:1)c 2008J.S.Vitter DOI:10.1561/0400000014 Algorithms and Data Structures for External Memory Jeffrey Scott Vitter Department of Computer Science, Purdue University, West Lafayette, Indiana, 47907–2107, USA, [email protected] Abstract Data sets in large applications are often too massive to fit completely inside the computer’s internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottle- neck. In this manuscript, we survey the state of the art in the design andanalysisofalgorithmsanddatastructuresforexternal memory (or EM for short), where the goal is to exploit locality and parallelism in order to reduce the I/O costs. We consider a variety of EM paradigms for solving batched and online problems efficiently in external memory. For the batched problem of sorting and related problems like per- muting and fast Fourier transform, the key paradigms include distribu- tion and merging. The paradigm of disk striping offers an elegant way to use multiple disks in parallel. For sorting, however, disk striping can benonoptimalwithrespecttoI/O,sotogainfurtherimprovementswe discuss distribution and merging techniques for using the disks inde- pendently.WealsoconsiderusefultechniquesforbatchedEMproblems involving matrices, geometric data, and graphs. In the online domain, canonical EM applications include dictionary lookup and range searching. The two important classes of indexed data structures are based upon extendible hashing and B-trees. The paradigms of filtering and bootstrapping provide convenient means in online data structures to make effective use of the data accessed from disk. We also re-examine some of the above EM problems in slightly different settings, such as when the data items are moving, when the data items are variable-length such as character strings, when the data structure is compressed to save space, or when the allocated amount of internal memory can change dynamically. Programming tools and environments are available for simplifying the EM programming task. We report on some experiments in the domain of spatial databases using the TPIE system (Transparent Par- allel I/O programming Environment). The newly developed EM algo- rithms and data structures that incorporate the paradigms we discuss are significantly faster than other methods used in practice. Preface I first became fascinated about the tradeoffs between computing and memory usage while a graduate student at Stanford University. Over the following years, this theme has influenced much of what I have doneprofessionally,notonlyinthefieldofexternalmemoryalgorithms, which this manuscript is about, but also on other topics such as data compression,datamining,databases,prefetching/caching,andrandom sampling. The reality of the computer world is that no matter how fast com- puters are and no matter how much data storage they provide, there will always be a desire and need to push the envelope. The solution is nottowaitforthenextgenerationofcomputers,butrathertoexamine the fundamental constraints in order to understand the limits of what is possible and to translate that understanding into effective solutions. In this manuscript you will consider a scenario that arises often in large computing applications, namely, that the relevant data sets are simply too massive to fit completely inside the computer’s internal memory and must instead reside on disk. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance ix
Description: