THE BURROWS-WHEELER TRANSFORM: Data Compression, Suffix Arrays, and Pattern Matching THE BURROWS- WHEELER TRANSFORM: Data Compression, Suffix Arrays, and Pattern Matching Donald Adjeroh West Virginia University Tim Bell University of Canterbury Amar Mukherjee University of Central Florida 123 Donald Adjeroh Amar Mukherjee West Virginia University University of Central Florida Morgantown, WV 26506 Orlando, FL 32816-2362 USA USA [email protected] [email protected] T im Bell U niversity of Canterbury Christchurch New Zealand [email protected] The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching Donald Adjeroh, Tim Bell, and Amar Mukherjee ISBN-13: 978-0-387-78908-8 e-ISBN-13: 978-0-387-78909-5 Library of Congress Control Number: 2008923556 2008 Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper. 9 8 7 6 5 4 3 2 1 springer.com Preface TheBurrows-WheelerTransformisoneofthebestlosslesscompressionmeth- ods available. It is an intriguing — even puzzling — approach to squeezing redundancy out of data, it has an interesting history, and it has applications well beyond its original purpose as a compression method. It is a relatively late addition to the compression canon, and hence our motivation to write thisbook,lookingatthemethodindetail,bringingtogetherthethreadsthat led to its discovery and development, and speculating on what future ideas might grow out of it. The book is aimed at a wide audience, ranging from those interested in learning a little more than the short descriptions of the BWT given in stan- dard texts, through to those whose research is building on what we know aboutcompression and pattern matching. The firstfewchaptersareacareful description suitable for readers with an elementary computer science back- ground (and these chapters have been used in undergraduate courses), but later chapters collect a wide range of detailed developments, some of which are built on advanced concepts from a range of computer science topics (for example, some of the advanced material has been used in a graduate com- puter science course in string algorithms). Some of the later explanations require some mathematical sophistication, but most should be accessible to those with a broad background in computer science. We have aimed to provide a detailed introduction to the current state of knowledge about the Burrows-Wheeler Transform. This ranges from ex- planations and examples of how the transform works, through analyzing the theoretical performance of the transform from various view points, to con- sidering issues relevant to implementing it on “real” systems. Each chapter (except the last one) contains a “further reading” section to guide the reader around the large collection of literature that has explored the BWT in detail, and Appendix B points to ongoing research. An important theme in this book is pattern matching and text indexing using the BWT. Because the transformed text contains a sorted version of the original text, it has considerable potential to help with locating patterns, VI Preface and we look in detail at a number of variations that have been proposed and evaluated. The BWT literature uses a variety of notation for the various structures usedinthetransform.Wherepossiblewehavetriedtousestandardnotation, butunfortunatelysomekeynotationsconflictwiththoseusedinthestandard patternmatchingliterature,andsowehavechosentocoinsomenewnotations to avoid having the same notation with two meanings, at times in the same paragraph! Appendix A gives a summary of the notation used to avoid any confusion. The BWT continues to be actively researched, and this book is merely a milestoneinitshistory.AppendixBgiveslinkstowebsitesthatwillbeworth watching for future developments of the BWT and related systems. We are also aware that despite some excellent help with checking this book, it will contain errors and require updates. An errata site is available athttp://www.cosc.canterbury.ac.nz/tim.bell/bwt/.Wewelcomefeed- back on the book, and this can be sent to the authors via the contact details on this web site. Acknowledgements Many people have contributed to this book either directly or indirectly. We have to first acknowledge the late David Wheeler, who conceived the idea on which this entire book is based. In researching the background of the BWT it has been inspiring to discover the role that this modest individual hasplayedinmakingandinfluencingthehistoryofcomputinginmanyareas, not just in data compression. Michael Burrows played an important role in developing and publishing the transform, and we have been very fortunate to receive valuable input from him while writing this book, including his insight into implementing the BWT on current and upcoming computer technology. WehavealsoappreciateddiscussionswithDr.JoyceWheeler,DavidWheeler’s wife, who has been able to help us with details of the history relating to the development of the transform. The photo of David Wheeler in Figure 1.3 was taken by Chris Hadley (University of Cambridge Computer Laboratory), and kindly supplied by Joyce Wheeler. We also particularly wish to thank Peter Fenwick, who has been heavily involved in this book from the early stages of planning through to the final stages of checking. He has been a fount of wisdom, insight and information stemming from his long history of work on the Burrows-Wheeler Transform. Our students who have worked with us on the BWT have our sincere thanks for a lot of detailed work and many useful discussions. We are partic- ularly grateful to Andrew Firth, who performed an extensive comparison of BWT-based searching methods. Much of Chapter 7 draws on this work, and wearegratefulforhispermissiontouseit.JieLin,FeiNan,MattPowell,Ravi Vijaya Satya, Tao Tao, Nan Zhang and Yong Zhang have also contributed a Preface VII numberofideasthatappearinthistext.Apreliminaryversionofsomechap- ters in this book have been used in a graduate course on string algorithms, and we are grateful to the students for their comments and suggestions. We have been fortunate to have many members of the BWT research communityassistwithadviceandhelpingcheckpartsofthebook,particularly Paolo Ferragina, Craig Nevill-Manning, Giovanni Manzini, Alistair Moffat, BillSmyth,RossanoVenturini,andIanWitten.WealsothankZiyaArnavut, MitsuharuArimuraandKunihikoSadakane,forprovidingusrequestedcopies of their papers. Many other people have contributed to practical aspects of this work: Jay Hollandhaslentushissharpeyetoassistwithproof-reading—weappreciate his careful checking, and any errors are likely to have been introduced after he checked the text! Isaac Freeman has worked extensively on drawing and editing the figures for us; Julie Faris has helped with administrative aspects, DeniseTjonKetTjongwithcomputersystemsupportandhelpwithtechnical issues, and Stacey Mickelbart has provided technical writing assistance. Amy Brais,oureditoratSpringer,hasbeenmostsupportiveinguidingusthrough the process of putting the book together. Our respective universities, West Virginia University, the University of Canterbury, and the University of Central Florida have been very supportive of this project. Some of this work has been done while traveling, and we particularlyacknowledgetheHuazhongUniversityofScienceandTechnology in Wuhan, China, which provided an excellent environment for writing. The support from National Science Foundation grants (9977336, 0207819, 0312724,0228370,0312484)ondatacompressionandcompressed-domainpat- ternmatchinghelpeddevelopourinterestintheBurrows-WheelerTransform. A DOE CAREER grant (DE-FG02-02ER25541) provided support for study- ing the use of the BWT in biological sequence analysis. The NSF grant on “U.S., New Zealand and Australia Collaboration on Research on Data Com- pression” (0331188, 0331896) brought the authors of this book together in New Zealand to conceive and brainstorm this book. The people who have made the greatest contribution, of course, are our families, who have released us for many long hours to write, re-write and fine tune this book. We are grateful to our wives, Leonie, Judith and Pampa, andotherfamilymembers(someofwhomhavesupportedusfordecades,and otherswhoareonlyjustabouttolearnwhatitmeanstohaveanacademicfor a parent), for their love and moral support throughout the project: Donald- Patrick(whowasbornduringthewritingofthebook),Elise-Cindel,Andrew, Michael, Paula, Mita, Cecilia, Don and Nuella. Morgantown, West Virginia Don Adjeroh Christchurch, New Zealand Tim Bell Orlando, Florida Amar Mukherjee Contents 1 Introduction............................................... 1 1.1 An example of a Burrows-Wheeler Transform ............... 3 1.2 Genesis of the Burrows-Wheeler Transform ................. 5 1.3 Transformation.......................................... 8 1.4 Permutation ............................................ 11 1.5 Recency................................................ 12 1.6 Pattern matching........................................ 13 1.7 Organization of this book................................. 14 1.8 Further reading ......................................... 16 2 How the Burrows-Wheeler Transform works............... 19 2.1 The forward Burrows-Wheeler Transform................... 19 2.2 The reverse Burrows-Wheeler Transform ................... 23 2.3 Special cases............................................ 29 2.4 Further reading ......................................... 31 3 Coders for the Burrows-Wheeler Transform ............... 33 3.1 Entropy coding.......................................... 33 3.2 Run-length and arithmetic coder .......................... 38 3.3 Move-to-front lists....................................... 39 3.4 Frequency counting methods.............................. 42 3.5 Inversion Frequencies (IF) ................................ 43 3.6 Distance coding ......................................... 44 3.7 Wavelet trees ........................................... 45 3.8 Other permutations...................................... 46 3.9 Block size .............................................. 47 3.10 Further reading ......................................... 48 4 Suffix trees and suffix arrays............................... 51 4.1 Suffix Trees............................................. 51 4.1.1 Basic notations and definitions ...................... 52 X Contents 4.1.2 Construction of a suffix tree ........................ 54 4.1.3 Ukkonen’s suffix tree algorithm ..................... 57 4.1.4 From implicit suffix tree to true suffix tree............ 64 4.1.5 Farach’s recursive construction...................... 66 4.1.6 Generalized suffix trees............................. 73 4.1.7 Implementation issues.............................. 74 4.2 Suffix arrays ............................................ 75 4.2.1 Traditional string sorting........................... 76 4.2.2 Suffix arrays via suffix trees......................... 78 4.2.3 Manber-Myers suffix sorting algorithm ............... 78 4.2.4 Linear-time direct suffix sorting ..................... 81 4.3 Space issues in suffix trees and suffix arrays................. 85 4.4 Further reading ......................................... 88 5 Analysis of the Burrows-Wheeler Transform ............... 91 5.1 The BWT, suffix trees and suffix arrays ................... 93 5.2 Computational complexity................................ 95 5.2.1 BWT first stage — the transform ................... 95 5.2.2 BWT second stage — coding the transformed text..... 95 5.3 BWT context clustering property.......................... 97 5.3.1 Context trees ..................................... 97 5.3.2 Estimation using context trees ......................100 5.3.3 BWT and context trees ............................103 5.4 Analysis of BWT output .................................104 5.4.1 Theoretical distribution of BWT output..............104 5.4.2 Empirical distribution of BWT output ...............105 5.5 Analysis of BWT compression performance .................119 5.5.1 Definitions and notation............................120 5.5.2 Performance using recency ranking ..................123 5.5.3 Performance without LGT..........................129 5.5.4 Performance using piecewise constant parameters......132 5.5.5 Performance on general sources via empirical entropy ..133 5.6 Relationship with other compression schemes ...............135 5.6.1 Context-based schemes.............................135 5.6.2 Symbol ranking schemes ...........................148 5.7 Further reading .........................................149 6 Variants of the Burrows-Wheeler Transform ...............153 6.1 The sort transform ......................................154 6.1.1 Forward sort transform ............................154 6.1.2 Inverse sort transform..............................155 6.1.3 Performance of the sort transform ...................159 6.2 Lexical permutation sorting...............................163 6.2.1 Sorting permutations ..............................164 6.2.2 Lexical permutation sorting algorithm ...............167 Contents XI 6.3 The extended BWT .....................................168 6.3.1 Sort order between strings..........................168 6.3.2 Performing the extended BWT......................169 6.3.3 Inverting the transform ............................170 6.4 Sort-based context similarity measurement..................173 6.4.1 Context similarity measurement and ranking..........173 6.4.2 The prefix list data structure .......................175 6.4.3 Relationship with the Burrows-Wheeler Transform.....178 6.4.4 Performance of the prefix list .......................180 6.5 Word-based compression .................................180 6.5.1 General word-based compression ....................181 6.5.2 Word-based Burrows-Wheeler Transform .............183 6.6 Further reading .........................................185 7 Exact and approximate pattern matching..................187 7.1 Exact pattern matching algorithms ........................188 7.1.1 Brute force matching ..............................189 7.1.2 The Knuth-Morris-Pratt Algorithm..................190 7.1.3 The Boyer-Moore algorithm ........................195 7.1.4 The Karp-Rabin algorithm .........................197 7.1.5 The shift-and method..............................199 7.1.6 Multiple pattern matching..........................200 7.1.7 Pattern matching with don’t-care characters ..........204 7.2 Pattern matching using the Burrows-Wheeler Transform .....207 7.2.1 Boyer-Moore pattern matching using the BWT........209 7.2.2 BWT-based exact pattern matching with binary search 209 7.2.3 BWT-based exact pattern matching with suffix arrays .214 7.2.4 Pattern matching using the FM-index................215 7.2.5 Algorithm improvements with overwritten arrays ......220 7.3 Performance of BWT-based exact pattern matching..........221 7.3.1 Compression performance ..........................222 7.3.2 Search performance................................224 7.3.3 Array construction speeds ..........................231 7.3.4 Comparison with LZ-based compressed-domain pattern matching..................................232 7.4 Approximate pattern matching............................233 7.4.1 Edit distance: dynamic programming formulation......234 7.4.2 Edit graphs.......................................236 7.4.3 Local similarity ...................................237 7.4.4 The longest common subsequence problem............239 7.4.5 String matching with k differences...................244 7.4.6 The k-mismatch problem using the BWT.............247 7.4.7 k-approximate matching using the BWT .............253 7.5 Hardware algorithms for pattern matching..................255 7.5.1 An equivalent hardware algorithm ...................256 XII Contents 7.5.2 A brief review of other hardware algorithms ..........258 7.6 Conclusion .............................................259 7.7 Further reading .........................................260 8 Other applications of the Burrows-Wheeler Transform.....265 8.1 Compressed suffix trees and compressed suffix arrays.........266 8.1.1 Compressed suffix trees ............................267 8.1.2 Compressed suffix arrays ...........................270 8.2 Compressed full-text indexing.............................275 8.2.1 Full-text indexing using CSTs and CSAs .............276 8.2.2 Searching on compressed suffix trees .................277 8.2.3 Searching on compressed suffix arrays. ...............278 8.3 Bioinformatics and computational biology ..................278 8.3.1 DNA sequence compression.........................279 8.3.2 Analysis of repetition structures.....................280 8.3.3 Whole-genome comparisons.........................281 8.3.4 Genome annotation................................282 8.3.5 Distance measure between sequences and phylogeny ...283 8.4 Test data compression ...................................284 8.4.1 Nature of test data ................................285 8.4.2 BWT-based test data compression...................286 8.5 Image compression, computer vision and machine translation .287 8.5.1 Image compression ................................287 8.5.2 Shape matching ...................................292 8.5.3 Machine translation ...............................294 8.6 Joint source-channel coding...............................296 8.6.1 General source coding via channel coding.............297 8.6.2 BWT-based joint source-channel coding ..............298 8.7 Prediction and entropy estimation .........................299 8.8 Further reading .........................................301 9 Conclusion ................................................305 A Notation...................................................309 B Ongoing work on the Burrows-Wheeler Transform.........313 B.1 BWT-related web sites ...................................313 B.2 Ph.D. theses relating to the Burrows-Wheeler Transform .....314 References.....................................................317 Index..........................................................341
Description: