Structures of String Matching and Data Compression N. Jesper Larsson Departmentof ComputerScience LundUniversity DepartmentofComputerScience LundUniversity Box118 S-22100Lund Sweden Copyright'1999byJesperLarsson CODEN:LUNFD6/(NFCS-1015)/1(cid:150)130/(1999) ISBN 91-628-3685-4 Abstract This doctoral dissertation presents a range of results concerning ef(cid:2)- cient algorithms and data structures for string processing, including several schemescontributing to sequential data compression.It com- prises boththeoretic resultsand practical implementations. We study the suf(cid:2)x tree data structure, presenting an ef(cid:2)cient rep- resentation and several generalizations. This includes augmenting the suf(cid:2)xtreetofullysupportslidingwindowindexing(includingapracti- calimplementation)inlineartime.Furthermore,weconsideravariant thatindexesnaturallyword-partitioneddata, andpresentalinear-time constructionalgorithmforatreethatrepresentsonlysuf(cid:2)xesstarting atwordboundaries,requiringspacelinearinthenumberofwords. By applying our sliding window indexing techniques, we achieve an ef(cid:2)cient implementation for dictionary-based compression based on the LZ-77 algorithm. Furthermore, considering predictive source modelling, we show that a PPM* style model can be maintained in linear time using arbitrarily bounded storage space. We also consider the related problem of suf(cid:2)x sorting, applicable to suf(cid:2)x array construction and block sorting compression. We present an algorithm that eliminates super(cid:3)uousprocessing of previoussolu- tions while maintaining robust worst-case behaviour.We experimen- tally show favourable performance for a wide range of natural and degenerate inputs,and present a complete implementation. Block sorting compression using BWT, the Burrows-Wheeler trans- form,hasimplicitstructurecloselyrelatedtocontexttreesusedinpre- dictive modelling. We show how an explicit BWT context tree can be ef(cid:2)ciently generated as a subset of the corresponding suf(cid:2)x tree and explore the central problems in using this structure. We experi- mentallyevaluatepredictioncapabilitiesofthetreeandconsiderrep- resenting it explicitly as part of the compressed data, arguing that a conscious treatment of the context tree can combine the compres- sion performance of predictivemodelling with the computational ef- (cid:2)ciency of BWT. Finally,weexploreof(cid:3)inedictionary-basedcompression,andpresent asemi-staticsourcemodellingschemethatobtainsexcellentcompres- sion,yetisalsocapableofhighdecodingrates.Theamountofmemory used by the decoder is (cid:3)exible, and the compressed data has the po- tential of supporting direct search operations. Between theoryand practice, sometalk as iftheyweretwo (cid:150)making a separation and difference between them. Yet wise men know that bothcan begained inapplyingoneselfwhole-heartedlytoone. Bhagavad-Gfl(cid:17)tafl 5:4 Short-sighted programming can fail to improve the quality of life. It can reduce it, causing economic loss or even physical harm. In a few extreme cases, bad programmingpractice can lead todeath. P.J. Plauger, Computer Language, Dec. 1990 Contents Foreword 7 ChapterOne Fundamentals 9 1.1 Basic De(cid:2)nitions 10 1.2 TrieStorageConsiderations 12 1.3 Suf(cid:2)xTrees 13 1.4 SequentialDataCompression 19 ChapterTwo SlidingWindowIndexing 21 2.1 Suf(cid:2)xTreeConstruction 22 2.2 SlidingtheWindow 24 2.3 StorageIssuesandFinalResult 32 ChapterThree IndexingWord-PartitionedData 33 3.1 De(cid:2)nitions 34 3.2 WastingSpace:AlgorithmA 36 3.3 SavingSpace:AlgorithmB 36 3.4 ExtensionsandVariations 40 3.5 SublinearConstruction:AlgorithmC 41 3.6 AdditionalNotesonPractice 45 ChapterFour Suf(cid:2)xSorting 48 4.1 Background 50 4.2 AFasterSuf(cid:2)xSort 52 4.3 TimeComplexity 56 4.4 AlgorithmRe(cid:2)nements 59 4.5 ImplementationandExperiments 63 ChapterFive Suf(cid:2)xTreeSourceModels 71 5.1 Ziv-LempelModel 71 5.2 PredictiveModelling 73 5.3 Suf(cid:2)xTreePPM*Model 74 5.4 FinitePPM*Model 76 5.5 Non-StructuralOperations 76 5.6 Conclusions 78 ChapterSix Burrows-WheelerContextTrees 79 6.1 Background 80 6.2 ContextTrees 82 6.3 TheRelationshipbetweenMove-to-frontCoding andContextTrees 86 6.4 ContextTreeBWTCompressionSchemes 87 6.5 FinalComments 89 ChapterSeven Semi-Static DictionaryModel 91 7.1 PreviousApproaches 93 7.2 RecursivePairing 94 7.3 Implementation 95 7.4 CompressionEffectiveness 101 7.5 EncodingtheDictionary 102 7.6 Tradeoffs 105 7.7 ExperimentalResults 106 7.8 FutureWork 110 AppendixA SlidingWindowSuf(cid:2)xTreeImplementation 111 AppendixB Suf(cid:2)xSortingImplementation 119 AppendixC Notation 125 Bibliography 127 Foreword O riginally,mymotivationforstudyingcomputersciencewasmost likely spawned by a calculator I bought fourteen years ago. This gad- getcouldstoreashortsequenceofoperations,includingaconditional jumptothestart,whichmadeitpossibletoprogramsurprisinglyintri- catecomputations.Isoonrealizedthatthissimplemechanismhadthe powerto replace the tedious repeated calculations I sodetested with an intellectual exercise: to (cid:2)nd a general method to solve a speci(cid:2)c problem (something I would later learn to refer to as an algorithm) thatcouldbeexpressedbypressingasequenceofcalculator keys.My fascination for this process still remains. With more powerful computers, programming is easier, and more challenging problems are needed to keep the process interesting. Ul- timately,inalgorithmtheory,the bothersofproducingan actual pro- gram are completely skipped over. Instead, the (cid:2)nal product is an explanation of how an idealized machine could be programmed to solve a problem ef(cid:2)ciently. In this abstract world, program elements are represented as mathematical objects that interact as if they were physical.Theycanbechainedtogether,piledontopofeachother,or linked together to any level of complexity. Without these data struc- tures, which can be combined into specialized tools for solving the problemat hand, producinglargeorcomplicated programswould be infeasible. However, they do not exist any further than in the pro- grammer’smind;whentheprogramistobewritten,everythingmust again betranslated into morebasic operations.Inmyresearch,Ihave 7 Foreword triedtomaintain thisconnection, seeingalgorithmtheorynotmerely as mathematics, but ultimately as a programming tool. At a low level, computers represent everything as sequences of numbers, albeit with different interpretations depending on the con- text. The main topic in this thesis is algorithms and data structures (cid:150) most often tree shaped structures (cid:150) for(cid:2)nding patterns and repeti- tions in long sequences, strings, of similar items. Examples of typical strings are texts (strings of letters and punctuation marks), programs (stringsofoperations),andgeneticdata(stringsofaminoacids).Even two-dimensional data, such as pictures, are represented as strings at a lower level. One area particularly explored in the thesis is storing strings compactly, compressing them, by recording repetition and sys- tematically introducing abbreviations forrepeating patterns. Theresultisacollectionofmethodsfororganizing,searching,and compressing data. Its creation has deepened my insights in computer science enormously, and I hope some of it can make a lasting contri- bution to the computing world as well. Numerouspeoplehavein(cid:3)uencedthis work.Obviously,mycoau- thorsfordifferentpartsofthethesis,ArneAndersson,AlistairMoffat, Kunihiko Sadakane, and Kurt Swanson, have had a direct part in its creation,butmanyothershavecontributedinavarietyofways.With- outattemptingtonamethemall,Iwouldliketoexpressmygratitude toallthecentral andperipheralmembersoftheglobalresearchcom- munity who have supported and assisted me. Thein(cid:3)uenceofmyadvisorArneAnderssongoesbeyondthework wherehe stands as an author.He broughtme into the research com- munity from his special angle, and imprinted me with his views and visions.Hisnotionsofwhatisrelevantresearch,andhowitshouldbe presented, have guided me throughthese last (cid:2)ve years. Finally,Iwishtospeci(cid:2)callythankAlistairMoffatforinvitingmeto Melbourneandcollaboratingwithmeforthreemonths,duringwhich time I was accepted as a full member of his dynamic research group. This gave me a newperspective,and a signi(cid:2)cant pushtowards com- pleting the thesis. Malm(cid:246),August1999 JesperLarsson 8 Chapter One Fundamentals T he main theme of this work is the organization of sequential data to (cid:2)nd and exploit patterns and regularities. This chapter de(cid:2)nes ba- sicconcepts,formulatesfundamental observationsandtheorems,and presentsanef(cid:2)cientsuf(cid:2)xtreerepresentation.Followingchaptersfre- quently referand relate tothe material givenin this chapter. The material and much of the text in this current work is taken primarilyfromthe following(cid:2)ve previouslypresentedwritings: (cid:149) ExtendedApplicationofSuf(cid:2)xTreesto DataCompression,presented at the IEEE Data Compression Conference 1996 [42]. A revised and updated version of this material is laid out in chapters two and (cid:2)ve, andtosomeextentin§1.3. (cid:149) Suf(cid:2)x Trees on Words, written in collaboration with Arne Andersson andKurtSwanson,publishedinAlgorithmica,March1998[4].Apre- liminary versionwaspresentedat theseventhAnnual Symposiumon Combinatorial Pattern Matching in June 1996. This is presented in chapterthree,withsomeofthepreliminariesgivenin§1.2. (cid:149) TheContextTreesofBlock SortingCompression,presented at the IEEE Data Compression Conference 1998 [43]. This is the basis of chap- tersix. (cid:149) Of(cid:3)ine Dictionary-Based Compression,written with Alistair Moffat of theUniversityofMelbourne,presentedattheIEEEDataCompression Conference1999[44].Anextendedversionofthisworkispresented inchapterseven. 9 ChapterOne (cid:149) FasterSuf(cid:2)xSorting,writtenwithKunihikoSadakaneoftheUniversity of Tokyo; technical report, submitted [45]. This work is reported in chapterfour.Someofitsmaterialhasbeenpresentedinapreliminary version as A Fast Algorithm for Making Suf(cid:2)xArrays andfor Burrows- WheelerTransformationbyKunihikoSadakane[59]. 1.1 Basic De(cid:2)nitions We assume that the reader is familiar with basic conventional de(cid:2)ni- tions regarding strings and graphs, and do not attempt to completely de(cid:2)ne all the concepts used. However, to resolve differences in the literature concerning the use of some concepts, we state the de(cid:2)ni- tions of not only our specialized concepts, but also of some more general ones. For quick reference to our specialized notations, appendix C on pages 125(cid:150)126 summarizes terms and symbols used in each of the chapters of the thesis. Notational Convention For notation regarding asymptotic growth of functionsandsimilarconcepts,weadoptthegeneraltraditionincom- puterscience;see,forinstance,Cormen,Leiserson,andRivest[20]. All logarithms in the thesis are assumed to be base two, except where otherwise stated. 1.1.1 SymbolsandStrings Theinputofeach ofthe algorithmsdescribed in this thesis is a sequence of items which we refer to as symbols. The interpretationofthesesymbolsasletters,programinstructions,amino acids, etc., is generally beyond our scope. We treat a symbol as an abstract elementthat canrepresentany kindofunitinthe actual im- plementation (cid:150)althoughwe doprovideseveral examplesof practical uses,andoftenaim oureffortsat aparticular areaofapplication. Two basic sets of operations for symbols are common. Either the symbolsareconsideredatomic(cid:150)indivisibleunitssubjecttoonlyafew prede(cid:2)nedoperations,ofwhichpairwisecomparisonisacommonex- ample(cid:150)ortheyareassumedtoberepresentedbyintegers,andthereby possibletomanipulatewithallthecommonarithmeticoperations.We adoptpredominantlythelatter approach,sinceourprimarygoalisto develop practically useful tools, and in present computers everything isalways,atthelowestlevel,representedasintegers.Thus,restricting allowedoperationsbeyondthesetofarithmeticonesoftenintroduces an unrealistic impediment. Wedenote thesize ofthe inputalphabet,thesetof possiblevalues 10
Description: