Lempel-Ziv Decoding in External Memory(cid:63) Djamal Belazzougui1, Juha Kärkkäinen2, Dominik Kempa2, and Simon J. Puglisi2 1 CERIST, Algeria [email protected] 2 Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Finland {juha.karkkainen,dominik.kempa,simon.puglisi}@cs.helsinki.fi 6 Abstract. Simple and fast decoding is one of the main ad- 1 0 vantagesofLZ77-typetextencodingusedinmanypopularfile 2 compressors such as gzip and 7zip. With the recent introduc- n tion of external memory algorithms for Lempel–Ziv factoriza- a tion there is a need for external memory LZ77 decoding but J the standard algorithm makes random accesses to the text 1 3 and cannot be trivially modified for external memory com- putation. We describe the first external memory algorithms ] S for LZ77 decoding, prove that their I/O complexity is op- D timal, and demonstrate that they are very fast in practice, . onlyaboutthreetimesslowerthanin-memorydecoding(when s c reading input and writing output is included in the time). [ 1 v 1 Introduction 9 2 The Lempel–Ziv (LZ) factorization [18] is a partitioning of a text string 3 into a minimal number of phrases consisting of substrings with an earlier 0 0 occurrenceinthestringandofsinglecharacters.InLZ77encoding[20]the . 2 repeated phrases are replaced by a pointer to an earlier occurrence (called 0 the source of the phrase). It is a fundamental tool for data compres- 6 sion [6,17,7,15] and today it lies at the heart of popular file compressors 1 : (e.g. gzip and 7zip), and information retrieval systems (see, e.g., [6,10]). v i Recently the factorization has become the basis for several compressed X full-text self-indexes [16,8,9,5]. Outside of compression, LZ factorization r a isawidelyusedalgorithmictoolinstringprocessing:thefactorizationlays bare the repetitive structure of a string, and this can be used to design efficient algorithms [2,13,14,12]. (cid:63) This research is partially supported by Academy of Finland through grant 258308 and grant 250345 (CoECGR). 2 D. Belazzougui et al. One of the main advantages of LZ77 encoding as a compression tech- niqueisafastandsimpledecoding:simplyreplaceeachpointertoasource by a copy of the source. However, this requires a random access to the earlier part of the text. Thus the recent introduction of external memory algorithms for LZ77 factorization [11] raises the question: Is fast LZ77 de- coding possible when the text length exceeds the RAM size? In this paper we answer the question positively by describing the first external memory algorithms for LZ77 decoding. In LZ77 compression, the need for external memory algorithms can be avoided by using an encoding window of limited size. However, a longer encoding window can improve the compression ratio [6]. Even with a lim- ited window size, decompression on a machine with a small RAM may require an external memory algorithm if the compression was done on a machine with a large RAM. Furthermore, in applications such as text indexing and string processing limiting the window size is not allowed. While most of these applications do not require decoding, a fast decoding algorithm is still useful for checking the correctness of the factorization. Ourcontribution. Weshowthatinthestandardexternalmemorymodel[19] theI/OcomplexityofdecodinganLZ77-likeencodingofastringoflength (cid:16) (cid:17) noveranalphabetofsizeσ isΘ n log n ,whereM isthe Blog n M/B Blog n σ σ RAM size and B is the disk block size in units of Θ(logn) bits. The lower bound is shown by a reduction from permuting and the upper bound by describing two algorithms with this I/O complexity. The first algorithm uses the powerful tools of external memory sorting and priority queues while the second one relies on plain disk I/O only. Both algorithms are relatively simple and easy to implement. Our imple- mentation uses the STXXL library [4] for sorting and priority queues. Our experiments show that both algorithms scale well for large data butthesecondalgorithmismuchfasterinallcases.Thisshowsthat,while external memory sorting and priority queues are extremely useful tools, they do have a significant overhead when their full power is not needed. The faster algorithm (using a very modest amount of RAM) is only 3– 4 times slower than an in-memory algorithm that has enough RAM to perform the decoding in RAM (but has to read the input from disk and write the output to disk). Our algorithms do not need a huge amount of disk space in addition to the input (factorization) and output (text), but we also describe and implement a version, which can reduce the additional disk space to less than 3% of total disk space usage essentially with no effect on runtime. Lempel-Ziv Decoding in External Memory 3 2 Basic Definitions Strings. Throughout we consider a string X = X[1..n] = X[1]X[2]...X[n] of |X| = n symbols drawn from the alphabet [0..σ−1] for σ = nO(1). For 1 ≤ i ≤ j ≤ n we write X[i..j] to denote the substring X[i]X[i+1]...X[j] of X. By X[i..j) we denote X[i..j −1]. LZ77. The longest previous factor (LPF) at position i in string X is a pair LPF[i] = (p ,(cid:96) ) such that, p < i, X[p ..p +(cid:96) ) = X[i..i+(cid:96) ), and (cid:96) i i i i i i i i is maximized. In other words, X[i..i+(cid:96) ) is the longest prefix of X[i..n] i which also occurs at some position p < i in X. There may be more than i one potential value of p , and we do not care which one is used. i The LZ77 factorization (or LZ77 parsing) of a string X is a greedy, left-to-right parsing of X into longest previous factors. More precisely, if the jth LZ factor (or phrase) in the parsing is to start at position i, then LZ[j] = LPF[i] = (p ,(cid:96) ) (to represent the jth phrase), and then the i i (j+1)th phrase starts at position i+(cid:96) . The exception is the case (cid:96) = 0, i i which happens iff X[i] is the leftmost occurrence of a symbol in X. In this case LZ[j] = (X[i],0) (to represent X[i..i]) and the next phrase starts at position i + 1. This is called a literal phrase and the other phrases are called repeat phrases. For a repeat phrases, the substring X[p ..p + (cid:96) ) i i i is called the source of the phrase X[i..i+(cid:96) ). We denote the number of i phrases in the LZ77 parsing of X by z. LZ77-type factorization. There are many variations of LZ77 parsing. For example, the original LZ77 encoding [20] had only one type of phrase, a (potentially empty) repeat phrase always followed by a literal charac- ter. Many compressors use parsing strategies that differ from the greedy strategydescribedabovetooptimizecompressionratioafterentropycom- pression or to speed up compression or decompression. The algorithms described in this paper can be easily adapted for most of them. For pur- poses of presentation and analysis we make two assumptions about the parsing: – All phrases are either literal or repeat phrases as described above. – The total number of repeat phrases, denoted by z , is O(n/log n). rep σ We call this an LZ77-type factorization. The second assumption holds for the greedy factorization [18] and means that the total size of the repeat phrases encoded using O(logn)-bit integers is O(nlogσ). If furthermore the zero length in the literal phrases is encoded with O(logσ) bits, the size of the whole encoding is O(nlogσ) bits. 4 D. Belazzougui et al. 3 On I/O complexity of LZ decoding Given an LZ77-type factorization of a string encoded as described above, the task of LZ77 decoding is to recover the original string. In this section, we obtain a lower bound on the I/O complexity of LZ decoding by a reduction from permuting. Wedotheanalysisusingthestandardexternalmemory(EM)model[19] with RAM size M and disk block size B, both measured in units of Θ(logn) bits. We are primarily interested in the I/O complexity, i.e., the number of disk blocks moved between RAM and disk. Given a sequence x¯ = x ,x ,...,x of n objects of size Θ(logn) bits 1 2 n eachandapermutationπ[1..n]of[1..n],thetaskofpermutingistoobtain the permuted sequence y¯ = y ,y ,...,y = x ,x ,...,x . Under 1 2 n π[1] π[2] π[n] the mild assumption that Blog(M/B) = Ω(log(n/B)), the I/O complex- (cid:16) (cid:17) ity of permuting is Θ n log n , the same as the I/O complexity of B M/B B sorting [1]. We show now that permuting can be reduced to LZ decoding. Let X be the string obtained from the sequence x¯ by encoding each x as a i string of length h = Θ(log n) over the alphabet [0..σ). Let Y be the σ string obtained in the same way from the sequence y¯. Form an LZ77-type factorization of XY by encoding the first half using literal phrases and the second half using repeat phrases so that the substring representing y is i encoded by the phrase (hπ[i]+1−h,h). This LZ factorization is easy to construct in O(n/B) I/Os given x¯ and π. By decoding the factorization we obtain XY and thus y¯. Theorem 1. The I/O complexity of decoding an LZ77-type factorization (cid:16) (cid:17) ofastringoflengthnoveranalphabetofsizeσ isΩ n log n . Blog n M/B Blog n σ σ Proof. The result follows by the above reduction from permuting a se- quence of Θ(n/log n) objects. (cid:116)(cid:117) σ 4 LZ decoding using EM sorting and priority queue OurfirstalgorithmforLZdecodingreliesonthepowerfultoolsofexternal memory sorting and external memory priority queues. We divide the string X into (cid:100)n/b(cid:101) segments of size exactly b (except the last segment can be smaller). The segments must be small enough to fit in RAM and big enough to fill at least one disk block. If a phrase or its source overlaps a segment boundary, the phrase is split so that all Lempel-Ziv Decoding in External Memory 5 phrases and their sources are completely inside one segment. The number of phrases increases by at most O(z +n/b) because of the splitting. rep After splitting, the phrases are divided into three sequences. The se- quence R contains repeat phrases with the source more than b positions far before the phrase (called far repeat phrases) and the sequence R the near other repeat phrases (called near repeat phrases). The sequence L con- tains all the literal phrases. The repeat phrases are represented by triples (p,q,(cid:96)), where p is the starting position of the source, q is the starting position of the phrase and (cid:96) is the length. The literal phrases are repre- sentedbypairs(q,c),whereq isthephrasepositionandcisthecharacter. The sequence R of far repeat phrases is sorted by the source position. far The other two sequences are not sorted, i.e., they remain ordered by the phrase position. During the computation, we maintain an external memory priority queueQthatstoresalreadyrecoveredfarrepeatphrases.Eachsuchphrase is represented by a triple (q,(cid:96),s), where q and (cid:96) are as above and s is the phrase as a literal string. The triples are extracted from the queue in the ascending order of q. The maximum length of phrases stored in the queue is bounded by a parameter (cid:96) . Longer phrases are split into multiple max phrases before inserting them into the queue. The string X is recovered one segment at a time in left-to-right order and each segment is recovered one phrase at a time in left-to-right order. A segment recovery is done in a (RAM) array Y[0..b) of size b. At any moment in time, for some i ∈ [0..b], Y[0..i) contains the already recovered prefix of the current segment and Y[i..b) contains the last b−i characters of the preceding segment. The next phrase starting at Y[i] is recovered in one of three ways depending on its type: – A literal phrase is obtained as the next phrase in the sequence L. – A near repeat phrase is obtained as the next phrase in the sequence R . The source of the phrase either starts in Y[0..i) or is contained near in Y[i..b), and is easily recovered in both cases. – A far repeat phrase is obtained from the the priority queue with the full literal representation. Onceasegmenthasbeenfullyrecovered,wereadallthephrasesinthe sequence R having the source within the current segment. Since R is far far orderedbythesourceposition,thisinvolvesasinglesequentialscanofR far over the whole algorithm. Each such phrase is inserted into the priority queue Q with its literal representation (splitting the phrase into multiple phrases if necessary). 6 D. Belazzougui et al. Theorem 2. A string of length n over an alphabet of size σ can be recov- (cid:16) (cid:17) ered from its LZ77 factorization in O n log n I/Os. Blog n M/B Blog n σ σ Proof. We set (cid:96) = Θ(log n) and b = Θ(Blog n). Then the ob- max σ σ jects stored in the priority queue need O(logn+(cid:96) logσ) = O(logn) max bits each and the total number of repeat phrases after all splitting is O(z + n/log n) = O(n/log n). Thus sorting the phrases needs rep σ σ (cid:16) (cid:17) O n log n I/Os. This is also the I/O complexity of all Blog n M/B Blog n σ σ the external memory priority queue operations [3]. All other processing is (cid:16) (cid:17) sequential and needs O n I/Os. (cid:116)(cid:117) Blog n σ We have implemented the algorithm using the STXXL library [4] for external memory sorting and priority queues. 5 LZ decoding without sorting or priority queue Thepracticalperformanceofthealgorithmintheprevioussectionisoften bounded by in-memory computation rather than I/O, at least on a ma- chine with relatively fast disks. In this section, we describe an algorithm that reduces computation based on the observation that we do not really need the full power of external memory sorting and priority queues. To get rid of sorting, we replace the sorted sequence R with (cid:100)n/b(cid:101) far unsorted sequences R ,R ,..., where R contains all phrases with the 1 2 i source in the ith segment. In other words, sorting R is replaced with far distributing the phrases into the segments R ,R ,.... If n/b is less than 1 2 M/B, the distribution can be done in one pass, since we only need one RAM buffer of size B for each segment. Otherwise, we group M/B con- secutive segments into a supersegment, distribute the phrases first into supersegments, and then scanning the supersegment sequences into seg- ments. If necessary, further layers can be added to the segment hierarchy. This operation generates the same amount of I/O as sorting the phrases but requires less computation because the segment sequences do not need to be sorted. In the same way, the priority queue is replaced with (cid:100)n/b(cid:101) simple queues. The queue Q contains a triple (q,(cid:96),s) for each far repeat phrase i whose phrase position is within the ith segment. The order of the phrases inthequeueisarbitrary.Insteadofinsertingarecoveredfarrepeatphrase into the priority queue Q it is appended into the appropriate queue Q . i This requires a RAM buffer of size B for each queue but as above a multi- round distribution can be used if the number of segments is too large. Lempel-Ziv Decoding in External Memory 7 This approach may not reduce the I/O compared to the use of a priority queue but it does reduce computation. Moreover, the simple queue allows the strings s to be of variable sizes and of unlimited length; thus there is no need to split the phrases except at segment boundaries. Since the queues Q are not ordered by the phrase position, we can i no more recover a segment in a strict left-to-right order, which requires a modification of the segment recovery procedure. The sequence R of near near repeat phrases is divided into two: R contains the phrases with prev the source in the preceding segment and R the ones with the source same in the same segment. As in the previous section, the recovery of a segment X starts with j the previous segment in the array Y[0..b) and consists of the following steps: 1. RecoverthephrasesinR (thatareinthissegment).Notethateach prev source is in the part of the previous segment that is still untouched. 2. Recover the literal phrases by reading them from L. 3. Recover the far repeat phrases by reading them from Q (with the full j literal representation). 4. Recover the phrases in R . Note that each source is in the part of same the current segment that has been fully recovered. After the recovery of the segment, we read all the phrases in R and insert j them into the queues Q with their full literal representations. k We want to minimize the number of segments. Thus we choose the segments size to occupy at least half of the available RAM and more if the RAM buffers for the queues Q do not require all of the other half. It k is easy to see that this algorithm does not generate asymptotically more I/Os than the algorithm of the previous section. Thus the I/O complexity (cid:16) (cid:17) is O n log n . We have implemented the algorithm using Blog n M/B Blog n σ σ standard file I/O (without the help of STXXL). 6 Reducing disk space usage ThealgorithmdescribedintheprevioussectioncanadapttoasmallRAM by using short segments, and if necessary, multiple rounds of distribution. However, reducing the segment size does not affect the disk space usage and the algorithm will fail if it does not have enough disk space to store all the external memory data. In this section, we describe how the disk space usage can be reduced. 8 D. Belazzougui et al. Name σ n/z hg.reads 6 52.81 wiki 213 84.26 kernel 229 7767.05 random255 255 4.10 Table 1. Statistics of data used in the experiments. All files are of size 256GiB. The value of n/z (the average length of a phrase in the LZ77 factorization) is included as a measure of repetitiveness. The idea is to divide the LZ factorization into parts and to process one part at a time recovering the corresponding part of the text. The first part is processed with the algorithm of the previous section as if it was the full string. To process the later parts, a slightly modified algorithm is needed because, although all the phrases are in the current part, the sources can be in the earlier parts. Thus we will have the R queues for all j the segments in the current and earlier parts but the Q queues only for j thecurrentpart.Thealgorithmprocessesfirstallsegmentsintheprevious parts performing the following steps for each segment X : j – Read X from disk to RAM. j – Read R and for each phrase in R create the triple (q,(cid:96),s) and write j j it to the appropriate queue Q . k Then the segments of the current part are processed as described in the previous section. For each part, the algorithm reads all segments in the preceding parts. ThenumberofadditionalI/OsneededforthisisO(np/(Blog n)),where σ p is the number of parts. In other respects, the performance of the algo- rithm remains essentially the same. We have implemented this partwise processing algorithm using greedy on-line partitioning. That is, we make each part as large as possible so that the peak disk usage does not exceed a given disk space budget. An estimated peak disk usage is maintained while reading the input. The implementation needs at least enough disk space to store the input (the factorization) and the output (the recovered string) but the disk space needed in addition to that can usually be reduced to a small fraction of the total with just a few parts. 7 Experimental Results Setup. We performed experiments on a machine equipped with two six- core1.9GHzIntelXeonE5-2420CPUswith15MiBL3cacheand120GiB Lempel-Ziv Decoding in External Memory 9 of DDR3 RAM. The machine had 7.2TiB of disk space striped with RAID0 across four identical local disks achieving a (combined) transfer rate of about 480MiB/s. The STXXL block size as well as the size of buffers in the algorithm based on plain disk I/O was set to 1MiB. The OS was Linux (Ubuntu 12.04, 64bit) running kernel 3.13.0. All programs were compiled using g++ version 4.7.3 with -O3 -DNDEBUG op- tions. The machine had no other significant CPU tasks running and only a single thread of execution was used for computation. All reported run- times are wallclock (real) times. Datasets. For the experiments we used the following files varying in the number of repetitions and alphabet size (see Table 1 for some statistics): – hg.reads: a collection of DNA reads (short fragments produced by a sequencing machine) from 40 human genomes3 filtered from symbols other than {A,C,G,T,N} and newline; – wiki: a concatenation of three different English Wikipedia dumps4 in XML format dated: 2014-07-07, 2014-12-08, and 2015-07-02; – kernel: a concatenation of ∼16.8 million source files from 510 versions of Linux kernel 5; – random255: a randomly generated sequence of bytes. Experiments. In the first experiment we compare the implementation of the new LZ77 decoding algorithm not using external-memory sorting or priority queue to a straightforward internal-memory LZ77 decoding algo- rithm that scans the input parsing from disk and decodes the text from left to right. All copying of text from sources to phrases happens in RAM. We use the latter algorithm as a baseline since it represents a realistic upper bound on the speed of LZ77 decoding. It needs enough RAM to accommodate the output text as a whole, and thus we were only able to process prefixes of test files up to size of about 120GiB. In the runtime we include the time it takes to read the parsing from disk (we stream the parsing using a small buffer) and write the output text to disk. The new algorithm, being fully external-memory algorithm, can handle full test instances. The RAM usage of the new algorithm was limited to 3.5GiB. The results are presented in Fig. 1. In nearly all cases the new algo- rithm is about three times slower than the baseline. This is due to the fact that in the external memory algorithm each text symbol in a far 3 http://www.1000genomes.org/ 4 http://dumps.wikimedia.org/ 5 http://www.kernel.org/ 10 D. Belazzougui et al. hg.reads wiki 192 Baseline (RAM) 192 Baseline (RAM) l LZ77decode (EM) l LZ77decode (EM) 160 160 øBœß Mis128 128 Ø Œedº 96 96 pe l S 64 lll l l l l 64 lll l l l 32 32 0 0 0 32 64 96 128 160 192 224 256 0 32 64 96 128 160 192 224 256 kernel random255 48 Baseline (RAM) Baseline (RAM) 640 l LZ77decode (EM) 40 l LZ77decode (EM) øBœß512 32 Mis Ø Œedº384 lll l l l l 24 pe 256 16 S 128 8 lll l l l l 0 0 0 32 64 96 128 160 192 224 256 0 32 64 96 128 160 192 224 256 Input size غGiBøß Input size غGiBøß Fig.1. Comparison of the new external memory LZ77 decoding algorithm based on plaindiskI/O(“LZ77decode”)withthepurelyin-RAMdecodingalgorithm(“Baseline”). The latter represents an upper bound on the speed of LZ77 decoding. The unit of decoding speed is MiB of output text decoded per second. repaeat phrase is read or written to disk three times: first, when written to a queue Q as a part of a recovered phrase, second, when read from j Q , and third, when we write the decoded text to disk. In comparison, j the baseline algorithm transfers each text symbol between RAM and disk once: when the decoded text is written to disk. Similarly, while the base- line algorithm usually needs one cache miss to copy the phrase from the source, the external memory algorithm performs about three cache misses per phrase: when adding the source of a phrase to R , when adding a j literal representation of a phrase into Q , and when copying the symbols j from Q into their correct position in the text. The exception of the above j behavior is the highly repetitive kernel testfile that contains many near repeat phrases, which are processed as efficiently as phrases in the RAM decoding algorithm. In the second experiment we compare our two algorithms described in Section 4 and 5 to each other. For the algorithm based on priority queue we set (cid:96) = 16. The segment size in both algorithms was set max to at least half of the available RAM (and even more if it did not lead to multiple rounds of EM sorting/distribution), except in the algorithm