Higher Order Context Transformations Michal Vasinek∗, Jan Platos∗ ∗VSB-TU Ostrava 17. listopadu Ostrava, 708 00, Czech Republic [email protected],[email protected] Abstract 7 1 Thecontexttransformationandgeneralizedcontexttransformationmethods,weintroduced 0 2 recently, were able to reduce zero order entropy by exchanging digrams, and as a conse- quence, they were removing mutual information between consecutive symbols of the input n a message. These transformations were intended to be used as a preprocessor for zero-order J entropy coding algorithms like Arithmetic or Huffman coding, since we know, that espe- 5 cially Arithmetic coding can achieve a compression rate almost of the size of Shannon’s entropy. ] T This paper introduces a novel algorithm based on the concept of generalized context I transformation,thatallowstransformationofwordslongerthansimpledigrams. Thehigher . s order contexts are exploited using recursive form of a generalized context transformation. c It is shown that the zero order entropy of transformed data drops significantly, but on the [ other hand, the overhead given by a description of individual transformations increases and 1 it has become a limiting factor in a successful transformation of smaller files. v 6 2 Introduction 3 1 Arithmetic coding[1] is a very successful compresison method that is able to achieve 0 . compression rate almost of the size of the limit given by Shannon’s formula. In our 1 0 work we rely on this fact and we assume that Shannon’s entropy of transformed data 7 is approximately achievable. In our previous work about context transformations [2] 1 and generalized context transformations [3] we studied, under what conditions the : v exchange of two different digrams, beginning with the same symbol, leads to the i X reduction of Shannon’s entropy. r a In the present paper we shall introduce a modified version of the generalized contexttransformationalgorithmthatexploitsthefact, thatduringdigramsexchange process, we get an information about positions of particular digram in input message. These positions are later used to form conditional distributions describing occurences of symbols following the transformation digrams and the process of the search for entropy reducing transformations is extended to trigrams and in the same way also for other longer words. Transformations The generalized context transformation is an exchange of two different digrams αβ and αγ throughout the input message m over the alphabet Σ. The exact definition is as follows: Definition 1 Generalized context transformation(GCT) is a mapping GCT(αβ ↔ αγ,m) : Σn → Σn, Σ is the alphabet of the input message m and n is the length of the input message, that exchanges all digrams αβ for digram αγ and vice-versa. We were considering also transformations of the more general form αβ ↔ γδ, but in such case, the space, where the transformation would be searched, is of order |Σ|4, meanwhile our preferred space, the space of generalized context transformations, is only of order |Σ|2. It is computationally much more efficient to work with GCT. GCT can be applied on the message from left to right GCT or from right to → left GCT , even though both transformations seem to be the same and in most ← cases they provide the same entropy reduction, they differ when we deal with the transformation of the form GCT(αα ↔ αβ). Meanwhile the value of zero order entropy change ∆H can be precisely predicted when GCT is applied, the opposite ← direction transformation GCT depends also on occurences of n−grams consisting → of n repetition of a symbol α. For instance suppose that the initial message is m = ααα and we apply the transformation GCT first, leaving m(cid:48) = GCT (αα ↔ αβ,ααα) = αββ, in the ← ← initial state there were exactly two digrams αα and three symbols α. In the final state both digrams αα were replaced by αβ and we have a number of symbols β equal to the number of digrams αα in the initial state of the message. The opposite direction transformation leaves m(cid:48) = GCT (αα ↔ αβ,ααα) = αβα and without → a knowledge of a distribution of n-grams of the type αn we are unable to precisely predict the number of newly introduced occurences of β. Our algorithm uses a right to left form of transformation GCT , it allows us to ← precisely predict how the zero order entropy will change before the arbitrary gener- alized context transformation is applied. Inverse transformation to GCT is GCT ← → and vice versa. When a transformation is applied in one direction then the same transformation applied in the opposite direction is its inverse[3]. Shannon’s entropy reduction The zero order or Shannon’s[4] entropy of random variable X is defined by: (cid:88) H(X) = − p(x)logp(x) (1) x∈Σ where, p(x) is a probability of a symbol x in input message and log(x) represents logarithm of x to the base 2. When we use a term entropy, we always mean Shan- non’s entropy. Suppose that the transformation GCT (αβ ↔ αγ) is applied. Only ← probabilities of symbols β and γ will change. The exact change of probabilities p (x), i where i = 0 means the initial state probabilities and i = 1 the state after the GCT is applied, are expressed by: p (β) = p (β)+p (α,γ)−p (α,β) (2) 1 0 0 0 p (γ) = p (γ)+p (α,β)−p (α,γ) (3) 1 0 0 0 The resulted change of entropy ∆H is then expressed as a difference of entropies between a final and initial state: ∆H = H −H 1 0 (cid:88) (cid:88) ∆H = − p (x)logp (x)+ p (x)logp (x) (4) 1 1 0 0 x∈{β,γ} x∈{β,γ} Any transformation with ∆H negative reduces zero order entropy. This zero order entropy reduction has an impact on the values of other information theoretic[5] quantities, especially the mutual information drops together with zero order entropy, since the mutual information is given as a relative entropy between the distribution of symbolsp(x)andtheconditionaldistributionp(x|α)anditcanbeinterpretedastheir distance. The purpose of transformations is to decrease the average distance, given by relative entropy, between the distribution of symbols and all other conditional distributions. Generalized context transformation that reduces entropy is described by the following theorem: Theorem 1 Suppose the generalized context transformation GCT(αβ ↔ αγ). Let p (β) and p (γ) are probabilities of symbols before the transformation is applied and let 0 0 p = max{p (β),p (γ)}. After the transformation, the associated probabilities are 0,max 0 0 p (β), p (γ) and p = max{p (β),p (γ)}. If p > p then the generalized 1 1 1,max 1 1 1,max 0,max context transformation T reduces entropy. Proof 1 Let c = p (α)+p (β) = p (α)+p (β), then since the entropy function of two 0 0 1 1 different letters is defined on the interval (cid:104)0;c(cid:105) and it is concave with maximum at c/2 and minimum at 0 and c, then p has to be located on the interval p ∈ (cid:104)c/2;c(cid:105), 0,max 0,max but on that particular interval the higher the maximum is the lower the entropy is, so if we increase the maximum(or we can say increase the absolute value of difference |p −p |), then the entropy will decrease. β α Algorithm Proposed algorithm aims to reduce entropy by sorting all conditional probabilities, such that the symbol’s ordering, based on their probabilities, is approximately the same also in cases of conditional probabilities. For instance if the space character is the most frequent character in the text, then after the transformation, it should be, at least approximately, also the most frequent character if we consider the conditional distribution of symbols following arbitrary prefix w. The algorithm consists of three stages: • Initial - collection of involved statistics. • Search for and application of generalized context transformations. • Storage. Initial phase Intheinitialphasethefileispassedonce. Inthispassthealgorithmcountsfrequencies of all symbols. These frequencies are then used to sort symbols yielding an ordered set of indices s to the alphabet. In proposed algorithm alphabet symbols are one byte values. Algorithm 1 Preparation for search and application stage function HighOrderCT contextTree ← rootNode for i = 0;i < |Σ|;i++ do positions ← CollectPositions(s[i]) child ← contextTree.AppendChild(s[i]) RecursiveCT(child,positions) end for end function Algorithm 1 loops through all symbols from the most frequent one to the least frequent one. Each time it passes through input message(file) m, it collects positions t = {j|m[j] = s[i]} of symbol α = s[i] and builds a frequency table s of symbols i α at positions t +1 following the symbol α. The complexity of this part is O(|Σ|n) i,j but in exchange, as will be discussed in the following section, we restrict every other search for and application of transformation onto the space of positions t and in a i function RecursiveCT we won’t need any other pass through the whole message, but only through the set of positions t . i Context transformation by maximal entropy reduction Suppose that the RecursiveCT function has been called with a symbol s[i] = α, then we can form an ordered set s [j] of symbols ordered by conditional frequencies(or α probabilities) f(X|α). In a final state, we would like to have a message in a state, such that for all prefixes αw, where for all substrings w: w (cid:54)= α, the same symbols i are ordered at the same positions: s [j] = s[j]. The RecursiveCT function is given αw in the listing of Algorithm 2. Algorithm 2 Recursive search for and application of context transformation function RecursiveCT(parentNode,positions) α ← parentNode.Symbol() repeat β,γ,dH ← SearchCT(α) if dH > lim then break end if node ← parentNode.AppendChild(β,γ) pos ← ApplyCT(α,β,γ) RecursiveCT(node,pos) until true end function The recursive algorithm searches for transformations until it is able to found a transformation that reduces entropy more, than the limit given by the lim variable. ThesearchisimplementedinthefunctionSearchCT giveninthelistingofAlgorithm 3. When the transformation is found, it is applied by the function ApplyCT, this function exchanges all symbols β and γ found by SearchCT and if the number of occurences of symbol γ is larger than of the symbol β, then it returns former positions of symbol γ, otherwise it returns former positions of symbol β. These positions will be later used in a transformation of higher order contexts. The structure of applied transformations is stored as a tree, the so called context transformation tree. The root node of the tree doesn’t represent any transformation and its children, called context symbol nodes, are individual context symbols s[i] selected in Algorithm 1. Each transformation is stored as a new child node of the input parameter parentNode. In the last step the function RecursiveCT is called again. A context symbol of this call is the more frequent character and its corresponding set of positions. The set of positions posβ resp. posγ are former positions of symbols γ resp. β in context of symbol α. When the function RecursiveCT is called for the first time and the transformation has been found, then its application corresponds to the mutual exchange of digrams αβ and αγ and vice versa in the former message. Each other call ofRecursiveCT, madefromitself, getsintothehigherordercontext. Forinstancethe first call of RecursiveCT(node,pos): suppose that the next transformation symbols, that has been found, are δ and (cid:15), then they correspond to the exchange of trigrams of the form αγδ and αγ(cid:15) in the former text. Due to the fact that we are collecting positions of each replacement, the longer the context of transformation is, the smaller is the space(the size of the set of positions) where the transformation is applied. Algorithm 3 Recursive search and application of context transformation function SearchTransformation(context symbol) dH ← 0 β ← context symbol γ ← context symbol for i = 0 to |Σ| do if s[i] == context symbol then continue end if for j = i+1 to |Σ| do temp dH = EntropyChange(s[i],s[j]) if temp dH < dH AND s[j] (cid:54)= context symbol then dH = temp dH β = s[i] γ = s[j] end if end for end for return [β,γ,dH] end function The SearchTransformation function returns the maximally entropy reducing transformation. The EntropyChange function in the listing of Algorithm 3 computes dH usingequation(4). Thesearchomitstransformationswhens[j] = αandfrequency f(α,w,α) (cid:54)= 0, this is very important, because this condition ensures the existence of inverse transformation. The context symbol α is a symbol found in the first phase of the search algorithm and it is a symbol from which the transformation begins. Before the transformations starts, all positions of context symbol are found, if we would allow transformation of context symbol, then it would become impossible to distinguish between α that emerges due to the transformation and α in the former message. On the other hand, it is possible to introduce α into transformation if f(α,w,α) = 0, i.e. no word of the form αwα is a substring of the former message. There are situations when the change of symbols ordering occurs and even it reduces entropy. Suppose two consecutive symbols of ordered set s: s[i] and s[i+1], given that probabilities p(s[i]) > p(s[i+1]), let β = s[i] and γ = s[i+1] and context symbol to be α, if p(α,β) − p(α,γ) > p(β) − p(γ) then it is convenient to apply transformation GCT(αβ ↔ αγ) even though p(α,β) > p(α,γ) and p(β) > p(γ). Suchtransformationwouldeventuallyswitchtheorderofsymbolsβ andγ andreduces entropy. As a conclusion we remark that symbols sorted into the order given in the initial phase do not neccessarily have to have the same order in the final state. Storage The result of transformation is stored in two parts, the first part contains description ofcontexttransformationtreeandthesecondpartcontainstransformedmessage. The transformed message has the same size as the input message, the additional overhead is given by the first part. The example of the resulted context transformation tree is visualized in Figure 1. The forward transformation is represented by paths from the root to the leaf node beginning with the most left path and finishing with the most right path. Figure 1: Context transformation tree corresponding to consecutive transformations: αβ ↔ αγ followed by two higher order transformations of trigrams αββ ↔ αβγ and αβδ ↔ αβ(cid:15), the next transformation found and applied is αδ ↔ α(cid:15). Then the second most frequent symbol is selected and the transformation βα ↔ βγ is applied. For the thirdandfifthcontextsymbolsγ and(cid:15)therearenoentropyreducingtransformations. For the transformation δα ↔ δ(cid:15) of the fourth context symbol δ there is one higher order transformations δαβ ↔ δαγ. The tree is stored using recursive function SaveHeader(rootNode) given in listing of Algorithm 4. In our implementation arguments of WRITE function are one byte values. Algorithm 4 Recursive header saving algorithm function SaveHeader(treeNode) WRITE(treeNode.ChildrenLength()) for i = 1 to treeNode.ChildrenLength() do if treeNode.IsContextSymbolNode == true then WRITE(node.Child(i).α) else WRITE(treeNode.Child(i).β) WRITE(treeNode.Child(i).γ) end if SaveHeader(treeNode.Child(i)) end for end function Inverse transformation The inverse transformation GCT−1 = GCT is applied from the beginning to the end → of a transfomed message. Transformations contained in the context transformation tree are applied in reversed order, beginning with the most right path and finishing withthemostleftpathfromtherootnodetotheleafnode. Aswebrieflymentionedin the section about the algorithm searching the entropy reducing transformations, the algorithm transforms words w residing between two consecutive context symbols α. j Actualy the algorithm behaves exactly like if we split the message on parts separated by context symbol α and we would transform each word independently of each other. The inverse transformation can be viewed from the same perspective, we found the first occurence of context symbol α and at the position that follows α we apply transformationfromthetreeinreversedorder. Sinceweknowthatthetransformation hasbeentakenoversymbolsbetweentwoconsecutivesymbolsα, wecanapplyinverse transformation and the next occurence of symbol α will be the next position where the inverse transformation will be applied again. Transformation of languages to languages with lower entropy Usuallythemostfrequentcharacterinatextisaspacecharacterseparatingindividual wordsinasentence. Figure2.b)givestheexampleofamessageaftertransformationof words residing between two consecutive space characters. Several words beginning on the new line, i.e. words that follows the end of line character, remains untransformed and they will be transformed by some later transformation. Figure 2.c) then presents final state of the message. Figure 2: Slice from the Calgary corpus book1 file, a) before transformation, b) after 10k transformations, c) final state(approx. 20k transformations). To emphasize differences between states of the message, all space characters has been replaced by @ and the end of line characters by #. The search algorithm can be modified to transform words residing between arbi- trary distinct symbols. Suppose that we would like to transform all tags in HTML document, such that symbols between ‘<’ and ‘>’ become ones of low entropy. In the listing of Algorithm 3. the initial context symbol would be ‘<’ and the transformation stops, when no other entropy reducing transformation exists than the one modifying occurence of ‘>’. Rules of this kind can be very simply integrated into the search algorithm. Results Our results are summarized in Table 1. We haven’t focused at the compression of the context transformation tree yet, instead we estimated upper bound of entropy of the tree’s description using the PAQ8 algorithm created by Matt Mahoney [6]. Let H is entropy of the transformed message m of the size |m|, let |h | is a size t paq8 of tree compressed by PAQ8 in bits, then the bits per symbol(byte) ratio F of the resulted message is computed as: |h |+H |m| paq8 t F = |m| The bits per byte ratio has been computed for several settings of the lim variable given in the listing of Algorithm 2. The value of the limit is given in the subscript of the column name and represents number of bits, for instance H , resp. F , means 2 2 entropy, resp. bits per byte ratio, for lim = 2. File H H H H F F F gzip bzip2 0 4 8 0 4 8 bib 5.201 2.355 2.938 3.214 3.545 3.529 3.551 2.509 1.975 book1 4.527 3.001 3.318 3.414 3.764 3.581 3.552 3.250 2.420 obj1 5.948 1.347 2.419 3.844 5.572 5.650 4.945 3.812 4.015 paper1 4.983 2.316 3.058 3.344 4.062 3.919 3.840 2.789 2.492 paper2 4.601 2.471 3.019 3.273 3.765 3.587 3.581 2.887 2.437 progc 5.199 2.346 3.071 3.421 4.336 4.103 3.989 2.677 2.533 progp 4.868 1.766 2.296 2.683 3.242 3.299 3.266 1.811 1.735 trans 5.532 1.473 2.289 2.667 2.784 3.141 3.205 1.610 1.528 alice29.txt 4.567 2.608 2.971 3.141 3.578 3.387 3.372 2.850 2.272 bible.txt 4.342 2.662 2.727 2.757 2.756 2.762 2.762 2.201 1.672 cp.html 5.229 1.593 2.767 3.221 4.248 3.992 3.817 2.593 2.479 kennedy.xls 3.573 3.143 3.146 3.150 3.164 3.158 3.156 1.629 1.012 world192 4.998 2.617 2.803 2.931 3.094 3.037 3.057 2.259 1.583 Table 1: The summary of achieved results for selected files from Calgary[7] and Canterbury[8] corpuses in comparison with standard compression methods of gzip and bzip2. H represents entropy of input file. Values of F and results of standard lim methods are measured in bits per byte. The case H is the extreme case when all accessible entropy reducing transforma- 0 tions were applied and in several cases the achieved entropy rate was better than the one achieved by standard methods, but the large number of transformations leads to a growth of context transformation tree and as a consequence the resulted file size is significantly larger. The difference F −H gets smaller with larger files(bible.txt lim lim - 3.85 MB) because of the relative size of the tree against the size of the message, but also due to the observation that larger files are less sensible to the selection of the value of entropy reduction limit lim. Conclusion We proposed a transformation of higher order contexts based on the concept of gener- alized context transformations. Our algorithm is able to significantly reduce entropy of input messages, but it is only of limited ability to efficiently store the information that is neccessary to store description of all transformations. The efficient storage of context transformation tree will be one of the areas we will focus at in the future, and it is fair to say, that it is a main weakness of our algorithm. On the other hand, this issue partially relates to the size of the input message, larger files like bible.txt from Canterbury Corpus have relatively small context transformation tree in comparison with the overall size of input. Under assumption, that the total number of applied transformations grows logarithmically with the size of the file, then we assume that the effect of the tree storage, on the resulted bits per byte ratio, should be decreasing with increasing size of file. References [1] I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for data compression,” Commun. ACM, vol. 30, no. 6, pp. 520–540, Jun. 1987. [Online]. Available: http://doi.acm.org/10.1145/214762.214771 [2] M. Vasinek and J. Platos, “Entropy reduction using context transformations,” in Data Compression Conference (DCC), 2014, March 2014, pp. 431–431. [3] ——, “Generalized context transformations - enhanced entropy reduction,” in Data Compression Conference (DCC), 2015, April 2015, pp. 474–474. [4] C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, pp. 379–423, 623–, july, october 1948. [Online]. Available: http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf [5] T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, 2006. [6] M. Mahonney, “Paq8.” [Online]. Available: http://www.mattmahoney.net/dc/#paq [7] “Calgary corpus.” [Online]. Available: http://www.data-compression.info/Corpora/ CalgaryCorpus/ [8] “Canterburycorpus.”[Online].Available: http://www.data-compression.info/Corpora/ CanterburyCorpus/