Compression, information theory, and grammars: a unified approach

Abraham Bookstein, Shmuel T. Klein
1990 ACM Transactions on Information Systems  
Text compression is of considerable theoretical and practical interest. It is, for example, becoming increasingly important for satisfying the requirements of tting a large database onto a single CD-ROM. Many of the compression techniques discussed in the literature are model based. We here propose the notion of a formal grammar as a exible model of text generation that encompasses most of the models o ered before as well as, in principle, extending the possibility of compression to a much more
more » ... general class of languages. Assuming a general model of text generation, a derivation is given of the well known Shannon entropy f o r m ula, making possible a theory of information based upon text representation rather than on communication. The ideas are shown to apply to a number of commonly used text models. Finally, w e focus on a Markov model of text generation, suggest an information theoretic measure of similarity b e t ween two probability distributions, and develop a clustering algorithm based on this measure. This algorithm allows us to cluster Markov states, and thereby base our compression algorithm on a smaller number of probability distributions than would otherwise have been required. A number of theoretical consequences of this approach to compression are explored, and a detailed example is given.
doi:10.1145/78915.78917 fatcat:xyh4u47kkjhgxeafyhgedklqne