Entropy reduction of English text using variable length grouping [report]

Vincent Ast
2000 unpublished
It is known that the entropy of English tl~xt can be reduced by arrangi.ng the text into groups of two or more letters each. The higher the order of the groL1ping the greater is the entropy reduction. Using this principle in a computer text compressing system brings about diffi cu1ties, however, because the number of entries rE.quired in the trans laUon table increases exponentially with group size. This experiment examined the possibility of using a t·canslation table containing only 2
more » ... entries of all group sizes with the expectation of obtaining a substantial entropy reduction with a relatively small table. An expression was derived that showed that the groups which should be included in the table are not necessarily those that occur frequently but rather occur more frequently than would be expected due to random occurrence. This was complicated by the fact that any grouping affects the frequency of occurrence of many other related groups. An algorithm was developed in which the table originally starts with the regular 26 letters of the alphabet and the space. Entries, which consist of letter groups, complete words, and word groups, are then added one by one based on the selection criterion. After each entry is added adjusanents are made to account for the interaction of the groups. This algorithm was programmed on a computer and was run using a text sample of about 7000 words. The results showed that the entropy could easily be reduced down to 3 bits per letter with a table of less than 200 entries. With about 500 entries the entropy could be reduced to about 2.5 bits per letter. About 60% of the table was composed of letter groups, 42% of single words and 8% of word groups and indicated that the extra compli cations involved in handling word groups may not be worthwhile. A visual examination of the table showed that many entries were very much oriented to the particular sample. This mayor may not be desirable depending on the intended use of the translating system.
doi:10.15760/etd.687 fatcat:yh3pc24jzrauljzoo7zv2qzgoe