A convergent gambling estimate of the entropy of English

T. Cover, R. King
1978 IEEE Transactions on Information Theory  
Abstmct-In his original paper on the subject, Shannon found upper and lower bounds for the entropy of printed English based on the number of trials required for a subject to guess subsequent symbols in a given text. The guessing approach precludes asymptotic consistency of either the upper or lower bounds except for degenerate ergodic processes. Shannon's technique of guessing the next symbol is altered by having the subject place sequential bets on the next symbol of text. lf S" denotes the
more » ... ject's capital after n bets at 27 for 1 odds, and lf it is assumed that the subject hnows the underlying prpbabillty distribution for the process X, then the entropy estimate ls H,(X) =(l -(l/n) log" S,) log, 27 bits/symbol. If the subject does npt hnow the true probabllty distribution for the stochastic process, then Z&(X! ls an asymptotic upper bound for the true entropy. ff X is stationary, EH"(X)+H(X), H(X) bell the true entropy of the process. Moreovzr, lf X is ergodic, then by the SLOW McMilhm-Brebnan theorem H"(X)+H(X) with probability one. Preliminary indications are that English text has au entropy of approximately 1.3 bits/symbol, which agrees well with Shannon's estimate.
doi:10.1109/tit.1978.1055912 fatcat:qsulklzqgneybkoyeo7vj3ta5y