Reliable Editions from Unreliable Components: Estimating Ebooks from Print Editions Using Profile Hidden Markov Models [article]

A. B. Riddell
2022 arXiv   pre-print
A profile hidden Markov model, a popular model in biological sequence analysis, can be used to model related sequences of characters transcribed from books, magazines, and other printed materials. This paper documents one application of a profile HMM: automatically producing an ebook edition from distinct print editions. The resulting ebook has virtually all the desired properties found in a publisher-prepared ebook, including accurate transcription and an absence of print artifacts such as
more » ... of-line hyphenation and running headers. The technique, which has particular benefits for readers and libraries that require books in an accessible format, is demonstrated using seven copies of a nineteenth-century novel.
arXiv:2204.01638v2 fatcat:owaprowa3raijdhj62asdtr3se