Searching Page-Images of Early Music Scanned with OMR: A Scalable Solution Using Minimal Absent Words

Tim Crawford, Golnaz Badkobeh, David Lewis
2018 Zenodo  
We define three retrieval tasks requiring efficient search of the musical content of a collection of ~32k pageimages of 16th-century music to find: duplicates; pages with the same musical content; pages of related music. The images are subjected to Optical Music Recognition (OMR), introducing inevitable errors. We encode pages as strings of diatonic pitch intervals, ignoring rests, to reduce the effect of such errors. We extract indices comprising lists of two kinds of 'word'. Approximate
more » ... ng is done by counting the number of common words between a query page and those in the collection. The two word-types are (a) normal ngrams and (b) minimal absent words (MAWs). The latter have three important properties for our purpose: they can be built and searched in linear time, the number of MAWs generated tends to be smaller, and they preserve the structure and order of the text, obviating the need for expensive sorting operations. We show that retrieval performance of MAWs is comparable with ngrams, but with a marked speed improvement. We also show the effect of word length on retrieval. Our results suggest that an index of MAWs of mixed length provides a good method for these tasks which is scalable to larger collections.
doi:10.5281/zenodo.1492391 fatcat:f6z6xurlj5aetgj6buwoqetqsa