Subject metadata enrichment using statistical topic models

David Newman, Kat Hagedorn, Chaitanya Chemudugunta, Padhraic Smyth
2007 Proceedings of the 2007 conference on Digital libraries - JCDL '07  
Creating a collection of metadata records from disparate and diverse sources often results in uneven, unreliable and variable quality subject metadata. Having uniform, consistent and enriched subject metadata allows users to more easily discover material, browse the collection, and limit keyword search results by subject. We demonstrate how statistical topic models are useful for subject metadata enrichment. We describe some of the challenges of metadata enrichment on a huge scale (10 million
more » ... tadata records from 700 repositories in the OAIster Digital Library) when the metadata is highly heterogeneous (metadata about images and text, and both cultural heritage material and scientific literature). We show how to improve the quality of the enriched metadata, using both manual and statistical modeling techniques. Finally, we discuss some of the challenges of the production environment, and demonstrate the value of the enriched metadata in a prototype portal.
doi:10.1145/1255175.1255248 dblp:conf/jcdl/NewmanHCS07 fatcat:zk5kbwnz6bf63omdkatpkw6rp4