Topic Models over Spoken Language

Niketan Pansare, Chris Jermaine, Peter Haas, Nitendra Rajput
2012 2012 IEEE 12th International Conference on Data Mining  
Virtually all work on topic modeling has assumed that the topics are to be learned over a text-based document corpus. However, there exist important applications where topic models must be learned over an audio corpus of spoken language. Unfortunately, speech-to-text programs can have very low accuracy. We therefore propose a novel topic model for spoken language that incorporates a statistical model of speech-to-text software behavior. Crucially, our model exploits the uncertainty numbers
more » ... tainty numbers returned by the software. Our ideas apply to any domain in which it would be useful to build a topic model over data in which uncertainties are explicitly represented.
doi:10.1109/icdm.2012.90 dblp:conf/icdm/PansareJHR12 fatcat:24ofrbssnvfe7c4iotvnrgzbra