Models, Inference, and Implementation for Scalable Probabilistic Models of Text [thesis]

Ke Zhai
Unsupervised probabilistic Bayesian models are powerful tools for statistical analysis, especially in the area of information retrieval, document analysis and text processing. Despite their success, unsupervised probabilistic Bayesian models are often slow in inference due to inter-entangled mutually dependent latent variables. In addition, the parameter space of these models is usually very large. As the data from various different media sources--for example, internet, electronic books,
more » ... films, etc--become widely accessible, lack of scalability for these unsupervised probabilistic Bayesian models becomes a critical bottleneck. The primary focus of this dissertation is to speed up the inference process in unsupervised probabilistic Bayesian models. There are two common solutions to scale the algorithm up to large data: parallelization or streaming. The former achieves scalability by distributing the data and the computation to multiple machines. The latter assumes data come in a stream and updates the model gradually after seeing each data observation. It is able to scale to larger datasets because it usually takes only one pass over the entire data. In this dissertation, we examine both approaches. We first demonstrate the effectiveness of the parallelization approach on a class of unsupervised Bayesian models--topic models, which are exemplified by latent Dirichlet allocation (LDA). We propose a fast parallel implementation using variational inference on the MapRe- duce framework, referred to as Mr. LDA. We show that parallelization enables topic models to handle significantly larger datasets. We further show that our implementation--unlike highly tuned and specialized implementations--is easily extensible. We demonstrate two extensions possible with this scalable framework: 1) informed priors to guide topic discovery and 2) extracting topics from a multilingual corpus. We propose polylingual tree-based topic models to infer topics in multilingual corpora. We then propose three different inference me [...]
doi:10.13016/m2p60x fatcat:6a3iy2pulrdpvpozudhvnxhvze