Uncertainty-aware generative models for inferring document class prevalence

Katherine Keith, Brendan O'Connor
2018 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing   unpublished
Prevalence estimation is the task of inferring the relative frequency of classes of unlabeled examples in a group-for example, the proportion of a document collection with positive sentiment. Previous work has focused on aggregating and adjusting discriminative individual classifiers to obtain prevalence point estimates. But imperfect classifier accuracy ought to be reflected in uncertainty over the predicted prevalence for scientifically valid inference. In this work, we present (1) a
more » ... e probabilistic modeling approach to prevalence estimation, and (2) the construction and evaluation of prevalence confidence intervals; in particular, we demonstrate that an off-theshelf discriminative classifier can be given a generative re-interpretation, by backing out an implicit individual-level likelihood function, which can be used to conduct fast and simple group-level Bayesian inference. Empirically, we demonstrate our approach provides better confidence interval coverage than an alternative, and is dramatically more robust to shifts in the class prior between training and testing. 1
doi:10.18653/v1/d18-1487 fatcat:xjtnrigwjjf7zo3lks3w67rgm4