Discovering Sociolinguistic Associations with Structured Sparsity

Jacob Eisenstein, Noah A. Smith, Eric P. Xing
2011 Annual Meeting of the Association for Computational Linguistics  
We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors' geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a composite 1,∞ regularizer, we obtain structured sparsity, driving entire rows of coefficients to zero. We perform two regression studies. First, we use term frequencies to predict demographic
more » ... tes; our method identifies a compact set of words that are strongly associated with author demographics. Next, we conjoin demographic attributes into features, which we use to predict term frequencies. The composite regularizer identifies a small number of features, which correspond to communities of authors united by shared demographic and linguistic properties.
dblp:conf/acl/EisensteinSX11 fatcat:zk3w2rybjvghzgsbwluukns2zi