The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions

Salvatore Giorgi, Daniel Preoţiuc-Pietro, Anneke Buffone, Daniel Rieman, Lyle Ungar, H. Andrew Schwartz
2018 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing  
Nowcasting based on social media text promises to provide unobtrusive and near real-time predictions of community-level outcomes. These outcomes are typically regarding people, but the data is often aggregated without regard to users in the Twitter populations of each community. This paper describes a simple yet effective method for building community-level models using Twitter language aggregated by user. Results on four different U.S. county-level tasks, spanning demographic, health, and
more » ... ological outcomes show large and consistent improvements in prediction accuracies (e.g. from Pearson r = .73 to .82 for median income prediction or r = .37 to .47 for life satisfaction prediction) over the standard approach of aggregating all tweets. We make our aggregated and anonymized community-level data, derived from 37 billion tweets -over 1 billion of which were mapped to counties, available for research.
doi:10.18653/v1/d18-1148 dblp:conf/emnlp/GiorgiPBRUS18 fatcat:5nlvcvwa3va4zhwvuqgqrcngym