A declarative query processing system for nowcasting

Dolan Antenucci, Michael R. Anderson, Michael Cafarella
2016 Proceedings of the VLDB Endowment  
Nowcasting is the practice of using social media data to quantify ongoing real-world phenomena. It has been used by researchers to measure flu activity, unemployment behavior, and more. However, the typical nowcasting workflow requires either slow and tedious manual searching of relevant social media messages or automated statistical approaches that are prone to spurious and low-quality results. In this paper, we propose a method for declaratively specifying a nowcasting model; this method
more » ... l; this method involves processing a user query over a very large social media database, which can take hours. Due to the human-in-the-loop nature of constructing nowcasting models, slow runtimes place an extreme burden on the user. Thus we also propose a novel set of query optimization techniques, which allow users to quickly construct nowcasting models over very large datasets. Further, we propose a novel query quality alarm that helps users estimate phenomena even when historical ground truth data is not available. These contributions allow us to build a declarative nowcasting data management system, RaccoonDB, which yields high-quality results in interactive time. We evaluate RaccoonDB using 40 billion tweets collected over five years. We show that our automated system saves work over traditional manual approaches while improving result quality-57% more accurate in our user study-and that its query optimizations yield a 424x speedup, allowing it to process queries 123x faster than a 300-core Spark cluster, using only 10% of the computational resources.
doi:10.14778/3021924.3021931 fatcat:3psxkkikc5b5rdm2rfuvzsz7yu