Dagger: A Data (not code) Debugger

El Kindi Rezig, Lei Cao, Giovanni Simonini, Maxime Schoemans, Samuel Madden, Nan Tang, Mourad Ouzzani, Michael Stonebraker
2020 Conference on Innovative Data Systems Research  
With the democratization of data science libraries and frameworks, most data scientists manage and generate their data analytics pipelines using a collection of scripts (e.g., Python, R). This marks a shift from traditional applications that communicate back and forth with a DBMS that stores and manages the application data. While code debuggers have reached impressive maturity over the past decades, they fall short in assisting users to explore data-driven what-if scenarios (e.g., split the
more » ... ining set into two and build two ML models). Those scenarios, while doable programmatically, are a substantial burden for users to manage themselves. Dagger (Data Debugger) is an end-to-end data debugger that abstracts key data-centric primitives to enable users to quickly identify and mitigate data-related problems in a given pipeline. Dagger was motivated by a series of interviews we conducted with data scientists across several organizations. A preliminary version of Dagger has been incorporated into Data Civilizer 2.0 to help physicians at the Massachusetts General Hospital process complex pipelines.
dblp:conf/cidr/RezigCSSMTOS20 fatcat:ke7jauf24rboxhcvdjjddvqz6y