ProvDB

Hui Miao, Amit Chavan, Amol Deshpande
2017 Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics - HILDA'17  
As data-driven methods are becoming pervasive in a wide variety of disciplines, there is an urgent need to develop scalable and sustainable tools to simplify the process of data science, to make it easier for the users to keep track of the analyses being performed and datasets being generated, and to enable the users to understand and analyze the work ows. In this paper, we describe our vision of a uni ed provenance and metadata management system to support lifecycle management of complex
more » ... orative data science workows. We argue that the information about the analysis processes and data artifacts can, and should be, captured in a semi-passive manner; and we show that querying and analyzing this information can not only simplify bookkeeping and debugging tasks but also enable a rich new set of capabilities like identifying aws in the data science process itself. It can also signi cantly reduce the user time spent in xing post-deployment problems through automated analysis and monitoring. We have implemented a prototype system, P DB, on top of git and Neo4j, and we describe its key features and capabilities.
doi:10.1145/3077257.3077267 dblp:conf/sigmod/MiaoCD17 fatcat:ofr25bj2trewri4ksbwtq3wxgu