TensorFlow Data Validation: Data Analysis and Validation in Continuous ML Pipelines

Emily Caveness, Paul Suganthan G. C., Zhuo Peng, Neoklis Polyzotis, Sudip Roy, Martin Zinkevich
2020 Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data  
Machine Learning (ML) research has primarily focused on improving the accuracy and efficiency of the training algorithms while paying much less attention to the equally important problem of understanding, validating, and monitoring the data fed to ML. Irrespective of the ML algorithms used, data errors can adversely affect the quality of the generated model. This indicates that we need to adopt a data-centric approach to ML that treats data as a first-class citizen, on par with algorithms and
more » ... th algorithms and infrastructure which are the typical building blocks of ML pipelines. In this demonstration we showcase TensorFlow Data Validation (TFDV), a scalable data analysis and validation system for ML that we have developed at Google and recently opensourced. This system is deployed in production as an integral part of TFX [5] -an end-to-end machine learning platform at Google. It is used by hundreds of product teams at Google and has received significant attention from the open-source community as well. CCS CONCEPTS • Information systems → Data management systems.
doi:10.1145/3318464.3384707 dblp:conf/sigmod/CavenessCPP0Z20 fatcat:agjc4n4f5jgw3kfmatmmvcueeu