Interactive and Deterministic Data Cleaning

Jian He, Enzo Veltri, Donatello Santoro, Guoliang Li, Giansalvatore Mecca, Paolo Papotti, Nan Tang
2016 Proceedings of the 2016 International Conference on Management of Data - SIGMOD '16  
We present Falcon, an interactive, deterministic, and declarative data cleaning system, which uses SQL update queries as the language to repair data. Falcon does not rely on the existence of a set of pre-defined data quality rules. On the contrary, it encourages users to explore the data, identify possible problems, and make updates to fix them. Bootstrapped by one user update, Falcon guesses a set of possible sql update queries that can be used to repair the data. The main technical challenge
more » ... ddressed in this paper consists in finding a set of sql update queries that is minimal in size and at the same time fixes the largest number of errors in the data. We formalize this problem as a search in a lattice-shaped space. To guarantee that the chosen updates are semantically correct, Falcon navigates the lattice by interacting with users to gradually validate the set of sql update queries. Besides using traditional one-hop based traverse algorithms (e.g., BFS or DFS), we describe novel multi-hop search algorithms such that Falcon can dive over the lattice and conduct the search e ciently. Our novel search strategy is coupled with a number of optimization techniques to further prune the search space and eciently maintain the lattice. We have conducted extensive experiments using both real-world and synthetic datasets to show that Falcon can e↵ectively communicate with users in data repairing.
doi:10.1145/2882903.2915242 dblp:conf/sigmod/HeVSLMPT16 fatcat:ob7xk77gofgc5mynsfjkvhzx5i