A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is
Data cleaning can be naturally framed as probabilistic inference in a generative model, combining a prior distribution over ground-truth databases with a likelihood that models the noisy channel by which the data are filtered and corrupted to yield incomplete, dirty, and denormalized datasets. Based on this view, we present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data. PClean is powered by three modeling and inferencearXiv:2007.11838v4 fatcat:navjwv7vpfbzhaq4mugkbucvve