Filters








4 Hits in 2.5 sec

PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming [article]

Alexander K. Lew, Monica Agrawal, David Sontag, Vikash K. Mansinghka
2020 arXiv   pre-print
Based on this view, we present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data.  ...  to state-of-the-art data-cleaning systems (unlike generic PPL inference given the same runtime); and scale to real-world datasets with millions of records.  ...  This paper presents PClean, a domain-specific generative probabilistic programming language (PPL) for Bayesian data cleaning.  ... 
arXiv:2007.11838v4 fatcat:navjwv7vpfbzhaq4mugkbucvve

Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts [article]

Allen Kim, Charuta Pethe, Naoya Inoue, Steve Skiena
2021 arXiv   pre-print
The inconsistencies found by aligning pairs of scans of the same underlying work provides training data to build models for detecting and correcting errors.  ...  Substantial amounts of work are required to clean large collections of digitized books for NLP analysis, both because of the presence of errors in the scanned text and the presence of duplicate volumes  ...  Methods such as PClean work off of Bayesian principles and probabilistic programming to identify likely errors in a specific domain.  ... 
arXiv:2110.11934v1 fatcat:y5bf6bzykngipdpauzxxhtrcbi

Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts

Allen Kim, Charuta Pethe, Naoya Inoue, Steve Skiena
2021 Findings of the Association for Computational Linguistics: EMNLP 2021   unpublished
The inconsistencies found by aligning pairs of scans of the same underlying work provides training data to build models for detecting and correcting errors.  ...  Substantial amounts of work are required to clean large collections of digitized books for NLP analysis, both because of the presence of errors in the scanned text and the presence of duplicate volumes  ...  Methods such as PClean work off of Bayesian principles and probabilistic programming to identify likely errors in a specific domain.  ... 
doi:10.18653/v1/2021.findings-emnlp.356 fatcat:usexxnmierglzhyneynjdos7mi

A risk-based maintenance methodology of industrial systems

B J Jones
2017
The 142 technique combined the use of probabilistic data together with objective and subjective data.  ...  At the other end of the scale is high volume, continuous manufacturing.  ...  Bayesian network modelling also allows these influencing events to change and update depending on the influencing data available at any given time, thus changing the failure rate or probability of failure  ... 
doi:10.24377/ljmu.t.00005904 fatcat:ijf4no5omzei5k4pstarmqqy34