A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage [chapter]

Robespierre Pita, Everton Mendonça, Sandra Reis, Marcos Barreto, Spiros Denaxas
2017 Lecture Notes in Computer Science  
Record linkage (RL) is the process of identifying and linking data that relates to the same physical entity across multiple heterogeneous data sources. Deterministic linkage methods rely on the presence of a set of common uniquely identifying attributes across all sources while probabilistic approaches use non-unique attributes and calculates similarity indexes for pairs of records. A key component of record linkage is accuracy assessment, the process of manually verifying and validating
more » ... pairs to further refine linkage parameters and increase its overall accuracy. This process however is time-consuming and impractical when applied to large administrative data sources where millions of records are being linked. Additionally, it is potentially biased as the gold standard used is often the intuition of the reviewer. In this paper, we discuss the evaluation of different self-training approaches (decision trees, naïve Bayes, logistic regression, random forest, linear support vector machines and gradient boosted trees) for assessing and refining the accuracy of probabilistic linkage. We used data sets extracted from large (more than 100 million individuals) Brazilian socioeconomic and public health care data sources. These models were evaluated using receiver operating characteristic plots, sensitivity, specificity and positive predictive values collected from a ten-fold cross-validation method. Results show that logistic regression outperforms other classifiers and enables the creation of a generalized model achieving very accurate results.
doi:10.1007/978-3-319-64283-3_16 fatcat:etf55edz7ze7jjl534eosdpfbq