Industrial Paper: Large-scale Record Linkage of Web-based Place Entities

Vinícius M. R. Cousseau, Luciano Barbosa
2019 Anais do Simpósio Brasileiro de Banco de Dados (SBBD)  
Extracting data about entities from the Web has become commonplace in the industry and academia alike. Web-based entities, however, are inherently noisy and, as such, introduce several normalization issues which must be attended to in order to maintain a clean database. Record linkage, which refers to the detection of replicated datum from possibly multiple sources, is one of the most critical of those issues. This paper presents a practical approach for solving the record linkage problem in
more » ... nkage problem in the places data domain at an industrial scale, displaying both a model which reaches a normalized Gini coefficient of 0.92, and an architecture that supports large-scale processing.
doi:10.5753/sbbd.2019.8820 dblp:conf/sbbd/CousseauB19 fatcat:kels3x5fefcilk35aojadex7q4