Temporal Graph Record Linkage and k-Safe Approximate Match [article]

(:Unkn) Unknown, Justin Y. Shi, University, My
2020
Since the advent of electronic data processing, organizations have accrued vast amounts of data contained in multiple databases with no reliable global unique identifier. These databases were developed by different departments for different purposes at different times. Organizing and analyzing these data for human services requires linking records from all sources. RL (Record Linkage) is a process that connects records that are related to the identical or a sufficiently similar entity from
more » ... ar entity from multiple heterogeneous databases. RL is a data and compute intensive, mission critical process. The process must be efficient enough to process big data and effective enough to provide accurate matches. We have evaluated an RL system that is currently in use by a local health and human services department. We found that they were using the typical approach that was offered by Fellegi and Sunter with tuple-by-tuple processing, using the Soundex as the primary approximate string matching method. The Soundex has been found to be unreliable both as a phonetic and as an approximate string matching method. We found that their data, in many cases, has more than one value per field, suggesting that the data were queried from a 5NF data base. Consider that if a woman has been married 3 times, she may have up to 4 last names on record. This query process produced more than one tuple per database/entity apparently generating a Cartesian product of this data. In many cases, more than a dozen tuples were observed for a single database/entity. This approach is both ineffective and inefficient. An effective RL method should handle this multi-data without redundancy and use edit-distance for approximate string matching. However, due to high computational complexity, edit-distance will not scale well with big data problems. We developed two methodologies for resolving the aforementioned issues: PSH and ALIM. PSH – The Probabilistic Signature Hash is a composite method that increases the speed of Damerau-Levenshtein edit-distance. It combines si [...]
doi:10.34944/dspace/3061 fatcat:eevkjx7iybcjjeqjndmt3zuyyq