Adaptive Filtering for Efficient Record Linkage [chapter]

Lifang Gu, Rohan Baxter
2004 Proceedings of the 2004 SIAM International Conference on Data Mining  
The process of identifying record pairs that represent the same real-world entity in multiple databases, commonly known as record linkage, is one of the important initial steps in many data mining applications. Record linkage of millions of records is a computationally expensive task. Various blocking methods have been used in record linkage systems to reduce the number of record pairs for comparison. A good blocking key is critical to the success of a blocking method and will ideally result in
more » ... lot of small blocks. However, in practice, there are almost always large blocks no matter how good the blocking key is. For example, when blocking on surname for an Anglo-Celtic population, 'Smith' and 'Taylor' are populous and result in very large block sizes. The efficiency of a blocking method is hindered by these large blocks since the resulting number of record pairs is dominated by the sizes of these large blocks. In this paper, we present an adaptive filtering algorithm to post-process large blocks to enhance the blocking efficiency. Experimental results show that our filtering algorithm can reduce the number of record pairs produced by the standard blocking method by 88% on a small real-world data set. The algorithm also reduces the number of record pairs generated by a 3-pass standard blocking method by 50% on several synthetic test data sets, with minimal loss of accuracy.
doi:10.1137/1.9781611972740.50 dblp:conf/sdm/GuB04 fatcat:ssl42uqrkzhazb7x333smkxz3a