Record linkage

Nick Koudas, Sunita Sarawagi, Divesh Srivastava
2006 Proceedings of the 2006 ACM SIGMOD international conference on Management of data - SIGMOD '06  
 Formalized the approach of Newcombe et al. [NKAJ59]  Given two sets of records (relations) A and B perform an approximate join comparison vector  Contains comparison features e.g., same last names, same SSN, etc.  Γ: range of γ(a,b) the comparison space. 9/23/06 13 Fellegi-Sunter Issues:  Tuning:  Estimates for m (γ), u (γ) ?  Training data: active learning for M, U labels  Semi or un-supervised clustering: identify M U clusters  Setting µ , λ?  Defining the comparison space Γ? 
more » ... ance metrics between records/fields  Efficiency/Scalability 9/23/06 18 Soundex Encoding  A phonetic algorithm that indexes names by their sounds when pronounced in english.  Consists of the first letter of the name followed by three numbers. Numbers encode similar sounding consonants.
