Breaking the problem into pieces: pre-clustering on-the-fly with the NGLMS

James Farrow
2017 International Journal of Population Data Science  
ObjectivesThe SA.NT DataLink Next Generation Linkage Management System (NGLMS) stores linked data in the form of a graph (in the computer science sense) comprised of nodes (records) and edges (record relationships or similarities). This permits efficient pre-clustering techniques based on transitive closure to form groups of records which relate to the same individual (or other selection criteria). ApproachOnly information known (or at least highly likely) to be relevant is extracted from the
more » ... aph as superclusters. This operation is computationally inexpensive when the underlying information is stored as a graph and may be able to be done on-the-fly for typical clusters. More computationally intensive analysis and/or further clustering may then be performed on this smaller subgraph. Canopy clustering and using blocking used to reduce pairwise comparisons are expressions of the same type of approach. ResultsSubclusters for manual review based on transitive closure are typically computationally inexpensive enough to extract from the NGLMS that they are extracted on-demand during manual clerical review activities. There is no necessity to pre-calculate these clusters. Once extracted further analysis is undertaken on these smaller data groupings for visualisation and presentation for review and quality analysis. More computationally expensive techniques can be used at this point to prepare data for visualisation or provide hints to manual reviewers. Extracting high-recall groups of data records for review but providing them to reviews grouped further into high precision groups as the result of a second pass has resulted in a reduction of the time taken for clerical reviewers at SANT DataLink to manual review a group by 30–40%. The reviewers are able to manipulate whole groups of related records at once rather than individual records. ConclusionPre-clustering reduces the computational cost associated with higher order clustering and analysis algorithms. Algorithms which scale by n^2 (or more) are typical in comparison scenarios. By breaking the problem into pieces the computational cost can be reduced. Typically breaking a problem into many pieces reduces the cost in proportion to the number of pieces the problem can be broken into. This cost reduction can make techniques possible which would otherwise be computationally prohibitive.
doi:10.23889/ijpds.v1i1.271 fatcat:lionr3ksdnghxeorn5v3o2yuku