Towards efficient and effective entity resolution for high-volume and variable data [article]

Xiao Chen, Universitäts- Und Landesbibliothek Sachsen-Anhalt, Martin-Luther Universität, Gunter Saake
2020
Entity Resolution (ER), as a process to identify records that refer to the same realworld entity, faces challenges that big data has brought to it. On the one hand, high-volume data forces ER to use blocking and parallel computation to improve ef- ficiency and scalability. In this scenario, we identify three limitations: First, facing abundant research on parallel ER, a thorough survey to overview the current state and expose research gaps is missing. Second, efficiency impacts by choosing di
more » ... ent implementation options from big data processing frameworks are unknown. Last, an in-depth analysis and comparison of the state-of-the-art block-splitting-based load balancing strategies are not provided. Therefore, correspondingly, we first conducted a systematic literature review on parallel ER and report our findings. Then we explore three Spark implementations of two scenarios of a conventional ER process and expose their respective efficiency and speed-up. Last, we theoretically analyze and compare two state-of-the-art block-splitting-based load balancing strategies, propose two improved strategies, and then empirically evaluate them to conclude the important factors for a block-splitting-based load balancing strategy. On the other hand, facing variable data, we identify two shortcomings. First, confronting variable data with di erent types of attributes, word-embedding-based similarity calculation can provide uniform solutions, but the e ectiveness may be lowered for attributes without semantics. Second, facing variable data from broad domains, training data required for learning-based classification may not be available leading to expensive human labeling costs. Existing committee-based active learning approaches for ER to reduce human labeling costs cannot provide balanced and informative initial training data and compromise the accuracy of their committees to provide di erent classification voting results. Therefore, correspondingly, we first propose a hybrid similarity calculation approach by choosing traditio [...]
doi:10.25673/35204 fatcat:ejgdps6glndmxjagwq5sq3hy74