4 Hits in 2.9 sec

Exploring Spark-SQL-Based Entity Resolution Using the Persistence Capability [chapter]

Xiao Chen, Roman Zoun, Eike Schallehn, Sravani Mantha, Kirity Rapuru, Gunter Saake
2018 Communications in Computer and Information Science  
Faced with an exploding data volume, pair-wise ER is challenged to achieve high efficiency and scalability. To tackle this challenge, parallel computing is proposed for speeding up the ER process.  ...  Due to the difficulty of distributed programming, big data processing frameworks are often used as tools to ease the realization of parallel ER, supporting data partitioning, workload balancing, and fault  ...  GECO consists of GEnerator and COrruptor, which is specifically designed for generating ER datasets.  ... 
doi:10.1007/978-3-319-99987-6_1 fatcat:2smcuytevnfsnnlbxoulugfegm

(Almost) All of Entity Resolution [article]

Olivier Binette, Rebecca C. Steorts
2022 arXiv   pre-print
using bibliographic data, all these applications have a common theme - integrating information from multiple sources.  ...  Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors  ...  -N., Vatsalan, D., and Christen, P. “GeCo: An Online Personal Data Generator and Corruptor.”  ... 
arXiv:2008.04443v3 fatcat:6tunuro7afhmbpambcn2bk32ly

Scalable and approximate privacy-preserving record linkage [article]

Dinusha Vatsalan, University, The Australian National, University, The Australian National
Often, it is not permissible to exchange personal identifying data across different organizations due to privacy and confidentiality concerns or regulations.  ...  Generally, unique entity identifiers are not available in all the databases to be linked.  ...  We used our flexible data Generation and Corruption of personal data tool (GeCo) [35] to corrupt the OZ and NC databases. The GeCo tool is available online: [183] .  ... 
doi:10.25911/5d739004a7846 fatcat:ib7nvtnc4jgszgyh3terw7nzpu

Towards efficient and effective entity resolution for high-volume and variable data [article]

Xiao Chen, Universitäts- Und Landesbibliothek Sachsen-Anhalt, Martin-Luther Universität, Gunter Saake
Last, an in-depth analysis and comparison of the state-of-the-art block-splitting-based load balancing strategies are not provided.  ...  On the one hand, high-volume data forces ER to use blocking and parallel computation to improve ef- ficiency and scalability.  ...  The research approaches [Sarawagi and Bhamidipaty, 2002; Tejada et al., 2001] are the most similar to ours. They form their committees with several classifiers, which  ... 
doi:10.25673/35204 fatcat:ejgdps6glndmxjagwq5sq3hy74