1,477 Hits in 5.6 sec

Applying Term Frequency-Based Indexing to Improve Scalability and Accuracy of Probabilistic Data Linkage

Robespierre Pita, Luan Menezes, Marcos Barreto
2018 Very Large Data Bases Conference  
In this paper, we discuss a new indexing scheme, based on term-frequency counts, deployed in our data linkage tool (AtyImo).  ...  Our results shows a very high level of accuracy and reduction in terms of pairwise comparison tasks.  ...  The scope of our work comprises the usage of term frequency-based indexing to improve the accuracy and the scalability of our probabilistic data linkage tool -AtyImo [Pita et al. 2018] .  ... 
dblp:conf/vldb/PitaMB18 fatcat:uz6tygwirzg4vbnwkpyokkwbsy

CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability

George C. G. Barbosa, M. Sanni Ali, Bruno Araujo, Sandra Reis, Samila Sena, Maria Y. T. Ichihara, Julia Pescarini, Rosemeire L. Fiaccone, Leila D. Amorim, Robespierre Pita, Marcos E. Barreto, Liam Smeeth (+1 others)
2020 BMC Medical Informatics and Decision Making  
Methods We developed CIDACS-RL (Centre for Data and Knowledge Integration for Health – Record Linkage), a novel iterative deterministic record linkage algorithm based on a combination of indexing search  ...  Conclusion CIDACS-RL algorithm is an innovative linkage tool for huge datasets, with higher accuracy, improved scalability, and substantially shorter execution time compared to other existing linkage tools  ...  Authors' contributions GCGB and MSA wrote the first draft of the manuscript with contributions from RP, MEB and all other co-authors. GCGB, MSA, RP, BA and MEB discussed  ... 
doi:10.1186/s12911-020-01285-w pmid:33167998 fatcat:skcuhribsvautjjml2izdttkvi

Secure Transmission of Record after Record Linkage for Crime Detection Using AES

Rincy K Raj
2014 IOSR Journal of Computer Engineering  
The proposed system contain the secure information retrieval after efficient record linkage with indexing. AES algorithm is applied for the secure transmission of matched data.  ...  It provides data integrity, data quality and also the reuse of existing data for advanced studies. The complexity of finding matching records is high due to the increased size of databases.  ...  Canopy Clustering with TFIDF (Term Frequency/Inverse Document Frequency) forms blocks of records based on those records placed in the same canopy cluster.  ... 
doi:10.9790/0661-16592935 fatcat:g7eftjlznrg35mxdfvv3o45hse

A taxonomy of privacy-preserving record linkage techniques

Dinusha Vatsalan, Peter Christen, Vassilios S. Verykios
2013 Information Systems  
The idea is based on the concepts of term frequency (TF) and inverse document frequency (IDF), as used in information retrieval, to give weights to words according to their overall occurrence in a database  ...  It uses specialized multi-dimensional tree index data structure based blocking (kd-tree, BSP-tree, R n -tree, etc.) to improve scalability. Previous work presented by Inan et al.  ... 
doi:10.1016/ fatcat:3kzh22vpjbexrpcxss4nyg55je

A hybrid cloud model for secure record linkage of large health datasets (Preprint)

Adrian P Brown, Sean M Randall
2020 JMIR Medical Informatics  
The linking of administrative data across agencies provides the capability to investigate many health and social issues with the potential to deliver significant public benefit.  ...  A new hybrid cloud model was developed, including privacy-preserving record linkage techniques and container-based batch processing.  ...  With the same comparison space and probabilistic parameters, the accuracy of the linkage is also identical.  ... 
doi:10.2196/18920 pmid:32965236 fatcat:ph7nfxyx5fdhve6rlxt5whocvi

A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage [chapter]

Robespierre Pita, Everton Mendonça, Sandra Reis, Marcos Barreto, Spiros Denaxas
2017 Lecture Notes in Computer Science  
) for assessing and refining the accuracy of probabilistic linkage.  ...  A key component of record linkage is accuracy assessment, the process of manually verifying and validating matched pairs to further refine linkage parameters and increase its overall accuracy.  ...  ) for assessing and refining the accuracy of probabilistic linkage.  ... 
doi:10.1007/978-3-319-64283-3_16 fatcat:etf55edz7ze7jjl534eosdpfbq

Review on Record LINKAGE and Deduplication based on Suffix Array Indexing

Warke Yaminia, Arti Mohanpurkar
2014 International Journal of Computer Applications  
Indexing technique specifically suffix array is used to efficiently implement record linkage and deduplication.  ...  Record linkage is a momentous process in data soundness which is used in combining, matching and duplicate removal from more than two databases that refer to the same entities.  ...  blocking and sorted neighborhood blocking, recent blocking methods such as bigram indexing and canopy clustering provide scalable blocking methods while maintaining or improving upon record linkage accuracy  ... 
doi:10.5120/18916-0243 fatcat:4f5dnnk73fddtnu5yr7mjdmdwy

Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases [article]

Yuhang Zhang, Kee Siong Ng, Michael Walker, Pauline Chou, Tania Churchill, Peter Christen
2018 arXiv   pre-print
The scalability and accuracy of the proposed algorithm are evaluated using benchmark datasets and shown to achieve state-of-the-art results.  ...  probabilistic identification of entity signatures in data.  ...  labelling algorithm that uses inverted-index data structures and parallel databases to compute transitive linkages in large graphs (tens to hundreds of millions of nodes); 4. is simple and scalable, allowing  ... 
arXiv:1712.09691v3 fatcat:ktno7nqnk5anfomslkzi6wq6wu

Online Social Network Profile Linkage [chapter]

Haochen Zhang, Min-Yen Kan, Yiqun Liu, Shaoping Ma
2014 Lecture Notes in Computer Science  
To enable this, we explore a probabilistic approach that uses a domain-specific prior knowledge to address this problem of online social network user profile linkage.  ...  Our probabilistic classifier integrating prior knowledge into Naïve Bayes performs at over 85% F1-measure for pairwise linkage, comparable to state-of-the-art approaches.  ...  Conclusion We investigate the problem of real world large-scale profile linkage and propose OPL, a probabilistic classifier to address this.  ... 
doi:10.1007/978-3-319-12844-3_17 fatcat:4iu6d7p6snbclpftipdtgrviea

Comparative Study of Record Linkage Approaches for Big Data

2021 Walailak Journal of Science and Technology  
In addition, Apache Flink is still rarely used to solve the record linkage problem of Big Data.  ...  ; fourth, the MapReduce was used in about 50 % of the selected studies to handle the parallel processing of Big Data, but due to its limitations, more recent and efficient approaches had been used, such  ...  A deterministic record linkage had been applied between two DB using direct join and probabilistic record linkage where Spark was used to perform record linkage. • Phase 4: Evaluate data mart.  ... 
doi:10.48048/wjst.2021.7221 fatcat:etnf63jobzhkpgad5z762pe4qm

Semantic based Document Clustering: A Detailed Review

Neepa Shah, Sunita Mahajan
2012 International Journal of Computer Applications  
Also, these methods reduce the dimensionality of term features efficiently for very large datasets, thus improves the accuracy and scalability of the clustering algorithms.  ...  The experimental results show the improvement in the accuracy and quality of FIHC. Also key terms are useful as the labels of the candidate clusters.  ... 
doi:10.5120/8202-1598 fatcat:mb5hph2d6vhofmyxuyib7srgqq

Text stream mining for Massive Open Online Courses: review and perspectives

Safwan Shatnawi, Mohamad Medhat Gaber, Mihaela Cocea
2014 Systems Science & Control Engineering  
MOOCs are neither precisely defined nor sufficiently researched in terms of their properties and usage.  ...  Text mining and streaming text mining techniques which can contribute to the success of these systems are reviewed and some open issues in MOOC systems are addressed.  ...  This approach integrates bigram-based and topic-based models to achieve a better predictive accuracy over LDA or hierarchical LDA.  ... 
doi:10.1080/21642583.2014.970732 fatcat:ihfmcjgopzaudjafzeubgb4dly

Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges [chapter]

Dinusha Vatsalan, Ziad Sehili, Peter Christen, Erhard Rahm
2017 Handbook of Big Data Technologies  
PPRL for Big Data poses several challenges, with the three major ones being (1) scalability to multiple large databases, due to their massive volume and the flow of data within Big Data applications, (  ...  2) achieving high quality results of the linkage in the presence of variety and veracity of Big Data, and (3) preserving privacy and confidentiality of the entities represented in Big Data collections.  ...  Co-operation Scheme, and also funded by the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF  ... 
doi:10.1007/978-3-319-49340-4_25 fatcat:a6p7w4sannbmdkq4ceridmoan4

Evaluation of Clustering around Weighted Prototype and Genetic Algorithm for Document Categorization

Garima Jain, Shailendra Kumar
2015 International Journal of Computer Applications  
Genetic algorithm, which is an optimization based technique which can be applied for finding out the best cluster centres easily by computing fitness values of data points.  ...  F-measure and accuracy of genetic algorithm is better than clustering around weighted prototype for the Reuter-21578 dataset.  ...  Efficiency and scalability are two vital factors to applications with huge scaled data. Genetic Algorithm is more capable and scalable compared to clustering around weighted prototype.  ... 
doi:10.5120/ijca2015906260 fatcat:bkcodo7e2bfwzbvqiqqvvg7eu4

Linkage of Hospital Records and Death Certificates by a Search Engine and Machine Learning

Sebastien Cossin, Serigne Diouf, Romain Griffier, Philippine Le Barrois d'Orgeval, Gayo Diallo, Vianney Jouhet
2021 JAMIA Open  
Our linkage strategy was composed of a search engine to reduce the number of comparisons and machine-learning algorithms.  ...  The recall and precision of our linkage strategy were 97.5% and 99.97% for the upper threshold and 99.4% and 98.9% for the lower threshold, respectively.  ...  ACKNOWLEDGMENTS The authors wish to thank Etalab and INSEE for publishing French death records as open data.  ... 
doi:10.1093/jamiaopen/ooab005 pmid:33709061 pmcid:PMC7935495 fatcat:gpzxu4dur5gepjcqaqjftyt75i
« Previous Showing results 1 — 15 out of 1,477 results