68,166 Hits in 4.3 sec

Refining Duplicate Detection for Improved Data Quality

Yu Huang, Fei Chiang
2017 International Conference on Theory and Practice of Digital Libraries  
Detecting duplicates is a pervasive data quality challenge that hinders organizations from extracting value from their data sooner.  ...  In this paper, we propose a duplication detection framework, which adapts metric functional dependencies (MFDs) to improve the detection accuracy by relaxing the matching condition on numeric values to  ...  These data quality rules can be used as additional information to improve the accuracy of the duplicate detection task.  ... 
dblp:conf/ercimdl/HuangC17 fatcat:2poorvob4ncxriezpuqiuu3ata

Self Similarity Wide-Joins for Near-Duplicate Image Detection

Luiz Olmes Carvalho, Lucio F.D. Santos, Willian D. Oliveira, Agma J.M. Traina, Caetano Traina
2015 2015 IEEE International Symposium on Multimedia (ISM)  
Near-duplicate image detection plays an important role in several real applications.  ...  Experiments performed on real datasets shows that our proposal is up to three orders of magnitude faster than the best techniques in the literature, always returning a high-quality result set.  ...  ACKNOWLEDGMENT The authors are grateful to FAPESP, CNPq, CAPES and Rescuer (EU FP7-614154 / CNPq 490084/2013-3) for their financial support.  ... 
doi:10.1109/ism.2015.114 dblp:conf/ism/CarvalhoSOTT15 fatcat:q27jlj2ugfakppqmgrhfpwwz5u

A Study on the Problem Analysis and Improvement Plan of the Data Quality Management System of National R&D Data

Sang Gi Lee, Byeonghee Lee, Hanjo Jeong
2015 Indian Journal of Science and Technology  
Repeated-Sentence Detection Method Levenshtein Distance Algorithm For the duplicated terms and phrases, we used Levenshtein distance algorithm to extract the duplicated parts.  ...  Free-Text Processing System The summary of R&D projects is one of the most important fields for detecting similar and duplicated projects.  ... 
doi:10.17485/ijst/2015/v8i23/79229 fatcat:p5ye7uwrdjc4pfhevj2p54g2ta


Ahmed Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin
2014 Proceedings of the 2014 ACM SIGMOD international conference on Management of data - SIGMOD '14  
Nadeef/Er provides a rich programming interface for manipulating entities, which allows generic, e cient and extensible ER.  ...  We present Nadeef/Er, a generic and interactive entity resolution system, which is built as an extension over our open-source generalized data cleaning system Nadeef.  ...  Additionally, we will show: (3) how the data quality dashboard can help users understand duplicate data, and how users can interact with Nadeef/Er to further refine ER rules; and (4) how existing ER algorithms  ... 
doi:10.1145/2588555.2594511 dblp:conf/sigmod/ElmagarmidIOQ0Y14 fatcat:isxtkb3k65gydpugbqpfdclssq

Duplicate File Detection and Elimination

Kanupriya Joshi, Mrs. Mamta
2019 International Journal of Scientific Research in Computer Science Engineering and Information Technology  
The main objective of this research work is to detect exact and inexact duplicates by using duplicate detection and elimination rules. This approach is used to improve the efficiency of the data.  ...  The problem of detecting and eliminating duplicated file is one of the major problems in the broad area of data cleaning and data quality in system.  ...  CONCLUSIONIn this research wok, a framework is designed to clean duplicate data for improving data quality and also to support any subject oriented data.  ... 
doi:10.32628/cseit19544 fatcat:4vfo7etsqffdpkj3dnmmk336da

A Comparison Study of Data Scrubbing Algorithms and Frameworks in Data Warehousing

Hamed IbrahimHousien, Zhang Zuping, Zainab Qays Abdulhadi
2013 International Journal of Computer Applications  
These constraints are: dirty data, noise data, missing values, inconsistency, uncertain data, ambiguous, conflicting, duplicated records and similar columns.  ...  Its robust and better decision depends on an important and conclusive factor called Data Quality (DQ), to obtain a high data quality using Data Scrubbing (DS) which is one of data Extraction Transformation  ...  , domain format errors, irregularities, integrity constraint violation, and duplicates, to improve the quality of the data.  ... 
doi:10.5120/11752-7406 fatcat:7hajxzypsjb3zexz5i2urh3fc4

An Expert System for Quality Assurance of Document Image Collections [chapter]

Roman Graf, Reinhold Huber-Mörk, Alexander Schindler, Sven Schlarb
2012 Lecture Notes in Computer Science  
This paper presents an expert system that supports decision making for page duplicate detection in document image collections.  ...  Digital preservation workflows for automatic acquisition of image collections are susceptible to errors and require quality assurance.  ...  Our hypothesis is that automatic approach should be able to detect duplicates with reliable quality. Then this method would be a significant improvement over a manual analysis.  ... 
doi:10.1007/978-3-642-34234-9_25 fatcat:3bxk326bynervoasg3n3slgfzq

Improving Data from Electron Backscatter Diffraction Experiments using Pattern Matching Techniques

Pat Trimby, Kim Larsen, Michael Hjelmstad, Aimo Winkelmann, Klaus Mehnert
2022 Microscopy and Microanalysis  
In this paper we present new developments in the use of EBSD pattern correlation for improving the quality of EBSD datasets.  ...  The results (figure 2c ) show that the dataset quality has been significantly improved, with far more information in the nanostructured matrix areascrucially without interpolation from or duplication  ... 
doi:10.1017/s1431927622011813 fatcat:rkaleii67nejvogotzgyseoudy

Handling Duplicated Tasks in Process Discovery by Refining Event Labels [chapter]

Xixi Lu, Dirk Fahland, Frank J. H. M. van den Biggelaar, Wil M. P. van der Aalst
2016 Lecture Notes in Computer Science  
We were able to improve the quality of up to 42% of the models compared to using a log with imprecise labeling using default parameters and up to 87% using adaptive parameters.  ...  Moreover, using our refinement approach significantly increased the similarity of the discovered model to the original process with duplicate labels allowing for better rediscoverability.  ...  Data Quality and Noise/Deviation Filtering. Imprecise labels could also be seen as data quality problem, i.e., events having incorrect labels.  ... 
doi:10.1007/978-3-319-45348-4_6 fatcat:k4nrzmt6uzeazp6vfy5gf5xhzy

Reach for gold

Tobias Vogel, Arvid Heise, Uwe Draisbach, Dustin Lange, Felix Naumann
2014 Journal of Data and Information Quality  
Duplicates in a database are one of the prime causes of poor data quality and are at the same time among the most difficult data quality problems to alleviate.  ...  Finally, we provide an annealing standard for 750,000 CDs to the duplicate detection community.  ...  A duplicate detection benchmark for XML (and potentially relational) data is proposed by Weis et al. [2006] .  ... 
doi:10.1145/2629687 dblp:journals/jdiq/VogelHDLN14 fatcat:ivvwovja3rgpbmsjkilwwevsf4

A Weighted PageRank-Based Bug Report Summarization Method Using Bug Report Relationships

Beomjun Kim, Sungwon Kang, Seonah Lee
2019 Applied Sciences  
For software maintenance, bug reports provide useful information to developers because they can be used for various tasks such as debugging and understanding previous changes.  ...  The experimental results show that our method outperforms the state-of-the-art method in terms of both the quality of the summary and the number of applicable bug reports.  ...  s [6] PRST effectively improved the summary quality for bug reports by using the duplicates relationships in bug reports.  ... 
doi:10.3390/app9245427 fatcat:hwsuws3prjgafimvisjgzkdpte

Utilising Code Smells to Detect Quality Problems in TTCN-3 Test Suites [chapter]

Helmut Neukirchen, Martin Bisanz
2007 Lecture Notes in Computer Science  
Therefore, a quality assessment of TTCN-3 test suites is desirable. A powerful approach to detect quality problems in source code is the identification of code smells.  ...  This paper presents a quality assessment approach for TTCN-3 test suites which is based on TTCN-3 code smells: To this aim, various TTCN-3 code smells have been identified and collected in a catalogue;  ...  Acknowledgements: The authors like to thank Jens Grabowski and the anonymous reviewers for valuable comments on improving this paper.  ... 
doi:10.1007/978-3-540-73066-8_16 fatcat:zsxc2wm2argwphjzbbuyfdbca4

EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data

2019 Nucleic Acids Research  
Herein, we propose a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data.  ...  ; (e) provides direct CNV genotyping accompanied with confidence score, directly accessible for downstream quality control and association analysis.  ...  The boundary refinement of ensembleCNV both improves the CNV calling quality and downstream functional interpretability. (6) Prepared for CNV-GWAS.  ... 
doi:10.1093/nar/gkz068 pmid:30722045 pmcid:PMC6468244 fatcat:uq3ste3jt5a7ja3ui3kdx2sjhu

What Are Expected Queries in End-to-End Object Detection? [article]

Shilong Zhang, Xinjiang Wang, Jiaqi Wang, Jiangmiao Pang, Kai Chen
2022 arXiv   pre-print
A duplicate query removal pre-process is applied to these queries so that they are distinguishable from each other.  ...  It obtains 44.5 AP on the MS COCO detection dataset with only 12 epochs.  ...  Although dominating object detection for years, this pipeline suffers from perfectly filtering out duplicated boxes without harming correct predictions.  ... 
arXiv:2206.01232v1 fatcat:5pdtvphv2zhu7k23bxtmuytgzq

A system proposal for automated data cleaning environment

Carlos Roberto Valêncio, Toni Jardini, Victor Hugo Penhalves Martins, Angelo Cesar Colombini, Márcio Zamboti Fortes
2020 ITEGAM- Journal of Engineering and Technology for Industrial Applications (ITEGAM-JETIA)  
Approaches were also demonstrated to show that besides detecting and treating information inconsistencies and duplication of positive cases, they also addressed cases of detected false-positives and the  ...  Against this backdrop, we developed an automated configurable data cleaning environment based on training and physical-semantic data similarity, aiming to provide a more efficient and extensible tool for  ...  BACKGROUND A data cleaning process should detect and remove errors and inconsistencies from one or more information sources to improve data quality.  ... 
doi:10.5935/jetia.v6i25.685 fatcat:4hxsg3z2ijduvge6u5qqumu35e
« Previous Showing results 1 — 15 out of 68,166 results