Filters








51,654 Hits in 2.9 sec

Collection statistics for fast duplicate document detection

Abdur Chowdhury, Ophir Frieder, David Grossman, Mary Catherine McCabe
2002 ACM Transactions on Information Systems  
We present a new algorithm for duplicate document detection that uses collection statistics. We compare our approach with the state-of-the-art approach using multiple collections.  ...  We show that our approach called I-Match, scales in terms of the number of documents and works well for documents of all sizes.  ...  The use of idf collection statistics allows us to determine the usefulness of terms for duplicate document detection.  ... 
doi:10.1145/506309.506311 fatcat:eilrac57nfgwnagybd3n2jyb2a

Compact Features for Detection of Near-Duplicates in Distributed Retrieval [chapter]

Yaniv Bernstein, Milad Shokouhi, Justin Zobel
2006 Lecture Notes in Computer Science  
The fact that the collections are distributed means that it is not in general feasible to prune duplicate and near-duplicate documents at index time.  ...  In this paper we introduce and analyze the grainy hash vector, a compact document representation that can be used to efficiently prune duplicate and near-duplicate documents from result lists.  ...  Chunk-based document fingerprinting is a technique for detecting near-duplicate documents that has been successfully used for applications such as filesystem-level duplicate detection [Manber, 1994] ,  ... 
doi:10.1007/11880561_10 fatcat:yckfw77htbf5pe3mq5togp6or4

An Image Based Approach for Content Analysis in Document Collections [chapter]

Reinhold Huber-Mörk, Alexander Schindler
2013 Lecture Notes in Computer Science  
Cluster analysis delivers groups of pages characterized by common properties, especially duplicated page content is detected with high reliability.  ...  The use of keypoint extractors combined with the bag of features approach is applied to scanned text documents.  ...  The authors would like to thank Sven Schlarb from the Austrian National Library (ONB) for providing data and expertise on library workflows.  ... 
doi:10.1007/978-3-642-41939-3_27 fatcat:vrb2pqj6czdhlod5sksuszkpha

Partial duplicate detection for large book collections

Ismet Zeki Yalniz, Ethem F. Can, R. Manmatha
2011 Proceedings of the 20th ACM international conference on Information and knowledge management - CIKM '11  
Experiments on several datasets show that DUPNIQ is more accurate than traditional methods for duplicate detection such as shingling and is fast.  ...  A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors.  ...  ACKNOWLEDGMENTS We thank James Allan, Bruce Croft and David Smith for discussions and Marek Blat and Logan Giorda for helping with annotations.  ... 
doi:10.1145/2063576.2063647 dblp:conf/cikm/YalnizCM11 fatcat:ogvbxvyozzbk5bl7stydmr2dwu

Online duplicate document detection

Jack G. Conrad, Xi S. Guo, Cindy P. Schriber
2003 Proceedings of the twelfth international conference on Information and knowledge management - CIKM '03  
As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical.  ...  A representative technique constructs a 'fingerprint' of the rarest or richest features in a document using collection statistics as criteria for feature selection.  ...  Recent work has focused on issues of computational efficiency and duplicate document detection (and, by extension, "deduping") effectiveness while relying on "collection statistics" to consistently recognize  ... 
doi:10.1145/956863.956946 dblp:conf/cikm/ConradGS03 fatcat:josv3o572naj7d6n7l6rqki6li

A Fast Text Similarity Measure for Large Document Collections using Multi-reference Cosine and Genetic Algorithm [article]

Hamid Mohammadi, Seyed Hossein Khasteh
2019 arXiv   pre-print
In order to remove duplicate and near-duplicate documents from the index, a search engine needs a swift and reliable duplicate and near-duplicate text document detection system.  ...  The proposed method is based on the idea of using reference texts to generate signatures for text documents.  ...  to detect duplicate and near-duplicate text documents.  ... 
arXiv:1810.03102v3 fatcat:jo5lhedsrzevpc4k6i5acp43i4

Near-duplicate detection by instance-level constrained clustering

Hui Yang, Jamie Callan
2006 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06  
This paper presents an instance-level constrained clustering approach for near-duplicate detection.  ...  For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information retrieval community are  ...  We are grateful to the USDA, US DOT, and US EPA for providing the public comment data that made this research possible. We are grateful to invaluable comments from the anonymous reviewers.  ... 
doi:10.1145/1148170.1148243 dblp:conf/sigir/YangC06 fatcat:27vmr3tstbc7hjfr75zubbazl4

Essential deduplication functions for transactional databases in law firms

Jack G. Conrad, Edward L. Raymond
2007 Proceedings of the 11th international conference on Artificial intelligence and law - ICAIL '07  
As massive document repositories and knowledge management systems continue to expand, in proprietary environments as well as on the Web, the need for duplicate detection becomes increasingly important.  ...  To our knowledge, we are the first to use principled methods to construct a test collection of transactional documents for such research purposes, one which identifies a variety of duplicate types and  ...  We thank Ely Razin and Kingsley Martin for their invaluable contribution of domain expertise.  ... 
doi:10.1145/1276318.1276368 dblp:conf/icail/ConradR07 fatcat:zaila4zxrnaflm2uvwwomxhwgu

Where and How Duplicates Occur in the Web

Alvaro Jr, Ricardo Baeza-Yates, Nivio Ziviani
2006 2006 Fourth Latin American Web Congress  
We identify duplicate and near-duplicate documents in our collections, studying the distribution of documents in clusters of duplicates.  ...  In this paper we study duplicates on the Web, using collections containing documents of all sites under the .cl domain that represent accurate and representative subsets of the Web.  ...  Duplicate Detection Algorithm In this section we present the algorithm to detect duplicate and near-duplicate documents in a collection C containing n documents.  ... 
doi:10.1109/la-web.2006.39 dblp:conf/la-web/JrBZ06 fatcat:ismrpwtzfnhmrfmeeh5vf72h5m

A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines [chapter]

Shaozhi Ye, Ruihua Song, Ji-Rong Wen, Wei-Ying Ma
2004 Lecture Notes in Computer Science  
This hybrid method provides not only an effective but also scalable solution for duplicate detection.  ...  Existing methods for detecting duplicated Web pages can be classified into two categories, i.e. offline and online methods.  ...  detect duplicates in the whole collection.  ... 
doi:10.1007/978-3-540-24655-8_6 fatcat:dkdijwgmhffhpkf35dfs4v3cpi

Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

Jack G. Conrad, Cindy P. Schriber
2006 Journal of the American Society for Information Science and Technology  
A Au ut th ho or r P Pr ro oo of f As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical.  ...  Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling nonidentical duplicate documents.  ...  We acknowledge the support of Marilee Winiarski, who invested in our nonidentical duplicate research.  ... 
doi:10.1002/asi.20363 fatcat:6wgke3ekc5gitiuaqvuo4tonha

Chinese keyword extraction based on max-duplicated strings of the documents

Wenfeng Yang
2002 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '02  
In this paper, we designed an efficient algorithm to extract the max-duplicated strings by building PAT-tree for the document, so that the keywords can be picked out from the max-duplicated strings by  ...  The extraction based on global statistical information only can get significant keywords in the whole corpus. Max-duplicated strings contain the local significant keywords in each document.  ...  In the process of building a PAT tree for one document, all the strings with f(S) > 1 can be detected.  ... 
doi:10.1145/564376.564483 dblp:conf/sigir/Yang02 fatcat:ly3b2ijcmzgnbficxyy7bf2ema

Stability and Reproducibility of the Measurement of Plasma Nitrate in Large Epidemiologic Studies

Yushan Wang, Mary K Townsend, A Heather Eliassen, Tianying Wu
2013 North American Journal of Medicine and Science  
The measurement of nitrate cannot be widely used in epidemiologic research without the documentation of its stability and reproducibility.  ...  Data on the validity of nitrate measurement in blood samples collected in typical epidemiologic settings are needed before nitrate can be evaluated as an exposure in large epidemiologic studies.  ...  Howard Shertzer from the University of Cincinnati, Department of Environmental Health for his valuable input.  ... 
pmid:24244804 pmcid:PMC3826455 fatcat:rvkgjeicubekjfujhe732j66ba

A fast text similarity measure for large document collections using multi-reference Cosine and genetic algorithm

2019 Turkish Journal of Electrical Engineering and Computer Sciences  
One of the critical factors that make a search engine fast and accurate is a concise and duplicate free index. 4 In order to remove duplicate and near-duplicate (DND) documents from the index, a search  ...  The proposed method is based on the 12 idea of using reference texts to generate signatures for text documents.  ...  The precision and recall of other approaches are collected from a 7 paper from Zhang et al. named "Effective and Fast Near Duplicate Detection via Signature-Based Compression8 9 [ 9 22].  ... 
doi:10.3906/elk-1906-30 fatcat:2azfsr6fmvdcbod2tn7xr2ggmu

Near Duplicate Document Detection using Document Image

Gaudence Uwamahoro, Zhang Zuping, Ambele Robert Mtafya, Weiqi Li, Long Jun
2016 International Journal of Multimedia and Ubiquitous Engineering  
for similar documents in a collection.  ...  We propose an algorithm based on tf-idf method with importance and discriminative power of a term within a single document to speed up search process for detecting how documents are similar in collection  ...  The future work will be concentrated to the more robust and accurate methods for near duplicate documents detection.  ... 
doi:10.14257/ijmue.2016.11.7.17 fatcat:lz3xc2mbvbfi5hazqtyirhw3su
« Previous Showing results 1 — 15 out of 51,654 results