A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2006; you can also visit the original URL.
The file type is application/pdf
.
Filters
Collection statistics for fast duplicate document detection
2002
ACM Transactions on Information Systems
We present a new algorithm for duplicate document detection that uses collection statistics. We compare our approach with the state-of-the-art approach using multiple collections. ...
We show that our approach called I-Match, scales in terms of the number of documents and works well for documents of all sizes. ...
The use of idf collection statistics allows us to determine the usefulness of terms for duplicate document detection. ...
doi:10.1145/506309.506311
fatcat:eilrac57nfgwnagybd3n2jyb2a
Compact Features for Detection of Near-Duplicates in Distributed Retrieval
[chapter]
2006
Lecture Notes in Computer Science
The fact that the collections are distributed means that it is not in general feasible to prune duplicate and near-duplicate documents at index time. ...
In this paper we introduce and analyze the grainy hash vector, a compact document representation that can be used to efficiently prune duplicate and near-duplicate documents from result lists. ...
Chunk-based document fingerprinting is a technique for detecting near-duplicate documents that has been successfully used for applications such as filesystem-level duplicate detection [Manber, 1994] , ...
doi:10.1007/11880561_10
fatcat:yckfw77htbf5pe3mq5togp6or4
An Image Based Approach for Content Analysis in Document Collections
[chapter]
2013
Lecture Notes in Computer Science
Cluster analysis delivers groups of pages characterized by common properties, especially duplicated page content is detected with high reliability. ...
The use of keypoint extractors combined with the bag of features approach is applied to scanned text documents. ...
The authors would like to thank Sven Schlarb from the Austrian National Library (ONB) for providing data and expertise on library workflows. ...
doi:10.1007/978-3-642-41939-3_27
fatcat:vrb2pqj6czdhlod5sksuszkpha
Partial duplicate detection for large book collections
2011
Proceedings of the 20th ACM international conference on Information and knowledge management - CIKM '11
Experiments on several datasets show that DUPNIQ is more accurate than traditional methods for duplicate detection such as shingling and is fast. ...
A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors. ...
ACKNOWLEDGMENTS We thank James Allan, Bruce Croft and David Smith for discussions and Marek Blat and Logan Giorda for helping with annotations. ...
doi:10.1145/2063576.2063647
dblp:conf/cikm/YalnizCM11
fatcat:ogvbxvyozzbk5bl7stydmr2dwu
Online duplicate document detection
2003
Proceedings of the twelfth international conference on Information and knowledge management - CIKM '03
As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. ...
A representative technique constructs a 'fingerprint' of the rarest or richest features in a document using collection statistics as criteria for feature selection. ...
Recent work has focused on issues of computational efficiency and duplicate document detection (and, by extension, "deduping") effectiveness while relying on "collection statistics" to consistently recognize ...
doi:10.1145/956863.956946
dblp:conf/cikm/ConradGS03
fatcat:josv3o572naj7d6n7l6rqki6li
A Fast Text Similarity Measure for Large Document Collections using Multi-reference Cosine and Genetic Algorithm
[article]
2019
arXiv
pre-print
In order to remove duplicate and near-duplicate documents from the index, a search engine needs a swift and reliable duplicate and near-duplicate text document detection system. ...
The proposed method is based on the idea of using reference texts to generate signatures for text documents. ...
to detect duplicate and near-duplicate text documents. ...
arXiv:1810.03102v3
fatcat:jo5lhedsrzevpc4k6i5acp43i4
Near-duplicate detection by instance-level constrained clustering
2006
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06
This paper presents an instance-level constrained clustering approach for near-duplicate detection. ...
For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information retrieval community are ...
We are grateful to the USDA, US DOT, and US EPA for providing the public comment data that made this research possible. We are grateful to invaluable comments from the anonymous reviewers. ...
doi:10.1145/1148170.1148243
dblp:conf/sigir/YangC06
fatcat:27vmr3tstbc7hjfr75zubbazl4
Essential deduplication functions for transactional databases in law firms
2007
Proceedings of the 11th international conference on Artificial intelligence and law - ICAIL '07
As massive document repositories and knowledge management systems continue to expand, in proprietary environments as well as on the Web, the need for duplicate detection becomes increasingly important. ...
To our knowledge, we are the first to use principled methods to construct a test collection of transactional documents for such research purposes, one which identifies a variety of duplicate types and ...
We thank Ely Razin and Kingsley Martin for their invaluable contribution of domain expertise. ...
doi:10.1145/1276318.1276368
dblp:conf/icail/ConradR07
fatcat:zaila4zxrnaflm2uvwwomxhwgu
Where and How Duplicates Occur in the Web
2006
2006 Fourth Latin American Web Congress
We identify duplicate and near-duplicate documents in our collections, studying the distribution of documents in clusters of duplicates. ...
In this paper we study duplicates on the Web, using collections containing documents of all sites under the .cl domain that represent accurate and representative subsets of the Web. ...
Duplicate Detection Algorithm In this section we present the algorithm to detect duplicate and near-duplicate documents in a collection C containing n documents. ...
doi:10.1109/la-web.2006.39
dblp:conf/la-web/JrBZ06
fatcat:ismrpwtzfnhmrfmeeh5vf72h5m
A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines
[chapter]
2004
Lecture Notes in Computer Science
This hybrid method provides not only an effective but also scalable solution for duplicate detection. ...
Existing methods for detecting duplicated Web pages can be classified into two categories, i.e. offline and online methods. ...
detect duplicates in the whole collection. ...
doi:10.1007/978-3-540-24655-8_6
fatcat:dkdijwgmhffhpkf35dfs4v3cpi
Managing déjà vu: Collection building for the identification of nonidentical duplicate documents
2006
Journal of the American Society for Information Science and Technology
A Au ut th ho or r P Pr ro oo of f As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. ...
Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling nonidentical duplicate documents. ...
We acknowledge the support of Marilee Winiarski, who invested in our nonidentical duplicate research. ...
doi:10.1002/asi.20363
fatcat:6wgke3ekc5gitiuaqvuo4tonha
Chinese keyword extraction based on max-duplicated strings of the documents
2002
Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '02
In this paper, we designed an efficient algorithm to extract the max-duplicated strings by building PAT-tree for the document, so that the keywords can be picked out from the max-duplicated strings by ...
The extraction based on global statistical information only can get significant keywords in the whole corpus. Max-duplicated strings contain the local significant keywords in each document. ...
In the process of building a PAT tree for one document, all the strings with f(S) > 1 can be detected. ...
doi:10.1145/564376.564483
dblp:conf/sigir/Yang02
fatcat:ly3b2ijcmzgnbficxyy7bf2ema
Stability and Reproducibility of the Measurement of Plasma Nitrate in Large Epidemiologic Studies
2013
North American Journal of Medicine and Science
The measurement of nitrate cannot be widely used in epidemiologic research without the documentation of its stability and reproducibility. ...
Data on the validity of nitrate measurement in blood samples collected in typical epidemiologic settings are needed before nitrate can be evaluated as an exposure in large epidemiologic studies. ...
Howard Shertzer from the University of Cincinnati, Department of Environmental Health for his valuable input. ...
pmid:24244804
pmcid:PMC3826455
fatcat:rvkgjeicubekjfujhe732j66ba
A fast text similarity measure for large document collections using multi-reference Cosine and genetic algorithm
2019
Turkish Journal of Electrical Engineering and Computer Sciences
One of the critical factors that make a search engine fast and accurate is a concise and duplicate free index. 4 In order to remove duplicate and near-duplicate (DND) documents from the index, a search ...
The proposed method is based on the 12 idea of using reference texts to generate signatures for text documents. ...
The precision and recall of other approaches are collected from a 7 paper from Zhang et al. named "Effective and Fast Near Duplicate Detection via Signature-Based Compression8
9 [ 9 22]. ...
doi:10.3906/elk-1906-30
fatcat:2azfsr6fmvdcbod2tn7xr2ggmu
Near Duplicate Document Detection using Document Image
2016
International Journal of Multimedia and Ubiquitous Engineering
for similar documents in a collection. ...
We propose an algorithm based on tf-idf method with importance and discriminative power of a term within a single document to speed up search process for detecting how documents are similar in collection ...
The future work will be concentrated to the more robust and accurate methods for near duplicate documents detection. ...
doi:10.14257/ijmue.2016.11.7.17
fatcat:lz3xc2mbvbfi5hazqtyirhw3su
« Previous
Showing results 1 — 15 out of 51,654 results