Filters








1,938 Hits in 3.5 sec

Recursive n-gram hashing is pairwise independent, at best

Daniel Lemire, Owen Kaser
2010 Computer Speech and Language  
Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be a performance bottleneck.  ...  For more speed, recursive hash families compute hash values by updating previous values. We prove that recursive hash families cannot be more than pairwise independent.  ...  The authors are grateful to the anonymous reviewers for their significant contributions.  ... 
doi:10.1016/j.csl.2009.12.001 fatcat:jxagvqxgkbdxbfrinzo7udzmyu

One-Pass, One-Hash n-Gram Statistics Estimation [article]

Daniel Lemire, Owen Kaser
2014 arXiv   pre-print
The approach further is extended to a one-pass/one-hash computation of n-gram entropy and iceberg counts.  ...  In multimedia, text or bioinformatics databases, applications query sequences of n consecutive symbols called n-grams. Estimating the number of distinct n-grams is a view-size estimation problem.  ...  This relates hashing an n + 1-gram to hashing an overlapping n-gram, whereas recursive hashing involves two overlapping n-grams. Hence, we cannot say that recursive hash functions are semi-recursive.  ... 
arXiv:cs/0610010v4 fatcat:tsjrhuthmvh3zo6bnbxh47pm54

Page 5249 of Mathematical Reviews Vol. , Issue 98H [page]

1998 Mathematical Reviews  
Chapter 6 shows how, for several types of perfect hash functions, randomized algo- rithms significantly speed up the process of construction.  ...  System Sci. 54 (1997), no. 1, part 1, 61-78 Recursion is one of the most important features of Datalog pro- grams, but this causes inefficiency when evaluating them.  ... 

Efficient Algorithm for Math Formula Semantic Search

Shunsuke OHASHI, Giovanni Yoko KRISTIANTO, Goran TOPIC, Akiko AIZAWA
2016 IEICE transactions on information and systems  
In this paper, we formulate three types of measures that represent distinctive features of semantic similarity of math formulae, and develop efficient hash-based algorithms for the approximate calculation  ...  Regardless of the importance of mathematical formula search, conventional keyword-based retrieval methods are not sufficient for searching mathematical formulae, which are structured as trees.  ...  and SIGURE Hash) and 112, 388 seconds for the pq-gram.  ... 
doi:10.1587/transinf.2015dap0023 fatcat:56pl6nmhrzdbrcxw3lbej2glh4

Indexing methods for approximate dictionary searching

Leonid Boytsov
2011 ACM Journal of Experimental Algorithmics  
We focus on infrequently updated dictionaries, which are used primarily for retrieval. Therefore, we consider indices that are optimized for retrieval rather than for update.  ...  Benchmark results are presented for the practically important cases of k = 1, 2, 3.  ...  In our experiments, we use the following hash functions: -An additive hash function for unigrams: h(Σ i ) = i mod |σ|; -An additive hash function for q-grams: h s [l:l+q−1] = q−1 i=0 η (q−i) ASCII s [l  ... 
doi:10.1145/1963190.1963191 fatcat:7ei22m56areynav7yqp54ejwbq

Topology-Aware Hashing for Effective Control Flow Graph Similarity Analysis [article]

Yuping Li, Jiong Jang, Xinming Ou
2020 arXiv   pre-print
Given the CFGs constructed from program binaries, we extract blended n-gram graphical features of the CFGs, encode the graphical features into numeric vectors (called graph signatures), and then measure  ...  In this paper, we propose a novel fuzzy hashing scheme called topology-aware hashing (TAH) for effective and efficient CFG similarity analysis.  ...  For instance, comparing binary A with m functions and binary B with n functions would result in m * n pairwise CFG comparisons.  ... 
arXiv:2004.06563v1 fatcat:tz7o6ziq3jcrdjna7zh4i3zd5e

Rescore in a Flash: Compact, Cache Efficient Hashing Data Structures for n-Gram Language Models

Grant P. Strimel, Ariya Rastrow, Gautam Tiwari, Adrien Piérard, Jon Webb
2020 Interspeech 2020  
We introduce DashHashLM, an efficient data structure that stores an n-gram language model compactly while making minimal trade-offs on runtime lookup latency.  ...  Specifically, we show that with roughly a 10% increase in memory size, compared to a highly optimized, compressed baseline n-gram representation, our proposed data structure can achieve up to a 6x query  ...  the need for effective n-gram LM compression methods.  ... 
doi:10.21437/interspeech.2020-1939 dblp:conf/interspeech/StrimelRTPW20 fatcat:djxokflyo5dtdb2v6m7xloqgr4

Smoothed Bloom Filter Language Models: Tera-Scale LMs on the Cheap

David Talbot, Miles Osborne
2007 Conference on Empirical Methods in Natural Language Processing  
We investigate how a BF containing n-gram statistics can be used as a direct replacement for a conventional n-gram model.  ...  Our proposal takes advantage of the one-sided error guarantees of the BF and simple inequalities that hold between related n-gram statistics in order to further reduce the BF storage requirements and the  ...  Here we use a distinct set of k hash functions for each such category.  ... 
dblp:conf/emnlp/TalbotO07 fatcat:l7dt4hvvw5bu3acxycu3iqi3hq

Efficient Algorithms for Similarity Measures over Sequential Data: A Look Beyond Kernels [chapter]

Konrad Rieck, Pavel Laskov, Klaus-Robert Müller
2006 Lecture Notes in Computer Science  
Kernel functions as similarity measures for sequential data have been extensively studied in previous research.  ...  This contribution addresses the efficient computation of distance functions and similarity coefficients for sequential data.  ...  authors gratefully acknowledge the funding from Bundesministerium für Bildung und Forschung under the project MIND (FKZ 01-SC40A) and would like to thank Sören Sonnenburg, Julian Laub and Mikio Braun for  ... 
doi:10.1007/11861898_38 fatcat:3oankd3oqraafakv63zqgaizmm

Combining Hashing and Abstraction in Sparse High Dimensional Feature Spaces

Cornelia Caragea, Adrian Silvescu, Prasenjit Mitra
2021 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
The commonly used "bag of words" and n-gram representations can result in prohibitively high dimensional input spaces.  ...  Experimental results on two text data sets show that the combined approach uses significantly smaller number of features and gives similar performance when compared with the "bag of words" and n-gram approaches  ...  Acknowledgments We would like to thank Doina Caragea and our anonymous reviewers for their constructive comments, which helped improve the presentation of this paper.  ... 
doi:10.1609/aaai.v26i1.8117 fatcat:qg6motu2gvgofmh7zaqso2zexm

An Efficient Indexing Approach for Continuous Spatial Approximate Keyword Queries over Geo-Textual Streaming Data

Ze Deng, Meng Wang, Lizhe Wang, Xiaohui Huang, Wei Han, Junde Chu, Albert Zomaya
2019 ISPRS International Journal of Geo-Information  
one hashing function instead of dozens.  ...  AP-tree + utilizes the one-permutation m i n - w i s e hashing method to achieve a much lower signature maintenance costs compared with the traditional m i n - w i s e hashing method because it only employs  ...  hashing function.  ... 
doi:10.3390/ijgi8020057 fatcat:aebnqztq7jc4fglkxyb5ovvc6u

Practical Queries of a Massive n-gram Database

Tobias Hawker, Mary Gardiner, Andrew Bennetts
2007 Australasian Language Technology Association Workshop  
Large quantities of data are an increasingly essential resource for many Natural Language Processing techniques.  ...  We present a software suite, "Get 1T", implementing these techniques, released as free software for use by the natural language research community, and others.  ...  Falconer, author of the free hashlib library for C, which we use for the pre-processed queries.  ... 
dblp:conf/acl-alta/HawkerGB07 fatcat:jo4o4lr35zalhcfrisg76i66sa

Distributed Kernel Matrix Approximation and Implementation Using Message Passing Interface

Taher A. Dameh, Wael Abd-Almageed, Mohamed Hefeeda
2013 2013 12th International Conference on Machine Learning and Applications  
To reduce these quadratic complexities, the proposed method first partitions the data into smaller subsets using various families of locality sensitive hashing, including random project and spectral hashing  ...  We propose a distributed method to compute similarity (also known as kernel and Gram) matrices used in various kernel-based machine learning algorithms.  ...  To do so, we set the Hilbert curve window width as n/2 k , where n is data set size and k is the number of hash function bits in LSH.  ... 
doi:10.1109/icmla.2013.17 dblp:conf/icmla/DamehAH13 fatcat:65433bhchbhabdsneh5jo7bbbq

The power of two min-hashes for similarity search among hierarchical data objects

Sreenivas Gollapudi, Rina Panigrahy
2008 Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems - PODS '08  
In this study we propose sketching algorithms for computing similarities between hierarchical data.  ...  Furthermore, we show that propagating one min-hash results in poor sketch properties while propagating two min-hashes results in good sketches.  ...  Such sketching functions are also called locality sensitive hash(LSH) functions [7] . For a domain X of points with distance measure d, an LSH family of functions is defined as follows.  ... 
doi:10.1145/1376916.1376946 dblp:conf/pods/GollapudiP08 fatcat:2ny65cdponc6pbvspqdsddv6uy

Accurate and Efficient Suffix Tree Based Privacy-Preserving String Matching [article]

Sirintra Vaiwsri, Thilina Ranbaduge, Peter Christen, Kee Siong Ng
2021 arXiv   pre-print
In this paper we propose a novel approach for accurate and efficient privacy-preserving string matching based on suffix trees that are encoded using chained hashing.  ...  Most existing privacy-preserving string comparison functions are either based on comparing sets of encoded character q-grams, allow only exact matching of encrypted strings, or they are aimed at long genomic  ...  The authors like to thank Alex Antic for discussions and contributions to the experimental design.  ... 
arXiv:2104.03018v1 fatcat:ticabrjphzga5olutll4eriz4q
« Previous Showing results 1 — 15 out of 1,938 results