13,176 Hits in 2.4 sec

Generalized substring selectivity estimation

Zhiyuan Chen, Flip Korn, Nick Koudas, S. Muthukrishnan
2003 Journal of computer and system sciences (Print)  
In providing such selectivity estimates, the correlation between different occurrences of substrings is crucial.  ...  The cross-counts generated by our methods are not exact, but they are adequate for selectivity estimation.  ...  work on substring selectivity estimation, the more general problem of selectivity estimation on Boolean substring predicates has not been studied.  ... 
doi:10.1016/s0022-0000(02)00031-4 fatcat:dsrih3esffflbgnxs4sskyzi7y

Approximate substring selectivity estimation

Hongrae Lee, Raymond T. Ng, Kyuseok Shim
2009 Proceedings of the 12th International Conference on Extending Database Technology Advances in Database Technology - EDBT '09  
We study the problem of estimating selectivity of approximate substring queries.  ...  The experimental results show that MOF is a light-weight algorithm that gives fairly accurate estimations.  ...  These concepts are not applicable to substring selectivity estimation.  ... 
doi:10.1145/1516360.1516455 dblp:conf/edbt/LeeNS09 fatcat:dkleqy5zejhdxfnqfwfl6cevfm

Supporting Similarity Operations Based on Approximate String Matching on the Web [chapter]

Eike Schallehn, Ingolf Geist, Kai-Uwe Sattler
2004 Lecture Notes in Computer Science  
To minimize the local processing costs and the required network traffic, the mapping uses materialized information on the selectivity of string samples such as ¤ -samples, substrings, and keywords.  ...  Based on the predicate mapping similarity selections and joins are described and the quality and required effort of the operations is evaluated experimentally.  ...  The key criteria considered during evaluation are the selectivity of generated pre-selections, the quality of our selectivity estimation, and the applicability to actual data values.  ... 
doi:10.1007/978-3-540-30468-5_16 fatcat:6ubzldbpzjfm3kafbqcygdim3u

One-dimensional and multi-dimensional substring selectivity estimation

H.V. Jagadish, Olga Kapitskaia, Raymond T. Ng, Divesh Srivastava
2000 The VLDB journal  
In this paper, we use pruned count-suffix trees (PSTs) as the basic data structure for substring selectivity estimation. For the 1-D problem, we present a novel technique called MO (Maximal Overlap).  ...  Effective query optimization in this context requires good selectivity estimates.  ...  To estimate substring selectivity in multiple dimensions, we need to generalize the PST to multiple dimensions.  ... 
doi:10.1007/s007780000029 fatcat:sy35a43oovef3czwa6iiho763a

CXHist : An On-line Classification-Based Histogram for XML String Selectivity Estimation

Lipyeow Lim, Min Wang, Jeffrey Scott Vitter
2005 Very Large Data Bases Conference  
Hence, XML string selectivity estimation is a harder problem than relational substring selectivity estimation, because the correlation between path and substring statistics needs to be captured as well  ...  The main difference between the XML string selectivity estimation problem and the relational substring selectivity estimation problem is that a correlated path (whether implicitly encoded as a path ID  ... 
dblp:conf/vldb/LimWV05 fatcat:2o222gg2cfcvlju2xirro4mh24

When is an estimation of distribution algorithm better than an evolutionary algorithm?

Tianshi Chen, Per Kristian Lehre, Ke Tang, Xin Yao
2009 2009 IEEE Congress on Evolutionary Computation  
Despite the wide-spread popularity of estimation of distribution algorithms (EDAs), there has been no theoretical proof that there exist optimisation problems where EDAs perform significantly better than  ...  Here, it is proved rigorously that on a problem called SUBSTRING, a simple EDA called univariate marginal distribution algorithm (UMDA) is efficient, whereas the (1+1) EA is highly inefficient.  ...  be the populations before and after the selection at the t th generation (t ∈ N + ) respectively, p t,i (1) (p t,i (0)) be the estimated marginal probability of the i th bit of an individual to be 1 (  ... 
doi:10.1109/cec.2009.4983116 dblp:conf/cec/ChenLTY09 fatcat:qgqog6v3djf4tnvllzwkl32oti

Unsupervised, low latency anomaly detection of algorithmically generated domain names by generative probabilistic modeling

Jayaram Raghuram, David J. Miller, George Kesidis
2014 Journal of Advanced Research  
We propose a fully generative model for the probability distribution of benign (white listed) domain names which can be used in an anomaly detection setting for identifying putative algorithmically generated  ...  On the other hand, in the present day scenario, algorithmically generated domain names typically have distributions that are quite different from that of human-created domain names.  ...  (iii) If the substring is to be selected from A li n V li , then generate a character sequence according to the joint distribution P int (w|l i ).  ... 
doi:10.1016/j.jare.2014.01.001 pmid:25685511 pmcid:PMC4294760 fatcat:lpxqtbssefgljiqexfphlaouj4

Page 3295 of Mathematical Reviews Vol. , Issue 2004d [page]

2004 Mathematical Reviews  
substring selectivity estimation.  ...  The cross-counts generated by our methods are not exact, but they are adequate for selectivity estimation.  ... 

A partition-based method for string similarity joins with edit-distance constraints

Guoliang Li, Dong Deng, Jianhua Feng
2013 ACM Transactions on Database Systems  
Finally, we verify the candidates to generate the final answer. We devise efficient techniques to select substrings and prove that our method can minimize the number of selected substrings.  ...  Then for each string, we select some of its substrings, identify the selected substrings from the inverted indices, and take strings on the inverted lists of the found substrings as candidates of this  ...  The substring set W m (s, l) generated by the multimatch-aware selection method has the minimum size among all the substring sets generated by the substring selection methods that satisfy completeness.  ... 
doi:10.1145/2487259.2487261 fatcat:k3tft2ydnvc53ptmunsoqb3fxu

Sequences Dimensionality-Reduction by K-mer Substring Space Sampling Enables Effective Resemblance- and Containment-Analysis for Large-Scale omics-data [article]

Huiguang Yi, Yanling Lin, Wenfei Jin
2019 bioRxiv   pre-print
We proposed a new sequence sketching technique named k-mer substring space decomposition (kssd), which sketches sequences via k-mer substring space sampling instead of local-sensitive hashing.  ...  Kssd is more accurate and faster for resemblance estimation than other sketching methods developed so far. Notably, kssd is robust even when two sequences are of very different sizes.  ...  To address this, we generalized k-mer space sampling/shuffling to kmer substring space sampling/shuffling (kssd)--where a substring of the k-mer is selected according to a predefined pattern, so that we  ... 
doi:10.1101/729665 fatcat:e3rjbqym25anpoa4li3bq6wwfu

Interpolated Spectral NGram Language Models

Ariadna Quattoni, Xavier Carreras
2019 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics  
First, in order to capture long-range dependencies of the data, the method must use statistics from long substrings, which results in very large matrices that are difficult to decompose.  ...  The spectral method is based on computing a Hankel matrix that contains statistics of expectations over substrings generated by the target language.  ...  The ability of the spectral method for PNFA to estimate substring expectations can be exploited in other contexts.  ... 
doi:10.18653/v1/p19-1594 dblp:conf/acl/QuattoniC19 fatcat:umjd54ixobcalfga6he45bypmu

Text classification stream-based R-measure approach using frequency of substring repetition

Mikhail F. Ashurov, Vasiliy V. Poddubny
2015 Vestnik Tomskogo gosudarstvennogo universiteta Upravlenie vychislitel naya tekhnika i informatika  
An accuracy of text classification is estimated by Van Rijsbergen's effectiveness measure known as F-measure.  ...  Stream-based approach of R-measure using frequency of substring repetition in text classification is offered.  ...  To estimate feasible performance of using frequencies in classification the approach based on R-measure modification that can use frequencies of substring repetition is offered.  ... 
doi:10.17223/19988605/33/1 fatcat:zystzkvgqvcrvnhmps36bzijke

PASS-JOIN: A Partition-based Method for Similarity Joins [article]

Guoliang Li, Dong Deng, Jiannan Wang, Jianhua Feng
2011 arXiv   pre-print
We devise efficient techniques to select the substrings and prove that our method can minimize the number of selected substrings.  ...  Then for each string, Pass-Join selects some of its substrings and uses the selected substrings to find candidate pairs using the inverted indices.  ...  The substring set Wm(s, l) generated by the multi-match-aware selection method has the minimum size among all the substring sets generated by the substring selection methods that satisfy completeness.  ... 
arXiv:1111.7171v1 fatcat:ygefsyrcuzc2rap42sf25enmei

The reference string indexing method [chapter]

H. -J. Schek
1978 Lecture Notes in Computer Science  
Generally f(s) denotes the frequency of a substring s in S, and RSj denotes the set of refstrings with length j.  ...  Exploiting this assumption, a (small) set of "reference strings" is generated by a statistical analysis of collected queries or -if not available -by usage estimation with the original data.  ... 
doi:10.1007/3-540-08934-9_92 fatcat:dettile55bda7lqowqd4qthluq

Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis

Huiguang Yi, Yanling Lin, Chengqi Lin, Wenfei Jin
2021 Genome Biology  
Here, we develop k -mer substring space decomposition (Kssd), a sketching technique which is significantly faster and more accurate than current sketching methods.  ...  First, the k-mers with p-selected-substring (green substring in 1st column) belonging to the red subspace s are selected, where the p-selected-substrings are recoded by the lexically ordered dimension  ...  Sketching: for a given sequence, k-mers with its p-selected-substring presented in the chosen k-mer substring subspace are selected and recoded into sketch (Fig. 2b ).  ... 
doi:10.1186/s13059-021-02303-4 pmid:33726811 pmcid:PMC7962209 fatcat:ylta5ntqqjflno5wen5r22665u
« Previous Showing results 1 — 15 out of 13,176 results