Filters








15 Hits in 2.8 sec

Efficient estimation of inclusion coefficient using hyperloglog sketches

Azade Nazi, Bolin Ding, Vivek Narasayya, Surajit Chaudhuri
2018 Proceedings of the VLDB Endowment  
We present a new estimator, BML, for inclusion coefficient based on Hyperloglog sketches that results in significantly lower error compared to the state-of-the art approach that uses Bottom-k sketches.  ...  Efficiently estimating the inclusion coefficient -the fraction of values of one column that are contained in another column -is useful for tasks such as data profiling and foreign-key detection.  ...  EFFICIENT ESTIMATION OF INCLU-SION COEFFICIENT In this section, we describe a technique to estimate the inclusion coefficient using Hyperloglog (HLL) sketches.  ... 
doi:10.14778/3231751.3231759 fatcat:chpxgos3lzhelf4jxsx4qqtfsi

Dashing: Fast and Accurate Genomic Distances with HyperLogLog [article]

Daniel N Baker, Ben Langmead
2018 bioRxiv   pre-print
It uses the HyperLogLog sketch together with cardinality estimation methods that are specialized for set unions and intersections.  ...  Dashing is a fast and accurate software tool for estimating similarities of genomes or sequencing datasets.  ...  Grant support BL and DNB were supported by National Science Foundation grant IIS-1349906 to BL and National Institutes of Health/National Institute of General Medical Sciences grant R01GM118568 to BL.  ... 
doi:10.1101/501726 fatcat:bihhaa5dzfcrxj2lghdqpas3ui

New cardinality estimation algorithms for HyperLogLog sketches [article]

Otmar Ertl
2017 arXiv   pre-print
This paper presents new methods to estimate the cardinalities of data sets recorded by HyperLogLog sketches.  ...  conventional technique using the inclusion-exclusion principle.  ...  The conventional approach merges both HyperLogLog sketches using Algorithm 2 and estimates the union size using single sketch cardinality estimation.  ... 
arXiv:1702.01284v2 fatcat:vnd3wdz7qjfullgzulmkpksnzi

Dashing: fast and accurate genomic distances with HyperLogLog

Daniel N. Baker, Ben Langmead
2019 Genome Biology  
It uses the HyperLogLog sketch together with cardinality estimation methods that are specialized for set unions and intersections.  ...  Dashing is a fast and accurate software tool for estimating similarities of genomes or sequencing datasets.  ...  This work used the Extreme Science and Engineering Discovery Environment (XSEDE), supported by National Science Foundation grant number ACI-1548562.  ... 
doi:10.1186/s13059-019-1875-0 pmid:31801633 pmcid:PMC6892282 fatcat:ujldsdngora6hkkyclds5xibmq

SetSketch: Filling the Gap between MinHash and HyperLogLog [article]

Otmar Ertl
2021 arXiv   pre-print
The presented joint estimator can also be applied to other data structures such as MinHash, HyperLogLog, or HyperMinHash, where it even performs better than the corresponding state-of-the-art estimators  ...  MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications.  ...  Cardinality estimation: An important use case is the estimation of the number of elements in a set. Sketches have different efficiencies in encoding cardinality information.  ... 
arXiv:2101.00314v2 fatcat:ccybanavojgp5mviczyvljk4iq

DegreeSketch: Distributed Cardinality Sketches on Massive Graphs with Applications [article]

Benjamin W. Priest
2020 arXiv   pre-print
We present efficient algorithms for estimating both local neighborhood sizes and local triangle count heavy hitters using DegreeSketch.  ...  In our experiments we implement DegreeSketch using the celebrated hyperloglog cardinality sketch and utilize the distributed communication tool YGM to achieve state-of-the-art performance in distributed  ...  INTERSECTION ESTIMATION A naïve approach to estimating an intersection of two sets A and B using cardinality sketches might involve computing the intersection via the inclusion-exclusion principle: |A  ... 
arXiv:2004.04289v1 fatcat:b4xsbu44qngcppaept5ydr57na

All-distances sketches, revisited

Edith Cohen
2014 Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems - PODS '14  
For approximate distinct counting on streams, we compare HIP and the original estimators applied to the HyperLogLog Min-Hash sketches (Flajolet et al. 2007 ).  ...  We make several contributions which facilitate a more effective use of ADSs for scalable analysis of massive graphs.  ...  Specifically, we use the Coefficient of Variation (CV), which is the ratio of the standard deviation to the mean, Prior to our work, ADS-based neighborhood cardinality estimators [14, 18, 41, 28, 8]  ... 
doi:10.1145/2594538.2594546 dblp:conf/pods/Cohen14 fatcat:j3hk2oe4ujdnhlhqzbmsmsgbla

All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis [article]

Edith Cohen
2015 arXiv   pre-print
Sketches for all nodes are computed using a nearly linear computation and estimators are applied to sketches of nodes to estimate their properties.  ...  For approximate distinct counting on data streams, HIP outperforms the original estimators for the HyperLogLog MinHash sketches (Flajolet et al. 2007), obtaining significantly improved estimation quality  ...  Since we are interested in relative error, we use the Coefficient of Variation (CV), which is the ratio of the standard deviation to the mean.  ... 
arXiv:1306.3284v7 fatcat:ggl2ollxarbpxkvarv4zsnjv5q

On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference

Alexis Criscuolo
2020 F1000Research  
Recently developed MinHash-based techniques were proven successful in quickly estimating the level of similarity between large nucleotide sequences.  ...  This last point of view is assessed with a simulation study using a dedicated bioinformatics tool.  ...  This work used the computational and storage services (TARS cluster) provided by the IT department at Institut Pasteur, Paris.  ... 
doi:10.12688/f1000research.26930.1 pmid:33335719 pmcid:PMC7713896 fatcat:n35yuwexfrcyhl6n2nwsxgerdq

Tracking Normalized Network Traffic Entropy to Detect DDoS Attacks in P4 [article]

Damu Ding, Marco Savi, Domenico Siracusa
2021 arXiv   pre-print
This work overcomes such a limitation and presents two novel strategies for flow cardinality and for normalized network traffic entropy estimation that only use P4-supported operations and guarantee a  ...  The dawn of programmable data planes in Software-Defined Networks can help mitigate this issue, opening the door to the detection of DDoS attacks directly in the data plane of the switches.  ...  a novel memory-efficient strategy that takes inspiration from LogLog algorithm [13] for the estimation of flow cardinality in P4.  ... 
arXiv:2104.05117v1 fatcat:unhpwf7w7nag3nm6nq24fsy6gq

Location Analytics for Location-Based Social Networks [article]

Muhammad Aamir Saleem
2018 PhD series, Technical Faculty of IT and Design, ˜Aalborg=ålborgœ University  
Acknowledgements Acknowledgements viii Conclusion We proposed the problem of predicting future companions in LBSNs, and an efficient, nontrivial solution, COVER; this solution mines geo-social cohorts  ...  In our algorithm, we use the HyperLogLog sketch (HLL) [10] to replace the exact sets V B (s, d) and V(s).  ...  In our approx algorithm, we use the HyperLogLog sketch (HLL) [11] to replace the exact sets B(s, d) and V(s).  ... 
doi:10.5278/vbn.phd.tech.00038 fatcat:wwovvw4mnjbe5fqno7xn4qqo4e

Referee report. For: On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference [version 1; peer review: 3 approved]

Guy Perrière
2020
Recently developed MinHash-based techniques were proven successful in quickly estimating the level of similarity between large nucleotide sequences.  ...  This last point of view is assessed with a simulation study using a dedicated bioinformatics tool.  ...  This work used the computational and storage services (TARS cluster) provided by the IT department at Institut Pasteur, Paris.  ... 
doi:10.5256/f1000research.29746.r74631 fatcat:vqkddt2fhbgwna3r2cccqjwxby

A Survey of Challenges for Runtime Verification from Advanced Application Domains (Beyond Software) [article]

César Sánchez and Gerardo Schneider and Wolfgang Ahrendt and Ezio Bartocci and Domenico Bianculli and Christian Colombo and Yliés Falcone and Adrian Francalanza and Sran Krstić and Joa̋o M. Lourenço and Dejan Nickovic and Gordon J. Pace and Jose Rufino and Julien Signoles and Dmitriy Traytel and Alexander Weiss
2018 arXiv   pre-print
Runtime verification is an area of formal methods that studies the dynamic analysis of execution traces against formal specifications.  ...  Typically, the two main activities in runtime verification efforts are the process of creating monitors from specifications, and the algorithms for the evaluation of traces against the generated monitors  ...  The authors would like to thank Fonenantsoa Maurica and Pablo Picazo-Sanchez for their feedback on parts of a preliminary version of this document.  ... 
arXiv:1811.06740v1 fatcat:4bxx5tvfpzez3jidsj22flibv4

A survey of challenges for runtime verification from advanced application domains (beyond software)

César Sánchez, Gerardo Schneider, Wolfgang Ahrendt, Ezio Bartocci, Domenico Bianculli, Christian Colombo, Yliés Falcone, Adrian Francalanza, Srđan Krstić, Joa̋o M. Lourenço, Dejan Nickovic, Gordon J. Pace (+4 others)
2019 Formal methods in system design  
Runtime verification is an area of formal methods that studies the dynamic analysis of execution traces against formal specifications.  ...  Typically, the two main activities in runtime verification efforts are the process of creating monitors from specifications, and the algorithms for the evaluation of traces against the generated monitors  ...  Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution  ... 
doi:10.1007/s10703-019-00337-w fatcat:6vu5odqyjjbkvf255bsxcchane

Vector representations of text data in deep learning [article]

Karol Grzegorczyk
2019 arXiv   pre-print
Representations learned by this model can be used in downstream tasks, like part-of-speech tagging or identification of semantic relations.  ...  For document-level representations we propose Binary Paragraph Vector: a neural network models for learning binary representations of text documents, which can be used for fast document retrieval.  ...  This research was carried out with the support of the "HPC Infrastructure for Grand Challenges of Science and Engineering" Project, co-financed by the European Regional Development Fund under the Innovative  ... 
arXiv:1901.01695v1 fatcat:et6cxs45mbcipfyvblntjwpuge