Sparse and Skew Hashing of K-Mers [article]

Giulio Ermanno Pibiri
2022 bioRxiv   pre-print
Motivation: A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports membership queries. This data structure is at the hearth of many important tasks in computational biology. High-throughput sequencing of DNA can produce very large k-mer sets, in the size of billions of strings - in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge. Results: To tackle this problem, we describe a compressed and associative
more » ... dictionary for k-mers, that is: a data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0,n). We show that some statistical properties of k-mer minimizers can be exploited by minimal perfect hashing to substantially improve the space/time trade-off of the dictionary compared to the best-known solutions. Availability: The C++ implementation of the dictionary is available at https://github.com/jermp/sshash.
doi:10.1101/2022.01.15.476199 fatcat:izagu2egq5bhvbm4r6l4unqt7y