Bloom Filter Trie – A Data Structure for Pan-Genome Storage [chapter]

Guillaume Holley, Roland Wittler, Jens Stoye
2015 Lecture Notes in Computer Science  
High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences "colored" by the genomes to which they belong. A colored de-Bruijn graph (C-DBG) extracts from the sequences all
more » ... ored k-mers, strings of length k, and stores them in vertices. In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the Bloom Filter Trie. The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Experimental results prove better performance compared to another state-of-the-art data structure.
doi:10.1007/978-3-662-48221-6_16 fatcat:esvzdqqy35c55k3bnr53svurmq