De Novo NGS Data Compression [chapter]

Gaetan Benoit, Claire Lemaitre, Guillaume Rizk, Erwan Drezen, Dominique Lavenier
2017 Algorithms for Next-Generation Sequencing Data  
High throughput sequencing machines decipher billions of nucleotides from DNA molecules at unprecedented speed. This mass of data is stored into large text files structured as a list of small DNA fragments. They represent random overlap regions over one or several genomes. The overlap fragment generate a lot of redundancy that can be advantageously exploited to compress next generation sequencing (NGS) data. This is the main motivation for developing dedicated compressing techniques for this
more » ... e data over generic text compressors that are not able to capture this kind of redundancy. This chapter focuses on de novo NGS data compression, which remains a very challenging issue. Here, no reference genome is considered. Compression and decompression is performed as a standalone process independently of external knowledge. The chapter explains the main NGS compression techniques, including lossless and lossy compression. Additionally, the chapter presents an evaluation of the main state-of-the-art compressors on several real NGS datasets.
doi:10.1007/978-3-319-59826-0_4 fatcat:pjctkfpul5cqvejdr5mtymfjm4