High-order statistical compressor for long-term storage of DNA sequencing data

Marek Chlopkowski, Maciej Antczak, Michal Slusarczyk, Aleksander Wdowinski, Michal Zajaczkowski, Marta Kasprzak
<span title="2016-03-24">2016</span> <i title="EDP Sciences"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/inei2pvlvnaw7lwfzvf7meb67e" style="color: black;">Reserche operationelle</a> </i> &nbsp;
We present a specialized compressor designed for efficient data storage of FASTQ files produced by high-throughput DNA sequencers. Since the method has been optimized for compression quality, it is especially suitable for long-term storage and for genome research centers processing huge amount of data (counted in petabytes). The proposed compressor uses high-order statistical models for range encoding, similar to Markov models, but the whole input is considered in building a symbol context.
ression of DNA reads is performed according to LZ-style with the use of the 5-7th order model, while nucleotides' scores are encoded with the 3rd order model. Mathematics Subject Classification. 68P20, 68P30, 68W32, 92D20.
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1051/ro/2015039">doi:10.1051/ro/2015039</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/nc6ic6r5hvhv3ia4w64qk5sndm">fatcat:nc6ic6r5hvhv3ia4w64qk5sndm</a> </span>
