An Adaptive Difference Distribution-Based Coding with Hierarchical Tree Structure for DNA Sequence Compression

Wenrui Dai, Hongkai Xiong, Xiaoqian Jiang, L. Ohno-Machado
2013 2013 Data Compression Conference  
Previous reference-based compression on DNA sequences do not fully exploit the intrinsic statistics by merely concerning the approximate matches. In this paper, an adaptive difference distribution-based coding framework is proposed by the fragments of nucleotides with a hierarchical tree structure. To keep the distribution of difference sequence from the reference and target sequences concentrated, the sub-fragment size and matching offset for predicting are flexible to the stepped size
more » ... e. The matching with approximate repeats in reference will be imposed with the Hamming-like weighted distance measure function in a local region closed to the current fragment, such that the accuracy of matching and the overhead of describing matching offset can be balanced. A well-designed coding scheme will make compact both the difference sequence and the additional parameters, e.g. sub-fragment size and matching offset. Experimental results show that the proposed scheme achieves 150% compression improvement in comparison with the best reference-based compressor GReEn.
doi:10.1109/dcc.2013.45 pmid:26501129 pmcid:PMC4617277 dblp:conf/dcc/DaiXJO13 fatcat:wllx4iysqvai3bupphd6rnuld4