Advantages of the D2 Statistic

C.J. Burden, J. Jing, S.R. Wilson
2011 Proceedings of the Annual International Conference on BioInformatics and Computational Biology & Proceedings of the Annual International Conference on Advances in Biotechnology   unpublished
The D2 statistic, defined as the number of matches of words of some pre-specified length k, is a computationally fast alignment-free measure of biological sequence similarity. However there is some debate about its suitability for this purpose as it may be susceptible to single-sequence noise. We examine the extent of the problem and the effectiveness of overcoming it by using a mean-centred version of the statistic. We conclude that the D2 statistic is a useful measure of sequence similarity
more » ... quence similarity which can easily be extended to a mean-centred version which may perform better in some situations. Both the D2 statistic and its mean-centred version are well approximated by Gamma random variables under an i.i.d. null hypothesis, allowing for an accurate estimation of P-values.
doi:10.5176/978-981-08-8119-1_bicb23 fatcat:zm46xjb3hjdszjfuapusbw5iky