Weighted k-word matches: a sequence comparison tool for proteins

J. Jing, S. R. Wilson, Conrad John Burden
2011 ANZIAM Journal  
The use of k-word matches was developed as a fast alignmentfree comparison method for dna sequences in cases where long range contiguity has been compromised, for example, by shuffling, duplication, deletion or inversion of extended blocks of sequence. Here we extend the algorithm to amino acid sequences. We define a new statistic, the weighted word match, which reflects the varying degrees of similarity between pairs of amino acids. We computed the mean and variance, and simulated the
more » ... ulated the distribution function for various forms of this statistic for sequences of identically and independently distributed letters. We present these results and a method for choosing an optimal word size. The efficiency of the method is tested by using simulated evolutionary sequences, and the results compared with blast. A common problem faced by biologists is to find closely related dna or protein sequences. Sequences with a high degree of similarity are believed to be closely related in terms of evolutionary distance or to have evolved to perform functionally similar tasks. Fast algorithms are needed to search large databases to find close matches to given query sequences. The most commonly used algorithms are based on alignments. Significance scores are attached to long alignments. These algorithms generally perform well, but fail when long range contiguity has been compromised, for example, by shuffling, duplication, deletion or inversion of extended blocks of sequence. An alternative alignment-free method is to use k-word matches, in which a significance score is attached to the number of exact matches of short words of prespecified length k [1, 2]. The algorithm for evaluating the number of k-word matches is extremely fast, with a run time linear in the lengths of the 2. S. R. Wilson,
doi:10.21914/anziamj.v52i0.3916 fatcat:ekd5xjvpzfddjnkfdentv2r5sm