A Q-gram Filter for Local Alignment in Large Genomic Database

Decai Sun, Xiaoxia Wang
2016 International Journal of Hybrid Information Technology  
Fast and exact searching for sequences similar to a query sequence in genomic databases remains a challenging task in molecular biology. In this paper, the problem of finding all e-matches in a large genomic database is considered, i.e. all local alignments over a given length w and an error rate of at most e. A new database searching algorithm called QFLA is designed to solve this problem. The proposed algorithm is a fullsensitivity algorithm which is a refined q-gram filter and implemented on
more » ... a q-gram index. First, new features are extracted from match-regions by logically partitioning both query sequence and genomic database. Second, a large part of irrelevant subsequences are eliminated quickly by these new features during the searching process. Last, the unfiltered regions are verified by the well-known smith-waterman algorithm. The experimental results demonstrate that our algorithm saves time by improving filtration efficiency in a short filtration time. In [11] [12] [13] , the pattern is split into ks  pieces, and hence at least s of the pieces must appear in any true matches. Therefore, the text that contains at least s of those pieces and requires the stated distance is verified for a complete match. In [12, 14] , the pattern is split into j pieces, and hence at least one of these pieces which has at most kj   errors with the pattern's one must appear in the true matches. 2) q-gram counting approach. The q-gram counting approach uses the q-grams of two strings for filtration. In [15] , q-samples are extracted from every h characters in text, hence the text that contains a certain number of pattern's q-samples and requires the stated distance is verified. In [16] [17] [18] , text is split into q-grams which are overlapped and continuous, hence the text that contains at least t n denotes the total length of true matches in database. In this section, some new features will be extracted from match-region and presented as lemmas. Here, let q denote the length of q-gram,
doi:10.14257/ijhit.2016.9.1.19 fatcat:al6ygd4a4ngnxbbx2a5f2h7wnu