MCAT: Motif Combining and Association Tool
Journal of Computational Biology
Motivation: De novo motif discovery in biological sequences is always an important and computationally challenging problem. In the past 20 years, a myriad of algorithms have been proposed to solve this problem with varying success. Ensemble algorithms, which combine different individual algorithms, have been introduced in previous studies, and it has been proved that an ensemble strategy can improve the prediction accuracy. However, the performance of these tools has not yet met most people's
... pectation. One reason for the low performance is failure to adapt to complicated and large data sets. Another existing problem is that fewer motif finding tools are available, and many of them are not maintained. Results: I present a novel and fast tool MCAT (Motif Combining and Association Tool) for de novo motif discovery by combining six state-of-the-art motif discovery tools (MEME, BioProspector, DECOD, XXmotif, Weeder, and CMF). In addition, I developed an innovative motif combining algorithm, VoteRank, which is a position based algorithm that votes, ranks, and combines candidate motifs. By testing against DNA sequences from budding yeast, fission yeast, human, fruit fly, and mouse, I showed that MCAT is able to identify exact match motifs in DNA sequences efficiently and achieves at least 30% improvement in prediction accuracy. (GENERAL AUDIENCE ABSTRACT) Finding hidden motifs in DNA or protein sequences is an important and computationally challenging problem. A motif is a short patterned DNA/protein sequence that has biological functions. Motifs regulate the process of gene expression, which is the fundamental biological process in which DNA is transcribed into RNA which is then translated to protein. In the past 20 years, a myriad of algorithms have been developed to solve the motif finding problem with varying success, but it can be difficult for even a small number of these tools to reach a consensus. Because individual tools can be better suited for specific scenarios, an ensemble tool that combines the results of many algorithms can yield a more confident and complete result. I present a novel and fast tool MCAT (Motif Combining and Association Tool) for motif discovery by combining six state-of-the-art motif discovery tools (MEME, BioProspector, DECOD, XXmotif, Weeder, and CMF). I apply MCAT to data sets with DNA sequences that come from various species and compare our results with two wellestablished ensemble motif finding tools, EMD and DynaMIT. The experimental results show that MCAT is able to identify exact match motifs in DNA sequences efficiently, and it has an improved performance in practice. I would like to thank my advisor, Prof. Lenwood S. Heath for his guidance and support during my studies at Virginia Tech. His patience, encouragement and insightful suggestions greatly aided my research. I would also like to thank my committee members, Prof. Silke Hauf and Prof. Liqing Zhang for their valuable time and helpful comments. I am thankful to all of my group members and former colleagues, Jeff Robertson, Zhen Guo, Christy Coghlan, and Jake Martinez for helping with the MCAT project, Doaa Altarawy for her advice at the beginning of my research, Haitham Elmarakeby for the Beacon project, and Xiao Liang for her valuable ideas and encouragement during my research.