Robust predictions of specialized metabolism genes through machine learning [article]

Bethany M Moore, Peipei Wang, Pengxiang Fan, Bryan Leong, Craig A Schenck, John P Lloyd, Melissa Lehti-Shiu, Robert Last, Eran Pichersky, Shin-Han Shiu
2018 bioRxiv   pre-print
Plant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. Using Arabidopsis thaliana as a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to as primary) genes through a detailed study of features including duplication patterns, sequence conservation, transcription, protein domain, and gene network properties. Study of benchmark genes
more » ... ed that SM genes tend to be tandemly duplicated, co-expressed with their paralogs, narrowly expressed at lower levels, less conserved, and less well connected in gene networks relative to GM genes. Although the values of each of these features significantly differed between SM and GM genes, any single feature was ineffective at predicting SM from GM genes. Using machine learning methods to integrate all features, a well performing prediction model was established with a true positive rate of 0.84 and a false positive rate of 0.23. In addition, 82% of known SM genes not used to create the machine learning model were predicted as SM genes, further demonstrating its accuracy. Application of the prediction model led to the identification of 1,817 A. thaliana genes with high confidence of being SM genes, providing a global estimate of SM gene content in a plant genome.
doi:10.1101/304873 fatcat:6neirea7vbf7fnby23fya5ruye