Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct

Christopher S Funk, Indika Kahanda, Asa Ben-Hur, Karin M Verspoor
2015 Journal of Biomedical Semantics  
Most computational methods that predict protein function do not take advantage of the large amount of information contained in the biomedical literature. In this work we evaluate both ontology term co-mention and bag-of-words features mined from the biomedical literature and analyze their impact in the context of a structured output support vector machine model, GOstruct. We find that even simple literature based features are useful for predicting human protein function (F-max: Molecular
more » ... n = 0.408, Biological Process = 0.461, Cellular Component = 0.608). One advantage of using literature features is their ability to offer easy verification of automated predictions. We find through manual inspection of misclassifications that some false positive predictions could be biologically valid predictions based upon support extracted from the literature. Additionally, we present a "medium-throughput" pipeline that was used to annotate a large subset of co-mentions; we suggest that this strategy could help to speed up the rate at which proteins are curated.
doi:10.1186/s13326-015-0006-4 pmid:26005564 pmcid:PMC4441003 fatcat:pz7pgludtjgm3ndkcl56ir26am