A text-mining system for extracting metabolic reactions from full-text articles

Jan Czarnecki, Irene Nobeli, Adrian M Smith, Adrian J Shepherd
2012 BMC Bioinformatics  
Increasingly biological text mining research is focusing on the extraction of complex relationships relevant to the construction and curation of biological networks and pathways. However, one important category of pathway -metabolic pathways -has been largely neglected. Here we present a relatively simple method for extracting metabolic reaction information from free text that scores different permutations of assigned entities (enzymes and metabolites) within a given sentence based on the
more » ... ce and location of stemmed keywords. This method extends an approach that has proved effective in the context of the extraction of protein-protein interactions. Results: When evaluated on a set of manually-curated metabolic pathways using standard performance criteria, our method performs surprisingly well. Precision and recall rates are comparable to those previously achieved for the well-known protein-protein interaction extraction task. Conclusions: We conclude that automated metabolic pathway construction is more tractable than has often been assumed, and that (as in the case of protein-protein interaction extraction) relatively simple text-mining approaches can prove surprisingly effective. It is hoped that these results will provide an impetus to further research and act as a useful benchmark for judging the performance of more sophisticated methods that are yet to be developed. Background On the extraction of metabolic pathway information An important goal of biological text mining is to extract relationships between named biological and/or medical entities. Until recently, the vast majority of research in this area has concentrated on extracting binary relationships between genes and/or proteins, most notably proteinprotein interactions. However, attention is increasingly shifting towards more complex relationships, with a particular focus on biomolecular networks and pathways [1]. However, in spite of this new focus on networks and pathways, one of the most important sub-topics -the construction and curation of metabolic pathways -has largely been ignored. This is in contrast to the protein-and gene-centric focus of recent text-mining research: protein-protein interaction networks [2,3], signal transduction pathways [4] [5] [6] , protein (synthesis, modification and degradation) [1], and regulatory networks [7, 8] . This protein/gene-centric focus is also enshrined in the BioNLP'09 shared task on event extraction, an important initiative designed to galvanize community-wide effort to address the challenges of extracting information about complex events [1]. The only system that we are aware of that has an explicit focus on extracting metabolic pathway information from free text is the template-based EMPathIE [9] , which is no longer under active development (R. Gaizauskas, personal communication). The aim of EMPathIE was to extract information about metabolic reactions together with relevant contextual information (including source organism and pathway name) from specific journals. When evaluated on a corpus of seven journal articles, EMPathIE achieved 23% recall and 43% precision [10] . Certain more generic systems may also be used for the same purpose, including the GeneWays system for "extracting, analyzing, visualizing and integrating molecular pathway data" [4] , and the MedScan sentence parsing system [11] , capable of extracting relationships between a range of biomedical entities including proteins and
doi:10.1186/1471-2105-13-172 pmid:22823282 pmcid:PMC3475109 fatcat:leqjw2phkbe6veikxy5pcjbhkq