Text mining for systems biology

Juliane Fluck, Martin Hofmann-Apitius
2014 Drug Discovery Today  
Scientific communication in biomedicine is, by and large, still text based. Text mining technologies for the automated extraction of useful biomedical information from unstructured text that can be directly used for systems biology modelling have been substantially improved over the past few years. In this review, we underline the importance of named entity recognition and relationship extraction as fundamental approaches that are relevant to systems biology. Furthermore, we emphasize the role
more » ... f publicly organized scientific benchmarking challenges that reflect the current status of text-mining technology and are important in moving the entire field forward. Given further interdisciplinary development of systems biology-orientated ontologies and training corpora, we expect a steadily increasing impact of text-mining technology on systems biology in the future. A substantial proportion of information relevant to the modelling and simulation of physiological and pathophysiological processes is not available from databases but is instead present in unstructured scientific documents, such as journal articles, reviews and monographies. Scientific communication in biomedicine is, by and large, still text based, because we all feel the need to report scientific advancements in a way that enables us to make use of the high expressiveness of natural human language. Technologies to identify useful biomedical information in unstructured text and to extract it automatically have been developed over the past 15 years. Initially focusing on finding and extracting information from PubMed abstracts, text-mining technology has advanced with impressive speed and is focusing increasingly on the extraction of complex biological context from full-text documents. A recent review on text-mining technologies enabling integrative biology provides a good overview of some of the academic technology developments made in this context [1]. Text-mining services for systems biology have to support the process of identifying and extracting information that is relevant to system description, modelling and simulation. Modelling of complex biological processes, spanning pathways to entire diseases, can be done at various levels of granularity using a range of mathematical modelling approaches [2] . Continuous models and quantitative models based on differential equations have been applied with great success where mechanistic details and kinetic parameters are known [3] . However, in cases where quantitative data are scarce, qualitative models, such as Boolean network models, have proven useful [4] . Another modelling approach that can deal with limited knowledge of mechanisms underlying systems behavior, but instead focuses on relationships represented as probabilities, are Bayesian network models or Belief Nets. Although Bayesian networks have been widely used in disease modelling [5] , they depend on the availability of prior knowledge that can be used for the design of the Bayesian network and the computing of the prior distribution. The first generation of text-mining applications has helped build Boolean models through entity recognition and co-occurrence networks [6] . Systems biology has since developed modelling strategies that represent information on causes and correlations in more detail; for example, OpenBEL [7] and, for pathway-related knowledge, BioPAX [8] . Disease classification systems and disease ontologies have facilitated the extraction of information that is relevant to modelling in systems biology; examples range from using ICD codes [9] on electronic patient records to the application of a dedicated ontology representing knowledge of Alzheimer' disease (AD) Reviews INFORMATICS Corresponding author:. Hofmann-Apitius, M. (martin.hofmann-apitius@scai.fraunhofer.de) 140 www.drugdiscoverytoday.com
doi:10.1016/j.drudis.2013.09.012 pmid:24070668 fatcat:ajnerehdqndv5nk3l6jak42n5e