Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL)

Juliane Fluck, Sumit Madan, Sam Ansari, Alpha T. Kodamullil, Reagon Karki, Majid Rastegar-Mojarad, Natalie L. Catlett, William Hayes, Justyna Szostak, Julia Hoeng, Manuel Peitsch
2016 Database: The Journal of Biological Databases and Curation  
Success in extracting biological relationships is mainly dependent on the complexity of the task as well as the availability of high-quality training data. Here, we describe the new corpora in the systems biology modeling language BEL for training and testing biological relationship extraction systems that we prepared for the BioCreative V BEL track. BEL was designed to capture relationships not only between proteins or chemicals, but also complex events such as biological processes or disease
more » ... tates. A BEL nanopub is the smallest unit of information and represents a biological relationship with its provenance. In BEL relationships (called BEL statements), the entities are normalized to defined namespaces mainly derived from public repositories, such as sequence databases, MeSH or publicly available ontologies. In the BEL nanopubs, the BEL statements are associated with citation information and supportive evidence such as a text excerpt. To enable the training of extraction tools, we prepared BEL resources and made them available to the community. We selected a subset of these resources focusing on a reduced set of namespaces, namely, human and mouse genes, ChEBI chemicals, MeSH diseases and GO biological processes, as well as relationship types 'increases' and 'decreases'. The published training corpus contains 11 000 BEL statements from over 6000 supportive text excerpts. For method evaluation, we selected and re-annotated two smaller subcorpora containing 100 text excerpts. For this re-annotation, the inter-annotator agreement was measured by the BEL track evaluation environment and resulted in a maximal F-score of 91.18% for full statement agreement. In addition, for a set of 100 BEL statements, we do V C The Author(s) not only provide the gold standard expert annotations, but also text excerpts preselected by two automated systems. Those text excerpts were evaluated and manually annotated as true or false supportive in the course of the BioCreative V BEL track task.
doi:10.1093/database/baw113 pmid:27554092 pmcid:PMC4995071 fatcat:e43tl6edwnfldgugogqeny3x3u