Challenges for automatically extracting molecular interactions from full-text articles

Tara McIntosh, James R Curran
2009 BMC Bioinformatics  
The increasing availability of full-text biomedical articles will allow more biomedical knowledge to be extracted automatically with greater reliability. However, most Information Retrieval (IR) and Extraction (IE) tools currently process only abstracts. The lack of corpora has limited the development of tools that are capable of exploiting the knowledge in full-text articles. As a result, there has been little investigation into the advantages of full-text document structure, and the
more » ... developers will face in processing full-text articles. Results: We manually annotated passages from full-text articles that describe interactions summarised in a Molecular Interaction Map (MIM). Our corpus tracks the process of identifying facts to form the MIM summaries and captures any factual dependencies that must be resolved to extract the fact completely. For example, a fact in the results section may require a synonym defined in the introduction. The passages are also annotated with negated and coreference expressions that must be resolved. We describe the guidelines for identifying relevant passages and possible dependencies. The corpus includes 2162 sentences from 78 full-text articles. Our corpus analysis demonstrates the necessity of full-text processing; identifies the article sections where interactions are most commonly stated; and quantifies the proportion of interaction statements requiring coherent dependencies. Further, it allows us to report on the relative importance of identifying synonyms and resolving negated expressions. We also experiment with an oracle sentence retrieval system using the corpus as a gold-standard evaluation set. Conclusion: We introduce the MIM corpus, a unique resource that maps interaction facts in a MIM to annotated passages within full-text articles. It is an invaluable case study providing guidance to developers of biomedical IR and IE systems, and can be used as a gold-standard evaluation set for full-text IR tasks. Background Almost all known and postulated knowledge relating to biological processes is recorded in the form of semi-structured full-text articles. The volume of biomedical litera-ture rapidly becoming available makes it very difficult for biologists to keep abreast of even their narrowest specialist fields. The traditional keyword-based Information Retrieval (IR) over abstracts often retrieves too many arti-
doi:10.1186/1471-2105-10-311 pmid:19778419 pmcid:PMC2761905 fatcat:s6wm264gxjcw7ltw64lny5on2a