Concept annotation in the CRAFT corpus

Michael Bada, Miriam Eckert, Donald Evans, Kristin Garcia, Krista Shipley, Dmitry Sitnikov, William A Baumgartner, K Cohen, Karin Verspoor, Judith A Blake, Lawrence E Hunter
2012 BMC Bioinformatics  
Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. Results: This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all
more » ... ions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.
doi:10.1186/1471-2105-13-161 pmid:22776079 pmcid:PMC3476437 fatcat:kuulhujx4ratnkoywafxggkhmi