Quantitative Assessment of Dictionary-based Protein Named Entity Tagging

H. Liu, Z.-Z. Hu, M. Torii, C. Wu, C. Friedman
2006 JAMIA Journal of the American Medical Informatics Association  
Objective: Natural language processing (NLP) approaches have been explored to manage and mine information recorded in biological literature. A critical step for biological literature mining is biological named entity tagging (BNET) that identifies names mentioned in text and normalizes them with entries in biological databases. The aim of this study was to provide quantitative assessment of the complexity of BNET on protein entities through BioThesaurus, a thesaurus of gene/protein names for
more » ... Prot knowledgebase (UniProtKB) entries that was acquired using online resources. Methods: We evaluated the complexity through several perspectives: ambiguity (i.e., the number of genes/proteins represented by one name), synonymy (i.e., the number of names associated with the same gene/ protein), and coverage (i.e., the percentage of gene/protein names in text included in the thesaurus). We also normalized names in BioThesaurus and measures were obtained twice, once before normalization and once after. Results: The current version of BioThesaurus has over 2.6 million names or 2.1 million normalized names covering more than 1.8 million UniProtKB entries. The average synonymy is 3.53 (2.86 after normalization), ambiguity is 2.31 before normalization and 2.32 after, while the coverage is 94.0% based on the BioCreAtive data set comprising MEDLINE abstracts containing genes/proteins. Conclusion: The study indicated that names for genes/proteins are highly ambiguous and there are usually multiple names for the same gene or protein. It also demonstrated that most gene/protein names appearing in text can be found in BioThesaurus. F i g u r e 1. The construction of BioThesaurus. Annotation fields from Genpept, PSD, RefSeq, Entrez GENE, Swiss-Prot and TrEMBL were extracted and associated with iProClass entries. Several other databases were also included including several model organism databases, HUGO, and ENZYME etc. Terms obtained from the annotation fields comprised the Raw Dictionary. An automatic curation process was performed using the UMLS. We also manually inspected high ambiguous entries in the raw dictionary and removed nonsensical terms. After curation, we obtained BioThesaurus, where terms were associated with entities from iProClass. BioThesaurus could be used for extensive information retrieval, investigating relationships among entities sharing the same name, biological named entity tagging, and serving as a gateway for protein information exploration.
doi:10.1197/jamia.m2085 pmid:16799122 pmcid:PMC1561801 fatcat:slluxn6gknhx5hpvexewpnr47a