Deep Learning meets Semantic Web: A feasibility study with the Cardiovascular Disease Ontology and PubMed citations

Mercedes Argüello Casteleiro, George Demetriou, Warren J. Read, Maria Jesus Fernandez Prieto, Diego Maseda-Fernandez, Goran Nenadic, Julie Klein, John A. Keane, Robert Stevens
2016 Workshop on Ontologies and Data in Life Sciences  
Automatic identification of gene and protein names from biomedical publications can help curators and researchers to keep up with the findings published in the scientific literature. As of today, this is a challenging task related to information retrieval, and in the realm of Big Data Analytics. Objectives: To investigate the feasibility of using word embeddings (i.e. distributed word representations) from Deep Learning algorithms together with terms from the Cardiovascular Disease Ontology
more » ... O) as a step to identifying omics information encoded in the biomedical literature. Methods: Word embeddings were generated using the neural language models CBOW and Skip-gram with an input of more than 14 million PubMed citations (titles and abstracts) corresponding to articles published between 2000 and 2016. Then the abstracts of selected papers from the sysVASC systematic review were manually annotated with gene/protein names. We set up two experiments that used the word embeddings to produce term variants for gene/protein names: the first experiment used the terms manually annotated from the papers; the second experiment enriched/expanded the annotated terms using terms from the human-readable labels of key classes (gene/proteins) from the CVDO ontology. CVDO is formalised in the W3C Web Ontology Language (OWL) and contains 172,121 UniProt Knowledgebase protein classes related to human and 86,792 UniProtKB protein classes related to mouse. The hypothesis is that by enriching the original annotated terms, a better context is provided, and therefore, it is easier to obtain suitable (full and/or partial) term variants for gene/protein names from word embeddings. Results: From the papers manually annotated, a list of 107 terms (gene/protein names) was acquired. As part of the word embeddings generated from CBOW and Skip-gram, a lexicon with more than 9 million terms was created. Using the cosine similarity metric, a list of the 12 top-ranked terms was generated from word embeddings for query terms present in the generated lexicon. Domain experts evaluated a total of 1968 pairs of terms and classified the retrieved terms as: TV (term variant); PTV (partial term variant); and NTV (non term variant, meaning none of the previous two categories). In experiment I, Skip-gram finds the double amount of (full and/or partial) term variants for gene/protein names as compared with CBOW. Using Skip-gram, the weighted Cohen's Kappa inter-annotator agreement for two domain experts was 0.80 for the first experiment and 0.74 for the second experiment. In the first experiment, suitable (full and/or partial) term variants were found for 65 of the 107 terms. In the second experiment, the number increased to 100. Conclusion: This study demonstrates the benefits of using terms from the CVDO ontology classes to obtain more pertinent term variants for gene/protein names from word embeddings generated from an unannotated corpus with more than 14 million PubMed citations. As the terms variants are induced from the biomedical literature, they can facilitate data tagging and semantic indexing tasks. Overall, our study explores the feasibility of obtaining methods that scale when dealing with big data, and which enable automation of deep semantic analysis and markup of textual information from unannotated biomedical literature.
dblp:conf/odls/CasteleiroDRPMN16 fatcat:gxdpzlhwfjeu7kwzkk5y6v3bve