PMCVec: Distributed Phrase Representation for Biomedical Text Processing

Zelalem Gero, Joyce Ho
2019 Journal of Biomedical Informatics: X  
Distributed semantic representation of biomedical text can be beneficial for text classification, named entity recognition, query expansion, human comprehension, and information retrieval. Despite the success of highquality vector space models such as Word2Vec and GloVe, they only provide unigram word representations and the semantics for multi-word phrases can only be approximated by composition. This is problematic in biomedical text processing where technical phrases for diseases, symptoms,
more » ... nd drugs should be represented as single entities to capture the correct meaning. In this paper, we introduce PMCVec, an unsupervised technique that generates important phrases from PubMed abstracts and learns embeddings for single words and multi-word phrases simultaneously. Evaluations performed on benchmark datasets produce significant performance gains both qualitatively and quantitatively. 'nuclear magnetic resonance', may not be well-expressed as a composition of the individual words. Therefore, it is important to build a distributed representation that not only captures single words but multi-word phrases as well. Learning a distributed phrase and word embeddings have been shown to be effective on a general, non-domain specific corpus [26] . Yet, one of the key challenges is to identify useful phrases. While this task is well-studied, many of the existing works require annotation or extensive computation to achieve good performance [4, 10, 35, 37, 44] . A new unsupervised method has been proposed to collect over 700,000 common phrases that may be useful for biomedical NLP from PubMed articles [20] . Unfortunately, including all possible phrases into the
doi:10.1016/j.yjbinx.2019.100047 fatcat:puejryvlavaivaxrlovkmsculi