RANKING GENE-DRUG RELATIONSHIPS IN BIOMEDICAL LITERATURE USING LATENT DIRICHLET ALLOCATION

YONGHUI WU, MEI LIU, W. JIM ZHENG, ZHONGMING ZHAO, HUA XU
2011 Biocomputing 2012  
Drug responses vary greatly among individuals due to human genetic variations, which is known as pharmacogenomics (PGx). Much of the PGx knowledge has been embedded in biomedical literature and there is a growing interest to develop text mining approaches to extract such knowledge. In this paper, we present a study to rank candidate gene-drug relations using Latent Dirichlet Allocation (LDA) model. Our approach consists of three steps: 1) recognize gene and drug entities in MEDLINE abstracts;
more » ... extract candidate gene-drug pairs based on different levels of co-occurrence, including abstract level, sentence level, and phrase level; and 3) rank candidate gene-drug pairs using multiple different methods including term frequency, Chi-square test, Mutual Information (MI), a reported Kullback-Leibler (KL) distance based on topics derived from LDA (LDA-KL), and a newly defined probabilistic KL distance based on LDA (LDA-PKL). We systematically evaluated these methods by using a gold standard data set of gene-drug relations derived from PharmGKB. Our results showed that the proposed LDA-PKL method achieved better Mean Average Precision (MAP) than any other methods, suggesting its promising uses for ranking and detecting PGx relations.
doi:10.1142/9789814366496_0041 fatcat:v7gih4dywne57lg3fzekwdfvf4