Comparative Effectiveness of Knowledge Graphs- and EHR Data-Based Medical Concept Embedding for Phenotyping
Objective: Concept identification is a major bottleneck in phenotyping. Properly learned medical concept embeddings (MCEs) have semantic meaning of the medical concepts, thus useful for feature engineering in phenotyping tasks. The objective of this study is to compare the effectiveness of MCEs learned by using knowledge graphs and EHR data for facilitating high-throughput phenotyping. Materials and Methods: We investigated four MCEs learned from different data sources and methods.
... phs were obtained from the Observational Medical Outcomes Partnership (OMOP) common data model. Medical concept co-occurrence statistics were obtained from Columbia University Irving Medical Center's (CUIMC) OMOP database. Two embedding methods, node2vec and GloVe, were used to learn embeddings for medical concepts. We used phenotypes with their corresponding concepts generated and validated by the Electronic Medical Records and Genomics (eMERGE) network to evaluate the performance of learned MCEs in identifying phenotype-relevant concepts. Results: Precision@k% and Recall@k% in identifying phenotype-relevant concepts based on a single concept and multiple seed concepts were used to evaluate MCEs. Recall@500% and Precision@500% based on a single seed concept of MCE learned using the enriched knowledge graph were 0.64 and 0.13, compared to Recall@500% and Precision@500% of MCE learned using the hierarchical knowledge graph (0.61 and 0.12), 5-year windowed EHR (0.51 and 0.10), and visit-windowed EHR (0.46 and 0.09). Conclusion: Medical concept embedding enables scalable identification of phenotype-relevant medical concepts, thereby facilitating high-throughput phenotyping. Knowledge graphs constructed by hierarchical relationships among medical concepts learn more effective MCEs, highlighting the need of more sophisticated use of big data to leverage MCEs for phenotyping.