Machine learning approaches for electronic health records phenotyping: A methodical review
ObjectiveAccurate and rapid methods for phenotyping are a prerequisite to realizing the potential of electronic health records (EHRs) data for clinical and translational research. This study reviews the literature on machine learning (ML) approaches for phenotyping with respect to the phenotypes considered, the data sources and methods used, and the contributions within the wider context of EHR-based research.Materials and MethodsWe searched for relevant articles in PubMed and Web of Science
... lished between January 1, 2018 and April 14, 2022. After screening, we collected data on 52 variables across 106 selected articles.ResultsML-based methods were developed for 156 unique phenotypes, primarily using EHR data from a single institution or health system. 72 of 106 articles leveraged unstructured data in clinical notes. In terms of methodology, supervised learning is the most prevalent ML paradigm (n = 64, 60.4%), with half of the articles employing deep learning. Semi-supervised and weakly-supervised approaches were applied to reduce the burden of obtaining gold-standard labeled data (n = 21, 19.8%), while unsupervised learning was used for phenotype discovery (n = 20, 18.9%). Federated learning has been applied to develop algorithms across multiple institutions while preserving data privacy (n = 2, 1.9%).DiscussionWhile the use of ML for phenotyping is growing, most articles applied traditional supervised ML to characterize the presence of common, chronic conditions.ConclusionContinued research in ML-based methods is warranted, with particular attention to the development of advanced methods for complex phenotypes and standards for reporting and evaluating phenotyping algorithms.