Divide and Conquer: Clustering Patients With Autism by Gene Expression Profiles Using Machine Learning Algorithms [post]

Ping-I Lin, Mohammad Ali Moni, Valsamma Eapen, Susan Shur-Fen Gau
2020 unpublished
BackgroundClinical heterogeneity in autism spectrum disorder (ASD) can complicate diagnostics and treatments. The identification of biomarkers may hold the key to the classification of ASD subgroups. Accumulating evidence suggests that genetic or genomic markers may facilitate the clustering of patients with ASD. The goal of the current study is to use machine learning algorithms to analyze microarray data to identify clusters with relatively homogeneous clinical features, such as language
more » ... ion.MethodsThe whole-genome gene expression microarray data were used to predict communication quotient (SCQ) scores against all probes to select differential expression regions (DERs). Gene set enrichment analysis was performed to identify hub pathways that play a role in the severity of social communication deficits inherent to ASD. We then used two machine learning methods, random forest classification (RF) combined with partition around medoids (PAM) and support vector machine (SVM), to identify two clusters using DERs. Finally, we evaluated how accurately the clusters predicted language impairment.ResultsA total of 191 DERs were identified. Cholesterol biosynthesis and metabolisms pathways appear to act as hubs that connect other trait-associated pathways to influence the severity of social communication deficits inherent to ASD. Both RF and SVM algorithms can yield a classification accuracy level greater than 90% when all 191 DERs were analyzed. LimitationsThe primary limitation of the current study is the small sample size. Nevertheless, some machine learning algorithm, such as SVM, can handle a small sample with a large number of features. Additionally, model overfitting may arise due to a lack of another independent sample for validation. Furthermore, unknown confounders may cause spurious associations between the phenotype and genomic markers. ConclusionsThe ASD subtypes defined by the presence of language impairment, a strong indicator for prognosis, can be predicted by transcriptomic profiles associated with social communication deficits and cholesterol biosynthesis and metabolism. Our proof-of-concept study suggests that both RF and SVM are acceptable options for machine learning algorithms to identify AD subgroups characterized by clinical homogeneity related to prognosis.
doi:10.21203/rs.3.rs-87427/v1 fatcat:7dqbrrmoonht5dbbw3b56k42qa