Explainable Machine Learning to Identify Patient-specific Biomarkers for Lung Cancer
Masrur Sobhan, Ananda Mohan Mondal
Lung cancer is the leading cause of cancer compared to other cancers in the USA despite being the most commonly diagnosed. The overall survival rate of lung cancer is not satisfactory even though having cutting-edge treatment methods for cancers. Genomic profiling and biomarker gene identification of lung cancer patients due to genomic alteration may play a role in the therapeutics of lung cancer patients. The biomarker genes identified by most of the existing methods (statistical and machine
... arning based) belong to the whole cohort or population. That is why different people having the same disease get the same kind of treatment which causes different outcomes in terms of success and side effects. So, the identification of biomarker genes for individual patients is very crucial to find efficacious therapeutics. Methods: In this study, we propose a pipeline to identify class-specific and patient-specific key genes that may help formulate effective therapies for lung cancer patients. We have used two subtypes of lung cancer- lung adenocarcinoma and lung squamous cell carcinoma to identify subtype-specific (class-specific) and patient-specific key genes using an explainable machine learning approach, SHAP. This approach provides scores for each of the genes for individual patients which tell us the attribution of each feature (gene) for each sample (patients). Result: In this study, we applied two variations of SHAP- tree explainer and gradient explainer for which tree-based classifier, XGBoost, and deep learning-based classifier, convolutional neural network (CNN) was used as classification algorithms, respectively. The classification accuracy of the XGBoost and CNN models was 96.3% and 92.6% respectively. Both the SHAP explainers provided attribution scores for each of the genes of individual samples. The class-specific identified top 100 genes were compared with the differentially expressed genes (DEGs) as both of them represent the population-based biomarkers. We also showed that there are minimal number of common genes among the class-specific top-100 genes which validates that the genes are truly class-specific. Similarly, we identified the patient-specific top-100 genes based on the SHAP score. We found that there are very few genes common among the patients which implicates that the identified patient-specific genes belong to individual lung cancer patients. Our results also show that there were lots of common genes when we identified the top 100 genes for healthy individuals which are due to the less mutation and genomic alteration. Conclusion: The proposed approach is capable of identifying key biomarker genes using SHAP. This study demonstrated the result of SHAP by comparing two different explainers in the context of tree-based and neural network-based classifiers trained on the lung cancer RNA-seq data. We also exhibited that the class-specific SHAP genes are biologically significant by comparing them with the output of a statistical approach, DGE analysis. This study also determined patient-specific genes from SHAP scores which may provide many biological insights and aid personalized medicine. The proposed pipeline can be used to determine class-specific and patient-specific biologically relevant genes which can be used for cancer patient diagnosis.