Robust and efficient identification of biomarkers from RNA-Seq data using median control chart

Md Shahjaman, Habiba Akter, Md. Mamunur Rashid, Md. Ibnul Asifuzzaman, Md. Bipul Hossen, Md. Rezanur Rahman
2019 F1000Research  
One of the main goals of RNA-seq data analysis is identification of biomarkers that are differentially expressed (DE) across two or more experimental conditions. RNA-seq uses next generation sequencing technology and it has many advantages over microarrays. Numerous statistical methods have already been developed for identification the biomarkers from RNA-seq data. Most of these methods were based on either Poisson distribution or negative binomial distribution. However, efficient biomarker
more » ... cient biomarker identification from discrete RNA-seq data is hampered by existing methods when the datasets contain outliers or extreme observations. Specially, the performance of these methods becomes more severe when the data come from a small number of samples in the presence of outliers. Therefore, in this study, an attempt is made to propose an outlier detection and modification approach for RNA-seq data to overcome the aforesaid problems of traditional methods. We make our proposed method facilitate in RNA-seq data by transforming the read count data into continuous data. Methods: We use median control chart to detect and modify the outlying observation in a log-transformed RNA-seq dataset. To investigate the performance of the proposed method in absence and presence of outliers, we employ the five popular biomarker selection methods (edgeR, edgeR_robust, DEseq, DEseq2 and limma) both in simulated and real datasets. Results: The simulation results strongly suggest that the performance of the proposed method improved in the presence of outliers. The proposed method also detected an additional 18 outlying DE genes from a real mouse RNA-seq dataset that were not detected by traditional methods. Using the KEGG pathway and gene ontology analysis results we reveal that these genes may be biomarkers, which require validation in a wet lab. Conclusions: Our proposal is to apply the proposed method for biomarker identification from other RNA-seq data.
doi:10.12688/f1000research.17351.1 fatcat:f4ro2ytibngvtlaucbw5arh6pe