Evaluation of Classical Statistical Methods for Analyzing BS-Seq Data
DNA methylation is an epigenetic change that is not only important in normal cell development, but also plays a significant role in human health and disease. Therefore, studies of DNA methylation have been actively pursued to clarify the precise role of this modification in disease etiology and its potential as a biomarker of disease. One key issue in analyzing DNA methylation data is the detection of significant differences in methylation levels between diseased individuals and healthy
... and healthy controls. In recent years, molecular technology has been developed to produce bisulfite sequencing (BS-Seq) data, which provide single-base resolution. For such data, methylation counts at a single site follow a binomial distribution, the probability of which reflects the methylation level at this site. Traditional hypothesis-testing methods, such as Fisher's exact (FE) test, have been applied to detect differentially methylated cytosines (DMCs). Although the FE test is widely used, its "fixed margin" assumption has been called into question in such applications. Furthermore, biological variability between samples within a group cannot be accounted for in the FE test. Statistical tests that do not rely on such an assumption exist, including the computationally efficient Storer-Kim (SK) test. However, whether such methods outperform the FE test for detecting DMCs is unknown, with or without the presence of within-group variation. In this study, we compared the performance of several traditional hypothesis-testing methods from both statistical and biological perspectives based on simulated and real data as well as theoretical analyses. Our results show that the unconditional SK test consistently outperforms the conditional FE test for the detection of DMCs. This advantage is especially noteworthy in studies with limited sequencing depth.