A Novel Over-Sampling Method and its Application to Cancer Classification from Gene Expression Data

Xuan Tho Dang, Osamu Hirose, Duong Hung Bui, Thammakorn Saethang, Vu Anh Tran, Lan Anh T. Nguyen, Tu Kien T. Le, Mamoru Kubo, Yoichi Yamada, Kenji Satou
2013 Chem-Bio Informatics Journal  
One of the most critical and frequent problems in biomedical data classification is imbalanced class distribution, where samples from the majority class significantly outnumber the minority class. SMOTE is a well-known general over-sampling method used to address this problem; however, in some cases it cannot improve or even reduces classification performance. To address these issues, we have developed a novel minority over-sampling method named safe-SMOTE. Experimental results from two gene
more » ... ression datasets for cancer classification (i.e., colon-cancer and leukemia) and six imbalanced benchmark datasets from the UCI Machine Learning Repository showed that our method achieved better sensitivity and G-mean values than both the control method (i.e., no over-sampling) and SMOTE. For example, in the colon-cancer dataset, although the sensitivity and specificity achieved by SMOTE (81.36% and 88.63%) were lower than for the control method (81.59% and 89.50%), safe-SMOTE in contrast had these values increase (81.82% and 90.50%). Similarly, the G-mean value of the control (85.45%) decreased to 84.91% when SMOTE was employed, but increased to 86.04% when using safe-SMOTE. In the leukemia dataset, SMOTE was able to improve the sensitivity and G-mean values with respect to the control; however, safe-SMOTE achieved noticeable, even greater improvements for both of these criteria.
doi:10.1273/cbij.13.19 fatcat:c3prq5oeundvjk3rgexeatfas4