Support Vector Machines Classification on Class Imbalanced Data: A Case Study with Real Medical Data
Journal of Data Science
support vector machines (SVMs) constitute one of the most popular and powerful classification methods. However, SVMs can be limited in their performance on highly imbalanced datasets. A classifier which has been trained on an imbalanced dataset can produce a biased model towards the majority class and result in high misclassification rate for minority class. For many applications, especially for medical diagnosis, it is of high importance to accurately distinguish false negative from false
... ive results. The purpose of this study is to successfully evaluate the performance of a classifier, keeping the correct balance between sensitivity and specificity, in order to enable the success of trauma outcome prediction. We compare the standard (or classic) SVM (C SVM) with resampling methods and a cost sensitive method, called Two Cost SVM (TC SVM), which constitute widely accepted strategies for imbalanced datasets and the derived results were discussed in terms of the sensitivity analysis and receiver operating characteristic (ROC) curves. However, standard SVMs, instead of their effectiveness in balanced datasets, could be proved inappropriate when they are faced with imbalanced data. The issue concerning imbalanced data is recognized as a crucial problem in machine learning community (Chawla, et al. (2004) ). In these cases, classifiers tend to be overpowered by the majority class and ignore the minority examples assuming an equal misclassification error. Therefore, the produced models are, often, biased toward the majority class while having a low performance on the minority class. Furthermore, classifiers are typically designed to maximize the overall accuracy which is not an appropriate evaluation measure for imbalanced data. As a consequence, in order to handle imbalanced data we should both, consider improved algorithms and choose other metrics, such as Geometric mean and AUC to measure the performance, instead of accuracy. In parallel with, for many applications, especially for medical diagnosis where normal cases are the majority, it is more important the correct balance between sensitivity and specificity means since we have to accurately distinguish false negative results from false positives. Numerous recent works, including preprocessing and algorithmic methods have been proposed and dealt with the crucial problem of imbalanced data. These techniques can be sorted into two different categories: preprocessing the data by oversampling the minority instances or undersampling the majority instances and algorithmic methods including cost-sensitive learning (Batuvita and Palade (2013) ). In our comparative study we use a cost sensitive learning technique proposed by Veropoulos et al. (1999) called "TC SVM" due to the fact that it uses two costs for the two different classes. In addition we applied two different forms of re-sampling methods, namely, random over-sampling and random under-sampling as well. Last but not least we present a combination of a widely used method called Synthetic Minority Oversampling Technique (SMOTE) proposed by Chawla et al.(2002) with random undersampling and the results were developed in the last section. Parpoula et al. (2013) have already dealt with the analysis of a large dimensional Trauma dataset; however, their study lies on the comparison of several high-powered data mining techniques. The motivation of conducting the present study, applied in the medical dataset in question, is not only to enable the success trauma outcome prediction, improving the quality of the prediction model, but also to successfully evaluate the performance of a classifier faced with imbalanced data and keeping the correct balance between sensitivity and specificity. In this way, we compare the performance of the standard SVM with the TC SVM, random over-/under-sampling and a combination of SMOTE method with undersampling, and the derived results were discussed in terms of the sensitivity analysis. The merits of our comparative study through a real medical data set show the effectiveness of the considered approaches. The rest of this paper is organized as follows. In Section 2, we present a theoretical background of the considered SVM classifiers. In Section 3, we present the SVM analysis and we carry out a comparative study for the considered methods in terms of accuracy, Geometric mean and the Area Under the Roc Curve (AUC). We also describe the performance criteria used for the evaluation of the employed methods. In conclusion, in Section 4, we summarize the results of our study and we highlight some concluding remarks. Note here that, we use classic and standard SVM with the same meaning as soft margin SVM. Moreover, we also use Gaussian or Radial or RBF kernel, consider exactly the same.