Performance analysis of cost-sensitive learning methods with application to imbalanced medical data

Ibomoiye Domor Mienye, Yanxia Sun
2021 Informatics in Medicine Unlocked  
Many real-world machine learning applications require building models using highly imbalanced datasets. Usually, in medical datasets, the healthy patients or samples are dominant, making them the majority class, while the sick patients are few, making them the minority class. Researchers have proposed numerous machine learning methods to predict medical diagnosis. Still, the class imbalance problem makes it difficult for classifiers to adequately learn and distinguish between the minority and
more » ... jority classes. Cost-sensitive learning and resampling techniques are used to deal with the class imbalance problem. This research focuses on developing robust cost-sensitive classifiers by modifying the objective functions of some famous algorithms, such as logistic regression, decision tree, extreme gradient boosting, and random forest, which are then used to efficiently predict medical diagnosis. Meanwhile, as opposed to resampling techniques, our approach does not alter the original data distribution. Firstly, we implement the standard versions of these algorithms to provide a baseline for performance comparison. Secondly, we develop their corresponding cost-sensitive algorithms. For the proposed approaches, it is not necessary to change the distribution of the original data as the modified algorithms consider the imbalanced class distribution during training, thereby resulting in more reliable performance than when the data is resampled. Four popular medical datasets, including the Pima Indians Diabetes, Haberman Breast Cancer, Cervical Cancer Risk Factors, and Chronic Kidney Disease datasets, are used in the experiments to validate the performance of the proposed approach. The experimental results show that the cost-sensitive methods yield superior performance compared to the standard algorithms. used for binary classifications tasks assume an even distribution of the classes. Hence, when trained with imbalanced data, the model gets dominated by samples from the majority class, thereby degrading the model's performance [6] . This problem is so crucial that it is viewed as one of the ten big challenges in machine learning research [7] . Furthermore, ML algorithms assume that misclassification errors (false negative and false positive) are equal [8] . However, this assumption can be dangerous in imbalanced classification problems such as medical diagnosis, fraud detection, and access control systems [9] . For example, misclassifying a positive instance is more costly than misclassifying a negative sample. Meanwhile, resampling techniques have been used to balance the class distributions in imbalanced datasets [10] . Resampling methods aim to manually balance the data through undersampling the majority instances or oversampling the minority instances; sometimes, both methods are used. However, resampling techniques may omit some possible valuable data and increase the computational cost with unnecessary instances. In essence, both undersampling and oversampling methods changes the distribution of the various classes [11] . Meanwhile, another method exists called cost-sensitive learning (CSL) that considers the cost associated with the misclassification of samples [12] . Rather than artificially creating balanced class distributions via sampling techniques, cost-sensitive learning solves the imbalanced class problem by utilizing cost matrices that outline the costs associated with the misclassification of the various classes [13] . By definition, cost-sensitive learning can be considered a subfield of ML that considers the cost of classification errors during model training [8] . Research has shown that cost-sensitive learning yields enhanced performance in applications where the dataset has a skewed class distribution [14] . Generally, ML algorithms aim to minimize error during training, and several functions can be utilized to compute the error or loss of a model on training data. In cost-sensitive learning, a penalty is placed for misclassifications, and this is referred to as the cost. Cost-sensitive learning aims to minimize the misclassification cost of a model on the input data. Hence, instead of optimizing the accuracy, the algorithm tries to minimize the total misclassification cost [15] . Furthermore, recent research has suggested a high correlation between cost-sensitive learning and imbalanced classification; hence, the conceptual frameworks and algorithms utilized for cost-sensitive learning can be inherently employed for imbalanced classification tasks [16] . Also, some research works have demonstrated that when attempting to solve imbalanced classification problems, costsensitive learning leads to superior performance [11] , and it is a more suitable approach than sampling techniques. Several research works have proposed numerous methods to classify imbalanced medical data, as stated in [14] and [5] . However, most of these methods focus on data resampling, such as in [3], [17], [18] . Even though there has been numerously published papers regarding the classification of imbalanced medical data, the focus has been on resampling methods. This research aims to provide a general overview of the imbalanced classification problem and ML algorithms suitable for such classification problems focusing on medical data. In the process, we develop some costsensitive ML algorithms to conduct a comparative study with standard algorithms. Also, while J o u r n a l P r e -p r o o f
doi:10.1016/j.imu.2021.100690 fatcat:fbxsyvzv2zdvfcn6mina5nxdme