Survey on deep learning with class imbalance

Justin M. Johnson, Taghi M. Khoshgoftaar
2019 Journal of Big Data  
Introduction Supervised learning methods require labeled training data, and in classification problems each data sample belongs to a known class, or category [1, 2] . In a binary classification problem with data samples from two groups, class imbalance occurs when one class, the minority group, contains significantly fewer samples than the other class, the majority group. In many problems [3] [4] [5] [6] [7] , the minority group is the class of interest, i.e., the positive class. A well-known
more » ... ass imbalanced machine learning scenario is the medical diagnosis task of detecting disease, where the majority of the patients are Abstract The purpose of this study is to examine existing deep learning techniques for addressing class imbalanced data. Effective classification with imbalanced data is an important area of research, as high class imbalance is naturally inherent in many real-world applications, e.g., fraud detection and cancer detection. Moreover, highly imbalanced data poses added difficulty, as most learners will exhibit bias towards the majority class, and in extreme cases, may ignore the minority class altogether. Class imbalance has been studied thoroughly over the last two decades using traditional machine learning models, i.e. non-deep learning. Despite recent advances in deep learning, along with its increasing popularity, very little empirical work in the area of deep learning with class imbalance exists. Having achieved record-breaking performance results in several complex domains, investigating the use of deep neural networks for problems containing high levels of class imbalance is of great interest. Available studies regarding class imbalance and deep learning are surveyed in order to better understand the efficacy of deep learning when applied to class imbalanced data. This survey discusses the implementation details and experimental results for each study, and offers additional insight into their strengths and weaknesses. Several areas of focus include: data complexity, architectures tested, performance interpretation, ease of use, big data application, and generalization to other domains. We have found that research in this area is very limited, that most existing work focuses on computer vision tasks with convolutional neural networks, and that the effects of big data are rarely considered. Several traditional methods for class imbalance, e.g. data sampling and cost-sensitive learning, prove to be applicable in deep learning, while more advanced methods that exploit neural network feature learning abilities show promising results. The survey concludes with a discussion that highlights various gaps in deep learning from class imbalanced data for the purpose of guiding future research. which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Actual positive Actual negative Predicted positive C(1, 1) = c 11 C(1, 0) = c 10 Predicted negative C(0, 1) = c 01 C(0, 0) = c 00 Johnson and Khoshgoftaar J Big Data (2019) 6:27 a range of costs, but can be expensive and even impractical if the size of the data set or number of features is too large. Hybrid methods Data-level and algorithm-level methods have been combined in various ways and applied to class imbalance problems [10]. One strategy includes performing data sampling to reduce class noise and imbalance, and then applying cost-sensitive learning or thresholding to further reduce the bias towards the majority group. Several techniques which combine ensemble methods with sampling and cost-sensitive learning were presented in [28] . Liu et al. [52] proposed two algorithms, EasyEnsemble and BalanceCascade, that learn multiple classifiers by combining subsets of the majority group with the minority group, creating pseudo-balanced training sets for each individual classifier. SMOTEBoost [53] , , and JOUS-Boost [55] all combine sampling with ensembles. Sun [56] introduced three cost-sensitive boosting methods, namely AdaC1, AdaC2, and AdaC3. These methods iteratively increase the impact of the minority group by introducing cost items into the AdaBoost algorithm's weight updates. Sun showed that the cost-sensitive boosted ensembles outperformed plain boosting methods in most cases. Deep learning background This section reviews the basic concepts of deep learning, including descriptions of the neural network architectures used throughout the surveyed works and the value of representation learning. We also touch on several important milestones that have contributed to the success of deep learning. Finally, the rise of big data analytics and its challenges are introduced along with a discussion on the role of deep learning in solving these challenges.
doi:10.1186/s40537-019-0192-5 fatcat:dor65fgn7ffhxmqqv3mkold6wq