Filters








938 Hits in 7.8 sec

When is resampling beneficial for feature selection with imbalanced wide data?

Ismael Ramos-Pérez, Álvar Arnaiz-González, Juan J. Rodríguez, César García-Osorio
2021 Expert systems with applications  
Additionally, specific results are also obtained depending on the classifier used, for example, for Gaussian SVM the best performance is obtained when the feature selection is done with SVM-RFE before  ...  This paper studies the effects that combinations of balancing and feature selection techniques have on wide data (many more attributes than instances) when different classifiers are used.  ...  This problem is even more relevant when dealing with wide data, where the number of features is extremely high.  ... 
doi:10.1016/j.eswa.2021.116015 fatcat:gfy7cwrpxnhc7crkxo3bid7mze

A Study of Data Pre-processing Techniques for Imbalanced Biomedical Data Classification [article]

Shigang Liu, Jun Zhang, Yang Xiang, Wanlei Zhou, Dongxi Xiang
2019 arXiv   pre-print
methods in most cases, thus, Feature Selection with SVM classifier is the best choice for imbalanced biomedical data learning.  ...  However, resampling and Feature Selection techniques perform poorly when using C4.5 decision tree and Linear discriminant analysis classifiers; (2) for datasets with different distributions, techniques  ...  In the meantime, considering that feature selection (FS) is also beneficial to imbalanced data learning, one of the recently developed FS approaches is also employed in this study (Yu et al. 2014) .  ... 
arXiv:1911.00996v1 fatcat:vrkuuh7ptbaa3p4kgd2eb47tbi

A Comparative Analysis of Data Resampling Methods on Imbalance Medical Data

Matloob Khushi, Kamran Shaukat, Talha Mahboob Alam, Ibrahim A. Hameed, Shahadat Uddin, Suhuai Luo, Xiaoyan Yang, Maranatha Consuelo Reyes
2021 IEEE Access  
Each categorical feature with n categories is converted to n binary (0-1) features [95, 96] . D.  ...  We imputed their median values for former smokers with missing entries for their age when they quit smoking.  ... 
doi:10.1109/access.2021.3102399 fatcat:4foj6xyyovanfhnr5z5fcvh5py

Malicious web domain identification using online credibility and performance data by considering the class imbalance issue

Zhongyi Hu, Raymond Chiong, Ilung Pranata, Yukun Bao, Yuqing Lin
2019 Industrial management & data systems  
Findings: By applying eight well-known machine learning classifiers, the proposed integrated resampling approach is comprehensively examined using several imbalanced web domain datasets with different  ...  An integrated resampling approach is proposed to address the class im-balance issue. The performance of the proposed approach is confirmed based on real-world datasets with different imbalance ratios.  ...  We would like to thank the handling editor and two anonymous reviewers for their valuable comments and suggestions on the previous version of this paper.  ... 
doi:10.1108/imds-02-2018-0072 fatcat:ugiip2ydtrdfrcrm6friudlkby

Partial Resampling of Imbalanced Data [article]

Firuz Kamalov, Amir F. Atiya, Dina Elreedy
2022 arXiv   pre-print
Imbalanced data is a frequently encountered problem in machine learning.  ...  Despite a vast amount of literature on sampling techniques for imbalanced data, there is a limited number of studies that address the issue of the optimal sampling ratio.  ...  It appears that the SVM classifier is better suited for imbalanced data when used in conjunction with data sampling. The details of the SVM-based experiments are supplied in Table 6 and Figure 3 .  ... 
arXiv:2207.04631v1 fatcat:shwanrkkrjhsncdg62s63kxvou

Online Defect Prediction for Imbalanced Data

Ming Tan, Lin Tan, Sashank Dara, Caleb Mayeux
2015 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering  
First, the data are imbalanced-there are much fewer buggy changes than clean changes.  ...  Accepted for publication by IEEE. c 2015 IEEE. Personal use of this material is permitted.  ...  We use four types of resampling techniques to predict for the imbalanced data: simple duplicate, SMOTE, spread subsample, and resampling with/without replacement [24] .  ... 
doi:10.1109/icse.2015.139 dblp:conf/icse/TanTDM15 fatcat:xav66z6k7vaw3bl3moofrncvf4

Experimental evaluation of ensemble classifiers for imbalance in Big Data

Mario Juez-Gil, Álvar Arnaiz-González, Juan J. Rodríguez, César García-Osorio
2021 Applied Soft Computing  
In this paper, in-depth experimentation with ensemble classifiers is conducted in the context of imbalanced Big Data classification, using two popular ensemble families (Bagging and Boosting) and different  ...  A common problem for classification, especially in Big Data, is that the numerous examples of the different classes might not be balanced.  ...  This material is based upon work supported by Google Cloud, United States.  ... 
doi:10.1016/j.asoc.2021.107447 fatcat:4glhtjzn4vbbndrdm64hhwxtj4

A Universal Data Augmentation Approach for Fault Localization

Huan Xie, Yan Lei, Meng Yan, Yue Yu, Xin Xia, Xiaoguang Mao
2022 International Conference on Software Engineering  
However, the input data is high-dimensional and extremely imbalanced since the real-world programs are large in size and the number of failing test cases is much less than that of passing test cases, which  ...  Then, Aeneas handles the imbalanced data issue through generating synthesized failing test cases from the reduced feature space through conditional variational autoencoder (CVAE).  ...  Aeneas is a novel approach to handle the problems of high-dimensional and extremely imbalanced data by feature selection and data synthesis, respectively.  ... 
doi:10.1145/3510003.3510136 dblp:conf/icse/XieLY00M22 fatcat:xrulttxmynckpdcqkqd76vrwpa

PSU: Particle Stacking Undersampling Method For Highly Imbalanced Big Data

Yong-Seok Jeon, Dong-Joon Lim
2020 IEEE Access  
Imbalanced classes are a common problem in machine learning, and the computational costs required for proper resampling increases with the data size.  ...  INDEX TERMS Data mining, imbalanced data, undersampling, big data, support vector machines.  ...  INTRODUCTION Dealing with imbalanced data is a crucial task in data mining studies.  ... 
doi:10.1109/access.2020.3009753 fatcat:viaqhideqra7jlz5ftm2td2epi

A Heterogeneous Ensemble Learning Model Based on Data Distribution for Credit Card Fraud Detection

Yalong Xie, Aiping Li, Liqun Gao, Ziniu Liu, Shan Zhong
2021 Wireless Communications and Mobile Computing  
In this paper, we propose a heterogeneous ensemble learning model based on data distribution (HELMDD) to deal with imbalanced data in CCFD.  ...  Credit card fraud detection (CCFD) is important for protecting the cardholder's property and the reputation of banks.  ...  Resampling is a widely used method to address the problem of imbalanced classification data.  ... 
doi:10.1155/2021/2531210 fatcat:cjwjrdq43fhcbhnnxhf5zclngi

Big data preprocessing: methods and prospects

Salvador García, Sergio Ramírez-Gallego, Julián Luengo, José Manuel Benítez, Francisco Herrera
2016 Big Data Analytics  
Addressing big data is a challenging and time-demanding task that requires a large computational infrastructure to ensure successful data processing and analysis.  ...  The massive growth in the scale of data has been observed in recent years being a key factor of the Big Data scenario.  ...  This method is also designed for matrices with a low number of features.  ... 
doi:10.1186/s41044-016-0014-0 fatcat:z3lqu2yi3vey3khbdal6mu34qa

Optimization of data resampling through GA for the classification of imbalanced datasets

Filippo Galli, Marco Vannucci, Valentina Colla
2019 IJAIN (International Journal of Advances in Intelligent Informatics)  
This paper overview a novel family of methods for the resampling of an imbalanced dataset in order to maximize the performance of arbitrary data-driven classifiers.  ...  Classification of imbalanced datasets is a critical problem in numerous contexts.  ...  These classifiers, in facts, aim at maximizing the overall performance that is achieved when coping with balanced datasets but it is not when the training datasets is imbalanced: in this latter case the  ... 
doi:10.26555/ijain.v5i3.409 fatcat:bmdt43ln4jdyrg32ksgk6dnwqu

CCR: A combined cleaning and resampling algorithm for imbalanced data classification

Michał Koziarski, Michał Wożniak
2017 International Journal of Applied Mathematics and Computer Science  
Imbalanced data classification is one of the most widespread challenges in contemporary pattern recognition.  ...  In this paper we describe a novel resampling technique focused on proper detection of minority examples in a two-class imbalanced data task.  ...  One of the most important questions we have to ask when dealing with imbalanced data is what performance measure should we optimize for.  ... 
doi:10.1515/amcs-2017-0050 fatcat:me52726ub5folfedmkcp5f7b5i

Self-paced Ensemble for Highly Imbalanced Massive Data Classification [article]

Zhining Liu, Wei Cao, Zhifeng Gao, Jiang Bian, Hechang Chen, Yi Chang, Tie-Yan Liu
2019 arXiv   pre-print
Many real-world applications reveal difficulties in learning classifiers from imbalanced data.  ...  The rising big data era has been witnessing more classification tasks with large-scale but extremely imbalance and low-quality datasets.  ...  Notice that we update hardness value in each iteration (line 4-5) in order to select data samples that were most beneficial for the current ensemble.  ... 
arXiv:1909.03500v2 fatcat:l3uitgbvl5cjpj7f3eskvlstmi

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework [article]

Gabriel Aguiar, Bartosz Krawczyk, Alberto Cano
2022 arXiv   pre-print
Additionally, we formulate open challenges and future directions for this domain. Our experimental testbed is fully reproducible and easy to extend with new methods.  ...  We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced  ...  Acknowledgements High Performance Computing resources provided by the High Performance Research Computing (HPRC) Core Facility at Virginia Commonwealth University (https://hprc.vcu.edu) were used for conducting  ... 
arXiv:2204.03719v1 fatcat:dulhr3cedrh6vd6m5m4qovffri
« Previous Showing results 1 — 15 out of 938 results