A Hybrid Modified Deep Learning Data Imputation Method for Numeric Datasets

Nuran Peker, Cemalettin Kubat
2021 International Journal of Intelligent Systems and Applications in Engineering  
Missing data is a major problem in terms of both machine learning and data mining methods. Like most of these methods do not work with missing data, negative results may occur on the performance of the working ones, also. Imputation is a data preprocessing method used to replace missing data with appropriate values. This study aims at developing a hybrid modified imputation method based on deep learning approach. For this purpose, we use Random Forest and Datawig deep learning imputation
more » ... RF-DLI) methods together. Datawig is a deep learning-based library that supports missing value imputation for all types of data. RF-DLI approach includes the following steps to impute missing data. First, the importance of each attribute of the data set is determined with Random Forest (RF). Second, the most important 50% of the attributes are selected. Finally, each missing value is imputed with datawig (DLI) using these most important attributes. The study uses six real-world data sets from different fields with 30% missing data. The imputation performance of RF-DLI is compared to K-Nearest Neighbors (KNN), Multiple Imputation by Chained Equations (MICE), MEAN imputation, and Principle Component Analysis (PCA) imputation approaches in terms of Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R-square (R 2 ) evaluation metrics. The results show that in most cases, the RF-DLI approach has better imputation performance than the other techniques mentioned. most important attributes of each data set with RF and then complete the missing values in all columns of the data set with DLI, using these important attributes we obtained.
doi:10.18201/ijisae.2021167931 fatcat:qbgfg3st4vci5nzqz3e5v3m2qy