Learning with Missing Data

Carlos A. Escobar, Jorge Arinez, Daniela Macias, Ruben Morales-Menendez
2020 2020 IEEE International Conference on Big Data (Big Data)  
Many real-world data sets contain missing values, therefore, learning with incomplete data sets is a common challenge faced by data scientists. Handling them in an intelligent way is important to develop robust data models, since there is no perfect approach to compensate for the missing values. Deleting the rows with empty cells is a commonly used approach, this naive method may lead to estimates with larger standard errors due to reduced sample size. On the other hand, imputing the missing
more » ... ords is a better approach, but it should be used with great caution, as it relies on often unrealistic specific assumptions which can potentially bias results. In this paper, a new greedy-like algorithm is proposed to maximize the number of records. The algorithm can be used to generate various maximized sub-sets by varying the number of columns (features) that can be used for learning. It salvages more records than the naive method, and it avoids the bias induced by imputation. The learning algorithms would be able to learn from real sub-sets without the bias induced by artificial data. Finally, the proposed algorithm is applied to a case study, the COVID-19 Open Research data set (CORD-19) that was prepared and posted by The White House and a coalition of leading research groups as a call to action to the world's artificial intelligence experts to answer high priority scientific questions. This data set contains missing records, therefore, resulting maximized sub-sets from this analysis can be further investigated by the research community.
doi:10.1109/bigdata50022.2020.9377785 fatcat:hrgflnmkdrfxtcky7brjxiqvfq