General-Purpose Open-Source Program for Ultra Incomplete Data-Oriented Parallel Fractional Hot Deck Imputation

In Ho Cho, Jae-Kwang Kim, Yicheng Yang
2021 Zenodo  
There emerges a strong need for a large/big data-oriented imputation method for accelerating data-driven scientific discovery in the new era of big data and powerful computing. Imputation is a statistics-based procedure to fill in missing data, and there exists a wide spectrum of methods. Still, they are often not applicable for large/big incomplete data and require difficult statistical assumptions. With support from NSF (OAC-1931380), we developed the ultra data-oriented parallel fractional
more » ... t-deck imputation (UP-FHDI [1,2]) which is general-purpose, assumption-free software for handling item nonresponse in big incomplete data by leveraging the theory of FHDI and parallel computing. Here, "ultra" data means a data set with high dimensions and many instances (i.e., concurrently big-p and big-n; see Figure). UP-FHDI inherits the strength of FHDI [3] that can cure multivariate missing data by filling each missing unit with multiple observed values without requiring any prior distributional assumptions. UP-FHDI adopts a parallel file system that supports inter-processor communication and allows simultaneous access from multiple compute servers to the hard drive to optimize memory usage by managing essential data in memory and other data on the hard drive. Meanwhile, we use the Optimal Overload IO Protection System with UP-FHDI to dynamically adjust the intensive and simultaneous IO workload during a job to avoid global file system performance degradation. Exploring the strength of this parallel file system, we provide full details of ultra data-oriented parallelisms on significant steps of UP-FHDI: cell construction, estimation of cell probability using expectation maximization, parallel imputation, and parallel variance estimation, respectively. The cell construction step adopts the parallel k-nearest neighbors method for deficient donor selection to break the computational bottleneck of cell-merging scheme of serial FHDI. The sure independence screening is embedded into the UP-FHDI for ultrahigh dimensional [...]
doi:10.5281/zenodo.5570263 fatcat:w7qusunswrdirejywpsbschete