Association-Based Multiple Imputation in Multivariate Datasets: A Summary

Weixiong Zhang
Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073)  
Missing data are ubiquitous and inevitable in real databases and datasets. They can lead to worrisome problems, making a given dataset incomplete and undependable as well as causing various complications in applications. How to handle missing data is an important issue that needs to be addressed properly. There are two general approaches to handling missing data. The rst one ignores missing data, by discarding the records with missing values from a dataset or taking unknown attribute values as
more » ... pecial, null values and processing them during the process of answering queries. The second approach is called imputations, which attempts to ll in missing data with plausible values. There are two major existing methods of this approach. The rst uses statistical relationships among the variables under consideration, and builds statistical models for the variables, namely linear regression models for numerical values and statistical classi cation rules for categorical values. A possible value of an unspeci ed item is then estimated using the statistical models. The second method is multiple imputations that replaces each missing value by m ultiple simulated values. After the multiple imputations are obtained, each plausible value is analyzed, using EM algorithm and Gibbs sampling, to produce an inferential conclusion that include uncertainty caused by missing data. Note that the multiple imputation method can only apply to numerical attributes. In this research, we apply the framework of association rules to multiple imputations in multivariate datasets with missing numeric and categorical data. The main ideas include using association relationships among variables and statistical features of possible variable values, such as their frequencies, and combining them to predict the most possible values of missing data. The hypothesis underlying the idea of using association relationships is that a given dataset must provide a coherent picture of an object, and therefore some variables of the dataset must be related to and constrained with one another. If the association relationships among dif-
doi:10.1109/icde.2000.839427 dblp:conf/icde/Zhang00 fatcat:xz62marorfaybnfw2337abaezq