Measuring Information Quality for Privacy Preserving Data Mining

Sam Fletcher, Md Zahidul Islam
2014 Journal of clean energy technologies  
In the strive for knowledge discovery in a world of ever-growing data collection, it is important that even if a dataset is altered to preserve people's privacy, the information in the dataset retains as much quality as possible. In this context, "quality" refers to the accuracy or usefulness of the information retrievable from a dataset. Defining and measuring the loss of information after meeting privacy requirements proves difficult however. Techniques have been developed to measure the
more » ... to measure the information quality of a dataset for a variety of anonymization techniques including Generalization, Suppression, and Randomization. Some measures analyze the data, while others analyze the outputted data mining results from tasks such as Clustering and Classification. This survey discusses a collection of information measures, and issues surrounding their usage and limitations. Index Terms-Anonymization, data mining, data quality, privacy preserving data mining. 3 "Data mining" refers to using automated algorithms for finding patterns in data. 4 "Classification" refers to predicting a record's value for an attribute based on the other explanatory attributes. Decision trees and neural networks are commonly-used method for doing so [5] . 5 "Clustering" refers to grouping records in such a way that similar records are grouped together and dissimilar records are grouped in separate clusters [5]. 6 "Data imputation" and "data cleaning" refer to estimating missing values in a dataset and removing misinformation/noise [9] .
doi:10.7763/ijcte.2015.v7.924 fatcat:7immj7yfxje25lizswocabsouu