Ontology-Enhanced Interactive Anonymization in Domain-Driven Data Mining Outsourcing

Brian C.S. Loh, Patrick H.H. Then
2010 2010 Second International Symposium on Data, Privacy, and E-Commerce  
Introduction. This thesis focuses on the data mining outsourcing scenario whereby a data owner publishes data to an application service provider who returns mining results. To ensure data privacy against an un-trusted party, protection techniques are required. Anonymization, a widely used method provides the benefit of preserving true attribute values as well as the capability of supporting various data mining algorithms. Although this is so, several issues emerge when anonymization is applied
more » ... n a real world outsourcing scenario. Most methods have focused on the traditional data mining paradigm, therefore they do not implement domain knowledge nor optimize data for domain-driven purposes. Furthermore, existing techniques limit users' control while assuming their natural capability of producing Domain Generalization Hierarchies (DGH). Moreover, previous utility metrics have not considered attribute correlations during generalization. Objective. The research objective is to create an ontology-based constrained anonymization framework which aims to preserve meaningful and actionable models for domain-driven data mining while protecting privacy. Framework. In contrast with existing works, this framework integrates the Unified Medical Language Systems (UMLS) as a form of domain ontology knowledge during DGH creation to preserve value meanings. Furthermore, it allows for user constraints based on attribute semantic types and relations to suit physician mining tasks. Also, attribute correlations are determined with external domain knowledge in the form of MEDLINE literatures to improve attribute selection during anonymization. Results. Experiments show that ontology-based DGHs manage to preserve semantic meaning after attribute generalization. Additionally, by setting constraints, important attributes for specific mining tasks can be preserved. Finally, utilizing a correlation-based measure can improve attribute selection during anonymization for domain-driven purposes. Conclusion. There is an urgent need for privacy preserving methods capable of anonymizing data for domain-driven usage. The proposed framework proves ii the benefit of integrating domain ontology knowledge and external literatures in improving utility for domain-driven purposes. Therefore, it is expected that by utilizing such a framework, data owners can protect data while maintaining utility for real world requirements. iii I am grateful for my fellow friends, Dynamic DGH Ontology-based DGH (a) Cath (L = 2, K = 2) Age = 66.00-95.00 Age = adolescent: no. | Post-Diastolic Blood Pressure = 5.00-68.00 Age = ages-80-and-over: yes. | | Pre-Diastolic Blood Pressure = 30.00-67.00 | | | Body Surface Area = 0.80-1.49: yes. | | | Body Surface Area = 1.49-1.63 | | | | Post-Heart Rate = 30.00-78.00: yes. | | | | Post-Heart Rate = 78.00-165.00: no. (b) Cleveland (L = 2, K = 2) Heart Rate = 137.00-148.00 Angina = no | Age = 25.00-55.00 | Age = 45-49 | | ST Depression = 0.00-1.80 | | Sex = female: no. | | | Angina = no: no. | | Sex = male | | | Angina = yes: yes. | | | Heart Rate = 148.00-210.00: no. | | ST Depression = 1.80-7.00: yes. | | | Heart Rate = 70.00-148.00: yes. | Age = 55.00-64.00: yes. | Age = 50-54: no. | Age = 64.00-80.00: no. | Age = 55-59: no. (c) Framingham (L = 2, K = 20) systolic-blood-pressure = 112.00-121.00 systolic-blood-pressure = 112.00-121.00 | cholesterol = 233.00-252.00 | cholesterol = 230-239: no. | | smoking-habit = cigar-or-pipe: no. | cholesterol = 240-249: no. | | smoking-habit = never-smoked: yes. | cholesterol = 250-259: no. | | smoking-habit = stopped: no. | | smoking-habit = tobacco(<20/day): no. | | smoking-habit = tobacco(>=20/day): no. be determined by assessing the attributes contained in each rule. For instance, when a decision tree is created, certain attributes may be missing due to generalization. Because of this, if the model is applied for a particular task requiring the missing attributes, it may provide low actionability. We first began by creating decision trees from each dataset with previously set parameters and using InfoGain as score. Two sets of models were created for each parameter, one with attribute constraints and the other without. Next, remaining attributes in the models were analyzed and compared. Cath As shown in Table 4 .2, all attributes are preserved in the raw Cath dataset model. At each L and K increment, less attributes remain in the resulting decision tree due to
doi:10.1109/isdpe.2010.7 fatcat:euaht3j43rcoxkmaqda6bjsqi4