An effective scheme for top-k frequent itemset mining under differential privacy conditions

Wenjuan Liang, Hong Chen, Jing Zhang, Dan Zhao, Cuiping Li
2020 Science China Information Sciences  
Citation Liang W J, Chen H, Zhang J, et al. An effective scheme for top-k frequent itemset mining under differential privacy conditions. Sci China Inf Sci, 2020, 63(5): 159101, Dear editor, Frequent itemset mining (FIM) is important in many data mining applications [1], such as web log mining and trend analysis. However, if the data are sensitive (e.g., web browsing history), directly releasing frequent itemsets and their support may breach user
more » ... The protection of user privacy while obtaining statistical information is important. Differential privacy (DP) is a strong and rigorous standard for privacy protection. In this study, we focused on effectively discovering top-k frequent itemsets under DP conditions. By adding a carefully selected amount of noise, DP ensures that the output of a computation is not sensitive to any individual tuple, and thus, user's privacy can be protected. The amount of noise is determined by the privacy budget ǫ and the sensitivity. Several studies [2] [3] [4] have recently begun to address the issue of performing FIM while satisfying DP. The sensitivity is the size of candidate frequent itemsets, which is very large. According to Laplace mechanism (LM) [5], a large magnitude of noise must be added to the release result. To promote the utility of the release result, a potential solution evaluated in previously published studies was to decrease the dimension of long transactions in a differentially private manner before releasing it. For instance, Ref. [4] employed random sampling to truncate long transactions before releasing them. Random truncation may cause a significant amount of information loss, which also affects its utility. Ref. [2] proposed double standards to reduce the information loss associated with truncation. Ref. [3] proposed the splitting of long transactions instead of truncating them to reduce information loss; however, despite reducing information loss, they were relatively inefficient (Detailed related studies and comparison can be seen in Appendixes C and F). For this reason, we aim to design an effective scheme for FIM under DP conditions. Problem definition. FIM refers to finding a set of patterns the support of which is greater than λ (the support threshold, 0 < λ < 1). Top-k FIM is designed to find k patterns the support of which is among the top k in frequent itemsets. The topk FIM under DP conditions is defined as follows. Let FI k denote the private top-k frequent itemsets. After adding a certain amount of noise that satisfies LM or exponential mechanism (EM) [6] to the release process, the probability of outputting the same result for any pair of neighboring databases (D, D ′ ) is bounded by exp(ǫ), which can be formalized as Pr( FI k |D) The overall scheme. Our scheme comprises two processes: the first is splitting the transaction using count estimation, and the second is releasing based on weighted reservoir sampling and EM. To achieve ǫ-differential privacy, ǫ is divided between the two processes: ǫ 1 = α · ǫ (0 < α < 1) is used to split the transaction, and ǫ 2 = (1 − α) · ǫ is used to genetate the private release. Our scheme satisfies
doi:10.1007/s11432-018-9849-y fatcat:byhvt7vr3ff23nst2zrwbewvde