### Selecting the Right Correlation Measure for Binary Data

Lian Duan, W. Nick Street, Yanchi Liu, Songhua Xu, Brook Wu
2014 ACM Transactions on Knowledge Discovery from Data
Finding the most interesting correlations among items is essential for problems in many commercial, medical, and scientific domains. Although there are numerous measures available for evaluating correlations, different correlation measures provide drastically different results. Piatetsky-Shapiro provided three mandatory properties for any reasonable correlation measure, and Tan et al. proposed several properties to categorize correlation measures; however, it is still hard for users to choose
more » ... r users to choose the desirable correlation measures according to their needs. In order to solve this problem, we explore the effectiveness problem in three ways. First, we propose two desirable properties and two optional properties for correlation measure selection and study the property satisfaction for different correlation measures. Second, we study different techniques to adjust correlation measures and propose two new correlation measures: the Simplified χ 2 with Continuity Correction and the Simplified χ 2 with Support. Third, we analyze the upper and lower bounds of different measures and categorize them by the bound differences. Combining these three directions, we provide guidelines for users to choose the proper measure according to their needs. 13:2 L. Duan et al. discussed here. For binary data, although we are, in general, interested in correlated sets of arbitrary size, most of the published work with regard to correlation is related to finding correlated pairs [Tan et al. 2004; Geng and Hamilton 2006] . Related work with association rules [Brin et al. 1997a [Brin et al. , 1997b Omiecinski 2003 ] is a special case of correlation pairs since each rule has a left-and right-hand side. Given an association rule X ⇒ Y where X and Y are itemsets, Support = P(X ∩ Y ) and Conf idence = P(X ∩ Y )/P(X) [Agrawal et al. 1993; Omiecinski 2003] are often used to represent its significance. However, these can produce misleading results because of the lack of comparison to the expected probability under the assumption of independence. In order to overcome the shortcoming, Lift [Brin et al. 1997a], Conviction [Brin et al. 1997b], and Leverage [Piatetsky-Shapiro 1991] are proposed. Dunning [1993] introduced a more statistically reliable measure, Likelihood Ratio, which outperforms other correlation measures. Jermaine [2005] extended Dunning's work and examined the computational issue of Probability Ratio and Likelihood Ratio. Bate et al. [1998] proposed a correlation measure called Bayesian Confidence Propagation Neural Network (BCPNN), which is good at searching for correlated patterns occurring rarely in the whole dataset. These correlation measures are intuitive; however, different correlation measures provide drastically different results. Although Tan et al. [2004] proposed several properties to categorize these correlation measures, there are no guidelines for users to choose the desirable correlation measures according to their needs. In order to solve this problem, we will propose several desirable properties for correlation measures and study the property satisfaction for different correlation measures in this article. By studying the literature related to correlation, we notice that different correlation measures are favored in different domains. In the text mining area, people use Likelihood Ratio [Dunning 1993]. BCPNN is favored in the medical domain [Bate et al. 1998], while Leverage is used in the social network context [Clauset et al. 2004]. Our research will answer this question of why different areas favor different measures. Evaluating the performance of different correlation measures requires a ground true ranking list matching human intuition and each measure will be evaluated by checking how similar the retrieved ranking list is to the ground true ranking list. However, when dealing with human intuition to get the ground true ranking list, different people have different opinions. Take the pairs {A, B} and {C, D} for example. When Event A happens, the probability of observing Event B will increase from 0.01% to 10%. When Event C happens, the probability of observing Event D will increase from 50% to 90%. Which correlation pattern is stronger between {A, B} and {C, D}? Different people have different answers. Therefore, there is no ground true ranking list to test the performance of each measure. Instead, people all agree that a good correlation measure can at least tell correlated patterns from uncorrelated patterns, and it is much easier to identify ground true correlated patterns, especially in simulated datasets. Therefore, our evaluation emphasizes more the precision of each measure by telling correlated patterns from uncorrelated patterns. Still, when using precision to measure performance, two correlation measures can both perfectly tell correlated patterns from uncorrelated patterns to achieve 100% precision but rank correlated patterns differently. For example, one measure can rank {A, B} higher, while the other can rank {C, D} higher. As a complement to precision, the preference differences among measures will be studied when they achieve similar precision. There are two very influential papers [Tan et al. 13:4 L. Duan et al. P2 : M monotonically increases with the increase of P(S) when all the P(I i ) remain the same. P3: M monotonically decreases with the increase of any P(I i ) when the remaining P(I k ) (where k = i) and P(S) remain unchanged. P4: The upper bound of M does not approach infinity when P(S) is closer to 0. P5: M gets closer to C (including negative correlation cases whose M is smaller than C) when an independent item is added to S. P6: The lower bound of M gets closer to the lowest possible function value when P(S) is closer to 0. P7: M gets further away from C (including negative correlation cases) with increased sample size when all the P(I i ) and P(S) remain unchanged.