Comparing effect sizes across variables: generalization without the need for Bonferroni correction
László Zsolt Garamszegi
2006
Behavioral Ecology
Studies in behavioral ecology often investigate several traits and then apply multiple statistical tests to discover their pairwise associations. Traditionally, such approaches require the adjustment of individual significance levels because as more statistical tests are performed the greater the likelihood that Type I errors are committed (i.e., rejecting H 0 when it is true) (Rice 1989) . Bonferroni correction that lowers the critical P values for each particular test based on the number of
more »
... sts to be performed is frequently used to reduce problems associated with multiple comparisons (Cabin and Mitchell 2000) . However, this procedure dramatically increases the risk of committing Type II errors as it results in a high risk of not rejecting a H 0 when it is false. To reach 80% statistical power, it is necessary to have huge sample sizes to detect medium (r ¼ 0.3 or d ¼ 0.5; sensu Cohen 1988) or small (r ¼ 0.1 or d ¼ 0.2; sensu Cohen 1988) strength effects (e.g., say N ¼ 128 or N ¼ 788, respectively, for a 2-sample t-test), but sample size is often limited when studying behavior. The strict application of Bonferroni correction in the field of ecology and behavioral ecology has therefore been criticized for mathematical and logical reasons (Wright 1992; advocated that the sacrificial loss of power can be avoided by choosing an experimentwise error rate higher than the usually accepted 5%, which results in a balance between different types of errors. As another alternative, the researcher might be more interested in controlling the proportion of erroneously rejected null hypotheses, the socalled false discovery rate, than in controlling for familywise error rate (Benjamini and Hochberg, 1995) . Although this approach allows for increased power in large series of repeated tests, it is rarely applied in ecological studies (Garcia 2003 (Garcia , 2004 . Recently, Nakagawa (2004) suggested reporting effect sizes together with confidence intervals (CIs) for all potential relationships to allow the readers to judge the biological importance of the results and to reduce publication bias. Due to the low power of the tests, the majority of investigated relationships are expected to be nonsignificant, which is thought to make publication difficult. Such difficulty is generally assumed to cause behavioral ecologists to selectively report data (Moran 2003; Nakagawa 2004) . The omission of nonsignificant results from publications is undesirable for both scientific and ethical reasons, which makes Bonferroni adjustment problematic. It is noteworthy that direct tests comparing effect sizes of representative samples of published and unpublished studies showed no evidence of publication bias in the biological literature (Koricheva 2003; Møller et al. 2005 ). However, independent of publication bias, conclusions drawn from effect sizes and the associated CIs should be encouraged. Such an approach considers the magnitude of an effect on a continuous scale, whereas conventional hypothesis testing based on significance levels tends to treat biological questions as allor-nothing effects depending on whether P values exceed the critical limit or not (Chow 1988; Wilkinson and Task Force Stat Inference 1999; Thompson 2002) . Hence, using the same data, the former approach may reveal that a particular effect is small, but still biologically important, whereas, the later approach may lead the investigator to conclude that the hypothesized phenomenon does not exist in nature. Although such philosophical differences may dramatically influence our knowledge, presenting standardized effect sizes is still uncommon in ecology and evolution (Nakagawa 2004). Here, I suggest that, in addition to their presentation, the calculated effect sizes may be further used in simple analyses that can help to estimate the true effect of a predictor variable and thus make general conclusions. These analytical tools rely on the fact that the strength and direction of relationships, as reflected by standardized measures of effect sizes (Pearson's r, Cohen's d, or Hedges' g), are comparable and independent of the scale on which the variables were measured (e.g., Hedges and Olkin 1985; Cohen 1988; Rosenthal 1991) . Thus, if multiple traits are measured and multiple correlations are calculated, the corresponding effect sizes tabulated among the variables measured will have a certain statistical distribution with measurable attributes. Below, I present 4 simple analyses to demonstrate how such statistical attributes can be used to make general interpretations. I will confine myself to a typical sampling design from behavioral ecology in which the experimenter is interested in explaining variation in certain traits (response variables) in the light of other (predictor) variables. Specific sampling designs can be tailored according to the biological question at hand that will be illustrated by using real data on the collared flycatcher, Ficedula albicollis from Garamszegi et al. (2004) . I will also discuss the confounding effect of colinearity between variables that may violate the assumption of statistical independence and the potentially low power of the suggested tests. ANALYSES OF EFFECT SIZES First, the mean effect size from multiple pairwise tests can be calculated to test the null hypothesis that the mean underlying effect size does not differ from zero. It will be rejected if the measured variables covary with a predictor variable consistently in the same direction. Normally, a few of the investigated relationships will be significant but the majority will not (see an example in Table 1 ). The classical interpretation of these results relies on the relationships that pass the filter of Bonferroni correction (i.e., strong effects). However, weak effects may also have biological importance: a meta-analysis of meta-analyses in ecology and evolution revealed small to intermediate mean effect sizes (r , 0.2) and that the amount of variance explained in biological studies appears to be very small (Møller and Jennions 2002) . Therefore, neglecting small effects could be misleading as it may cause us to overlook weak but evolutionarily still important patterns. A consistent pattern of variation in all measured effect sizes in a certain Behavioral Ecology
doi:10.1093/beheco/ark005
fatcat:y4oxpt3dzzhabcbgmvv6hz7x6u