Allele frequency estimation
Isolation, Migration and Health
11 For genetic association studies with related individuals, standard linear mixed-effect model is the 12 most popular approach. The model treats a complex trait (phenotype) as the response variable 13 while a genetic variant (genotype) as a covariate. An alternative approach is to reverse the roles 14 of phenotype and genotype. This class of tests includes quasi-likelihood based score tests. In this 15 work, after reviewing these existing methods, we propose a general, unifying 'reverse'
... ing 'reverse' regression 16 framework. We then show that the proposed method can also explicitly adjust for potential de-17 parture from Hardy-Weinberg equilibrium. Lastly, we demonstrate the additional flexibility of the 18 proposed model on allele frequency estimation, as well as its connection with earlier work of best 19 linear unbiased allele-frequency estimator. We conclude the paper with supporting evidence from 20 simulation and application studies. 21 Weinberg equilibrium; Allele frequency estimation. 23 2 All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/470328 doi: bioRxiv preprint 24 Genetic association studies aim at identifying genetic variants, Gs, that influence a heritable trait, Y , 25 of interest. To this end, allele-based association tests or allelic tests, comparing allele frequencies 26 between case and control groups, are locally most powerful (Sasieni, 1997) in a sample of unrelated 27 individuals. However, traditional allelic tests (i) analyze only binary outcomes, (ii) cannot easily 28 accommodate covariates, (iii) are limited to independent samples, and (iv) have type I error control 29 issue if there is a departure from Hardy-Weinberg equilibrium (HWE) in the study population. 30 HWE states that the two alleles in a genotype are independent draws from the same distribution, 31 or, equivalently, genotype frequencies depend solely on the allele frequencies. For a bi-allelic SNP 32 with two possible alleles A and a, let p and 1 − p be the respective allele frequencies. Under 33 HWE, p aa = (1 − p) 2 , p Aa = 2p(1 − p), and p AA = p 2 , where p aa , p Aa , and p AA are the genotype 34 frequencies of genotypes aa, Aa and AA, respectively. To measure the departure from HWE or the 35 amount of Hardy-Weinberg disequilibrium (HWD), 36 δ = p AA − p 2 (1) is a widely used quantity, and δ = 0 corresponds to HWE (Weir, 1996). 37 Genotype-based association tests treat phenotype Y as the response variable and genotype G as 38 an explanatory variable. Due to the regression nature of the framework, genotype-based association 39 tests can easily handle continuous traits and incorporate covariates. It is commonly assumed that 40 genotype-based association tests are robust to departure from HWE. With a sample of independent 41 individuals, both theoretical and empirical results support this (Sasieni, 1997; Schaid and Jacobsen, 42 1999). However, in the presence of sample dependency, little has been discussed. 43 When individuals in a sample are genetically related with each other, linear mixed-effect mod-44 els (LMM) have become the most popular approach for association testing (Eu-Ahsunthornwattana 45 et al., 2014). The variance-covariance matrix of the phenotype is partitioned into a weighted sum 46 3 All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/470328 doi: bioRxiv preprint of correlation structure due to genetic relatedness and shared environmental effects, where the 47 weight is usually referred to as 'heritability' (Visscher et al., 2008). The genetic relatedness is 48 typically represented by a known kinship coefficient matrix, or estimated based on the available 49 genome-wide genetic data if the pedigree information was not collected (Yang et al., 2011; Sun 50 and Dimitromanolakis, 2012). 51 An alternative approach is to reverse the roles of Y and G in the regression model. O'Reilly 52 et al. (2012) proposed MultiPhen, a method that treats the genotype G of a SNP as the response 53 variable and phenotype values Y s of multiple traits as predictors. However, MultiPhen relies on 54 ordinal logistic regression and analyzes only independent samples. MultiPhen does not require 55 the assumption of HWE, but insights to its robustness to HWD was not provided (O'Reilly et al., 56 2012). 57 Thornton and McPeek (2007) extended the traditional allelic tests to study binary traits with 58 related individuals. Their test was then generalized by Feng et al. (2011) and Feng (2014) to a 59 quasi-likelihood score test for either binary or continuous traits. However, none of these methods 60 can directly incorporate covariates. Jakobsdottir and McPeek (2013) later proposed a 'retrospec-61 tive' approach, MASTOR, to study the association between G and one (approximately) normally 62 distributed trait Y , while accommodating covariates in related individuals. All methods in this 63 category, however, implicitly assumed HWE. 64 In this paper, we first review and provide some insights into the aforementioned genetic asso-65 ciation tests. We then propose a robust and flexible 'reverse' regression framework that (a) unifies 66 several existing association methods, and (b) explicitly includes a correction factor in the variance-67 covariance matrix to adjust for potential departure from HWE. Further, we show that the proposed 68 'reverse' regression framework (c) can also be used to estimate allele frequency in complex pedi-69 grees. Interestingly, we reveal that for the simple case of no covariates and HWE, the proposed 70 estimator is the best linear unbiased estimator of McPeek et al. (2004). We conclude the paper 71 with supporting evidence from simulation and application studies, and some discussion points. 72 4 All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/470328 doi: bioRxiv preprint =p AA −p 2 ,p AA = n 2 /n,p = (2n 2 + n 1 )/2n, andp andp AA are the sample frequency estimates of allele A and genotype AA, respectively. Nev-81 ertheless, existing (classical and robust) allelic tests are limited to binary Y without consideration 82 of covariate effects. 83 5 All rights reserved. No reuse allowed without permission.