Hardy-Weinberg equilibrium revisited for inferences on genotypes featuring allele and copy-number variations

Andreas Recke, Klaus-Günther Recke, Saleh Ibrahim, Steffen Möller, Reinhard Vonthein
2015 Scientific Reports  
Copy number variations represent a substantial source of genetic variation and are associated with a plethora of physiological and pathophysiological conditions. Joint copy number and allelic variations (CNAVs) are difficult to analyze and require new strategies to unravel the properties of genotype distributions. We developed a Bayesian hidden Markov model (HMM) approach that allows dissecting intrinsic properties and metastructures of the distribution of CNAVs within populations, in
more » ... haplotype phases of genes with varying copy numbers. As a key feature, this approach incorporates an extension of the Hardy-Weinberg equilibrium, allowing both a comprehensive and parsimonious model design. We demonstrate the quality of performance and applicability of the HMM approach with a real data set describing the Fcc receptor (FccR) gene region. Our concept, using a dynamic process to analyze a static distribution, establishes the basis for a novel understanding of complex genomic data sets. C opy number variations (CNVs) are common and represent a source of enormous genetic complexity, which is further increased by sequence variations. CNVs have been recognized as highly important for the understanding of human disease pathogenesis 1 . Nonetheless, the complexity of CNAV poses a challenge for statistical association analyses, requiring novel approaches to reveal hidden metastructures. Most CNVs are inherited, but 10% develop de novo, either during parental meiosis (30%) or during developmental mitosis (70%). The majority of CNVs (approximately 80%) are gains 2 . During mitosis, several mechanisms have been proposed to cause changes in copy numbers 3-6 . Among these mechanisms, the microhomology-mediated break-induced repair (MMBIR) mechanism has been proposed to be a major source of CNVs 4,7 . This repair mechanism uses either the sister DNA strand or the second chromosome as a reference to repair DNA strand breaks during replication. In microhomologous regions, annealing to the reference may be misplaced, leading to either the deletion or duplication of the affected gene region. This mechanism causes loss of heterozygosity (LOH) as its specific signature in the respective DNA sequence 4,7 . A prominent example of a CNAV is in the genetic region harboring the genes for low-affinity Fcc receptors. Fcc receptors are key molecules for the binding of immunoglobulins by cellular players of the immune system and mediate a plethora of downstream signaling events 8, 9 . These receptors are associated with susceptibility to autoimmune diseases, including systemic lupus erythematosus, rheumatoid arthritis and idiopathic thrombocytopenic purpura 10-16 . Although each FccR possesses a unique functionality, their respective genes show a very high intergenic homology. Currently, the most advanced high-throughput method to characterize CNAVs of the FccR gene region is a multiplex ligation-dependent probe amplification (MLPA) method 16, 17 . This method determines the abundance counts of sequence motifs, i.e., alleles, in genomic DNA. In detail, this method determines integer copy numbers for 7 genes within the FccR gene region, ranging from 0 to more than 5 copies, thereby distinguishing allelic variants of 9 different single nucleotide polymorphisms (SNPs) (Fig. 1) . The approach we present here was originally developed to address the complexity of this data set. For this purpose, we re-interpret these data as the static summary of a dynamic series of events. The order of events is introduced as a latent variable described by a specialized hidden Markov model (HMM) that is randomly walked according to transition probabilities, which are inferred from the data set. The order of events can be regarded as the order of genes and alleles along a single chromosomal strand (Fig. 2a) . Inspired by the above described mechanisms that lead to copy number variation, we introduce recursive loops that model deletion or multiplication of genes (Fig. 2b) .
doi:10.1038/srep09066 pmid:25765626 pmcid:PMC4357990 fatcat:a5wxkjznbfewvficf7dpwbvequ