Investigation of rare and low-frequency variants using high-throughput sequencing with pooled DNA samples

Jingwen Wang, Tiina Skoog, Elisabet Einarsdottir, Tea Kaartokallio, Hannele Laivuori, Anna Grauers, Paul Gerdhem, Marjo Hytönen, Hannes Lohi, Juha Kere, Hong Jiao
2016 Scientific Reports  
High-throughput sequencing using pooled DNA samples can facilitate genome-wide studies on rare and low-frequency variants in a large population. Some major questions concerning the pooling sequencing strategy are whether rare and low-frequency variants can be detected reliably, and whether estimated minor allele frequencies (MAFs) can represent the actual values obtained from individually genotyped samples. In this study, we evaluated MAF estimates using three variant detection tools with two
more » ... ts of pooled whole exome sequencing (WES) and one set of pooled whole genome sequencing (WGS) data. Both GATK and Freebayes displayed high sensitivity, specificity and accuracy when detecting rare or low-frequency variants. For the WGS study, 56% of the low-frequency variants in Illumina array have identical MAFs and 26% have one allele difference between sequencing and individual genotyping data. The MAF estimates from WGS correlated well (r = 0.94) with those from Illumina arrays. The MAFs from the pooled WES data also showed high concordance (r = 0.88) with those from the individual genotyping data. In conclusion, the MAFs estimated from pooled DNA sequencing data reflect the MAFs in individually genotyped samples well. The pooling strategy can thus be a rapid and costeffective approach for the initial screening in large-scale association studies. In the last two decades, more than 10,000 variants associated with complex traits have been identified by genome-wide association studies (GWAS) 1 . However, most of the target sites of GWAS have been common variants (risk allele frequency > 5%) with modest or weak genetic effects, usually requiring large sample sizes for detection at the genome-wide significant level 2 . On the other hand, it is possible that common diseases are partially caused by rare and generally deleterious variants with a strong impact on the risk of disease in individual patients 3 . The majority of those low-frequency variants have not been investigated by single-nucleotide polymorphism (SNP) array-based GWAS, as the arrays primarily target common variants. High-throughput next generation sequencing (NGS) technologies have revolutionised genetic research by enabling the identification of rare and low-frequency genetic variation on a massive scale 4,5 . In contrast to SNP array genotyping, next generation DNA sequencing does not rely on pre-designed probes against target sequences and is therefore able to detect any variant within the studied genome. Moreover, the new technology greatly reduces per base pair sequencing cost, provides high read coverage and depth and produces an abundance of sequencing reads at both the whole genome and exome wide scale. It has contributed to the mapping of a number
doi:10.1038/srep33256 pmid:27633116 pmcid:PMC5025741 fatcat:rpppubn6c5hm3gymkqwogssytm