Haploid, diploid, and pooled exome capture recapitulate features of biology and paralogy in two non-model tree species [article]

Brandon M Lind, Mengmeng Lu, Dragana Vidakovic, Pooja Singh, Tom R Booker, Sam Yeaman, Sally Aitken
2020 bioRxiv   pre-print
Despite their suitability for studying evolution, many conifer species have large and repetitive giga-genomes (16-31Gbp) that create hurdles to producing high coverage SNP datasets that captures diversity from across the entirety of the genome. Due in part to multiple ancient whole genome duplication events, gene family expansion and subsequent evolution within Pinaceae, false diversity from the misalignment of paralog copies create further challenges in accurately and reproducibly inferring
more » ... ucibly inferring evolutionary history from sequence data. Here, we leverage the cost-saving benefits of pool-seq and exome-capture to discover SNPs in two conifer species, Douglas-fir (Pseudotsuga menziesii var. menziesii (Mirb.) Franco, Pinaceae) and jack pine (Pinus banksiana Lamb., Pinaceae). We show, using minimal baseline filtering, that allele frequencies estimated from pooled individuals show a strong positive correlation with those estimated by sequencing the same population as individuals (r > 0.948), on par with such comparisons made in model organisms. Further, we highlight the use of haploid megagametophyte tissue in identifying sites that are likely due to misaligned paralogs. Together with additional minor filtering, we show that it is possible to remove many of the loci with large frequency estimate discrepancies between individual and pooled sequencing approaches, improving the correlation further (r > 0.973). Our work addresses bioinformatic challenges in non-model organisms with large and complex genomes, highlights the use of megagametophyte tissue for the identification of paralog sites when sequencing large numbers of populations, and suggests the combination of pool-seq and exome capture to be robust for further evolutionary hypothesis testing in these systems.
doi:10.1101/2020.10.07.329961 fatcat:guy7dpecbrccrek2aoyuju56wi