Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs
Variable number tandem repeats (VNTR) are genetic loci composed of consecutive repeats of short segments of DNA, with hypervariable repeat count and composition across individuals. Many genes have coding VNTR sequences, and noncoding VNTR variants are associated with a wide spectrum of clinical disorders such as Alzheimer's disease, bipolar disorder and colorectal cancer. The identification of VNTR length and composition provides the basis for downstream analysis such as expression quantitative
... trait loci discovery and genome wide association studies. Disease studies that use high-throughput short read sequencing do not resolve the repeat structures of many VNTR loci because the VNTR sequence is missing from the reference or is too repetitive to map. We solve the VNTR mapping problem for short reads by representing a collection of genomes with a repeat-pangenome graph, a data structure that encodes both the population diversity and repeat structure of VNTR loci. We developed software to build a repeat-pangenome using haplotype-resolved single-molecule sequencing assemblies, and to estimate VNTR length and sequence composition based on the alignment of short read sequences to the graph. Using long-read assemblies as ground truth, we are able to determine which VNTR loci may be accurately profiled using repeat-pangenome graph analysis with short reads. This enabled measuring the global diversity of VNTR sequences in the 1000-Genomes Project, and the discovery of expression quantitative trait loci in the Genotype-Tissue Expression Project. This analysis reveals loci that have significant differences in length and repeat composition between continental populations. Furthermore, the repeat pangenome graph analysis establishes an association between previously inaccessible variation and gene expression. Taken together, these indicate that measuring VNTR sequence diversity with repeat-pangenome graphs will be a critical component of future studies on human diversity and disease.