A long read mapping method for highly repetitive reference sequences [article]

Chirag Jain, Arang Rhie, Nancy Hansen, Sergey Koren, Adam M. Phillippy
2020 bioRxiv   pre-print
About 5-10% of the human genome remains inaccessible for functional analysis due to the presence of repetitive sequences such as segmental duplications and tandem repeat arrays. To enable high-quality resequencing of personal genomes, it is crucial to support end-to-end genome variant discovery using repeat-aware read mapping methods. In this study, we highlight the fact that existing long read mappers often yield incorrect alignments and variant calls within long, near-identical repeats, as
more » ... y remain vulnerable to allelic bias. In the presence of a non-reference allele within a repeat, a read sampled from that region could be mapped to an incorrect repeat copy because the standard pairwise sequence alignment scoring system penalises true variants. To address the above problem, we propose a novel, long read mapping method that addresses allelic bias by making use of minimal confidently alignable substrings (MCASs). MCASs are formulated as minimal length substrings of a read that have unique alignments to a reference locus with sufficient mapping confidence (i.e., a mapping quality score above a user-specified threshold). This approach treats each read mapping as a collection of confident sub-alignments, which is more tolerant of structural variation and more sensitive to paralog-specific variants (PSVs) within repeats. We mathematically define MCASs and discuss an exact algorithm as well as a practical heuristic to compute them. The proposed method, referred to as Winnowmap2, is evaluated using simulated as well as real long read benchmarks using the recently completed gapless assemblies of human chromosomes X and 8 as a reference. We show that Winnowmap2 successfully addresses the issue of allelic bias, enabling more accurate downstream variant calls in repetitive sequences. As an example, using simulated PacBio HiFi reads and structural variants in chromosome 8, Winnowmap2 alignments achieved the lowest false-negative and false-positive rates (1.89%, 1.89%) for calling structural variants within near-identical repeats compared to minimap2 (39.62%, 5.88%) and NGMLR (56.60%, 36.11%) respectively. Winnowmap2 code is accessible at https://github.com/marbl/Winnowmap
doi:10.1101/2020.11.01.363887 fatcat:eq5il5khhzfdphfb7zylffnknu