Accelerating Maximal-Exact-Match Seeding with Enumerated Radix Trees [article]

Arun Subramaniyan, Jack Wadden, Kush Goliya, Nathan Ozog, Xiao Wu, Satish Narayanasamy, David Blaauw, Reetuparna Das
2020 bioRxiv   pre-print
ABSTRACTMotivationRead alignment is a time-consuming step in genome sequence analysis. In the read alignment software BWA-MEM and the recently published faster version BWA-MEM2, the seeding step is a major bottleneck, for instance, contributing 38% to the overall execution time in BWA-MEM2 when aligning single-end whole human genome reads from the Platinum Genomes dataset. This is because both BWA-MEM and BWA-MEM2 use a compressed index structure called the FMD-Index, which results in high
more » ... y bandwidth requirements for seeding, primarily due to its character-by-character processing of reads.ResultsWe propose a memory bandwidth-aware data structure for maximal-exact-match seeding called Enumerated Radix Tree (ERT). ERT trades off memory capacity to improve seeding performance (∼60 GB index for human genome). Together with optimizations to the seeding algorithm and mate-rescue step, ERT when integrated into BWA-MEM2 speeds up overall read alignment by 1.28× and provides up to 2.1× higher seeding performance while guaranteeing identical output to the original software. Furthermore, we prototype an FPGA implementation of ERT on Amazon EC2 F1 cloud and observe 1.6× higher seeding throughput over a 48-thread optimized CPU-ERT implementation.Availability and implementationhttps://github.com/arun-sub/bwa-mem2Contactarunsub@umich.edu, reetudas@umich.edu
doi:10.1101/2020.03.23.003897 fatcat:ih3uy7tabjc7zjr774uqez2nva