Performance Analysis of a Parallel, Multi-node Pipeline for DNA Sequencing [chapter]

Dries Decap, Joke Reumers, Charlotte Herzeel, Pascal Costanza, Jan Fostier
2016 Lecture Notes in Computer Science  
Post-sequencing DNA analysis typically consists of read mapping followed by variant calling and is very time-consuming, even on a multi-core machine. Recently, we proposed Halvade, a parallel, multi-node implementation of a DNA sequencing pipeline according to the GATK Best Practices recommendations. The MapReduce programming model is used to distribute the workload among different workers. In this paper, we study the impact of different hardware configurations on the performance of Halvade.
more » ... chmarks indicate that especially the lack of good multithreading capabilities in the existing tools (BWA, SAMtools, Picard, GATK) cause suboptimal scaling behavior. We demonstrate that it is possible to circumvent this bottleneck by using multiprocessing on high-memory machines rather than using multithreading. Using a 15-node cluster with 360 CPU cores in total, this results in a runtime of 1h 31 min. Compared to a single-threaded runtime of ∼12 days, this corresponds to an overall parallel efficiency of 53%.
doi:10.1007/978-3-319-32152-3_22 fatcat:sq2tl5dlwrg35mcxy5fjzcir24