Kafka interfaces for composable streaming genomics pipelines

Francesco Versaci, Luca Pireddu, Gianluigi Zanetti
2018 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI)  
Modern sequencing machines produce order of a terabyte of data per day, which need subsequently to go through a complex processing pipeline. The standard workflow begins with a few independent, shared-memory tools, which communicate by means of intermediate files. Given the constant increase of the amount of data produced, this approach is proving more and more unmanageable, due to its lack of robustness and scalability. In this work we propose the adoption of stream computing to simplify the
more » ... nomic pipeline, boost its performance and improve its fault-tolerance. We decompose the first steps of the genomic processing in two distinct and specialized modules (preprocessing and alignment) and we loosely compose them via communication through Kafka streams, in order to allow for easy composability and integration in the already existing Hadoop-based pipelines. The proposed solution is then experimentally validated on real data and shown to scale almost linearly.
doi:10.1109/bhi.2018.8333418 dblp:conf/bhi/VersaciPZ18 fatcat:3tl35kjc2rfx5jap2ex2ynqd2a