Minerva: An Alignment and Reference Free Approach to Deconvolve Linked-Reads for Metagenomics [article]

David C. Danko, Dmitry Meleshko, Daniela Bezdan, Christopher Mason, Iman Hajirasouliha
2017 bioRxiv   pre-print
Emerging linked-read technologies (aka read-cloud or barcoded short-reads) have revived interest in standard short-read technology as a viable way to understand large scale structure in genomes and metagenomes. Linked-read technologies, such as the 10X Chromium system, use a microfluidic system and a set of specially designed 3 prime barcodes (aka UIDs) to tag short DNA reads which were originally sourced from the same long fragment of DNA; subsequently these specially barcoded reads are
more » ... ed on standard short read platforms. This approach results in interesting compromises. Each long fragment of DNA is covered only sparsely by short reads, no information about the relative ordering of reads from the same fragment is preserved, and typically each 3 prime barcode matches reads from 5-20 long fragments of DNA. However, the cost per base to sequence is far lower than single molecule long read sequencing systems, far less input DNA is required, and the error rate is that of standard short-reads. Linked-reads represent a new set of algorithmic challenges. In this paper we formally describe one particular issue common to all applications of linked-read technology: the deconvolution of reads with a single barcode into clusters that correspond to a single long fragment of DNA. We introduce Minerva, A graph-based algorithm which approximately solves the barcode deconvolution problem for metagenomic data (where reference genomes may be incomplete or unavailable). Additionally, we demonstrate that deconvolved barcoded reads significantly improve downstream results by improving the specificity of taxonomic assignments, and by improving the ability of topic models to identify clusters of related sequences.
doi:10.1101/217869 fatcat:ubg27u5d3fan3ohcyajytu3szu