Algorithms and tools for the analysis of high throughput DNA sequencing data

Marcel Martin, Technische Universität Dortmund, Technische Universität Dortmund
2014
High-throughput DNA sequencing technologies make it possible to determine the order of the nucleotides adenine, cytosine, guanine and thymine in DNA samples, resulting in millions of short strings (reads) over the alphabet (A, C, G, T). Advances in biological and biomedical research rely on the ability of bioinformatics to make sense out of that data with novel algorithms and tools. In this thesis, we contribute on four levels to the typical data processing pipeline in sequencing experiments
more » ... provide soware tools that implement the described algorithms. When sequenced DNA fragments are short, reads can contain adapter sequences. ese artifacts are a technical requirement of the sequencing process. We describe how to remove them with a modified semiglobal alignment algorithm that finds overlapping regions between read and adapter. e algorithm is designed to only find alignments below a given error rate threshold, where the error rate is defined as the number of errors divided by the number of aligned adapter characters. We show how to use only linear space while still keeping track of all information necessary to correctly locate and remove adapter sequences. e algorithm can remove adapters also from colorspace reads, which come from a sequencing technology that queries two adjacent nucleotides (colors) of DNA at the same time. We show how to modify the trimming procedure to get correct results. e easy-to-use cutadapt tool is introduced. It contains additional features that make pre-processing of adapter-contaminated reads simple, and is in use by many other researchers. e next step in the pipeline is read mapping, where the likely origin of reads is found on a given reference DNA. We concentrate on mapping reads from bisulfite sequencing experiments, in which sodium bisulfite is used to determine which cytosines have a methyl group attached to them. Methylation changes gene expression and is therefore biologically interesting. Bisulfite converts unmethylated cytosines into thymines. By comparing modified reads to the reference, methylation patterns can be determined. To map reads while allowing sequencing errors and also differences from bisulfite conversion, we introduce the bisulfite q-gram index, an extension of regular q-gram indices. For a given q-gram (string of length q), the index returns all positions in the reference where that bisulfite-converted q-gram may have originated. By efficiently simulating bisulfite conversion of the reference, the index can be constructed in time proportional to its memory usage. Simulation theoretically leads to an exponential increase in index size, but size is only triple that of a regular index on realistic references. We describe how to map reads with the index with the seed-and-extend paradigm, first finding short matches with the help of the index, and extending them to longer maximal error-free matches (seeds) with either a deterministic finite automaton (DFA) or an efficient bitparallel algorithm. Seeds are then extended to an alignment that covers the full read, and parts that were not bisulfite converted are detected. We show that the number of bisulfite strings of a given length n is approximately 1.19 · 3.3 n , and we show how to compress the index by up to 25% while retaining efficient access. We finally apply the full read mapping algorithm to a dataset of 454 bisulfite sequencing data using the Verjinxer tool. i Danksagungen Diese Arbeit würde es nicht geben ohne viele andere Menschen, die mir geholfen haben und denen ich sehr dankbar bin. Mein Betreuer Sven Rahmann hat mich während meiner ganzen Arbeit unterstützt, indem er immer Zeit hatte, meine Fragen zu beantworten. Seine neuen und manchmal ungewöhnlichen Ideen, die er in unsere Diskussionen einfließen ließ, brachten mich in meinen Überlegungen o voran. Jens Stoye weckte vor Jahren mein Interesse an der Bioinformatik und ermunterte mich in einer schwierigen Phase weiterzumachen. Heinrich Müller und Lars Hildebrand haben ohne zu zögern zugestimmt, die Prüfungskommission zu vervollständigen. Tobias Marschall ist der ideale Bürokollege. Durch unsere Diskussionen hatte er wohl einen größeren Einfluss auf mich als er es selbst vermuten würde. Meine Kollegen Marianna D' Addario und Dominik Kopczynski hatten immer Zeit und ein offenes Ohr für mich, wenn ich jemanden zum Reden brauchtesowohl fachliche als auch mal andere Dinge betreffend. Johannes Köster hat das Snakemake-Programm geschrieben, welches ich nutzte, um meine Forschung reproduzierbar zu machen. Michael Zeschnigk half mir dabei, den biologischen Teil unserer Arbeit zu verstehen. Durch seine Erklärungen konnte meine Arbeit letztlich relevant für Forscher in der Genetik werden.
doi:10.17877/de290r-439 fatcat:iqmxipdcdrhodf7ei5vumxm3ci