A Pipeline for Markers Selection Using Restriction Site Associated DNA Sequencing (Radseq)

Hanan Begali
2018 Journal of Applied Bioinformatics & Computational Biology  
The discovery and assessment of genetic variants for Next Generation Sequencing (NGS), including Restriction site Associated DNA sequencing (RADSeq), is an important task in bioinformatics and comparative genetics. The genetic variants can be single-nucleotide polymorphisms (SNPs), insertions and deletions (Indels) when compared to a reference genome. Usually, the short reads are aligned to a reference genome at first using NGS alignment software, such as the Burrows-Wheeler Aligner (BWA). The
more » ... Aligner (BWA). The alignment is usually stored into a BAM file, a binary format of standard SAM (Sequence Alignment/Map) protocol. Then analysis software, such as Genome analysis Toolkit (GATK) or SAMTools, together with scripts written in R programming language, could provide an efficient solution for calling variants. In this project, we focus on RADSeq-based marker selection for Arabidopsis thaliana. RADSeq consists of short reads which do not cover the whole reference genome. In order to obtain four call-sets of SNPs as output in Variant Call Format (VCF), SNPs have been called by GATK or SAMTools. Then VCF files have been visualized by Integrative Genomics Viewer (IGV) software. We found that the visualization of SNPs and Indels has been very helpful and has provided us with valuable insights on marker selection. We found that applying Chi-Square test for all target genotypes, which are homozygous reference 0/0, heterozygous variants 0/1 and homozygous variants 1/1, to test Hardy-Weinberg Equilibrium (HWE) in order to reduce false positive rate significantly. We show that our pipeline is efficient in RADSeq-based marker selection. investigating single nucleotide polymorphisms (SNPs). SNPs can be defined as a difference in a single nucleotide of DNA at a particular location in the genome. Therefore, that necessitates the application of data processing in order to determine the reliable markers using RADSeq data, then evaluate them and obtain reliable SNPs [5,6,7]. Pre-processing data for mapping sequences In order to prepare data using RAD Sequences, for downstream analysis, data pre-processing has been performed. RAD sequences data is typically in a raw state. Data is present in the form of FASTQ files, which are used to store short reads data from high-throughput sequencing experiment before mapping and record each sequence with quilty score for each nucleotide [12] . The pre-processing stage is the main step in preparing RAD sequences in order to continue Citation: Begali H (2018) A Pipeline for Markers Selection Using Restriction Site Associated DNA Sequencing (Radseq). J Appl Bioinforma Comput Biol 7:1. • Page 2 of 8 •
doi:10.4172/2329-9533.1000147 fatcat:mu25zfszn5gldp255haaduigju