CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data

Steve Davis, James B. Pettengill, Yan Luo, Justin Payne, Al Shpuntoff, Hugh Rand, Errol Strain
2015 PeerJ Computer Science  
The analysis of next-generation sequence (NGS) data is often a fragmented step-wise process. For example, multiple pieces of software are typically needed to map NGS reads, extract variant sites, and construct a DNA sequence matrix containing only single nucleotide polymorphisms (i.e., a SNP matrix) for a set of individuals. The management and chaining of these software pieces and their outputs can often be a cumbersome and difficult task. Here, we present CFSAN SNP Pipeline, which combines
more » ... which combines into a single package the mapping of NGS reads to a reference genome with Bowtie2, processing of those mapping (BAM) files using SAMtools, identification of variant sites using VarScan, and production of a SNP matrix using custom Python scripts. We also introduce a Python package (CFSAN SNP Mutator) that when given a reference genome will generate variants of known position against which we validate our pipeline. We created 1,000 simulated Salmonella enterica sp. enterica Serovar Agona genomes at 100× and 20× coverage, each containing 500 SNPs, 20 single-base insertions and 20 single-base deletions. For the 100× dataset, the CFSAN SNP Pipeline recovered 98.9% of the introduced SNPs and had a false positive rate of 1.04 × 10 −6 ; for the 20× dataset 98.8% of SNPs were recovered and the false positive rate was 8.34 × 10 −7 . Based on these results, CFSAN SNP Pipeline is a robust and accurate tool that it is among the first to combine into a single executable the myriad steps required to produce a SNP matrix from NGS data. Such a tool is useful to those working in an applied setting (e.g., food safety traceback investigations) as well as for those interested in evolutionary questions. Subjects Bioinformatics
doi:10.7717/peerj-cs.20 fatcat:vtek74m73rhudltxchndccw52m