A Novel Method for Detecting Contaminated Sample Based on Illumina Sequencing Data

Zheng Huang, Qibin Li, Wei Jin, Qijun Liao, Xiao Sun
2014 International Journal of Bioscience Biochemistry and Bioinformatics  
Illumina sequencing platform is widely used in genetics research. Due to the complex andlong-term library construction and DNA sequencing, samples can be contaminated with different sources, which can lead to false-positive SNP calling. To identify the contaminated samples, we built a model of mappability score to quantitatively measurethe accessibility of different parts ofhuman genome. By characterizing the genomic region with high probability of uniqueness and counting the discordant reads
more » ... ainst genotypes on the unique region, we could detect outliers as the contaminated samples in a population scale. Totest the effectiveness of our method, we manually mixed the sequencing reads of two clean samples. With the prior knowledge of mixture rate, we concluded that ourmethodis quite sensitive for female samples contaminated even slightly by male samples, accurate for male samples with moderate contamination by female samples and powerful for severe cross-individual contamination with the same gender. This method is easily understood but fairly effective in population-scale sample quality control. Index Terms-Contamination, mappability score, sample quality control, unique region.
doi:10.7763/ijbbb.2014.v4.322 fatcat:yu4y7b6awfb6ni7zeoxitf73ou