Punchline: Identifying and comparing significant Pfam protein domain differences across draft whole genome sequences [article]

Lisa C Crossman
2019 bioRxiv   pre-print
Motivation: Short-read draft paired-end Illumina assemblies can be fragmented, contain many contigs and be impacted on by repeat regions, caused by mobile element activity within the genome or inherently repetitive gene structure. Annotating such assemblies for function and analysing gene content can be challenging if predicted genes are fragmented across contigs. Such a case can often occur within specific families of genes such as longer genes with repeating domains, genes specifying several
more » ... ransmembrane domains and of unusual nucleotide content. These genes can often be virulence determinants, therefore losing these specific types of data can seriously impact downstream studies. Results: Rather than studying the predicted gene content of draft genomes, we examined predicted protein content using the Pfam domain complements of predicted proteins. We produced a workflow, Punchline, to study the genetic content of draft contig assemblies by looking at the complement of short domains that are unlikely to be affected. We investigated a dataset of Bacteroides ovatus in terms of a grouping involving the vertebrate host from which the organism was isolated and identified potential host restricted functions and host restricted phylogenetic clustering. Availability: https://github.com/LCrossman
doi:10.1101/686543 fatcat:xckepszw6zhfjiddeoq2n7qyum