De novo virus inference and host prediction from metagenome using CRISPR spacers [article]

Ryota Sugimoto, Luca Nishimura, Jumpei Ito, Nicholas F Parrish, Hiroshi Mori, Ken Kurokawa, Hirofumi Nakaoka, Ituro Inoue
2020 bioRxiv   pre-print
Viruses are the most numerous biological entity, existing in all environments and infecting all cellular organisms. Compared with cellular life, the evolution and origin of viruses are poorly understood; viruses are enormously diverse and most lack sequence similarity to cellular genes. To uncover viral sequences without relying on either reference viral sequences from databases or marker genes known to characterize specific viral taxa, we developed a virus inference method based on clustered
more » ... gularly interspaced short palindromic repeats (CRISPR). CRISPR is a prokaryotic nucleic acid restriction system that stores memory of previous exposure. Our method can reconstruct viral sequences targeted by CRISPR and predict their hosts using unassembled short-read metagenomic sequencing data. Analysing human gut metagenomic data, we extracted 11,391 terminally redundant CRISPR-targeted sequences which are likely complete circular genomes of viruses or plasmids. The sequences include 257 complete crAssphage family genomes, 11 genomes larger than 200 kilobases, 766 genomes of Microviridae species, 114 genomes of Inoviridae species and many entirely novel genomes of unknown taxa. We predicted the host(s) of approximately 70% of discovered genomes by linking protospacers to taxonomically assigned CRISPR direct repeats. These results support that our method is efficient for de novo inference of viral genomes and host prediction. In addition, we investigated the origin of the diversity-generating retroelement (DGR) locus of the crAssphage family. Phylogenetic analysis and gene locus comparisons indicate that DGR is orthologous in human gut crAssphages and shares a common ancestor with baboon-derived crAssphage; however, the locus has likely been lost in multiple lineages recently.
doi:10.1101/2020.09.04.282665 fatcat:run2yefpe5fu3g4kl5odwex5ma