Virus discovery using current and novel methods

Barbara Franziska Mühlemann, Apollo-University Of Cambridge Repository, Apollo-University Of Cambridge Repository, Terry C. Jones, Derek J. Smith
Next Generation Sequencing (NGS) technology allows researchers to sequence genetic material from a wide range of sources, including patient and environmental samples, and ancient remains. The recovery of viruses from such datasets can provide insights into the diversity and evolution of both novel and already known viruses. This thesis focuses on two aspects of virus discovery in NGS datasets. In the first part of this thesis, I present ancient viral sequences from hepatitis B virus, human
more » ... virus B19, and variola virus. The sequences were recovered from NGS datasets from individuals living in Eurasia between ∼150 to ∼31,630 years ago, using standard sequence matching tools. The data show the past existence of viruses similar to variants circulating today. The sequences reveal a complexity of virus evolution that is not evident when considering modern sequences alone, including revised substitution rates and most recent common ancestor dates, as well as geographic movement and extinction of strains. The identification of viral sequences in NGS datasets relies heavily on sequence-based matching of unknown sequences to a database of known sequences. Comparisons are usually done at the nucleotide or amino acid level. However, those methods only work well on sequences closely related to those already present in the database. With the aim of identifying more diverged viral sequences, in the second part of this thesis, I present an algorithm to compare sequences based on predicted structural features, such as secondary structures and conserved amino acids. The algorithm is modelled after the music-matching algorithm 'Shazam'. While initial results of the algorithm are somewhat encouraging, problems remain, in particular with the identification of adequate structural features. Identifying highly diverged viral sequences is thus still a challenging problem, hopefully to be solved in the future.
doi:10.17863/cam.51285 fatcat:umob5bqeabg6ncytja7bmn4smm