Hierarchical Clustering on RNA Dependent RNA Polymerase using Machine Learning [article]

Gudipati Pavan Kumar
2021 bioRxiv   pre-print
RNA Dependent RNA Polymerase (RdRP) catalyzes the replication of RNA from an RNA template and is mostly found in Viruses. We have collected over 161 viral RdRP FASTA Sequences from the NCBI protein database using python script. Each of these sequences was transformed with TfidfVectorizer using sklearn module, with the one Letter word, because each Letter belongs to one Amino acid. These transformed data were sent to Hierarchical clustering using scipy library and visualized data using
more » ... . These Machine Learning technique is able to classify or segment similar RdRp into one cluster. Each of these clusters was tested for their multiple sequence alignment with COBALT of NCBI. We observed that these clusters predicted similar RdRP among various viruses. These techniques can be further improved to segment or classify various proteins. These Machine Learning or Artificial Intelligence techniques need more improvement in their algorithms to solve genomics and proteomics.
doi:10.1101/2021.08.23.457366 fatcat:pg6t3yivzrc6bmwzkonnyfuw4m