Peer Review #1 of "Predicting the host of influenza viruses based on the word vector (v0.1)" [peer_review]

2017 unpublished
Newly emerging influenza viruses continue to threaten public health. A rapid determination of the host range of newly discovered influenza viruses would assist in early assessment of their risk. Here, we attempted to predict the host of influenza viruses using the Support Vector Machine (SVM) classifier based on the word vector, a new representation and feature extraction method for biological sequences. The results show that the length of word within the word vector, the sequence type (DNA or
more » ... uence type (DNA or protein) and the species from which the sequences were derived for generating the word vector all influence the performance of models in predicting the host of influenza viruses. In nearly all cases, the models built on the surface proteins hemagglutinin (HA) and neuraminidase (NA) (or their genes) produced better results than internal influenza proteins (or their genes). The best performance was achieved when the model was built on the HA gene based on word vectors (words of three-letters long) generated from DNA sequences of the influenza virus. This results in accuracies of 99.7% for avian, 96.9% for human and 90.6% for swine influenza viruses. Compared to the method of sequence homology best-hit searches using Basic Local Alignment Search Tool (BLAST), the word vector-based models still need further improvements in predicting the host of influenza A viruses. PeerJ reviewing PDF | Abstract 14 Newly emerging influenza viruses continue to threaten public health. A rapid 15 determination of the host range of newly discovered influenza viruses would assist 16 in early assessment of their risk. Here, we attempted to predict the host of 17 influenza viruses using the Support Vector Machine (SVM) classifier based on the 18 word vector, a new representation and feature extraction method for biological PeerJ reviewing PDF | Manuscript to be reviewed 19 sequences. The results show that the length of word within the word vector, the 20 sequence type (DNA or protein) and the species from which the sequences were 21 derived for generating the word vector all influence the performance of models in 22 predicting the host of influenza viruses. In nearly all cases, the models built on the 23 surface proteins hemagglutinin (HA) and neuraminidase (NA) (or their genes) 24 produced better results than internal influenza proteins (or their genes). The best 25 performance was achieved when the model was built on the HA gene based on 26 word vectors (words of three-letters long) generated from DNA sequences of the 27 influenza virus. This results in accuracies of 99.7% for avian, 96.9% for human 28 and 90.6% for swine influenza viruses. Compared to the method of sequence 29 homology best-hit searches using Basic Local Alignment Search Tool (BLAST), 30 the word vector-based models still need further improvements in predicting the 31 host of influenza A viruses.
doi:10.7287/peerj.3579v0.1/reviews/1 fatcat:dixgl5lbsrg47f5hbh25kmg37a