ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network
With the development of next generation sequencing techniques, it is fast and cheap to determine protein sequences but relatively slow and expensive to extract useful information from protein sequences because of limitations of traditional biological experimental techniques. Protein function prediction has been a long standing challenge to fill the gap between the huge amount of protein sequences and the known function. In this paper, we propose a novel method to convert the protein function
... protein function problem into a language translation problem by the new proposed protein sequence language "ProLan" to the protein function language "GOLan", and build a neural machine translation model based on recurrent neural networks to translate "ProLan" language to "GOLan" language. We blindly tested our method by attending the latest third Critical Assessment of Function Annotation (CAFA 3) in 2016, and also evaluate the performance of our methods on selected proteins whose function was released after CAFA competition. The good performance on the training and testing datasets demonstrates that our new proposed method is a promising direction for protein function prediction. In summary, we first time propose a method which converts the protein function prediction problem to a language translation problem and applies a neural machine translation model for protein function prediction. 2017 arXiv:1710.07016v1 [q-bio.QM] 19 Oct 2017 function, protein structure) of each protein sequence becomes an urgent task, which not only helps us better understanding their role in our life, but more importantly their potential biomedical and pharmaceutical applications, such as drug discovery [2, 3] . The traditional biological experimental methods to determine a protein's properties (e.g., protein function and structure) can be very slow and also resource-demanding [3, 4] . Moreover, sometimes it has the limitation that may not faithfully reflect the protein's activity in vivo [3, 5] . Due to these limitations, the computation method that could accurately and quickly predict protein function from its sequence is greatly desired. The computation protein function prediction method can be used to fill the gap between the large amount of sequence data and the unknown properties of these proteins. Protein function prediction is usually treated as a multi-label classification problem. Researchers have tried different computation methods in the last few decades for this problem [6, 7, 8, 9, 10, 11, 12] . In general, the following methods are used for protein function prediction. The first and the most widely used method for protein function prediction is the Basic Local Alignment Search Tool (BLAST)  used to search the query sequence against the existing protein databases, which contain experimentally determined protein function information, and then use these homologous proteins' function information for the function prediction of the query sequence. For example, the methods of Gotcha , OntoBlast , and Goblet . Except for using the BLAST tool, some methods use the tool PSI-BLAST  to find the remote homologous, such as the PFP method  . The second includes network based methods. Most methods in this category use protein-protein interaction networks for protein function prediction based on the assumption that interacted proteins share similar functions [10, 18, 19, 20, 21, 22, 23] . Instead of protein-protein interaction networks, some methods use other kind of networks for protein function prediction, such as gene-gene interaction network and domain co-occurance networks [3, 10, 24] .