End-to-end multitask learning, from protein language to protein features without alignments [article]

Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Burkhard Rost
2019 bioRxiv   pre-print
AbstractCorrectly predicting features of protein structure and function from amino acid sequence alone remains a supreme challenge for computational biology. For almost three decades, state-of-the-art approaches combined machine learning and evolutionary information from multiple sequence alignments. Exponentially growing sequence databases make it infeasible to gather evolutionary information for entire microbiomes or metaproteomics. On top, for many important proteins (e.g. dark proteome and
more » ... ntrinsically disordered proteins) evolutionary information remains limited. Here, we introduced a novel approach combining recent advances of Language Models (LMs) with multi-task learning to successfully predict aspects of protein structure (secondary structure) and function (cellular component or subcellular localization) without using any evolutionary information from alignments. Our approach fused self-supervised pre-training LMs on an unlabeled big dataset (UniRef50, corresponding to 9.6 billion words) with supervised training on labelled high-quality data in one single end-to-end network. We demonstrated the effectiveness of the novel concept through the successful per-residue prediction of protein secondary structure (Q3=85%, Q8=72%) and through per-protein predictions of localization (Q10=69%) and the distinction between integral membrane and water-soluble proteins (Q2=89%). On top, multi-task predictions are 300-3000 times faster (where HHblits needs 30-300 seconds on average, our method needed 0.045 seconds). These new results push the boundaries of predictability towards grayer and darker areas of the protein space, allowing to make reliable predictions for proteins which were not accessible by previous methods. On top, our method remains scalable as it removes the necessity to search sequence databases for evolutionary related proteins.
doi:10.1101/864405 fatcat:6i5sr524zrd6nk27xf7tbas3ye