Exploring the Sequence-based Prediction of Folding Initiation Sites in Proteins
Protein folding is a complex process that can lead to disease when it fails. Especially poorly understood are the very early stages of protein folding, which are likely defined by intrinsic local interactions between amino acids close to each other in the protein sequence. We here present EFoldMine, a method that predicts, from the primary amino acid sequence of a protein, which amino acids are likely involved in early folding events. The method is based on early folding data from hydrogen
... rium exchange (HDX) data from NMR pulsed labelling experiments, and uses backbone and sidechain dynamics as well as secondary structure propensities as features. The EFoldMine predictions give insights into the folding process, as illustrated by a qualitative comparison with independent experimental observations. Furthermore, on a quantitative proteome scale, the predicted early folding residues tend to become the residues that interact the most in the folded structure, and they are often residues that display evolutionary covariation. The connection of the EFoldMine predictions with both folding pathway data and the folded protein structure suggests that the initial statistical behavior of the protein chain with respect to local structure formation has a lasting effect on its subsequent states. Proteins perform a multitude of functions in organisms. To fulfill their function, a well-defined three-dimensional organization of the protein atoms is often required, with many proteins folding independently into such stable structures 1 . Others need help from chaperones to fold 2 , while some only fold upon binding their interaction partner(s) 3 or do not fold at all 4 . In all cases, the protein sequence encodes its behavior and, by extension, the environmental context that is required for the protein to fold, whether that is the right temperature and/or pH 5 , another biomolecule or a post-translational modification 6 . Proteins that misfold, for example prions or in amyloid formation 1, 7 , can lead to disease. Different theories about how proteins fold independently have been suggested over the last decades 1, 8-10 , with the view of initial formation of foldons, which provide the right context for the rest of the protein to fold, recently strongly supported by hydrogen-deuterium exchange (HDX) based mass spectrometry (MS) experiments 9, 11 . Foldons are essentially structural elements that likely form easily through favorable interactions between amino acids close to each other in the sequence. These interactions determine the initial conformational states in the pathway towards the native fold, and provide the context for other residues in the protein to fold themselves. The importance of local amino acid interactions was already pointed out decades ago based on information from folded protein structures 12, 13 . The structure of a protein is, however, an end product of the folding process, and does not provide direct information about where the first local structural elements started to form. To obtain a more accurate picture of such 'early folding' residues in proteins, we recently created the Start2Fold database, which collects data from pulsed labelling and related HDX experiments 14 . We showed that the DynaMine sequence-based protein backbone rigidity predictions 15, 16 give the best results in discriminating early folding residues from other regions of the protein 17 . In addition, we observed that protein regions with higher backbone rigidity tend to preserve this rigidity in evolution 17 . Experimental early folding data remain difficult to obtain, however, and are only available for specific proteins. We here present EFoldMine, a protein-sequence predictor of early folding residues trained on a set of 30 proteins for which high-quality experimental HDX data is available in Start2Fold. The 'early folding' residues in the training set were identified by NMR pulsed labelling experiments, where protein folding is triggered from its unfolded state. These experiments can identify residues that form stable local structure very quickly, on the low millisecond timescale, under kinetic control without fast conformational exchange. Residues are only detected if their backbone amide proton is protected from exchange with water by hydrogen bond formation. Information on the type of local structure that is formed is not available from these experiments. EFoldMine therefore identifies the residues in proteins that are inclined to form structural elements unaided during the very first stage of the folding process, prior to the formation of specific defined interatomic contacts in the folded protein. We show that EFoldMine can provide mechanistic insights into the folding process, and can indicate regions of intrinsically disordered proteins poised to fold. On a proteome scale, the predictions pinpoint many of the residues that create the most interactions in the final folded protein structure, as well as detecting residues that tend to display evolutionary covariation. These observations suggest that early folding events determined by local interactions shape the folding landscape of proteins, so influencing the fold the protein finally adopts.