Analysis of the code relating sequence to conformation in globular proteins. Theory and application of expected information

Barry Robson
1974 Biochemical Journal  
1. An information theory analysis of the folding of a globular protein is proposed. 2. The folding is seen as a transfer of information between two messages, the primary sequence and the biologically active conformation. 3. It is shown how the information transferred was estimated by inspection of proteins of known primary sequence and conformation. 4. In this estimation, concerted use of subjective (Bayesian) probabilities leads to a more robust approach which can be employed whether the
more » ... of proteins of known sequence and conformation is large or small. 5. Further, it is demonstrated that the problem then becomes a very simple algebraic formulation for information estimates. 6. Finally, it is shown how this process of information theory analysis can be reversed to predict the conformation of a protein by using its primary sequence and the above information estimates obtained from other proteins. 7. The present paper provides the theoretical basis for the derivation and application of a stereochemical alphabet (Robson & Pain, 1974a,c), and for an investigation of the effects of residues on the conformations of their neighbours (Robson & Pain, 1974b) . The possibility ofpredicting the native, biologically active conformation of a protein from its amino acid sequence is of considerable interest. The ability to make successful predictions would imply an understanding of the relationship between sequence and conformation and would help in solving the problem of how a globular protein folds up. Further, the ability to produce novel and artificial conformations could have a variety of applications in the biomedical and bioengineering fields. The problem of making good predictions of the overall conformation of a protein has not yet been solved despite experimental evidence (Anfinsen, 1962 (Anfinsen, , 1967 Tanford, 1968) that all the information for the native conformation is carried by the amino acid sequence. Currently, the problem is being characterized in the following way. A conformation can be described either in terms of external coordinates (the Cartesian co-ordinates of all the constituent atoms) or internal co-ordinates (the bond lengths, valence angles between bonds and rotation angles around bonds). Usually, a subset of the internal co-ordinates is used, namely, the rotation angles around single bonds that have relatively small energies associated with their distortion from equilibrium values and are therefore called 'soft' variables. Since the remaining 'hard' variables are relatively invariant, the problem reduces to one of predicting the values of the soft variables, at least as a first approximation. Further, attention is directed to those soft variables that specify the progress of Vol. 141 the protein backbone through space, namely the rotation angles 0 and V/ around the N-Cac and Ca-C' bonds respectively. The remaining rotation angles a) around the C'-N bonds of the backbone are relatively 'hard' because of partial double-bond character and are therefore frequently considered to be invariant in the planar and trans configuration. Although it is true that a small error in the predicted values of the internal co-ordinates can lead to very large error in the predicted values of the external co-ordinates, good predictions ofthe soft variables, and particularly of q and V', would represent a considerable advance at the present time. Inthepast therehavebeen two principal approaches to the prediction problem (see Robson, 1972 Robson, , 1974 , for reviews). The first approach, which may be termed analytic, involves making predictions of the values of the soft variables on the basis of statistical analysis of proteins of known sequence and conformation, the assumption being that such correlations as exist between sequence and conformation in this example will also hold in any new protein. The other approach involves the use of theoretical conformational energy calculations on the assumption that the native conformation corresponds to the deepest minimum in the conformational free-energy surface. Although enjoying some success with local interactions between residues close together in the amino acid sequence, the analytic approach apparently cannot at the present time be extended to non-local interactions because correlations fall off with
doi:10.1042/bj1410853 pmid:4463965 pmcid:PMC1168191 fatcat:plc3d5drffb3nniwnmws3kayvm