Relations of the numbers of protein sequences, families and folds

C. T. Zhang
1997 Protein Engineering Design & Selection  
with the principles of stereochemistry. Ptitsyn and Finkelstein The relations among the numbers of protein sequences, have pointed out that due to the stereochemical constraints, families and folds have been studied theoretically. It is the possible number of globular protein folds is limited (Ptitsyn found that the number of families is related to the natural and Finkelstein, 1980; Finkelstein and Ptitsyn, 1987). This logarithm of the number of sequences. The logarithmic conclusion is most
more » ... ome to researchers in the area of relation should not be changed regardless of what value of protein structure prediction. The prediction of protein tertiary the homology threshold is applied in the protein sequence structure from amino acid sequences based on the principle of comparison routines. To study the relation between the free energy minimization has not yet been successful. In this numbers of families and folds, the degenerate degree of a case, the knowledge-based approach to predicting the tertiary fold has been introduced. The degenerate degree of a fold structures of proteins, such as the threading and profile methods is the number of protein families which adopt the same (Bowie et al., 1991; Jones et al., 1992), seems to be one of fold. The distribution of the degenerate degrees of folds the most promising approaches. The fact that there exists a has been found to be very likely exponential. Based on the limited number of protein folds provides a solid basis for such distribution, the average degenerate degree d is calculated. an approach. Therefore, further discussion on the above issue The number of folds is simply equal to that of families is necessary and meaningful. In this paper, the relations among divided by the average degenerate degree of folds. It is the numbers of protein sequences, families and folds are shown that d is an increasing function of time. The current studied theoretically. The numbers of folds for proteins in four value of d is about 2. It will continue to increase and reach species are estimated based on the theory established. the value of at least 3.3 in some years. By using the above result, the numbers of protein folds for four species have Result of analysis been estimated. In particular, the number of folds for Logarithmic relation human proteins is estimated to be ≤5200. Three quantities are concerned in our case, i.e. the numbers Keywords: degeneracy/degenerate degree/distribution of of protein sequences, families and folds, denoted by s, f a and degenerate degrees/numerical relations/protein families/protein f o , respectively. Note that all three quantities are functions of folds/protein sequences time. For example, s(t) indicates the cumulative number of protein sequences found through the year t. f a (t) and f o (t) indicate the cumulative numbers of protein families and folds Introduction found through the year t, respectively. It is first important to Protein sequence pairs with more than 30% residue identity study the relationships among these quantities. Suppose that are clustered together into superfamilies, or 30SEQ families there is a protein set consisting of s protein sequences. Let s (Orengo et al., 1994). For convenience, the 30SEQ family is have an increment ∆s. Accordingly, f a has also an increment also called family hereafter in this paper. It is well established ∆f a . Obviously, for given s we should have that in most cases each family adopts a unique fold structure, ∆f a ϰ ∆s (1) while in the other cases different families may adopt the same fold structure (Sander and Schneider, 1991; Holm et al., 1992; Now, for given ∆s, suppose that Rufino and Blundell, s 0 ϩ s 1994). This implies that the number of protein folds should be less than the number of the families, which should be less than the number of proteins. Therefore, it is reasonable to ask where s 0 is a constant to be determined later. Equation 2 how many folds there are in nature. In other words, there should be explained. Since the 30SEQ families are based on exists an upper limit for the number of the unique folds. The the identity of residues, for given ∆s, the larger the quantity question was probably first raised by Chothia (1992), who s, the lower is the probability of finding the new family estimated the figure to be about 1000. Since then, several members, i.e. the smaller the quantity ∆f a . Consequently, research groups have tackled this problem again. However, we have different results were reported. Blundell and Johnson (1993) ∆s estimated the number to be less than 1000, in agreement with ∆f a ϭ k (3) the estimate of Chothia (1992), but Alexandrov and Go (1994) s 0 ϩ s and Orengo et al. (1994) reported much larger figures than previously estimated, 6700 and 7920, respectively. Recently, where k is a proportionality constant. Integrating both sides from t 0 to t, we find Wang (1996) gave a very low estimate of probably only 400.
doi:10.1093/protein/10.7.757 pmid:9342141 fatcat:pyv6nbtyn5cwdgq4eb3ifes22m