Distinctive Sequence Features in Protein Coding Genic Non-coding, and Intergenic Human DNA

Roderic Guigó, James W. Fickett
1995 Journal of Molecular Biology  
We have studied the behavior of a number of sequence statistics, mostly Theoretical Biology and indicative of protein coding function, in a large set of human clone sequences Biophysics Group Los Alamos National randomly selected in the course of genome mapping (randomly selected clone sequences), and compared this with the behavior in known sequences Laboratory, Los Alamos containing genes (which we term genic sequences). As expected, given the NM 87545, USA higher coding density of the genic
more » ... equences, the sequence statistics studied behave in a substantially different manner in the randomly selected clone sequences (mostly intergenic DNA) and in the genic sequences. Strong differences in behavior of a number of such statistics are also observed, however when the randomly selected clone sequences are compared with only the non-coding fraction of the genic sequences, suggesting that intergenic and genic non-coding DNA constitute two different classes of non-coding DNA. By studying the behavior of the sequence statistics in simulated DNA of different C + G content, we have observed that a number of them are strongly dependent on C + G content. Thus, most differences between intergenic and genic non-coding DNA can be explained by differences in C + G content. A + T-rich intergenic DNA appears to be at the compositional equilibrium expected under random mutation, while C + G richer non-coding genic DNA is far from this equilibrium. The results obtained in simulated DNA indicate, on the other hand, that a very large fraction of the variation in the coding statistics that underlie gene identification algorithms is due simply to C + G content, and is not directly related to protein coding function. It appears, thus, that the performance of gene-finding algorithms should be improved by carefully distinguishing the effects of protein coding function from those of mere base compositional variation on such coding statistics.
doi:10.1006/jmbi.1995.0535 pmid:7473716 fatcat:weltgu6lwrdubh3mjszczmbpz4