What can we learn with wavelets about DNA sequences

We use the wavelet transform to explore the complexity of DNA sequences. Long-range correlations are clearly identified and shown to be related to the sequence GC content. The significance of this observation to gene evolution is discussed.


Introduction
The possible relevance of scale invariance and fractal concepts to the structural complexity of genomic sequences is the subject of considerable increasing interest [1]. During the past few years, there has been intense discussion about the existence, the nature and the origin of long-range correlations in DNA sequences. Di erent techniques including mutual information functions [2], autocorrelation functions [3,4], power spectra [5,6], "DNA walk" representation [1,7], Zipf analysis [8] were used for statistical analysis of DNA sequences. But despite the e ort spent, there is still some continuing debate on rather struggling questions. In that respect, it is of fundamental importance to corroborate the fact that the reported long-range correlations are not just an artefact of the compositional heterogeneity of the genome organization [3,4, 9 -12]. Furthermore, it is still an open question whether the long-range correlation properties are di erent for protein-coding (exonic) and noncoding (intronic, intergenetic) sequences [ 1 -8,13]. One of the main obstacles to long-range correlation analysis is the mosaic structure of 1 DNA sequences which are well known to be formed of "patches" ("strand bias") of di erent underlying compositions [14][15][16]. These patches appear as trends in the DNA walk landscapes and are likely to introduce some breaking of scale invariance [ 9 -12]. Most of the technique used so far for characterizing the presence of long-range correlations are not well adapted to study patchy sequences. In a preliminary work [17,18], we have emphasized the wavelet transform (WT) as a very powerful technique for fractal analysis of DNA sequences. By considering analyzing wavelets that make the WT microscope blind to low-frequency trends, one can reveal and quantify the scaling properties of DNA walks. Here we report on recent results obtained by applying the so-called wavelet transform modulus maxima (WTMM) method [19,20] to various genomic sequences mainly selected in the human genome.

A wavelet-based multifractal formalism
In order to characterize the singular behavior of a distribution f at a point x o , one can use the H older (or roughness) exponent h(x o ) which basically corresponds to the highest exponent h which satisÿes: is of order smaller than h. The so-deÿned exponent h(x o ) "quantiÿes" the singularity strengh of f at x o : the higher h(x o ), the less singular f around x o . The singularity spectrum D(h) is then deÿned as the Hausdor dimension of the set of points x where the H older exponent of f is h. It thus gives a statistical characterization of the di erent singular behavior involved in f. In order to estimate locally the H older exponent, one needs a tool which must be blind to local smooth behavior, i.e., to the polynomial term P(x). The WT [19,20] is perfectly adapted to such requirements. It is a space-scale analysis which consists in expanding distributions in terms of wavelets which are constructed from a single function, the analyzing wavelet , by means of dilatations and translations. The WT is deÿned as where x o is the space parameter and a (¿ 0) the scale parameter. By choosing so that its ÿrst N moments are zero, one can easily prove [19,20] can be obtained locally from the power-law behavior of the WT, T (x o ; a) ∼ a h(xo) , in the limit a →0 + (we discard here the possible existence of oscillating singularities). In this work, we will mainly use the derivatives of the Gaussian function as analyzing wavelets: A natural way of performing a multifractal analysis of fractal functions consists in generalizing classical box-counting techniques by using wavelets instead of boxes. The WTMM method [19,20] amounts to investigate the scaling behavior of some partition functions deÿned in terms of wavelet coe cients: where q ∈ R. The sum is taken over the WT skeleton S(a) deÿned, at each scale a, by the set of all the points x i that correspond to local maxima of |T (x; a)| considered as a function of x. The main result of the WTMM method is that the D(h) singularity spectrum can be determined from the Legendre transform of the scaling exponent (q): D(h) = min q (qh − (q)). Homogeneous fractal functions that involve singularities of unique H older exponent h(x) = h, are characterized by a linear (q) spectrum (h = @ =@q). On the contrary, a nonlinear (q) curve is the signature of nonhomogeneous functions that display multifractal properties (i.e., h(x) is a uctuating quantity that depends upon x). For its ability to resolve multifractal scaling via the estimate of the entire (q) spectrum, the WTMM method [17,18] is a deÿnite step beyond the technique used so far in the literature [1][2][3][4][5][6][7] which were restricted to the estimate of the second-order exponent (2) only. The reliability of the WTMM method has been tested on various mathematical examples including fractional Brownian motions (fBm's). The fBm's B H (x) are Gaussian stochastic processes of zero mean with stationary increments. They are indexed by a parameter H (0¡H ¡1) that accounts for the presence (H = 1 2 ) or the absence (H = 1 2 ) of correlations between increments. The fBm's are statistically homogeneous fractals characterized by a single H older exponent h = H and thus by a linear (q) spectrum: (q) = qH − 1. The WTMM method has already been successfully applied to numerical and experimental data from various domains such as fully developed turbulence and fractal growth phenomena [19,20].

How to make the WT microscope blind to compositional patchiness
We concentrate our study on the statistical analysis of 121 DNA sequences selected in the human genome, with the requirement that their overall lengh L¿2000 nucleotides, so that the range of scales available to fractal scaling be large enough to make the analysis meaningful with respect to ÿnite size e ects. We took the sequences from EMBL data bank and processed seperately 47 coding (individual exons,CDS's) and 74 noncoding (individual introns) regions. To graphically portray these sequences we follow the so-called "DNA walk" analysis [7] which requires ÿrst to convert the four letter (A,C,G,T) text into a binary sequence. This can be done, for example, on the basis of purine (A,G) vs. pyrimidine (C,T) distinction, by deÿning the incremental variable that associates to position i the value (i) = 1 or −1, depending on whether the ith nucleotide of the sequence is a purine or a pyrimidine. (We refer the reader to Refs. [17,18,21] for similar analysis with the two complementary pair-base identiÿcations). The wavelet analysis of the human desmoplaskin I CDS is shown in Fig. 1 [17,18]. The patchiness of this sequence is patent on the corresponding DNA Fig. 1a: one clearly recognizes three regions of di erent strand bias. Fig. 1b shows the WT space-scale representation of this DNA signal when using the order 1 analyzing wavelet (1) . In Fig. 1c and Fig. 1d, two horizontal cuts x for a 2 = 2 7 (∼512 nucleotides). (e) Same analysis as in (d) but with the analyzing wavelet (2) .
T (1) (x; a) are shown at two di erent scales a = a 1 = 2 3 and a 2 = 2 7 which correspond (when taking into account the characteristic size of (1) ) to looking at the uctuations of the DNA walk over a characteristic lengh of the order of 32 and 512 nucleotides respectively. When progressively increasing the WT magniÿcation, one realizes that the uctuations detected at small scales actually occur around three successive linear trends. (1) not being blind to linear behavior, the WT coe cients uctuate about ÿnite constant values that correspond to the slopes of those linear trends. This phenomenon is indeed present at all scales and drastically a ects the fractal branching of the WT. In Fig. 1e, the uctuations of the WT coe cients are shown at the same coarse-scale a = a 2 but as computed with the order-2 analyzing wavelet (2) . The WT microscope now being orthogonal also to linear behavior, the WT uctuates about zero and one does not see the in uence of the strand bias anymore. Furthermore, by considering successively (3) , (4) ; : : : ; one can hope not only to restore the stationarity of the increments of the DNA signal but also to eliminate more complicated nonlinear trends.

Application of the WTMM method to the study of DNA sequences
In Fig. 2 are reported typical data coming out from the application of the WTMM method to the DNA walk corresponding to the dystrophin CDS when using the analyzing wavelet (2) . Fig. 2a shows plots of the partition functions Z(q; a) computed from the WT skeleton according to Eq. (2) vs. the scale parameter a in a log-log representation. Only plots obtained for q = 1 are shown on this ÿgure since they are quite representative of the typical features of the data for di erent values of q (−26q64). For convenience, aZ(1; a) is plotted in Fig. 2a, since from Eq. (2), it is expected to scale like a (1)+1 ; this will allow us to compare our results for DNA walks directly to the prediction for homogeneous fBms: aZ(1; a) ∼ a H where H is the Hurst exponent [17,18]. From a linear regression ÿt over a reasonable range of scales (more than a decade), one gets (1) + 1 = 0:50 ± 0:02, i.e., a value which is, up to the experimental uncertainty, quite consistent with the value H = 1 2 for uncorrelated Brownian random walks. This partial result is conÿrmed when repeating this measurement for di erent values of q; as shown in Fig. 2b, the data for the overall (q) spectrum remarkably fall on a straight line (the hallmark of homogeneous fractal signals) of slope h = @ =@q = 0:50 ± 0:01. In Fig. 2 are also reported the results of a similar WTMM analysis of the largest intron (L = 71 718) of the human retinoblastoma susceptibility gene. In Fig. 2b the data points for (q) again fall on a straight line but with a slope H = 0:58 ± 0:02 which is now signiÿcantly larger than 1 2 . Note that again the value obtained in Fig. 2a for (1) + 1 = H = 0:57 is quite compatible with the slope of (q). The dashed line in Fig. 2b corresponds to the theoretical spectrum for fBm's with a Hurst exponent H = 0:58; the remarkable agreement observed with the WTMM data conÿrms the presence of long-range correlations in the considered intronic DNA walk.
The results reported in Fig. 2 are actually quite representative of the results obtained for our statistical sample of 47 coding and 74 noncoding sequences [17,18]. When averaging the partition functions over these two statistical samples, we get Z C (q; a) and Z NC (q; a) which both scale with the exponent predicted for homogeneous fBm's, i.e., (q) = qH − 1, with a main di erence which allows us to distinguish coding from noncoding sequences namely the presence of long range correlation in the latter: H NC = 0:59 ± 0:02, while the former look like uncorrelated random walks: H C = 0:51 ± 0:02. These results are illustrated in Fig. 4a where a 1=2 Z(1; a) ∼ a H−1=2 is plotted vs. log 2 a to enlighten some possible departure of H from 1 2 .

About the Gaussian character of the uctuations in DNA walk landscapes
One of the most striking result of our WTMM analysis is the fact that the (q) spectra extracted for all the exons and introns we have considered in the human genome, are suprisingly in remarkable agreement with the theoretical prediction for Gaussian processes. Within that prospect, we have studied the probability distribution function of wavelet coe cient values P(T (2) (:; a)), as computed at a ÿxed scale a in the fractal scaling range [17,18]. The distribution obtained for both the coding DNA sequences of Fig. 1a and the largest intron contained in the human retinoblastoma susceptibility gene are shown in Fig. 3a and Fig. 3b, respectively. When increasing the scale parameter a, the distributions become wider, but when plotting ln P vs. T = (a), where (a) is the r.m.s value at scale a, all the data computed at di erent scales fall on the same parabola independently of the nature of the sequence. Thus, as explored through the WT microscope, the basic uctuations in DNA walks are likely to have Gaussian statistics. The presence of long-range correlations in the human introns is, in fact, contained in the scale dependence of (a) ∼ a H where H = 0:60 ± 0:02 as compared to the uncorrelated random-walk value H = 0:50 ± 0:02 obtained for the coding sequences.

Uncovering long-range correlations in coding DNA sequences
Because of the "period three" codon structure of coding DNA, it is natural to investigate separately the three subsequences relative to the position (1, 2 or 3) of the bases within their codons [22]. We have build up these subsequences from our  The analyzing wavelet is (2) . 35 largest CDS sequences and we have repeated the WTMM analysis. As shown in Fig. 4a, the data for a 1=2 Z(1; a) ∼ a H−1=2 when plotted vs. a in a log-log representation, display a rather at behavior for both the subsequences relative to positions 1 and 2 which indicates that the corresponding roughness exponents H C1 = 0:53 ± 0:02 and H C2 = 0:51 ± 0:02 are undistinguishable from H C and therefore from 1 2 . Surprisingly, the data for the subsequence relative to position 3 exhibit a clear linear increase with slope H C3 = 0:07 ± 0:02 which re ects the fact that H C3 = 0:57 ± 0:02, i.e., a value which is very close to the exponent estimated for introns. This observation suggests that this third coding subsequence is likely to display the same degree of long-range correlations as noncoding sequences.

Nucleotide composition e ects on the long-range correlation properties of human genes
In Fig. 4b, Fig. 4c and Fig. 5 are reported the results of a similar statistical analysis when classifying these DNA sequences into categories that correspond to a given GC content [21]. The idea of looking for a link between the long-range correlation properties and the GC content of the sequences results from the remark that the WTMM method indeed fails to distinguish a few introns from actual exons [17,18]. These introns with an exponent H close to 1 2 , actually correspond to DNA sequences with a low GC content (from 31% to 36%). As shown in Fig. 4b, when investigating the scaling behavior of a 1=2 Z(1; a) ∼ a H−1=2 for our set of introns, one notices some signiÿcant tendancy of the curves to become steeper when continuously increasing the GC content. The corresponding values of the roughness exponent H are reported in Fig. 5a. H clearly increases from value close to 1 2 at low GC content (∼30%) up to values signiÿcantly larger than 0.6 at high GC content (¿60%). In Fig. 5a are also shown the estimates of the roughness exponent for the coding sequences. Whether the CDS be poor or rich in GC, it does not seem to possess strong long-range correlations as indicated by an exponent H close to 1 2 . Fig. 5b is devoted to the results of similar analysis of the 3 coding subsequences relative to the position (1, 2 or 3) of the bases within the codons. For the ÿrst and the second subsequences, one gets results quite consistent with the estimates obtained with the overall CDS sequences: whatever the GC content, the exponent H does not signiÿcantly depart from the value 1 2 . Note that the data do not exclude a possible slow increase of H C2 . For the third subsequence, H is found to increase up to values close to 0.60 at high GC content, which brings the clue that this subsequence exhibits GC-dependent long-range correlations very much like those observed in Fig. 5a for introns (see also Fig. 4c). In order to investigate the possibility that these observations might result from the exon concatenation in the CDSs, we have analyzed individual human exons (for statistical reason, only the largest ones). These exons exhibit the same features than the CDSs, as it is exempliÿed in Fig. 4c by the apoB-100 largest exon.

Discussion
The results reported in this work clearly show that the GC content is likely to be relevant to the long-range correlation properties observed in both intronic and exonic DNA sequences. The evolution of DNA sequences in terms of GC content has attracted a lot of interest during the past few years [14][15][16][22][23][24][25][26][27]. Several mechanisms can be proposed to account for the observed long-range correlations in the GC rich intronic sequences [21].
(i) Besides punctual mutations, genomic sequences are subject to a number of insertion-deletion events of DNA fragments of widely variable sizes. These events are much less frequent in exonic regions due to the strong constraints imposed by their coding properties. The insertion-deletion mechanisms could be responsible for the observed long-range correlations [2]. However, insertions-deletions occur in low GC intronic sequences which we just showed to present no long-range correlations. Furthermore, the correlations observed between the third bases of the codons, but not between adjacent nucleotides, are unlikely to result from (rare) insertion-deletion events which generally involve several adjacent nucleotides in order to maintain the coding phase. (ii) The human genome is well known to be compartimentalized into wide speciÿc domains with uniform GC content, called isochores [23]; appreciable scatter of the average GC content is actually observed when comparing di erent domains. Another hypothesis is to consider that the processes operating to create the GC-rich isochores lead to the appearence of long-range correlations. Thanks to the functional constraints acting on the coding sequences embedded in these GC-rich regions, these processes should be less active on the exons, with a concomittant lack of long-range correlations as compared to the surrounding introns. Since these constraints are less stringent on the third base of the codons, this would explain the correlations observed between these nucleotides in high GC containing exons. In human genes the frequencies of the third base of codons are highly correlated with neighboring intronic GC content [28]. This property favors the hypothesis that the exonic correlations are produced by the same mechanisms which lead to intronic correlations.
It is likely that the observations reported here also extend to genomes of mammals and warmblooded vertebrates. The exploration of genomes of various organisms including unicellular eukaryotes and prokaryotes is currently under progress.