Characterizing long-range correlations in DNA-sequences from wavelet analysis

The fractal scaling properties of DNA sequences are analyzed using the wavelet transform. Because the wavelet transform microscope can be made blind to the "patchiness" of genomic sequences, we demonstrate and quantify the existence of long-range correlations in genes containing introns and noncoding regions. Moreover, the fluctuations in the patchy landscapes of DNA walks are found to be homogeneous with Gaussian statistics.

where xo is the space parameter and a () 0) the scale parameter.
The main advantage of using the WT for The possible relevance of scale invariance and fractal concepts to the structural complexity of genomic DNA has been the subject of considerable recent interest.Dur-  ing the past few years, there has been intense discussion about the existence and the nature of long-range corre- lations within DNA sequences [1 -6].But despite the efforts spent, it is still an open question whether the longrange correlation properties are different for intronless and intron-containing coding regions.On more fundamental ground, there is still continuing debate as to whether the reported long-range correlations really mean a lack of in- dependence at long distances or simply refIect the "patchiness" (bias in nucleotide composition) of DNA sequences [1 -6].One of the main reasons for this controversial situ- ation is that the different techniques (e.g., DNA walk, cor- relation function, and power spectrum analyses) used, so far, for characterizing the presence of long-range correla- tions are not well adapted to study patchy sequences [6].
In that respect, there have been some attempts to elimi- nate local patchiness using ad hoc methods such as the so-called "min-max" [1] and "detrended fiuctuation analy- sis" [4(c)] methods.The purpose of this Letter is to advo- cate the use of a new technique, the wavelet tranform [7] (WT), which has proven to be well suited for character- izing the scaling properties of fractal objects even in the presence of low-frequency trends [8].We will proceed to a statistical analysis of DNA sequences by applying the wavelet transform modulus maxima (WTMM) method [8 (b)] that has been recently proposed as a generalized multifractal formalism for fractal functions.
The WT is a space-scale analysis which consists of ex- panding signals in terms of wavelets that are constructed from a single function, the analyzing wavelet P, by means of dilations and translations [7].The WT of a function s(x) is defined as analyzing the regularity of a function s is its ability to eliminate polynomial behavior by an appropriate choice of the wavelet P. Indeed, if s has, at the point xo, a local scaling (Holder) exponent h(xo) E ]n, n + 1[, in the sense that, around xo, Is(x) -P"(x)I -Ix -xoI" ", where P"(x)is some order-n polynomial, then one can easily prove [8] that T&(xo, a) -a" '"~, provided the first n + 1 moments of iver are zero [fx ill(x)dx = 0, 0 m ~n] Ther.efore the WT turns out to be a very powerful tool to detect and characterize singularities, even when they are masked by a smooth behavior.This property will be of fundamental importance to break free from the intrinsic patchiness of DNA sequences without using any ad hoc recipe.In this work, we will use the derivatives of the Gaussian function as analyzing wavelets: $1~1 = d~(e ' t2)/dx~( the first N moments of /~~1 are vanishing).
The WTMM method [8(b)] is a natural generalization of classical box-counting techniques.It consists of inves- tigating the scaling behavior of some partition functions defined in terms of wavelet coefficients, Z(q, a) = P sup IT~(x, a')I -~~a1, (2)   IEg (a) -(x,a')El where q E R. The sum is taken over the WT skeleton defined at each scale a by the local maxima of ITp(x, a)l considered as a function of x, these WTMM are disposed on connected curves called maxima lines; the set X (a) of all maxima lines that exist at scale a indicates how to position the wavelets (" generalized oscillating boxes") in or- der to obtain a partition at this scale.In the framework of this wavelet-based multifractal formalism, r(q) is the Legendre transform of the singularity spectrum D(h) de- fined as the Hausdorff dimension of the set of points x where the Holder exponent is h.Homogeneous fractal functions (i.e., functions with a unique Holder exponent h) are characterized by a linear ~(q) spectrum (h = Br/Bq).
On the contrary, a nonlinear r(q) curve is the signature of nonhomogeneous functions that displays multifractal prop- erties [i.e., h(x) is a fiuctuating quantity that depends upon 0031-9007/95/74(16) /3293(4) $06.00 1995 The American Physical Society 3293 x].For some specific values of q, r(q) has a well-known meaning.For example,r(0) can be identified to the frac- tal dimension (capacity) of the set where s is not smooth; r(1) is related to the capacity of the graph of the consid- ered function; furthermore, r( 2) is related to the scaling exponent P of the spectral density: 5(f) = ~s( f)~-f with P = 2 + r(2).Thus r(2) 4 0 indicates the presence of long-range correlations.Previous studies of DNA se- quences [1 -6] have mainly focused on the estimate of the power-law exponents of the rms fluctuations of the DNA walk or the autocorrelation function, which are simply re- lated to the power spectrum exponent P and thus to r (2).
For its ability to resolve multifractal scaling via the esti- mate of the entire r(q) spectrum, the WTMM method is a definite step beyond the techniques used so far in the lit- erature.Its reliability has been tested [8(b)] on various experimental and mathematical examples including fractional Brownian motions [9] (FBM).The FBM's BH(x) are Gaussian stochastic processes of zero mean with sta- tionary increments, which are indexed by a parameter H (0 ( H ( 1) that accounts for the presence (H 4 -) or 1 2the absence (H = z) of correlations between increments.
The FBM's are statistically homogeneous fractals charac- terized by a single Holder exponent h = H and thus by a r(q) spectrum which is a linear function of slope H: As a first application of the WTMM method in a bio- logical context, this study is devoted to the statistical analysis of 70 human DNA sequences, extracted from the EMBL data bank and long enough (L ~6000 nucleotides) to make the fractal analysis meaningful with respect to finite-size effects [4(b)].We processed separately the en- tire genes, the coding (individual exons, cDNA s) and the noncoding (individual introns, flanks) regions, provided the length of these subsequences exceed 2000 nucleotides.
To graphically portray these sequences, we followed the strategy originally proposed in [1], which consists of transforming them into random walks by defining an incremental variable that associates to the position i the value y(i) = 1 for purine (A, 6) or -1 for pyrimidine (C, T) The graph of the D. NA walk defined by the cu- mulative variable s(x) = g;, g(i) is plotted in Fig. 1 (a)   for the human desmoplakin I cDNA.The patchiness of this DNA sequence is patent; one clearly recognizes three regions of different strand bias.Figure 1(b) shows the WT space-scale representation of this DNA signal when using order 1 analyzing wavelet Pl' .This WT displays a treelike structure from large to small scales, that looks qualitatively similar to the fractal branching observed in the WT representation of Brownian or turbulent signals  of s(x), i.e. , the fluctuations over a characteristic length of the order of ai = 32 nucleotides.When increasing the WT magnification in Fig. 1(d), one realizes that these fluctuations actually occur around three successive linear trends; P~') not being blind to linear behavior, the WT coefficients fluctuate about nonzero constant behavior that correspond to the slopes of those linear trends.
Even though this phenomenon is more pronounced when progressively increasing the scale parameter a, it is indeed present at all scales and drastically affects the fractal branching of the WT.In Fig. 1(e) at the same coarse scale a = a2 as in Fig. 1(d), the fluctuations of the WT coefficients are shown as computed with the order-2 wavelet /~2).The WT microscope being now orthogonal also to linear behavior, the WT coefficients fluctuate about zero and one does not see the inhuence of the strand bias anymore.Furthermore, by considering successively Pi3), P~"), . . ., one can hope to eliminate more complicated nonlinear trends, with the ultimate goal of filtering the fractal underlying structure that might be responsible for the presence of long-range correlations in DNA sequences [10].
In Figs.2(a) -2(c) typical data are reported coming from the quantitative application of the WTMM method to the DNA walk graph corresponding to an intron-containing sequence (L = 73 326) which has been widely stud- ied in previous works [1,6(b),6(c)]: the human beta-globin intergenomic sequence (gene bank name HHUMHBB).First, let us mention that the patchy structure of this sequence is a little trickier than the one of the cDNA sequence in Fig. 1(a), in so far as it is not so easily amenable to ad hoc detrending methods such as the min-max procedure [1].Figures 2(a) and 2(b) display plots of IogzZ(q, a) vs log&a for some values of q; two sets of data are represented corresponding to computations performed with P(') ( ~) and Pl ) (o), respectively.While the scaling behavior expected from Eq. ( 2) seems to operate over a wide range of scales when using P( ), it does not show up so clearly for the data obtained with P(') for which some continuous (nonlinear) crossover phe- nomenon are observed as the signature of the breaking of the scale invariance by the extent patchiness of the DNA walk [6 (b)].The overall r(q) spectrum obtained with P( ) using a linear regression fit of the data is compared in Fig. 2(c) to the corresponding spectrum (U) derived for the same sequence but after randomly shuffling the nucleotides.The data for both the true and the shuffled HHUMHBB sequences remarkably fall on straight lines which are the hallmark of homogeneous fractal functions.
The corresponding unique Holder exponents can be estimated by fitting the slope h of these straight lines; the value extracted for the original sequence h = 0.60 ~0.02 of length I = 2000).We checked the reliability of this measurement by reproducing this WTMM analysis with higher order analyzing wavelets (P ",P' ').Let us point out that the value h = 0.60 is clearly lower than previous estimates reported in the literature, e.g. , h = 0.71 in [1] or h = 0.67 in [6(c)].Note that if we had used the inap- propriate analyzing wavelet P 'l, we would have obtained a similar biased estimate h = 0.70 ~0.03.Therefore the WTMM method not only corroborates the existence of long-range correlations in this intron-containing sequence [~( 2) ) 0], but it also provides a very efficient methodol- ogy to correct some systematic overestimates reported in previous works.In Fig. 2(c) the ~(q) spectrum computed when analyzing (using P( )) the coding DNA sequence portrayed in Fig. 1(a) is also shown for comparison.
The data (6) are almost indistinguishable from those previously obtained from the shuffled sequence; the ~(q) spectrum is linear with a slope h = 0.49 ~0.02, as expected for homogeneous fractal fluctuations that do not display long-range correlations.
Actually the linearity of the r(q) spectrum turns out to be a general result for all the DNA walks that we considered.In Figs.2(d) -2(f) the results of a systematic in- vestigation of our statistical sample of 70 human genomic sequences are summarized under the form of histograms of Holder exponent values.In order to test the robust- ness of our results with respect to the rule used to map the 2. WTMM analysis of protein coding and noncoding DNA sequences.(a) HHUMHBB: log2Z(q, a) vs log2a for different values of q; the data correspond to the analyzing wavelets P"' ( ~) and P'2' (o).(b) HHUMHBB: log, Z(q, a) vs log, a for q = 1.5; the symbols ( ~) and (o) have the same meaning as in (a); the symbol (Q) corresponds to the data for the shuftled sequence (see text) when using P"', the dashed and solid lines represent the corresponding least-squares fit estimates of ~(q).( c) ~(q) vs q for the HHUMHBB sequence when using P'" ( ~), the HHUMHBB sequence (o), the shuftled HHUMHBB sequence (G), and the human desmoplakin I cDNA (4) when using P"'.Histograms of Holder exponent values extracted from the WTMM analysis (with P'") of 70 human DNA sequences, ( ) cDNA's and ( ) intron-containing sequences, the DNA walk graphs are constructed on the basis of the following identifications: (A, G) vs (C, T) in (d), (C, G) vs (A, T) in (e), and (A, C) vs (G, T) in (f).
the entire genes; they display a rather pronounced peak for h = 0.63.Despite some slight overlap, the histograms ob- tained for individual exons and cDNA are systematically shifted towards lower h values and markedly peaked at h = 0.50.These results are in qualitative agreement with the conclusions of [1,4,5] and suggest that the experimental finding of long-range correlations in the human noncoding sequences is very likely to be a general characteristic feature of nucleotide organization in DNA.
Let us stress that the r(q) spectra extracted from both coding and noncoding DNA walks are remarkably well fit- ted by the theoretical spectrum for FBM's [r(q) = qH- I].Within that prospect, we studied the probability distri- bution function of wavelet coefficient values P(Tpn) (., a)), as computed at a fixed scale a in the scaling range.The distributions obtained for both the coding DNA sequence (a)a", where h = 0.60 ~0.02 as compared to the uncorrelated random walk value h = 0.50 ~0.02 obtained for the coding sequence.
To conclude, we have emphasized the WT as a very powerful and reliable tool to characterize the fractal scaling organization of DNA sequences.Our main experimental finding is that the fIuctuations in the patchy DNA walks are homogeneous with Gaussian statistics.Those for noncod- ing and intron-containing DNA sequences display longrange correlations and are well modeled by FBM's with 1 H & 2. In contrast, those for coding sequences cannot be distinguished from classical (uncorrelated steps) Brown- ian motions.Actually, much more can be learned from the WTMM analysis, especially as far as the nature of the over- all superimposed patchy structure of the DNA sequences  is concerned.For instance, very instructive information is contained in the way the scaling range [Figs. 2(a) and 2(b)] depends upon the shape of the analyzing wavelet and the value of q, whether there exists some finite char- acteristic scale above which either the scaling is broken or some crossover to a different scaling regime is observed.This information is likely to provide decisive tests for the validity of various models proposed to account for the long-range correlations in DNA sequences.On a more fundamental ground, further application of the WTMM analysis to DNA sequences of different evolutionary cate- gories [3] looks promising for future understanding of the role played by introns, repetitive motifs, and noncoding in- tergenic regions in the nonequilibrium dynamical process Fig. 1(b).When focusing the WT microscope at small scale a = ai in Fig.1(c), since P~') is orthogonal to con- stants, one filters the local (high frequency) fluctuations FIG.1.WT analysis of the human desmoplakin I cDNA (L = 8499).(a) DNA walk displacement s(x) (excess of purines over pyrimidines) vs nucleotide distances x.(b) WT of s(x) computed with the analyzing wavelet Pi' ' T~(i)(x, a) is coded, independently at each scale a, using 32 grey levels from white [minT&~i~(x, a)] to black [maxT&ii~(x, a)]; small scales are at the top.(c) T&~i~(x, a = a&) vs x for a& = 32.(d) T&ii~(x, a = a&) vs x for a2 = 512.(e) Same analysis as in (d) but with the analyzing wavelet Pi2'.

[ 7 ( 2 )
= 0.20 ~0.02] is significantly different from the expected value h = 0.50 ~0.02 [~(2) = 0.00 ~0.02] ob- tained for the uncorrelated random shuffled sequence (the error bars have been estimated from the fluctuations of h 17 observed when splitting the original signals in samples FIG.2.WTMM analysis of protein coding and noncoding DNA sequences.(a) HHUMHBB: log2Z(q, a) vs log2a for different

of Fig. 1 (
a) and the largest intron (L = 33 895) contained in the human retinoblastoma susceptibility gene are shown in Figs.3(a) and 3(b), respectively.When plotting lnP vs T~/o.(a), where cr(a) is the rms value at scale a, all the data points computed at different scales fall on the same parabola independently of the nature of the sequence.Thus the fIuctuations in the DNA walks are likely to have Gaussian statistics.The presence of long-range correla- tions in the intron sequence is in fact contained in the scale dependence of the rms o.