Relaxing the Hypotheses of Symmetry and Time-Reversibility in Genome Evolutionary Models

Various genome evolutionary models have been proposed these last decades to predict the evolution of a DNA sequence over time, essentially described using a mutation matrix. By essence, all of these models relate the evolution of DNA sequences to the computation of the successive powers of the mutation matrix. To make this computation possible, hypotheses are assumed for the matrix, such as symmetry and time-reversibility, which are not compatible with mutation rates that have been recently obtained experimentally on genes ura3 and can1 of the Yeast Saccharomyces cerevisiae. In this work, authors investigate systematically the possibility to relax either the symmetry or the time-reversibility hypothesis of the mutation matrix, by investigating all the possible matrices of size 2*2 and 3*3. As an application example, the experimental study on the Yeast Saccharomyces cerevisiae has been used in order to deduce a simple mutation matrix, and to compute the future evolution of the rate purine/pyrimidine for $ura3$ on the one hand, and of the particular behavior of cytosines and thymines compared to purines on the other hand.


Introduction
Due to mutations or recombination, some variations occur in the frequency of each codon, and these codons are thus not uniformly distributed into a given genome. Since the late '60s, various genome evolutionary models have been proposed to predict the evolution of a DNA sequence as generations pass. Mathematical models allow the prediction of such an evolution, in such a way that statistical values observed in current genomes can be at least partially recovered from hypotheses on past DNA sequences. Moreover, it can be attractive to study the genetic patterns (blocs of more than one nucleotide: dinucleotides, trinucleotides...) that appear and disappear depending on mutation parameters.
A first model for genomes evolution has been proposed in 1969 by Thomas Jukes and Charles Cantor [5]. This first model is very simple, as it supposes that each nucleotide has the probability m to mutate to any other nucleotide, as described in the following mutation matrix,¨˚m m m m˚m m m m˚m m m m˚‹ ‹ ‚ .
In that matrix, the nucleotides are ordered as pA, C, G, T q, so that for instance the coefficient in row 3, column 2 represents the probability that the nucleotide G mutates into a C during the next time interval, i.e., P pG Ñ Cq. As diagonal elements can be deduced by the fact that the sum of each row must be equal to 1, they are omitted here.
This first attempt has been followed up by Motoo Kimura [6], who has reasonably considered that transitions (A ÐÑ G and T ÐÑ C) should not have the same mutation rate than transversions (A ÐÑ T , A ÐÑ C, T ÐÑ G, and C ÐÑ G), this model being refined by Kimura in 1981, with three constant parameters to make a distinction between natural A ÐÑ T , C ÐÑ G and unnatural transversions, leading to:˚c a b c˚b a a b˚c b a c˚‹ ‹ ‚ .
These efforts have been continued by Tamura, who proposed in [9,10] a simple method to estimate the number of nucleotide substitutions per site between two DNA sequences, by extending the model of Kimura (1980). The idea is to consider a two-parameter method, for the case where a GC bias exists. Let us denote by πGC the frequency of this dinucleotide motif. Tamura supposes that πG " πC " πGC 2 and πA " πT " 1´πGC 2 , which leads to the following rate matrix:˚κ All these models are special cases of the GTR model [11], in which the mutation matrix has the form (using obvious notations):˚f AC πC fAGπG fAT πT fAC πA˚fCGπG fCT πT fAGπA fCGπC˚πT fAT πA fCT πC πG˚‹ ‹ ‚ .
Non-reversible and non-symmetric models have, for their part, been considered in practical inferences since at least a decade for phylogenetic studies, see for instance [7,2,12] Furthermore, in the nonlinear case, a mutation can exhibit a chaos, see for instance [15]. As they are more regarded for their interest in practical inference investigations than on the theoretical side, they will not be developed in this article. 26  40  Transitions  46  65  Transversions  121  85   Table 1: Summary of sequenced ura3 and can1 mutations [8] Due to mathematical complexity [13,14], matrices theoretically investigated to model evolution of DNA sequences are thus limited either by the hypotheses of symmetry and time-reversibility or by the desire to reduce the number of parameters under consideration. These hypotheses allow their authors to solve theoretically the DNA evolution problem, for instance by computing directly the successive powers of their mutation matrix. However, one can wonder whether such restrictions on the mutation rates are realistic. Focusing on this question, we used in [1] a recent research work of Lang and Murray [8], in which the perbase-pair mutation rates of the Yeast Saccharomyces cerevisiae have been experimentally measured (see Table 1), allowing us to calculate concrete mutation matrices for genes ura3 and can1. We deduced in [1] that none of the existing genomes evolution models can fit such mutation matrices, implying the fact that some hypotheses must be relaxed, even if this relaxation implies less ambitious models: current models do not match with what really occurs in concrete genomes, at least in the case of this yeast. Having these considerations in mind, the data obtained by Lang and Murray have been used in [1] in order to predict the evolution of the rates or purines and pyrimidines in the particular case of ura3. Mathematical investigations and numerical simulations have been proposed, focusing on this particular gene and its associated matrix of size 2ˆ2 (purines vs. pyrimidines), and of size 3ˆ3 (cytosines and thymines compared to purines). Instead of focusing on two particular matrices, this extension of [1] investigates systematically all the possible mutation matrices of sizes 2ˆ2 and 3ˆ3. Thus, the study is finalized in this article, by investigating all the possible cases, and discussing about their mathematical and biological relevance.
The remainder of this research work is organized as follows. First of all the case of mutation matrices of size 2ˆ2 is recalled in Section 2 and applied to the ura3 gene taking into account purines and pyrimidines mutations. A simulation is then performed to compare this non reversible model to the classical symmetric Cantor model. The next sections deal with all the possible 6-parameters models of size 3ˆ3. In Section 3, a complete theoretical study is led encompassing all the particular situations, whereas in Section 4 an illustrative example focusing on the evolution of the purines, cytosines, and thymines triplet is given for ura3. We finally conclude this work in Section 5.

General Model of Size 2ˆ2
In this section, a first general genome evolution model focusing on purines versus pyrimidines is proposed, to illustrate the method and as a pattern for further investigations. This model is applied to the case of the yeast Saccharomyces cerevisiae.

A convergence result
Let R and Y denote respectively the occurrence frequency of purines and pyrimidines in a sequence of nucleotides, and M "ˆa b c d˙t he associated mutation matrix, with a " and thus M "ˆa 1´a c 1´c˙. The initial probability is denoted by P0 " pR0 Y0q, where R0 and Y0 denote respectively the initial frequency of purines and pyrimidines. So the occurrence probability at generation n is Pn " P0M n , where Pn " pRpnq Y pnqq is a probability vector such that Rpnq (resp. Y pnq) is the rate of purines (resp. pyrimidines) after n generations. The following theorem states the time asymptotic behavior of the probabilit Pn.
We recall the following result was proved in [1]: Consider a DNA sequence under evolution, whose mutation matrix is M " a 1´a c 1´c˙w ith a " P pR Ñ Rq and c " P pY Ñ Rq.
• If a " 1, c " 0, then the frequencies of purines and pyrimidines do not change as the generation pass.
• Else the value Pn " pRpnq Y pnqq of purines and pyrimidines frequencies at generation n is convergent to the following limit: rem 2.1. Note that the case a ‰ 1´c, resp. a ‰ c, translates the non symmetry property, resp. the time reversibility property.
Proof. To prove the theorem, we have to determinate M n for evey n P N. A division algorithm leads to the existence of a polynomial of degree n´2, denoted by QM P Rn´2rXs, and to an, bn P R such that X n " QM pXqχM pXq`anX`bn, when χM is the characteristic polynomial of M . Using both the Cayley-Hamilton theorem and the equality given above, we thus have In order to determine an and bn, we must find the roots of χM . As χM pXq " X 2T rpM qX`detpM q and due to (2.1), we can conclude that 1 is a root of χM , which thus has two real roots: 1 and x2. As the roots sum is equal to -tr(A), we conclude that x2 " a´c.
If x2 " a´c " 1, then a " 1 and c " 0 (as these parameters are in r0, 1s), so the mutation matrix is the identity and the frequencies of purines and pyrimidines into the DNA sequence does not evolve. If not, evaluating (2.2) in both X " 1 and X " x2, we thus obtain # 1 " an`bn, pa´cq n " anpa´cq`bn.
Considering that a´c ‰ 1, we obtain an " pa´cq n´1 a´c´1 , bn " a´c´pa´cq n a´c´1 .
Using these last expressions into the equality linking M , an, and bn, we thus deduce the value of Pn " P0M n , where If a " 0 and c " 1, then M "ˆ0 ; p1, 0qu, then the limit of M n can be easily found using (2.3).
All the studided cases for M n lead to Theorem 2.1.

Numerical Application
For numerical application, we will consider mutations rates in the ura3 gene of the Yeast Saccharomyces cerevisiae, as obtained by Gregory I. Lang [5], we obtain the evolution depicted in Figure 1.  Computation of probability a. P pR Ñ Rq " p1´mq . The use of Table 1 and  where m " 3.0552ˆ10´7 as mentionned previously.
Using the value of m for the ura3 gene leads to 1´a " 2.83391ˆ10´7 and c " 2.89714ˆ10´7, which can be used in Theorem 2.1 to conclude that the rate of pyrimidines is convergent to 49.45% whereas the rate of purines converge to 50.55%. Numerical simulations using data published in [8] are given in Figure 2, leading to a similar conclusion.

A First Genomes Evolution Model of size 3ˆhaving 6 Parameters without Time-reversibility hypothesis
In order to investigate the evolution of the frequencies of cytosines and thymines in the gene ura3, a model of size 3ˆ3 compatible with real mutation rates of the yeast Saccharomyces cerevisiae is now presented.

Formalization
Let us consider a line of yeasts where a given gene is sequenced at each generation, in order to clarify explanations. The n´th generation is obtained at time n, and the frequences of purines, cytosines, and thymines at time n are respectively denoted by PRpnq, PC pnq, and PT pnq.
Let a be the probability that a purine is changed into a cytosine between two generations, that is: a " P pR Ñ Cq. Similarly, denote by b, c, d, e, f the respective probabilities: P pR Ñ T q, P pC Ñ Rq, P pC Ñ T q, P pT Ñ Rq, and P pT Ñ Cq. Contrary to existing approaches, P pR Ñ Cq is not supposed to be equal to P pC Ñ Rq, and the same statement holds for the other probabilities. For the sake of simplicity, we will suppose in all that follows that a, b, c, d, e, f are not time dependent. Let be the mutation matrix associated to the probabilities mentioned above, and Pn the vector of occurrence, at time n, of each of the three kind of nucleotides. In other words, Pn " pPRpnq PC pnq PT pnqq. Under that hypothesis, Pn is a probability vector: @n P N, • PRpnq, PC pnq, PT pnq P r0, 1s, • PRpnq`PC pnq`PT pnq " 1, Let P0 " pPRp0q PC p0q PT p0qq P r0, 1s 3 be the initial probability vector. We have obviously: PRpn`1q " PRpnqP pR Ñ Rq`PC pnqP pC Ñ Rq`PT pnqP pT Ñ Rq, with similar equalities for PC pn`1q and PT pn`1q so that In all that follows we wonder if, given the parameters a, b, c, d, e, f as in [8], one can determine the frequency of occurrence of any of the three kind of nucleotides when n is sufficiently large, in other words if the limit of Pn is accessible by computations.

Resolution
This section, that is a preliminary of the convergence study, is devoted to the determination of the powers of matrix M in the general case and some particular situations

Determination of M n in the general case
The characteristic polynomial of M is equal to p " ad`ae`af`bc`bd`bf`ce`cf`de, detpM q " 1´s`p.
The discriminant of the polynomial of degree 2 in the factorization of χM is equal to ∆ " ps´2q 2´4 p1´s´pq " s 2´4 p. Let x1 and x2 the two roots (potentially complex or equal) of χM , given by x1 "´s`2´a s 2´4 p 2 and x2 "´s`2`a s 2´4 p 2 .
Let n P N, n ě 2. As χM is a polynomial of degree 3, a division algorithm of X n by χM pXq leads to the existence and uniqueness of two polynomials Qn and Rn, such that where the degree of Rn is lower than or equal to the degree of χM , i.e., RnpXq " anX 2b nX`cn with an, bn, cn P R for every n P N. By evaluating (3.3) in the three roots of χM , we find the system $ & % 1 " an`bn`cn x n 1 " anx 2 1`bn x1`cn x n 2 " anx 2 2`bn x2`cn This system is equivalent to If we suppose that x1 ‰ 1, x2 ‰ 1, and x1 ‰ x2, then standard algebraic computations give cn " 1´an´bn.
Using for i " 1, 2 and n P N the following notation, and since x2´x1 " ? ∆, the system above can be rewritten as where I3 is the identity matrix of size 3, an, bn, and cn are given by (3.5), and M 2 is given by

Determination of M n in particular situations
Formulations of (3.5) only hold for x1 ‰ x2, x1 ‰ 1, and x2 ‰ 1. We now investigate these latter cases.
Note that, as we deal with a stochastic process, the module of the eigenvalues of M are smaller than 1, so |x1| ď 1 and |x2| ď 1.
Suppose that x 1 " 1 Then´s " a s 2´4 p ðñ s " p " 0. So a " b " c " d " e " f " 0, and the mutation matrix is equal to the identity of size 3. Conversely, if a " b " c " d " e " f " 0, then x1 " 1.
In that situation, the system does not evolve.

Convergence study in particular situations
The case where x1 " 1 has already been discussed, it implies that a " b " c " d " e " f " 0, and so the system does not evolve. The other particular situations are invastigated in the two following theorems. thm 3.4. Suppose that x2 " 1 and x1 ‰ 1 (or equivalently p " 0). Then the system is well formulated if and only if M 2`s ps´2qM´ps´1q 2 I3 ‰ 0. In that situation, we have: • either s Ps0, 2r, and so pPRpnq PC pnq PT pnqq ÝÑ pPRp0q PC p0q PT p0qqˆ1 s 2 r´M 2s p3´sqM`ps´1qp2s´1qI3s.
Several cases can be deduced from this equality.
• If s Ps0, 2r, then M n is bounded if and only if M 2`s ps´2qM´ps´1q 2 I3 " 0. In that condition, M n ÝÑ 1 s 2 r´M 2`s p3´sqM`ps´1qp2s´1qI3s. • Finally, if s ą 2, then as s " a`b`c`d`e`f and a, b, c, d, e, f P r0, 1s, we have necessarily at least three coefficients in a, b, c, d, e, f that are non zero. So at least one product in abc, abd, abe, abf, acd, ace, acf, ade, adf, aef, bcd, bce, bcf, bde, bdf, bef, cde, cdf, cef, def is strictly positive. This is impossible, as p " ad`ae`af`bc`bd`bf`ce`cf`de is equal to 0. thm 3.5. Suppose that x1 " x2 ‰ 1 (or equivalently s 2 " 4p). Then the probabilities PRpnq, PC pnq, and PT pnq of occurrence at time n of a purine, cytosine, and thymine on the considered nucleotide, converge to the following values: • PRpnq ÝÑ 4 s 2 pce`cf`deq, s 2 pae`af`bf q, • PT pnq ÝÑ 4 s 2 pad`bc`bdq.
Proof. In that case ∆ " 0, meaning that (3.10) holds. Since x1 P "´1 2 , 1˘, one gets the following limits, and finally pan, bn, cnq converges to pa8, b8, c8q with Using these values in (3.6), we can determine the limit of M n , which is a8M 2`b 8Mc 8I3, where I3 is the identity matrix of size 3. All computations done, we find M22 " s 2 4´p`a e`af`bf , M23 " ad`bd`bc, M31 " ce`de`cf , M32 " ae`af`bf , and M33 " s 2 4´p`a d`bc`bd. However, since x1 " x2, we have ∆ " s 2´4 p " 0 and so M n ÝÑ 4 s 2¨c e`cf`de ae`af`bf ad`bc`bd ce`cf`de ae`af`bf ad`bc`bd ce`cf`de ae`af`bf ad`bc`bd‚ ,

Application in Concrete Genomes Prediction
We consider another time the numerical values for mutations published in [8]. Gene ura3 of the Yeast Saccharomyces cerevisiae has a mutation rate of 3.80ˆ10´1 0 /bp/generation [8].
As this gene is constituted by 804 nucleotides, we can deduce that its global mutation rate per generation is equal to m " 3.80ˆ10´1 0ˆ8 04 " 3.0552ˆ10´7. Let us compute the values of a, b, c, d, e, and f . The first line of the mutation matrix is constituted by 1´a´b " P pR Ñ Rq, a " P pR Ñ T q, and b " P pR Ñ Cq. P pR Ñ Rq takes into account the fact that a purine can either be preserved (no mutation, probability 1´m), or mutate into another purine (A Ñ G, G Ñ A). As the generations pass, authors of [8] have counted 0 mutations of kind A Ñ G, and 26 mutations of kind G Ñ A. Similarly, there were 28 mutations G Ñ T and 8: A Ñ T , so 36: R Ñ T . Finally, 6: A Ñ C and 9: G Ñ C lead to 15: R Ñ C mutations. The total of mutations to consider when evaluating the first line is so equal to 77.  In that situation, s " a`b`c`d`e`f " 205m 77 « 8.134ˆ10´7, and p " 207488m 2 118657 «

Final Remarks
In this document, a formulation of the non symmetric discrete model of size 2ˆ2 has been proposed, which studies a DNA evolution taking into account purines and pyrimidines mutation rates. A simulation has been performed, to compare the proposal to the well known Jukes and Cantor model. Then all non-symmetrical models of size 3x3 that have 6 parameters have been studied theoretically. They have been tested with numerical simulations, to make a distinction between cytosines and thymines in the former proposal. These two models still remain generic, and can be adapted to a large panel of applications, replacing either the couple (purines, pyrimidines) or the tuple (purines, cytosines, thymines) by any categories of interest.
Remark that the ura3 gene is not the unique example of a DNA sequence of interest such that none of the existing nucleotides evolution models cannot be applied due to a complex mutation matrix. For instance, a second gene called can1 has been studied too by the authors of [8]. Similarly to gene ura3, usual models cannot be used to predict the evolution of can1, whereas a study following a same canvas than what has been proposed in this research work can be realized.
In future work, biological consequences of the results produces by these models will be systematically investigated. Then, the most general non symmetric model of size 4 will be regarded in some particular cases taken from biological case studies, and the possibility of mutations non uniformly distributed will then be regarded. Finally, this 4ˆ4 general case will be investigated using Perron-Frobenius based approaches instead of using methods directly inspired by linear algebra, in order to obtain the most global results on mutation matrices.