Diffusive Realization of a Lyapunov Equation Solution, and Its FPGA Implementation

In Yakoubi [2010] and Lenczner et al. [2010] we developed a theoretical framework of diffusive realization for state-realizations of some linear operators. Those are solutions to certain operator linear differential equations in one-dimensional bounded domains. We also illustrated the theory and developed a numerical method for a Lyapunov equation arising from optimal control theory of the heat equation. However, the principles of our numerical methods were only sketched, and now we provide more details. Then, we do not only provide validation results of the method, but we also report our experience in its implementation on a Field Programmable Gate Arrays (FPGA), for the purpose of promoting embedded real-time computation.


INTRODUCTION
One of the main recognized advantages of the diffusive realization of a linear operator is its very low computational cost, see the papers of G. Montseny and of D. Matignon, e.g. Laudebat et al. [2004], and Hélie et al. [2007], for representations of various pseudodifferential operators, and for their approximation. Those of C. Lubich and collaborators, e.g. López-Fernández et al. [2005], apply a similar idea to convolution operators, and they introduced optimized adaptive numerical methods. Notice that fast operator realization is essential for real-time control. Until now, realization of a linear operator u → z = P u, by the diffusive realization method, has been addressed to causal operators when the kernel p of P is explicitely known, and analytic in its second variable, see Montseny [2005]. Here, we cover the case where P is solution of an operator equation, so p is not explicitly given nor analytic. Our approach is presented and illustrated with the example of P solution to the Lyapunov equation, in ω = (0, 1), for all u vanishing at the boundary, and where Q is another linear operator. This problem comes from optimal filtering or control theory of the heat equation, with Dirichlet boundary conditions. Our new method was announced in Lenczner and Montseny [2005], it was fully developed in Lenczner et al. [2010], and the theoritical part with some numerical results were presented in Yakoubi [2010]. The present paper focus mainly on the numerical method which is not published yet. A second aim of this paper, is implementation in view of embeded real-time computation. Field Programmable Gate Arrays (FPGA) is the best today choice for real-time, embeded, massive, and low-cost computation. The main drawback of these processors, compared to usual computers, is that they require a good expertise in digital electronics for application implementation. Helpful dedicated software exist to help FPGA implementation, but until today, due to FPGA complexity, the solutions that they give are often not very efficient. So, we have elaborated a solution from scratch to execute our algorithm in a small FPGA, namely the Spartan3A by Xilinx.
The paper is organized as follows. The diffusive realization of P is recalled in Section 2, then in Section 3 and 4, we propose a numerical method, and we present related numerical results. Last, we discuss our FPGA implementation in Section 5.

DIFFUSIVE REPRESENTATION OF P
In this section we recall the diffusive realization of the solution P to the Lyapunov equation (1) published in Lenczner et al. [2010]. We consider the kernel formulation of the operator P P u(x) = 1 0 p(x, y)u(y) dy, and its unique decomposition (P u) = z + + z − into causal and anti-causal parts, p(x, y)u(y) dy and z − = 1 x p(x, y)u(y) dy.
Throughout this paper, we shall use the superscripts + or − to refer to causal or anti-causal operators, and the convention ∓ = −(±).
We recall that the kernel p is the unique solution to the boundary value problem −∆p = q in the square (0, 1) 2 , and p = 0 on the square boundary (0, 1) 2 , where q is the kernel of Q. The realization of z + and of z − may be formulated thanks to the diffusive representation, see Lenczner and Montseny [2005] and Montseny [2005], in the form where both ψ + and ψ − store a part of the history of the input data u. They are respectively solution to the forward and backward ordinary differential equation in x, and parametrized by ξ ∈ R. We notice that ψ + and ψ − are defined independently of P . Conversely, the coefficients µ + and µ − , called diffusive symbols, depend on P but not on u.
The functions ξ → θ + (ξ) and θ − (ξ) parametrize two closed paths in the complex plane, satisfying the cone condition, and enlacing the singularities of the Laplace transform P + and P − defined hereafter. The diffusive symbol derivation is achieved within several steps. First, the functions y → p(x, x − y) and y → p(x, x + y), (5) corresponding to the causal part and the anti-causal part of the impulse response, are analytically extended to R + . Then, we assume that the Laplace transforms P + and P − , with respect to y, of the extended causal and anticausal parts of the impulse response, are well-defined in C + , and that they admit holomorphic extensions vanishing at infinity. Finally, we show that the diffusive symbols are given by

DIFFUSIVE REALIZATION APPROXIMATION
The approximation presented in this section are formulated in the particular case of Lyapunov Equation (1).
In Subsection 3.1, we recall a Petrov-Galerkin method to approximate the symbols µ ± . Then, computational algorithms for history functions ψ ± and for z ± are derived in Subsection 3.2 and 3.3.

Remark 2
We have used a spectral method to discretize both xand y-directions. In the y-direction we actually need to use global basis functions so that they can be analytically extended. On the contrary, there is no particular restriction regarding approximations in the x-direction.
For instance a local basis as a finite element basis might be used.

Discretization of ψ with respect to x
Two x-discretizations have been considered. They are based on two different interpolations of discrete inputs (u n ) n located at regularly spaced nodes (x n ) n separated by a distance h. In the interval [x n , x n+1 ), the first one is piecewise constant u (x) = u n , and the second one is piecewise linear and continuous, u (x) = u n + un+1−un h (x − x n ). We detail the calculation for both causal and anti-causal parts, since it is not so trivial to deduce from each other. To proceed, we firstly consider the integral forms of (3-4), i.e.
In particular, at a point x = x n+1 , we have We deduce the recurrence relations , the integrals turns to be equal to: So, the recurrence relations yields (15) Notice that this recurence relation was already found by Casenave [2009] for the causal part.
so the recurrence relations are rewritten as López-Fernández et al. [2005] have developed a computation of ψ, in the linear interpolation case, which presents similarities with ours. (2) Using the above recurrence relations, we establish the final approximations z + n+1 and z − n of z + and z − at the input nodes. We notice that a direct application of the residu theorem yields to eliminate the terms without exponential,

Approximation of the integrals in z + and z − in
◮ Piecewise constant interpolation of u From the recurrence relation (15), and similarly Evaluating the integrals thanks to the trapezoidale rule with 2M + 1 quadrature nodes regularly spaced at a distance h ξ yields the final aproximations, ◮ Piecewise linear interpolation of u We follow the same route to find

Balance of Error Estimates
We replace µ ± (or P ± ) and ψ ± by their approximations µ N ± (or P N ± ) and the x-discretization of ψ ± in the integral (2). We establish that the approximation of z ± the realization in (2) can be written like a linear combination of inverse Laplace transform L −1 .
◮ In the piecewise linear interpolation case: ◮ In the piecewise constant interpolation case: , for all n ∈ {0, 1..., N }, with N = 1/h − 1 the number of intervals in the x-variable, J + n = {0, ..., n − 1}, and F ± 2n (−θ ± ) = P N ± (xn,−θ ± (ξ)) θ ± 2 (ξ) . Following Weideman and Trefethen [2007], we approximate the inverse Laplace transforms ](x ± ) at x ± = ±(x n − x j ) and x ± = ±(x n − x j+1 ) using a numerical integration formula along the integration contours (7-8). The optimization of these contours is founded on a balance between the truncation error estimate and the discretization error estimate for the numerical integration of the Laplace inversion at points x ± ∈ I = {h, 2h, .., 1} excluding the point 0. At x ± = 0, we know that (17). The ratio between the upper and lower bounds of the set I, i.e. Λ = 1 h is very large for a fine mesh and the numerical inversion of Laplace transform is relatively expensive. To avoid this problem, we observed an improvement by changing the formulas obtained in Weideman and Trefethen [2007] by fixing Λ = 6. The minimum quadrature interval lengths are to be equal to where α * is the unique argument of the maximization problem, α * = arg max α∈(0, π 2 ) π(π − 2α) L H (α) = 1.0641.
Then, it is shown that for the optimal contour parameters, the quadrature error estimate e P M and e H M , for parabolic (P) and hyperbolic (H) paths, are exponentially decreasing with respect to M the number of points of the integration contours, e P M = C P e −AM and e H M = C H e −BM . The two constants C P and C H are uniform with respect to M and the decay rates depends on the quadrature interval length, We notice that hyperbolic paths always yield faster computation over parabolic ones. Moreover, the error e C h and e L h , in the exact integral (2), for constant (C) and linear (L) interpolations of u are linear and quadratic in h. So, there exists two constants C C and C L such that, e C h = C C h and e L h = C L h 2 . In each couple of approximation, the quadrature and the interpolation errors can be balanced by equating the related errors e X M and e Y h . This yields the four relations between h and M , stated in Table 1, where E(.) stands for the integer part, that allows to parameterize the numerical method by h only.

NUMERICAL RESULTS
In our presentation of numerical experiments, we discuss only causal parts. Similar results have been observed for anti-causal parts. We have considered the kernel q (x, y) = 2(1 − 3x)(1 − y)y 2 + 2(1 − x)x 2 (1 − 3y) and the input variable u (x) = sin (jπx) with j ∈ N * . The number of base functions of the Petrov-Galerkin method are fixed at sufficiently large values, N 1 = N 2 = 15, so that the error on µ the diffusive symbols is negligible in comparison with the other error sources. In Figure 1 we report the relative errors in logarithmic scale between z + an exact realization and its approximations z h+ , parameterized by h only, for u(x) = sin(πx). The error is measured in the discrete L 2 (ω)-norm, and the discretization step h ranges from 0.005 to 0.25. The errors decay rate is proportional to h for piecewise constant (C) interpolation, and proportional to h 2 for piecewise linear (L) interpolation. We notice that balancing the errors e X M and e Y h yields the same global error for both the parabolic (P) and the hyperbolic (H) path, this is the reason why both errors are plotted with a same curve.
To speed up the computation, we take into account the fact that for a real valued operator P , only half of the integrals need to be computed, i.e.
which reduces, by a factor of 2, the number of quadrature points. Figure 2 provides a comparison of computation times between the direct (D) quadrature method, P u(x n ) ≈ h j p(x n , y j )u(y j ) for all n, and the diffusive realization methods with hyperbolic (H) and parabolic (P) contours with piecewise constant interpolation of the input variable u (x) = sin (10πx) presenting ten oscillations. The gain in using diffusive realization over direct quadrature increases for finer spatial discretization points. The implementation is done in Matlab version 7.9, hence the timing-results have to be taken with caution. ψ + n+1,k = α + k ψ + n,k + β + k u n , ψ + 0,k = 0, 6: end for 7: Note that the implementation of the anti-causal part is done in a similar way, and will not be described. Consequently, we will drop all upper indices "+" without any risk of confusion.
In order to process the highest number of operations in parallel, quantization of numbers must be optimized. To do so, we first require that all variables belong to a neighborhood of 1. This is achieved by scaling β, γ, µ, u, ψ, and z by a factor corresponding to an estimate of their larger value β * = β/γ max , γ * = γ/γ max , µ * = µ/µ max , u * = u/u max , ψ * = ψ/(β max u max ), and z * = z/(u max β max µ max ). We notice that α does not require any scaling, and that β has been scaled by γ max since the latter is larger than β max . Then, Algorithm 1 has been rewritten based on the scaled variables, and the impact of quantization has been evaluated. In Table 2, we report absolute and relative errors, in the maximum norm, on the output z for quantizations varying between 11 bits to 16 bits.

Number of bits
Maximum error Relative error 16 9.21e-5 1.05% 15 2.08e-4 2.51% 14 1.38e-4 1.51% 13 4.82e-4 6.79% 12 6.87e-4 9.23% 11 1.4e-3 18.69% Table 2. Error with respect to the quantization Regarding FPGA hardware implementations, three variants have been studied, namely a sequential, a parallel, plus a pipelined architecture. The parallel and the pipeline solution have been implemented. Before to discuss them, we underline some features of our Algorithm. We observe that each couple (π n,k , ν n,k ) = (α k ψ n+1,k + β k u n , α k ψ n+1,k + γ k u n ) of complex numbers can be computed independently, and that the same real number u n is used for their evaluation. The dependency flow, related to the computation of the vector (ψ n+1,k ) k , is depicted in Figure  3. given u n and ψ n,k , eleven multiplications are required to determine a ψ n+1,k . Due to the limited number of multipliers available in a FPGA, for large N and M a full parallel implementation is not always possible and the limited resources must be shared. To optimize resource sharing, a finite state machine was used. To design it, we deduce from the algorithm a number of states depending on the number of multipliers in a branch and on the number of required multiplications in a step. In our implementation of a state, the output of each multiplier is affected to a register before to be entered as an input of a subsequent addition or substraction, and the multiplier inputs are refreshed with new data. Finally, the state machine is designed to control all state evolution and the RAM, to load data in multipliers, and to store the results. Assuming infinite resources, the necessary time to build a complete vector (z n ) n is N × n z × t clock where n z = 4 is the number of time steps necessary to compute a new occurence z n , and t clock is the clock period.
Our implementation of a pipeline architecture takes advantage of the already underlined independence of the (π n,k , ν n,k ). For a given input u n , the computation of a sequence of couples (π n,k , ν n,k ) k , and the updating of z n are executed through the pipeline. Thus, in regular functioning (i.e. except initial and final periods), the computation of a couple takes one clock period. When the computation of ψ n+1,k is complete, u n+1 is taken as a new input parameter. Finally, the time required to build a realization (z n ) n is N × M × t clock .
We have successfully implemented the parallel and the pipeline architectures in a Xilinx Spartan3A XILINX [2010] comprising 200 kgates, 16 multipliers with 18-bit input data and 36-bit output data, 16 RAM blocks of 16 kbits each, and a 100 MHz clock. For a vector of data (u n ) n with N = 8 components, for a quadrature formula with M = 4 nodes, and for 9-bit encoded integers, the parallel computation (with n z = 9) and the pipeline computation (with n z = 4) of a vector (z n ) n takes 0.72µs and 0.32µs respectively, which fit with the theoritical values. The same computation with a C program on a laptop computer with a x86 1.6 GHz processor takes about 60µs. This yields a speed-up of about 10 2 .

CONCLUSION
Until now, the diffusive realization of operators has been applied to operators with analytically known kernels. From the references in the field, it is known to be a very efficient method requiring little computation for real time realizations since small M (compared to N ) are generally enough to yield good approximations. In Lenczner et al. [2010] we have introduced a mathematical framework allowing for its derivation when an operator is a solution to a linear operator partial differential equation. A complete justification of the numerical method was not included, so it constitutes the main focus of the present paper. Here, our general approach is presented through the example of a Lyapunov equation arising in optimal control theory of the one-dimensional heat equation. In view of real-time applications, we have also implemented this method in a FPGA with a parallel and with a pipeline architecture. The theoretical (optimal) computation time for the pipeline implementation is found to be N * M * t cloc and is confirmed by our experiment. Further extensions of this method are now in development: we study how to encompass Riccati equations coming from more general partial differential equations in higher dimensional domains.