Wavelet-in-time multigrid-in-space preconditioning of parabolic evolution equations

. Two space-time variational formulations of linear parabolic evolution equations are considered, one is symmetric and elliptic on the trial space while the other is not. In each case, a space-time Petrov–Galerkin discretization using suitable tensor product trial and test functions leads to a large linear system of equations. The well-posedness of this system with respect to parabolic norms induces a canonical preconditioner for the algebraic equations that arise after a choice of basis. For the iterative resolution of this algebraic system with parallelization in the temporal direction we propose a sparse algebraic wavelet-in-time transformation on possibly nonuniform temporal meshes. This transformation approximately block-diagonalizes the preconditioner, and the individual spatial blocks can then be inverted for instance by standard spatial multigrid methods in parallel. The performance of the preconditioner is documented in a series of numerical experiments.


Introduction and problem formulation.
1.1.Introduction.We present a preconditioner for space-time simultaneous Petrov-Galerkin discretizations of linear parabolic evolution equations that is suitable for large scale parallel-in-time computation.The basic rationale for the design of the preconditioner is the ability to use existing spatial finite element codes and preconditioners for elliptic problems, such as multigrid.In the context of large scale parallel computation we therefore stipulate that the space-time solution vector is distributed across computing nodes along the temporal dimension so that a node, or a group of nodes, each hold only a few temporal snapshots of the solution out of many.By many we mean hundreds or thousands of snapshots.Then, space-time simultaneous iterative computation asks for the reduction of communication between these groups of nodes as far as possible, in particular when it comes to preconditioning.This will be achieved by a transformation to a temporal multiscale basis.
We consider two different space-time variational formulations of the linear parabolic evolution equation (Section 1.3).The key difference between them is that one is symmetric and elliptic while the other is not.The lack of symmetry has implications on the choice of discrete trial and test spaces and on the choice of the iterative solver (Sections 2.1-2.4).The symmetric formulation, on the other hand, involves the inverse of the spatial operator, which renders its numerical implementation more challenging.However, we will show that both formulations admit the same canonical preconditioner induced by the continuous norm on the trial space (Section 2.4).Throughout, we assume the regime of small Péclet numbers.That is we assume that the symmetric part of the generator is elliptic, and the asymmetric part is relatively small.
As we shall explain (Section 3.2.2), the space-time preconditioner involves the construction of a temporal piecewise polynomial multiscale basis with compact support which can be rescaled to a Riesz basis both in the Lebesgue space L 2 and in the Sobolev space H 1 .Many such bases of wavelet type are known, but their construction is typically based on dyadic refinement, posing restrictions on the temporal mesh.The classical fast Fourier transform suffers from the same drawback.For piecewise linear spline spaces we provide an algebraic algorithm that works on arbitrary temporal meshes (Section 3.3).The algorithm proceeds as follows: a) identify basis functions in the top frequency band; b) approximately orthogonalize to those of lower frequency; c) split off the resulting detail space; d) repeat.Approximate iterative orthogonalization in the second step allows to produce compactly supported wavelet-like functions.Due to the purely algebraic approach no complications appear at the boundary of the interval.Extensions to splines of higher polynomial degree are of interest but will not be discussed here.
The sparsity of the multiscale transformation to the wavelet-like basis in the temporal variable is crucial for reducing the interprocess communication cost.Moreover, the Riesz basis property simultaneously in L 2 and H 1 allow to transform and approximate the nonlocal canonical preconditioner as a block-diagonal one, where each block corresponds to a spatial Helmholtz problem with an imaginary frequency.For the purpose of preconditioning, this block can be replaced by multigrid versions of positive definite Helmholtz problems with real frequency while maintaining control over the quality of the preconditioner (Section 3.2).This block-diagonal preconditioner can then be applied in parallel.We emphasize that parallelization is achieved in the temporal direction, but the effect of the multiscale wavelet-like transformation in time is such that the individual spatial blocks of the preconditioner do not correspond to small temporal subintervals, rather to wave packets of different temporal scale.
In Section 4 we present our numerical experiments that document the quality of the preconditioner and its performance in the space-time simultaneous resolution of parabolic evolution equations.We conclude in Section 5.
In the Appendix we provide A) a Matlab code for the construction of the waveletlike basis, and B) a generalized LSQR algorithm that is used in the complete spacetime algorithm.
1.2.Problem class.Let V and H be real separable Hilbert spaces with continuous and dense embedding V → H.The Hilbert space H is identified with its (continuous) dual H by the Riesz isomorphism.This results in the Gelfand triple V → H ∼ = H → V with continuous and dense embeddings.We write • V for the norm on V .The scalar product on H and (its continuous extension to) the duality pairing between V and V is denoted by (•, •).For the duality pairing on other Banach space we write •, • .We write L 2 (J; V ) for the Bochner space of V -measurable functions on J, and H 1 (J; V ) for the Bochner-Sobolev subspace of weakly differentiable functions w on J with derivative d t w in L 2 (J; V ), see [22] or [39, Chapter 1].Similar notation is employed for other instances of Hilbert spaces.
Let J = (0, T ) be a nonempty bounded interval.Let A(t) : V → V , (a.e.) t ∈ J, be a family of bounded linear operators.We assume that, for some fixed constants C > 0, α > 0, and γ 0 ≥ 0, the family A(t), t ∈ J, satisfies the conditions a) For all χ, χ ∈ V , the mapping t → (A(t)χ, χ) is Lebesgue measurable. ( Let f ∈ L 2 (J; V ) and g ∈ H be given.The subject of this paper is the abstract linear parabolic evolution equation where the equalities are understood in H and V , respectively.Equivalently, one could write d t u + Au = f with equality in L 2 (J; V ), where Au stands for the map t → A(t)u(t), t ∈ J.In order to make the concept of a solution precise, we introduce the Sobolev-Bochner spaces A generic element of Y has the form v = (v 1 , v 0 ).A norm on X is given by We recall the continuous embedding X → C 0 ([0, T ]; H).In particular, w(t 0 ) is welldefined in H for any w ∈ X and 0 ≤ t 0 ≤ T , and w → w(t 0 ) is continuous, see [22, Section 5.9.2] or [39, Chapter 1].An equivalent norm (with equivalence constants depending on T , see Section 2.5) is therefore given by and henceforth, X is understood to be equipped with this norm.With the obvious corresponding inner product, it is a Hilbert space.The choice of the norm can be motivated as follows.Suppose A is the embedding H can be verified by expanding the square and using integration by parts.Therefore . Under the stated assumptions on A and the data f and g, there exists a unique solution u ∈ X to (4), and it depends continuously on the data [39,Chapter 3,§4.7].
In view of the usual transformation u → v := [t → u(t)e −γ 2 0 t ], which is an automorphism on X, we assume from now on without loss of generality that γ 0 = 0. 1.3.Space-time variational formulations.Our first space-time variational formulation is from [46].It is based on testing d t u + Au = f by v 1 ∈ L 2 (J; V ) and appending the initial condition u(0) = g using a Lagrange multiplier v 0 ∈ H. Specifically, define the bounded linear operator B : X → Y by and the bounded linear functional F ∈ Y by Here and in the following we omit the dependence of the integrands on t.The spacetime variational formulation then reads Find u ∈ X : Our second variational formulation is an instance of the Brézis-Ekeland-Nayroles variational principle [15,41].To state it, we introduce the notations where A and A denote the symmetric and the anti-symmetric part of A, respectively.Define the bounded linear operator B : X → X by for w, v ∈ X, and the bounded linear functional F ∈ X by The second space-time variational formulation then reads The essential property of this space-time variational formulation to (10) is that the trial space and the test space coincide, and the operator B is symmetric, that is Bw, v = Bv, w for all w, v ∈ X.Moreover, owing to X-ellipticity of B, see [2, Proposition 3.2.26], the Lax-Milgram lemma shows that there is a unique solution u to (14), and it depends continuously on the data.It is instructive and straightforward to verify that u from (4) satisfies the variational formulation (14).
For completeness, we mention another space-time variational formulation [7,8,27,16,40] which is based on testing d t u + Au = f by v ∈ X with v(T ) = 0, performing integration by parts in time, and putting the adjoint of A onto the test function.This exposes u(0), which is then replaced by the known initial datum g.The result is the space-time variational formulation: Find u ∈ L 2 (J; V ) such that It can be treated analogously to the first one.Yet another space-time variational formulation based on fractional temporal derivatives was given in [37], with the effect that only H and V appear in the definitions of the trial and test spaces, but not V .Finally, time-periodic problems with periodicity condition u(0) = u(T ) require only minor modifications, and will not be discussed here.
In the remainder of the paper we focus entirely on space-time full tensor product discretizations of the variational formulations (10) and (14).We point out, however, that they can serve as a basis for space-time adaptive and space-time compressive discretizations.Some references in that direction are given in the next section.
Those works do not exploit the fact that parabolic evolution equations define well-posed operator equations between Bochner spaces, and thus admit well-posed space-time variational formulations.This was first done in [7,8] using conforming discretizations of a space-time variational formulation of the heat equation with piecewise polynomials in time.In [46] and subsequently [16,17,34,37], the applicability of adaptive wavelet methods to parabolic evolution equations was shown.By means of a reduction to a boundary integral equation [20], stable space-time compressive algorithms were constructed in [18].Well-posed space-time variational formulations and stable a priori Petrov-Galerkin discretizations were derived and implemented in [36,40,4,5], space-time compressive variants in [2,3,6].Quasi-optimality in spacetime norms and space-time adaptivity were investigated in [49].

Conforming space-time tensor product discretization.
2.1.Space-time tensor product subspaces.The discrete space-time trial and test spaces are built from nested finite-dimensional univariate temporal subspaces E L ⊂ H 1 (J), F L ⊂ L 2 (J), and spatial subspaces V L ⊂ V parameterized by an integer L ≥ 0. We then define the discrete trial and test spaces (recall X and Y from (5)) where ⊗ denotes the algebraic tensor product.If L≥0 E L is dense in H 1 (J) and We assume henceforth that all subspaces are nontrivial and satisfy the discrete inf-sup condition where B is as in (8).Note that at this stage, γ L > 0 may not be bounded away from zero as L → ∞.By ellipticity (15) of the symmetric operator B, its discrete inf-sup constant γ L automatically satisfies γ L ≥ γ > 0 when X L is used simultaneously as the trial and as the test space.
In the quantification of the well-posedness of the discrete problems the quantity will also play a role.It is clear that Γ L ≤ B .Similarly, the corresponding sup sup quantity Γ L for B satisfies Γ L ≤ B .The discrete inf-sup condition (17) necessitates dim X L ≤ dim Y L .We discuss the cases dim X L = dim Y L and dim X L ≤ dim Y L in Sections 2.2 and 2.3, respectively.

Discrete variational formulation.
In this section we assume that X L ⊂ X and Y L ⊂ Y have the same dimension.The existence of such subspaces satisfying the discrete inf-sup condition (17) is guaranteed by bounded invertibility of B : X → Y .Practical instances are more subtle due to the non-symmetric contribution of the temporal derivative; however, in [5] it was shown that collocation Runge-Kutta time-stepping schemes applied to spatial semi-discretizations admit an interpretation as a discrete space-time variational formulation, see (19) below, with discrete trial and test spaces of the form (16).In addition, discrete stability γ L ≥ γ > 0 in (17) may be achieved for all L ≥ 0 under mild assumptions on the discretization parameters.
Given such X L and Y L , it is straightforward to define the discrete solution by the discrete nonsymmetric space-time variational formulation With the assumption that the discrete inf-sup condition (17) holds for X L and Y L , a unique solution u L ∈ X L to (19) exists by standard finite element method theory.For any nontrivial subspace X L ⊂ X, symmetry and ellipticity (15) of B guarantee its ellipticity on X L .The Lax-Milgram theorem then provides a unique solution to the discrete symmetric space-time variational formulation Again, we emphasize that X L is used as the trial and as the test space simultaneously.We will content ourselves with the full tensor product subspace X L as in ( 16), but we mention that space-time compressive discretizations X L and Y L could be used for (19) and for (20), see [3].
Let Φ be a basis for X L , and Ψ a basis for Y L .Here, and where appropriate, we omit the subscript L for readability.We write u L = Φ T u with a vector of coefficients u ∈ R Φ indexed by the elements of Φ, and define the system matrix B ∈ R Ψ×Φ with the components B ψφ = Bφ, ψ , as well as the load vector F ∈ R Ψ with the components F ψ = F ψ, where (φ, ψ) ∈ Φ × Ψ.Then the discrete variational formulation (19) is equivalent to the algebraic equation The discrete inf-sup condition (17) implies that B is injective, so that (21), and also (24) below, is uniquely solvable.Defining B ∈ R Φ×Φ and F ∈ R Φ analogously, and writing u L := Φ T u, the symmetric discrete variational formulation ( 20) is equivalent to the algebraic equation 2.3.Minimal residual variational formulation.For details and proofs for this section we refer to [3,2].Contrary to the aforegoing section we now assume that dim Y L ≥ dim X L , both finite but possibly unequal.The motivation to consider this case is the fact that the discrete inf-sup condition (17) and indeed, discrete stability γ L ≥ γ > 0, is then easier to achieve, for one is allowed to choose any discrete test space Y L that is "large enough".In this case the discrete variational formulation ( 19) is meaningless.Instead, we define the discrete solution as the minimizer of the functional residual by In the particular case dim X L = dim Y L this formulation reduces to (19), so it is appropriate to use the same symbol u L here.Under the inf-sup condition (17) there exists a unique solution u L ∈ X L to (23), it depends linearly and continuously on the load functional F , and the operator norm of the discrete solution mapping F → u L is bounded by 1/γ L .Defining B, F, and u as in the aforegoing section, the functional residual minimization ( 23) is equivalent to the algebraic equation where N is the symmetric positive definite matrix such that v Nv = Ψ T v 2 Y for all coefficient vectors v ∈ R Ψ .Note that (24) has the form of generalized Gauss normal equations.If dim X L = dim Y L then it is equivalent to (21), and therefore henceforth we consider (21) as a special case of (24), and do not discuss it separately.We will solve (24) iteratively with the generalized LSQR algorithm provided in Appendix B.
Instead of the exact N in (24), an approximation may be used, say N with If u denotes the corresponding solution, and u L := Φ u then the quasi-optimality estimate holds: Our objective now is to devise a preconditioner for the symmetric algebraic equations (22) and (24).This is the subject of the next section.

Parabolic space-time preconditioners.
As in the aforegoing Sections 2.2 and 2.3, assume X L ⊂ X and Y L ⊂ Y are nontrivial finite-dimensional subspaces that satisfy the discrete inf-sup condition (17), and with dim X L ≤ dim Y L possibly unequal.The algebraic equations ( 22) and ( 24) stem from a space-time Petrov-Galerkin discretization of a parabolic evolution equation, so that direct solution is likely impossible due to the large size of the system.Two basic methods for their approximate iterative solution are therefore • the preconditioned Richardson iteration, and • the preconditioned conjugate gradients method.It may be preferable to apply the Richardson iteration to the symmetric formulation (22) due to the fact that even the forward application on B requires solutions of elliptic problems.The preconditioned conjugate gradient method applied to the normal equations ( 24) is equivalent to the generalized LSQR method given in the Appendix.
Both iterative schemes necessitate a good preconditioner, that is a matrix M such that the condition numbers (27) are small (or the eigenvalues are clustered) [26].
To exhibit such a preconditioner we exploit the mapping properties of the operators B and B. Given a basis Φ ⊂ X L we define M ∈ R Φ×Φ as the matrix such that w T Mw = Φ T w 2 X for all coefficient vectors w.Its components are then The variational characterization of the extremal singular values shows that the set of singular values σ of the preconditioned matrices satisfies and the intervals are the smallest possible, see [2, Proof of Proposition 4.2.3], and Section 2.2 for the definition of these quantities.In practice, only M −1 and N −1 will be needed, but not their square roots.It follows for the condition numbers with respect to the euclidean norm.
In Section 3 below we discuss the Kronecker structure of the system matrices B and B, the norm matrix N, and the preconditioner M. We then describe the structure of the inverses N −1 and M −1 , and their practical approximations based on a wavelet-in-time multigrid-in-space transform.
2.5.Norm equivalence on the trial space.As noted above, the two norms H , are equivalent.Since the norm |||•||| X gives rise to a simpler preconditioner than (26), it is of interest to quantify this norm equivalence.Obviously, |||•||| X ≤ • X .For the reverse comparison we need to find the constant C X in the estimate w(T We shall do this only in the (relevant) situation that H is infinite-dimensional and the embedding V → H is compact.In this case there exists a countable orthonormal basis {σ k } k∈N for H, which is also an orthogonal basis for V and V .Set λ k := σ k V .The indexing is assumed to be such that 0 < λ 1 ≤ λ k for all k ∈ N. Then 1/λ 1 is the norm of the embedding V → H, appearing in the Friedrichs inequality when V and H are Sobolev spaces.Further, by orthonormality of the basis in H there follows λ . This motivates the reduced question: Given λ > 0, find the smallest C > 0 such that for all f ∈ H 1 (J) there holds Herein, we can without loss of generality suppose that f (0) = 1.Then the question becomes that of minimizing the right hand side.Applying the variational principle leads to the boundary value problem −λ −2 f +λ 2 f = 0 with the boundary conditions f (0) = 0 and f (T ) = 1.This problem can be solved explicitly, yielding the catenary as the solution.A direct computation then shows C = 1/ tanh(λ 2 T ).This is a decreasing function of λ.Therefore, the best constant in T diverges as T 0. In summary, if T is large enough, then the last term in ( 26) can be omitted without sacrificing the quality of the preconditioner M.
3. Wavelet-in-time multigrid-in-space approximation.In order to simplify the further exposition of the main ideas relating to preconditioning, we shall assume from now on that in the parabolic evolution equation ( 4), the generator t → A(t) is time-independent.Hence we write A instead of A(t).While the results are stable with respect to slight perturbations of these assumptions, the discretization of the operators ( 8) and ( 12) becomes technical, and the quality of the proposed preconditioner may deteriorate for large variations of t → A(t).
The assumption that A is time-independent allows us to equip the space V with the norm χ V := ( Aχ, χ) 1/2 , which by ( 1)-( 3) is equivalent to the original norm.
In order to avoid in practice the computation of the dual norm • V that appears in the definition (7) of the norm • X , for any nontrivial finite-dimensional subspace V L ⊂ V we introduce χ V L as the norm of the functional χ, • ∈ V restricted to V L , and the measure of self-duality 0 < κ L ≤ 1 as the largest constant satisfying Equivalently, κ −1 L is the norm of the H-orthogonal projector from V to V L ⊂ V , see [3, Lemma 6.2] and the discussion therein, as well as [10] for a quantitative analysis for finite element subspaces.
3.1.Kronecker product structure of discretized operators.Recall from ( 16) that the discrete trial and test spaces X L ⊂ X and Y L ⊂ Y are of the tensor product form This has implications on the structure of the system matrices B and B introduced in Section 2.2.
Let now Θ ⊂ E L , Ξ ⊂ F L , and Σ ⊂ V L , be bases for the respective spaces.Set are bases for X L and Y L , respectively.The system matrix B ∈ R Ψ×Φ from Section 2.2, whose components are B ψφ = Bφ, ψ , then has the form where a) "temporal FEM" matrices and b) the usual "spatial FEM" mass and stiffness matrices M x , A x ∈ R Σ×Σ have the components Let M E t and M F t denote the temporal mass matrices for Θ ⊂ E L and Ξ ⊂ F L , respectively, and A E t the temporal stiffness matrix for Θ ⊂ E L .Let A x be the symmetric part of A x .Then the preconditioner M from Section 2.4 satisfies where Note that e T T ⊗ e T in ( 33) is a rank-one matrix in R Θ×Θ of the same size as M E t and A E t .Employing the estimate (28) for the last term of ( 34), together with the observation Σ T c 2 x M x c, shows the equivalence in (33).The norm-measuring matrix N from Section 2.3 is the block matrix Finally, the matrix B from Section 2.2 has the form where Z ∈ R Φ×Φ is a symmetric matrix with the components Z ϕφ := A −1 Cϕ, Cφ .Details for a concrete example of Z are discussed in Section 4.1.3.

3.2.
Multigrid-in-space.As indicated in Section 2.4, we require a practical way to apply the inverses of N and of the preconditioner M.
3.2.1.Inverse of N. From (35) we have the representation for the inverse of N. Using an L 2 -orthogonal basis for the temporal test space x can be replaced by a multigrid iteration.3.2.2.Inverse of M. Based on the discussion in Section 2.5 and the equivalence (33), instead of M we will consider the simpler matrix M as the preconditioner.For the sake of readability, we omit the superscript in the notation for the "temporal FEM" matrices.Let V t be a transformation matrix of the same size as A t and M t such that V T t M t V t and V T t A t V t are spectrally equivalent to some diagonal matrices J t and D t , respectively, where the constants of the equivalence should be close to one.We refer to the ratio of those constants as the condition number of the transformation.The canonical choice of V t is the matrix collecting in its columns the M t -orthonormal eigenvectors of the generalized eigenvalue problem A t v = λM t v, in which case J t := V T t M t V t is the identity and D t := V T t A t V t has the eigenvalues on the diagonal.In this case, the condition number of the transformation is one, but the drawback is that V t is a dense matrix.A sparse alternative V t := T T t is given by the inverse wavelet(-like) transformation introduced in Section 3.3 below, and we take J t as the diagonal of V T t M t V t and D t as the diagonal of V T t A t V t .The condition numbers are investigated in Section 3.3.2.We suppose now that V t is such that with some nonnegative diagonal matrices J t and D t , where J t is positive definite (specifically, we use ( 48)).Let I x denote the identity matrix of the same size as M x and A x .We set T := V T t ⊗ I x .Then T T = V t ⊗ I x .To obtain a computationally accessible approximation of M −1 , we begin with the change of temporal basis This requires the approximation of the inverse of the block-diagonal matrix where each block has the form with some nonnegative reals j and d, and we recall that A x denotes the symmetric part of the spatial FEM stiffness matrix A x .Thus, the inversion of the space-time matrix ( 40) is equivalent to a sequence of independent spatial problems of the form where γ = d/j.Intuitively, these spatial blocks measure the energy content of the part of the solution associated to the temporal basis function of frequency γ.Now, the identity , where i is the imaginary unit, is easily verified.Thus, solving (41) is equivalent to solving two Helmholtz problems with the imaginary parameter ±iγ, with one forward application of A x between them.In order to avoid complex numbers arithmetics, one can observe that the right-hand side of the identity is the Schur complement of a 2 × 2 block matrix that admits iterative solvers that are robust in γ [54].Recalling, however, that M is merely a preconditioner, we estimate H γ x by the composition of two Helmholtz problems with a real frequency γ.Indeed, one only has to verify 2γM x ≤ H γ x , but this follows by changing to the eigenvalue basis of the generalized eigenvalue problem A x p = µM x p together with the estimate 2γ ≤ µ + γ 2 /µ.Therefore, in the inverse of (40), which is again block-diagonal, we replace each spatial block by the corresponding real double Helmholtz problem, that is, by the application of the mapping The exact solution of these positive definite Helmholtz problems can be replaced by one of the following strategies: 1.The multigrid iteration, which is robust in γ > 0 [43].
2. The (incomplete) LU or Cholesky factorization of the matrix A x + γM x .This is simpler to implement than the multigrid iteration but may not be applicable to very fine spatial discretizations.
3. The preconditioner from [14, Section 4.2], again robust in γ.In fact, it easily extends to a preconditioner for the compound matrix H γ x in (42) and can therefore replace the mapping (43).4. A wavelet transform and a diagonal matrix like in the temporal direction (38), which is robust in γ if the wavelets form Riesz bases for V and H.As a further simplification, γ may be rounded to one of reference values {γ ref i }, for instance logarithmically equispaced such that for any occurring γ there is a reference value This approximation allows to speed up the application of the preconditioner through the recycling of matrices in the 1st strategy or factorizations in the 2nd strategy, at the cost of another factor of two in the spectral equivalence (39) and additional memory consumption.In the numerical experiments we will use the 1st and the 2nd strategy with γ ref i := 2 i .In view of parallelization we reiterate here that ( 40) is a block-diagonal matrix.
In summary, the following approximations have been made to obtain a tractable version of the inverse of M: replace the • V norm by the • V L norm using (28); omit the final-time term in (33) as quantified in Section 2.5; block-diagonalize with (38); round the frequency γ to one of the reference values; approximate the inverse of each block through one of the four strategies above.

Construction of the inverse wavelet transform.
In this section we construct the inverse wavelet transform T t such that the sparse matrix V t := T T t has the properties required in Section 3.2.2.The main idea is to identify the "most energetic" hat functions that lie in the top energy bandwidth of relative width η > 0, approximately orthogonalize them to the remaining ones thus defining the fine scale subspace, and to proceed recursively.A similar construction was given and analyzed in [51,52] but on a predefined hierarchy of meshes.

Construction.
Let T denote the collection of finite temporal meshes T = {0 =: t 0 < t 1 < . . .< t N := T }.Recall that J = (0, T ).For any T ∈ T let Θ(T ) denote the set of hat functions (hats) T , that is the set of piecewise linear splines θ t on T that evaluate to one on some t ∈ T and to zero on all other nodes.We let s(T ) denote the left-most node s ∈ T \ {0, T } for which the energy (T ) := θ s 2 H 1 (J) / θ s 2 L2(J) is maximized among the non-boundary hats θ t ∈ Θ(T )\ {θ 0 , θ T }.A wavelet-like basis for the span of Θ(T ) is constructed as follows.Assume given: 1. T ∈ T with #T ≥ 3. 2. Relative energy bandwidth parameter η > 1 (we will use η := 1.9).
Step 1.Given T ∈ T , identify fine T + ⊂ T and coarse nodes T − ⊂ T by: 1. Set T 0 := T and n := 0. 2. While #T n ≥ 3 and (T n ) ≥ η −1 (T 0 ) do: Set T n+1 := T n \ s(T n ) and increase n by 1. 3. Define T − := T n and T + := T \ T − .Please note that, after the n-th iteration, Θ(T n+1 ) does not equal Θ(T n ) \ {θ s(Tn) } because removing a node from T n modifies the hats associated with the two neighboring nodes of s(T n ) ∈ T n by widening their support.Yet, obviously Θ(T n ) and {θ s(Tn) } ∪ Θ(T n+1 ) span the same space.On a uniform grid T , in particular, the algorithm collects every second node of T \ {0, T } in T − and every other in T + .Also, each intermediate Θ(T n ) constitutes a partition of unity on the interval J, and so do the coarse hats Θ(T − ).
Step 2. Let Θ + (T ) denote the set of hats that are removed from Θ(T n ) over all iterations n during Step 1.In general, the set of obtained fine hats Θ + (T ) is not the same as {θ t ∈ Θ(T ) : t ∈ T + } because neighboring nodes of T may have been selected into T + .Now, the basis Θ + (T ) ∪ Θ(T − ) is a two-scale basis for the span of Θ(T ).To obtain a wavelet-like basis Ψ + (T ) for the detail space spanned by Θ + (T ) we perform an approximate orthogonalization of the fine hats Θ + (T ) against the coarse hats Θ(T − ) using the mapping which can be interpreted as a Jacobi preconditioned Richardson iteration for the exact L 2 orthogonalization.The wavelet-like basis Ψ + (T ) is defined by where ν ∈ N was given.An application of P T− to a compactly supported θ increases its support by at most the support of the few coarse hats that are nonzero there.
Once the algorithm terminates, we have constructed a multiscale basis for the span of the original hats Θ(T ).Since the number of levels is not known in advance, K + 1 denotes the coarsest while 0 denotes the finest level.
Clearly, the mapping P T− in (44) operates only in the subspace spanned by Θ(T − ) leaving its orthogonal complement in L 2 (J) unchanged.Since Θ(T − ) constitute a partition of unity on J we have P T− 1 = 1 − ζ∈Θ(T−) ζ = 0. Therefore, P T− θ has at least one vanishing moment, so it is indeed a wavelet-like oscillating function.In the limit ν → ∞, the function [P T ] ν θ is orthogonal to the span of Θ(T − ).
Some typical wavelet-like basis functions are shown in Figure 3. On the uniform mesh, there is a striking similarity to the 2,2ν ψ family of wavelets of [19].But unlike those, our wavelet-like functions do not have 2ν vanishing moments, hence are not exactly the same.
In the Appendix we provide a Matlab code that constructs the coefficient transformation matrix T t such that the mass matrix with respect to the multiscale basis is T t M t T T t where M t is the mass matrix with respect to the original hat basis Θ(T ).The mapping w → T T t w is the inverse wavelet transformation, as it takes a vector of coefficients w with respect to the multiscale wavelet-like basis Ψ(T ) and returns a vector of coefficients with respect to the hat basis Θ(T ).
The transformation can be performed in a matrix-free fashion with linear complexity known from wavelet pyramid schemes.However, we are thinking of the situation where it is applied (from the right) to a distributed matrix that stores spatial vectors in its columns.This is because (T t ⊗ I x )Vec(U) = Vec(UT T t ) is precisely the type of operation required in (39), where U is a matrix of suitable size and Vec(•) stacks the columns of its argument one after the other into a long vector; see Section 4.2.4.We therefore believe that it may be beneficial, at least at first, to formulate and test the transformation in matrix form.

Condition numbers of the basis.
We compute the condition number of the wavelet-like basis Ψ(T k ) constructed in Section 3.3.1 for a test sequence of temporal meshes T k , k = 1, . . ., 11.Each T k contains a uniform temporal mesh on [0, T ] with 2 k + 1 nodes and additionally the nodes 1/2 ± 2 −r , r = 1, . . ., 20, that provide geometric refinement towards t = 1/2.The condition number in L 2 (J) and H 1 (J) is the ratio of the Riesz basis constants of the normalized wavelet-like basis in L 2 (J) and in H 1 (J), respectively.These constants are precisely the extremal eigenvalues of {T t M t T T t } and {T t A t T T t }, where {X} denotes the rescaled matrix and M t and A t are the mass and the stiffness matrix with respect to the original hat basis Θ(T k ), cf.Section 3.3.Setting for V t := T T t , equivalence ( 38) is satisfied with those constants.Additionally, we measure the percentage of nonzeros in the transformation matrix T t .
The results depending on the number ν of orthogonalization steps are shown in Figure 1 for T k with k = 11.It seems that ν = 2 is the appropriate choice because for larger ν the condition number in L 2 (J) remains stable while that in H 1 (J) increases slightly.Moreover, the percentage of nonzeros increases roughly linearly with ν, as long as ν is small.
The dependence of the condition number on the number of nodes in T k , k = 1, . . ., 11, is shown in Figure 2 for ν = 1, 2, 3, orthogonalization steps.The condition number in L 2 (J) is robust as the number of nodes increases, and that in H 1 (J) exhibits a slight growth.Again, ν = 2 seems to be a good choice.
4. Numerical results.In Section 4.1 we document the quality of the spacetime preconditioners using a series of numerical examples in one spatial dimension.In Section 4.2 we apply the preconditioner to the space-time resolution of a model parabolic evolution equation in two spatial dimensions with parallelization in time.
4.1.Quality of the space-time preconditioner.In this section we investigate the quality of the proposed preconditioner on small scale examples by computing the condition numbers of the preconditioned system.

Model problem.
In order to investigate the performance of the spacetime preconditioners in detail, we look at the model problem of convection-diffusion on the interval D := (0, 1).We fix the end-time T := 1.Specifically, we assume for (1)-( 3) that H := L 2 (D), V := H 1 0 (D), and with some constant β ∈ R. The symmetric part of A is then A = −∂ 2 x , and the asymmetric part is A = β∂ x .Of particular interest is the performance of the preconditioner with respect to the drift velocity β.

Computation of the condition numbers.
The condition number κ 2 of the preconditioned systems (27) with different approximate variants of the inverses is computed by means of a power iteration for the maximal and minimal eigenvalue of the associated generalized eigenvalue problems.The condition number is then obtained as the ratio of the maximal to the minimal eigenvalue.For instance, in order to approximate the maximal eigenvalue of starting with an all-ones vector.Then e max n converges to the maximum eigenvalue, and we stop the iteration when the relative improvement is less than 10 −4 .In order to compute the minimal eigenvalue, more work is required.is inverted exactly.This results in significantly lower condition numbers, which are however, still not robust in β.

Number of temporal intervals
Optimization problems constrained by evolution equations like (4) provide a major motivation for space-time simultaneous discretizations due to the coupling of the forward-in-time state equation and the backward-in-time adjoint equation, see for instance [45].Consider the minimization of the standard tracking functional where u is the control variable, y is the desired state, y is the state variable constrained by the parabolic evolution equation and λ > 0 is a regularization parameter.Here A is given by ( 49) with zero drift β = 0. Introducing the dual variable p for the constraint, forming the first order optimality conditions, eliminating the control variable and discretizing the resulting linear system leads to the saddle-point problem for some vector rhs which is of no importance here.Here, M 0 := M E t ⊗ M x is the matrix corresponding to the space-time L 2 inner product.The large block matrix A is symmetric but indefinite, and therefore the system is typically solved using the preconditioned MinRes method [12,53].It is of particular interest to obtain computationally accessible preconditioners for (53) that are robust simultaneously in the discretization parameters and in the regularization parameter λ > 0. Having used the symmetric space-time variational formulation, we are in the situation that each block of the matrix is symmetric, so that the block-diagonal matrix with blocks ) is a good preconditioner for A, see [54, Section 4.1].Replacing B by the spectrally equivalent M we obtain the block preconditioner Fig. 6.Condition number of the preconditioned system matrix for the nonsymmetric spacetime variational formulation as a function of the number of temporal intervals, see Section 4.1.4.Left: Wavelet-in-time and exact computation in space.Each line from bottom to top corresponds to a value of drift velocity β = 0, 0.01, . . ., 0.1.Right: For β = 0, wavelet-in-time multigrid-in-space preconditioner with different multigrid parameters: V-or W-cycle and the number of pre-/postsmoothing steps.
[30, Section 4.1].The pre-smoother is defined by the Gauss-Seidel iteration q → (L + D) −1 (−Uq + f ) and the post-smoother by q → (U + D) −1 (−Lq + f ) for the equation (L + D + U)q = f with strictly lower/upper triangular L/U and diagonal D. This ensures symmetry of the multrigrid procedure when the number of preand post-smoothing steps is the same.As the prolongation we use the canonical embedding of V k into V k+1 , the restriction is its transpose.In order to approximate the action of ( 43), the multigrid procedure with one pre-smoothing step is applied with the matrix A x + γM x , then A x is applied, then again the multigrid procedure.For four configurations of the multigrid determined by whether the V-or the W-cycle is executed, and whether none or one post-smoothing step is performed, the resulting condition numbers for the space-time preconditioned system are visualized in Figure 6 (right).Based on those measurements we recommend to use the V-cycle with one preand one post-smoothing step, since the resulting condition number remains below ten and the resulting preconditioner is symmetric.

Resolution of parabolic evolution equations.
In this section we describe the basic form of the complete algorithm for the space-time resolution of parabolic evolution equations based on the nonsymmetric variational formulation (10), and discuss some of its properties.See [2,3,4,5,6] for further numerical experiments.We briefly comment on the symmetric one (14) in Section 4.2.6.The temporal interval is partitioned into N = 2 5 , 2 6 , . . ., 2 13 temporal elements of equal size.As temporal discretization E L ⊂ H 1 (J) and F L ⊂ L 2 (J) we take the space of continuous piecewise affine and piecewise constant functions with respect to this partition, respectively, which results in N + 1 temporal degrees of freedom.This discretization, which is fact equivalent to Crank-Nicholson time-stepping, is not unconditionally space-time stable, more precisely [5] in (17) for Λ L := sup χ∈V L \{0} χ H 1 / χ L2 and CFL L := Λ 2 L T /N .The computations are performed on a SGI UV 2000 shared memory machine with 160 cores on 16 CPUs of the Laboratoire J.-L.Lions.The spatial finite element c++ code used was originally developed in the context of [13].Linear algebra dense and sparse data structures are provided by the boost ublas library with preprocessor macro NDEBUG defined, row or column major ordering is used as appropriate.All LU factorizations (disregarding matrix symmetry) are done through UMFPACK 4.4 linking to OpenBLAS 0.2.14 on default settings and with access to one core per process, whether in parallel or serial computation.Parallelism (described below) is achieved with the boost mpi 1.57 library wrapping OpenMPI 1.8.3.The code is compiled using g++ 4.9.2 with the -O2 optimization flag.
The inverse of N is computed (only once per MPI process) by LU factorization of the spatial mass/stiffness matrix, see Section 3.2.1.Unless specified otherwise, the application of M −1 is approximated as in Section 3.2.2 using the strategy 2 described following (43) with γ ref i = 2 i for integer i.Every MPI process keeps its own list of required i, necessitating no more than 11 LU factorizations per process (this number increases as log 2 N in our setup and decreases with the number of processes).The temporal transformation V t is the inverse wavelet transform from Section 3.3 with ν = 2 orthogonalization step, the difference to ν = 1 in the timings is marginal.All LU factorizations are included in the timings shown.

4.2.2.
Complete algorithm for the nonsymmetric formulation.The nonsymmetric space-time resolution algorithm for parabolic evolution equations proceeds as follows.
1. Assemble the "temporal FEM" matrices C F E t , M F E t , M E t , A E t , M F t , and the "spatial FEM" mass and stiffness matrices M x , A x , as in Section 3.1.Compute the inverse wavelet transform T t as in Section 3. of the preconditioners as in Section 3.2.1 and 3.2.2,respectively.4. Compute an approximate solution to the Gauss normal equations (24) using the generalized LSQR algorithm provided in the Appendix.
4.2.3.Numerical convergence analysis.We document the convergence of the discrete space-time solution after 10 LSQR iterations with respect to the number of temporal and spatial degrees of freedom in the norms of L ∞ (J; L 2 (D)), L 2 (J; H 1 (D)), and L 2 (J; L 2 (D)) in Figure 7.The discrete solution computed with 20 LSQR iterations at (2 13 + 1) × 32 385 ≈ 265M degrees of freedom is taken as the reference solution for error estimation.Since the trial space is defined on a uniform mesh both in space and time, the low regularity of the exact solution implies low convergence rates that can be expected from quasi-optimality (25), see for instance [46,Section 7] for a discussion.
The error of the discrete solution with 8 001 spatial degrees of freedom as a function of the LSQR iteration number is depicted in Figure 8 (left) for different temporal resolutions.At the first iteration, the influence of the temporal discretization on the discrete inf-sup constant (55) can be seen: increasing the temporal resolution increases the inf-sup constant γ L , improving the condition number (27) of the preconditioned algebraic system and leading to a smaller error.In Figure 8 (right) we show the same data as a function of walltime computed on 64 MPI processes, see Section 4.2.5.
Figure 9 shows the same computation but with a multigrid iteration replacing the exact LU factorization for the Helmholtz problems (43).The mesh hierarchy is the one generated by the refinement starting from the coarsest mesh with 4 elements.The timings do not include the assembly of the matrices on different levels or their lexicographic reordering.As suggested by the 1d computations in Section 4.1.4,we apply the V-cycle with one pre-and one post-smoothing Gauss-Seidel step.The frequency γ is again rounded to the next power of two (on logarithmic scale).The LU factorization and themultigrid iteration show a similar behavior both in terms of the error and the computational time, except most notably in the initialize phase of the LSQR algorithm where all the necessary LU factorizations are computed.4.2.4.Data structures and parallelization.The space-time tensor product discretization of the trial space described in Section 3.1 implies that the discrete solution vector u, whose components we index by the basis functions (θ, σ) ∈ Θ × Σ, can be written as a rectangular array U ∈ R Σ×Θ where each column corresponds to a spatial vector of size #Σ associated to one of the temporal basis functions θ ∈ Θ.We write Vec(U) := u.As anticipated in the introduction, for parallel computation we evenly distribute this rectangular array columnwise across the MPI processes.For example, for 2 10 + 1 temporal degrees of freedom and 64 MPI processes, the first one hosts 17 columns, all the others 16 columns each.Mutatis mutandis, this applies to the space-time part of the load vector F and all the vector quantities in the LSQR algorithm.The initial datum part of the load vector F (and similar quantities in the LSQR algorithm) is assembled and kept on each MPI process.The "spatial FEM" matrices are computed, stored, and LU factorized on demand by each MPI process independently.
The above data layout is tailored to the identity (S t ⊗S x )Vec(U) = Vec(S x US T t ), where S x and S t are matrices of appropriate size.Using this identity in the application of B, N −1 , etc., allows to avoid the formation of the Kronecker product of matrices.This remains true for the preconditioners M and N even if the generator A is time-dependent.Moreover, since the application of S x is columnwise, it is performed in parallel.It is thus only in the multiplication by "temporal FEM" matrices and in the wavelet transform that communication between the MPI processes is required.In particular, the application of the approximate inverse of the block-diagonal preconditioner (40) is performed in parallel without any interprocess communication.

Parallel scaling.
In this section we comment on the parallel scaling of the complete space-time algorithm from Section 4.2.2.In order to focus on the parallelization in the temporal direction, we fix the number of spatial degrees of freedom to 1 953.We compare the run time of the space-time algorithm on 16, 32, 64, 128 MPI processes with that of a sequential implementation of the implicit Euler time-stepping scheme on the same temporal mesh (the timings for Crank-Nicholson time-stepping are, of course, similar).In each step, the linear system is solved by LU factorization.We point out that the factorization is performed anew for each time step, although the matrix remains the same.This simulates the situation of a diffusion coefficient with a nontrivial temporal dependence or of a nonuniform time step size; otherwise many more MPI processes and time steps, or a finer spatial discretization would be required to compete with the time-stepping scheme in terms of walltime.
To compare the run times of the space-time algorithm and the time-stepping scheme we do not include mass/stiffness matrix and load vector assembly time, and other elements that are common to both methods.We report on the walltime of the LSQR algorithm with exactly 20 iterations (a rather generous number, see Figures 8  and 9) for the space-time algorithm and essentially the N matrix factorizations that are necessary for time-stepping.
For the wavelet transform in time we use a preliminary parallel implementation of the pyramid scheme, on which we intend to elaborate in a forthcoming work.The application of T t and T T t in the approximation of M −1 together require approximately a quarter of the walltime for each LSQR iteration.Our implementation of the wavelet transform through matrix representation and matrix multiplication instead of the pyramid scheme does not scale as well, and we therefore omit the details.
The results are displayed in Figure 10 as a function of the number of MPI processes (left) and as a function of the number of temporal elements (right).The parallel spacetime algorithm shows a decent speed up compared to the sequential time-stepping, for instance by a factor of 16 for N = 2 12 temporal elements on 128 MPI processes.Concerning weak scaling, a computation with 8x the number of MPI processes and 8x the number of temporal intervals takes approximately 2x longer.4.2.6.On the implementation of the symmetric formulation.The main potential of the symmetric formulation ( 14) is in that the discretization is uniformly stable in X L ⊂ X.Moreover, the Lagrangian from which it derives [2, Section 3.2.4]assumes its minimum at the exact solution, and can therefore drive the a posteriori refinement [48].The different spatial resolutions may be associated with the single scale temporal hat functions (complicating the application of the preconditioner) or directly with the temporal wavelets.In any case, the application of the discretized parabolic operator involves the inverse of (the symmetric part of) the generator A, say the Laplacian, that typically has to be computed on a yet finer spatial mesh up to certain accuracy.This limits the accuracy of the resulting discrete solution but these spatial problems are independent and can therefore be computed in parallel.Details are delegated to future work.

Conclusions.
We have developed a wavelet-in-time multigrid-in-space preconditioner for algebraic linear systems arising from space-time Petrov-Galerkin discretizations of linear parabolic evolution equations.The sparsity of the wavelet-intime transformation is crucial to reduce the inter-process communication cost when the parallelization is done along the temporal direction.This transformation approximately block-diagonalizes the canonical preconditioner given by the continuous space-time norms, and allows to invert the resulting spatial blocks in parallel using for instance standard spatial multigrid methods.We have presented several numerical experiments documenting the excellent performance of the preconditioner in the regime of small Péclet numbers, including a first application to robust preconditioning of optimality systems from optimal control constrained by parabolic PDEs, as well as parallel-in-time computations.

Fig. 1 .
Fig. 1.Condition number of the wavelet-like basis (left) and percentage of nonzeros in the transformation matrix (right) as a function of the number of orthogonalization steps, see Section 3.3.2.Without orthogonalization, the condition number in L 2 is very large.

Fig. 2 .Fig. 3 .
Fig. 2. Condition number of the wavelet-like basis as a function of the size of the test mesh for ν = 1, 2, 3 (left to right) orthogonalization steps.

Fig. 4 .
Fig. 4. Condition number of the preconditioned system matrix for the symmetric space-time variational formulation as a function of the number of temporal intervals, see Section 4.1.3.Each line from bottom to top corresponds to a value of drift velocity β = 0, 10, . . ., 100.Left: Wavelet-intime and exact computation in space.Right: Same preconditioner with an additional β-dependent term.

4. 2 . 1 .
Model problem, discretization and setup.As a specific model problem we take the heat equation posed on the L-shaped domain D = (−1, 1) 2 \ [0, 1) 2 in d = 2 spatial dimensions, with the initial datum g(x) := (1 − |x 1 |)(1 − |x 2 |), right hand side f (t) := 2t, and homogeneous Dirichlet boundary conditions.The fact that the initial datum does not satisfy the boundary conditions leads to a solution with low space-time regularity, which is reflected in the convergence rates, see Section 4.2.3.The spatial domain is partitioned into two pairs of congruent triangles such that |x 1 |−|x 2 | has constant sign on each triangle.Thereafter, global uniform red refinement is applied resulting in 3, 21, 105, 465, 1 953, 8 001, 32 385 internal degrees of freedom of the P1 Lagrangian finite element discretization.
3.1.Assemble the space-time load vector F as in [4, Section 7.2].2. Define the functions w → Bw and d → B T d as in Section 3.1.3. Define the functions that approximate the action d → N −1 d and w → M −1 w

Fig. 10 .
Fig. 10.Walltime of the LSQR algorithm from the start until the completion of 20 iterations for 1 953 spatial degrees of freedom, see Section 4.2.5.Left: as a function of the number of MPI processes for different temporal resolutions; the dashed line indicates weak scaling.Right: as a function of the temporal resolution for different numbers of MPI processes, compared with a sequential implementation of the implicit Euler time-stepping scheme.