Stability of over-relaxations for the Forward-Backward algorithm, application to FISTA

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Introduction. Let H be a Hilbert space and f and g two convex, l.s.c functions from H to R ∪ {+∞} such that f is differentiable with L-Lipschitz continuous gradient, and g is "simple", meaning that its "proximal map" x → arg min y∈H g(y) + x − y 2τ 2 can be easily computed. We consider the following minimization problem and we assume that F is coercive (i.e. F (x) → +∞ when x → +∞) which implies that this problem has at least a solution (and possibly an infinite set of solutions). Among the many algorithms which exist to tackle such problems, the proximal splitting algorithms, which perform alternating descents in f and in g, are frequently used, because of their simplicity and relatively small per-iteration complexity. One can mention the Forward-Backward (FB) splitting, the Douglas-Rachford splitting, the ADMM (alternating direction method of multipliers), 1 which all have been proved to be efficient in many imaging problem such as denoising, inpainting, deconvolution, color transfer and many others.
This work focuses on variants of the so-called "Fast Iterative Soft Thresholding Algorithm" (FISTA) which is an accelerated variant of the Forward-Backward algorithm proposed by Beck and Teboulle [4], built upon ideas of Nesterov [16] and Güler [12]. More precisely the ergodic convergence rate and the stability to perturbations of the convergence of the iterates of these over-relaxed algorithms are studied.
The FB is a descent algorithm which defines a sequence (x n ) n∈N by performing an explicit descent in f and implicit in g. It is then shown that there exists C > 0, such that for all n ∈ N where x * is a minimizer of F . Moreover the sequence (x n ) n∈N weakly converges in H. See for instance [17] or [4] for a simple derivation of this rate. Combettes and Wajs [9] proved that if 1 See for instance [9,13,10,11,8]. 1 the sum of norms of errors done at each step on the proximal operator and on the gradient was finite, iterates of FB still converge to a minimizer of F . This paper proposes a convergence analysis of a class of over-relaxation of FB or inertial Forward-Backward (iFB) when the proximal operator and the gradient are computed with errors at each step. This class includes FISTA and somehow interpolates FB and FISTA. An algorithm is mainly defined by a parameter d ∈ [0, 1]. The choice d = 1 corresponds to FISTA and the case d = 0 corresponds to FB.
Two definitions of perturbation of the proximal operator are considered (see e.g. [19,22,1] and references therein).
Numerically, FISTA seems less stable than FB: this has led for instance Beck and Teboulle to introduce a monotone version MFISTA of FISTA in [3], the claim being that MFISTA is more robust than FISTA. Our purpose is to show that another way to accelerate FB ensuring a better stability than FISTA is to slow down the over-relaxation depending on the assumptions on perturbations.
Three convergences are studied for the algorithms: 1. The convergence rate of F (x n ) − F (x * ).
2. The convergence rate of F (z n ) − F (x * ) where z n is a convex combination of (x k ) k n , which is defined as ergodic convergence. 3. The weak convergence of (x n ) n∈N . Our work takes some inspiration from the paper by Schmidt et al in [20], where the authors investigate the stability of FISTA. One of the key results of the paper which enables us to derive theorems about the convergence for this class of algorithms is Proposition 3.3 which can be seen as a generalization of Proposition 2 in [20]. The convergence speed results we get also include the one of [20] and [22].
Moreover, our work is the first one providing stability results for FISTA in terms of convergence of the iterates. Indeed, our work is also based on the paper [7] of the second author of the present article, where the convergence of the iterates of FISTA is proved. The extension of the approach of [7] is done using ideas of the work by Moudafy and Oliny [15] with the notion of ε enlargements.
The main contributions of the paper can be summarized as follows: If the perturbations are small enough to ensure the optimal decay of FISTA (O 1 n 2 ), iterates of FISTA weakly converge. If the perturbation are larger, slowing down the over relaxation of FB can ensure the weak convergence of the iterates. Moreover the ergodic convergence of these over relaxations can be better than classical and ergodic convergence of FISTA.
The rest of the paper is organized as follows. In section 1, we recall the main notations and definitions used in this paper to analyze iFB and the specific case FISTA. In Section 2, we introduce the different notions used to approximate proximal operators, and we give some basic facts.
In Section 3, we study the convergence rate (in terms of values of the functional) for iFB. We show that for suitable parameter choices, iFB may lead to faster convergence rate than classical FB and FISTA if we consider ergodic convergence depending on noise assumptions. In Section 4, we show the convergence of the iterates of the different schemes considered in the previous section. We then discuss the obtained results, and we put them in perspective with the existing literature in Section 5. In Section 6, we give some numerical simulations that confirm the theoretical results of the paper. Most of the proofs of the results presented in the paper are postponed to Appendix A, for ease of reading. defined even if this solution is not unique.
The set of non negative integers is denoted by N and the set of positive integers is denoted by N * . A key tool of FISTA is the proximal map. To any proper, convex and l.s.c function h is associated the proximal map Prox h which is a function from H to H defined by This function is uniquely defined and it generalizes the projection onto a closed convex set to convex functions. In the sequel, γ denotes a non negative real number such that γ 1 L where L is the Lipschitz constant of ∇f and T the mapping from H to H defined by The idea of FB is to apply this mapping from any x 0 ∈ H using Krasnosel'ski Mann iterations to get a weak convergence to a minimizer x * of F .
The idea of inertial Forward Backward (iFB) and of FISTA is to apply this mapping using a suitable extragradient rule to accelerate the convergence.
The iFB is defined by a sequence (t n ) n∈N * of real numbers larger than 1 and a point x 0 ∈ H. Let (t n ) n∈N * be a sequence of non negative real numbers and x 0 ∈ H, the sequences (x n ) n∈N , (y n ) n∈N and (u n ) n∈N and (y n ) n∈N are defined by y 0 = u 0 = x 0 and for all n 1, The point y n may also be defined from points x n and x n−1 by For suitable choices of (t n ) n∈N * the sequence (F (x n )) n∈N converge to F (x * ), i.e the sequence (w n ) n∈N , defined as follows, tends to 0 when n goes to infinity. In their seminal work [4] Beck and Teboulle introduce FISTA choosing the specific sequence (1.6) t 1 = 1 and ∀n 0, t n+1 = 1 + t 2 n + 1 2 In many articles, see for example [2], authors call FISTA the previous algorithm with t n = n+1 2 . More recently Chambolle and D. [7] propose the choice t n = n+a−1 a with a > 2. In the sequel we ill consider these three different choices as different versions of a single algorithm we will call FISTA.
Several proofs use bounds on the local variation of the sequence (x n ) n∈N , which we will denote by (δ n ) n∈N : The sequence (v n ) n∈N denoting the distance between u n and a fixed minimizer x * of F will also be useful: To complete this part dedicated to notations, we define a sequence (ρ n ) n∈N , associated to (t n ) n∈N * , whose positivity will ensure the convergence of the iFB iterations: (1.9) ρ n := t 2 n−1 − t 2 n + t n .
In [7], the following result is shown on the iterates of FISTA: Theorem 1.1. Let a > 2 be a positive real number, and for all n ∈ N let t n = n+a−1 a . Then the sequence (x n ) n∈N given by FISTA weakly converges to a minimizer of F and it exist a real number C depending on F and x 0 such that Remark that only the convergence of F (x n ) is shown in [4] where FISTA is introduced. In the present paper, we will use ideas from [7] to prove the convergence of iterates x n for the considered schemes.
2. Inexact computations of the proximal point. In this section, we introduce the different notions used to approximate a proximal operator in this work. As recalled in the previous section, if F is a proper, convex and l.s.c function, and λ > 0, we can define the proximal map Prox λF by Let us denote by The first order optimality condition for a convex minimum problem yields We now introduce the notion of ε-subdifferential of F at the point z ∈ domF as: It is worth noticing that it holds: This is a generalization of the subdifferential: We can introduce different kind of approximations of the proximal operator computation [19,22]. Definition 2.1. We say that z ∈ H is a type 1 approximation of Prox λF (y) with ε precision and we write z ≈ 1 Prox λF (y) if and only if Another notion of approximation which is usefull is obtained by relaxing the last equation in (2.3): Definition 2.2. We say that z ∈ H is a type 2 approximation of Prox λF (y) with ε precision and we write z ≈ 2 Prox λF (y) if and only if Notice that if z ≈ 2 Prox λF (y), then z ≈ 1 Prox λF (y) (see Proposition 1 in [22]). Condition (2.8) can be written equivalently as: Recalling that the proximity operator of F is defined as (Id + ∂F ) −1 , the admissible approximations of type 1 can be interpreted as a kind of ε enlargement of the proximity operator [6].
Indeed, if R is a monotone operator, we can generalize the notion of approximate subdifferential with the one of ε enlargement [6]: Definition 2.3.
Notice that if R = ∂f with f a convex function, then one has ∂ ε f (x) ⊂ R ε (x) ∀x ∈ H. This inclusion may be strict (see [6] for examples).
Another definition of approximation of prox is used in [9] to study the stability of the Forward-Backward algorithm.
Definition 2.4. We say that z ∈ H is a type 0 approximation of Prox λF (y) with ε precision and we write z ≈ 0 Prox λF (y) if and only if Unfortunately our analysis does not handle such an approximation. We end this section with a technical lemma taken from [2] that enables to consider approximations of types i = 1 or i = 2 in the same setting.
Lemma 2.5. If x ∈ H is a type 1 approximation of Prox λF (y) with ε precision, then there exists r such that r ≤ √ 2λε and The proof of this lemma is the one of Lemma 2 in [2]. Notice that when r = 0, then we get the definition of a type 2 approximation.
Proof. Let a λ (x) = 1 2λ x − y 2 . Then: Now that we have introduced all this material, we can formulate the main results of the paper in the next two sections.
3. Convergence rates of inertial FB in presence of perturbations. Application to FISTA. The sketch of the approximate over-relaxation of FB (FISTA when t n is well chosen) used in the paper is given in Algorithm 1.
Algorithm 1 Approximate inertial FB algorithm Let (t n ) n∈N * be a non decreasing sequence of non negative real numbers such that t 1 = 1 and x 0 ∈ H, the sequences (x n ) n∈N , (y n ) n∈N and (u n ) n∈N and (y n ) n∈N are defined by y 0 = u 0 = x 0 and for all n 1, The point y n may also be defined from points x n and x n−1 by This section presents the convergence rate results (in term of values of the functional).
In all the sequel, we will use specific choices of sequences (t n ) n∈N : ∀n ∈ N * , One can notice that the choice d = 1 and a > 2 corresponds to the version of FISTA proposed in [7] and satisfies H1. The choice d = 0 corresponds to Forward-Backward. This condition ensures that for all n ∈ N, n n, n + t n > 0 which is a key property for all the following results. More precisely: Lemma 3.2. If (a, d) satisfies condition H1 then ∀n ∈ N, n 2, Proof. See Subsection A.1. All the following theorems derive from the next proposition which can be seen as a generalization of Proposition 2 of [20,2] to any sequence (t n ) n∈N * ensuring the positivity of the sequence (ρ n ) n 2 , with the additional and crucial term N n=2 ρ n w n−1 . Lemma 3.2 gives a lower bound on ρ n . Proposition 3.3. Consider Algorithm 1 with i ∈ {1, 2} and any sequence (t n ) n∈N * such that t 1 = 1 and the sequence (ρ n ) n 2 is positive. Then for all n 1, we have This proposition relates the quantity w n = F (w n ) − F (x * ) to the initialization choice (distance to the minimizer) and the numerical errors ε n and e n .
For suitable choices of the sequence (t n ) n∈N * Proposition 3.3 leads to the following theorem: Consider that (a, d) satisfies condition H1 and that ∀n ∈ N * , t n = n+a−1 a d . Then for all n 1, we have This applies to FISTA (d=1). If the sequences A i,n and B n are uniformly bounded, we get convergence rates of the overrelaxation: Theorem 3.5. Consider Algorithm 1 with i ∈ {1, 2}, consider that (a, d) satisfies condition H1 and that ∀n ∈ N * , t n = n+a−1 a d . Assume that the following assumptions hold: There exists A 1 , A 2 and B positive real numbers such that 1.
Proof. The two first points are direct consequence of Theorem 3.4. The third point is a consequence of the coercivity of the function F . Indeed, under the hypothesis of the Theorem, from Theorem 3.4, the sequence ( x * − u n ) n∈N belongs to ℓ ∞ (N) which implies that the sequence ( u n ) n∈N belongs to ℓ ∞ (N). Since (w(x n )) n∈N tends to 0 and since the function F is coercive we get that the sequence ( x n ) n∈N belongs to ℓ ∞ (N). From the definition of u n (1.3) it follows that (t n x n − x n−1 ) n∈N belongs to ℓ ∞ (N). Remarking that t n = n+a−1 a d and δ n = 1 2 x n − x n−1 2 concludes the proof of this third point. The fourth one is a consequence of the fact that under the hypotheses of the Theorem there exists C > 0 such that Hence by convexity of function F , which concludes the proof of the Theorem.
The previous theorem ensures that the over-relaxed algorithm behaves in the same way with small perturbations and with no perturbations. In the following Corollaries we focus on consequences of Theorems 3.4 and 3.5 when the perturbations are too large to ensure the optimal decay of FISTA. We will focus on convergence of w n := F (x n )−F (x * ) and on ergodic convergence, i.e. convergence of w e n = F (z n ) − F (x * ) where z n is defined in (3.19 One can remark that results are similar for classical and ergodic convergence. The first result with i = 1 is similar to the one of Schmidt et al. [20] and the second one to Salzo et al. [19] (although [19] only considers the special case e n = 0). The case α = 1 is treated by Theorem 3.5. One can observe that if the convergence rate is good for α close to 1, it is not that good for α close to 0. If α > 0, the sequence ( e n ) n∈N and (ε n ) n∈N belongs to ℓ 1 (N). Proposition 1 of [20] implies that the ergodic convergence of Forward-Backward satisfies in this case: Hence for α < 1 2 , the bounds we can achieve on FISTA are worse than what can be achieved with ergodic convergence of Forward-Backward. Next corollary shows that, for α ∈]0, 1[ for a suitable choice of d, one can perform better than FISTA and FB. This corollary is a direct consequence of Theorem 3.5.
Then choosing t n = n+a−1 a α in Algorithm 1, Let us assume that there exists 3 positive real numbers A 1 , A 2 and B such that 1.
Then the sequence (n d δ n ) ∈N belongs to ℓ 1 (N).
Proof. See Subsection A.3. This bound on (n d δ n ) will play a key role in proving the convergence of the iterates, since it will be part of the final bound.
Now that the convergence of the values of the functional has been adressed, we can turn to the convergence of the iterates for Algorithm 1.

Convergence of iterates.
In this section, we present the theorem stating the convergence of iterates of inertial Forward Backward algorithms. The next theorem and corollary are generalizations of Theorem 1.1 in the case of errors in Algorithm 1.       5. Discussion. With Theorems 3.4 and 3.5, we propose generalizations of results of Salzo et al. [19] and of Schmidt et al. [2] to any choice of d ∈]0, 1[ for two different ways to define the approximate proximal operator, with an extension of error to gradient computation in the case of [19]. More precisely in both articles, authors consider classical FISTA where the sequence t n is defined accordingly to the rule of Beck and Teboulle [4] or with the other classical choice t n = n+1 2 corresponding to α n = n−1 n+2 . Following Chambolle et al. [7], for d = 1 we consider t n = n+a−1 a with a > 2 because this sequence ensures the weak convergence of iterates, but it turns out that results of [19,2,4] are similar with this choice of FISTA parameters. The weaker assumptions on errors for i = 2 in the Algorithm 1 to get similar decay of w n confirms the fact that the definition of approximation of proximal operator given by i = 1 is stronger than for i = 2. Considering d ∈]0, 1[ has two advantages: 1. The ergodic convergence is better than FISTA for some perturbation levels.
2. The weak convergence of iterates may be achieved for perturbation levels for which the weak convergence of iterates of FISTA is not ensured. More precisely, we show that 1. If the perturbations on proximal operator and gradient are small enough to ensure the optimal decay rate of FISTA (O( 1 n 2 )), then iterates of FISTA weakly converge. 2. If the perturbation level is too high to ensure the optimal decay rate of FISTA, it may be better to slow down the over-relaxation to limit the enhancement of perturbations due to over relaxation. A lower over-relaxation may stabilize the algorithm, ensuring a better ergodic convergence and a weak convergence of iterates. 3. For large perturbation, ergodic convergence behaves better than classical convergence. Theorem 4.1 and Corollary 4.2 extend the convergence result of [7] to the case when errors occur in FISTA algorithm. Notice that this extension is based on ideas proposed in [15], the notion of ε enlargements having a key role. This stability result for the convergence of iterates for FISTA indicates that provided the errors are sufficiently controlled, then there is still convergence of the iterates. This is an interesting property, in particular in the case of nested algorithms.
Since no strong convergence has been proved for FB or FISTA, the question of the convergence rate of iterates (x n ) n∈N does not have any meaning in a general setting. Nevertheless, the question may be interesting in finite dimension, when the weak convergence implies a strong convergence. Unfortunately there is no chance to prove any convergence rate of iterates for FB or FISTA. If we consider the minimization problem inf x (f (x) + g(x)) with f (x) = x p , p > 2, and g(x) = 0, than FB is a simple gradient descent and FISTA is an inertial gradient descent. It can be shown that any sequence (x n ) n∈N defined by FB satisfies x n > C 1 n − 1 p−2 where C 1 depends on x 0 > 0, and that any sequence (x n ) n∈N defined by FISTA satisfies x n > C 2 n − 2 p−2 where C 2 depends on x 0 > 0. It follows that the convergence to the minimizer 0 may be very slow for large values of p.
6. Numerical experiments. Theorem 3.5 ensures that the bound on the ergodic convergence rate may be better for an over-relaxation of Forward-Backward that is not FISTA for some noise level. The fact that the bound is lower does not guaranty that the decay of F (z n ) − F (x * ) is better for a suitable over-relaxation. A classical example of an algorithm whose bound on the convergence rate are not tight is the original FISTA. It is known for a while that in most experiments the sequence F (x n ) − F (x * ) is oscillating and that for most values of n the bound x0−x * 2(n+1) 2 given in [4] is not tight.
To test these bounds, we need to be able to bound the errors on gradient and on the proximal operator at each step. We propose two examples to illustrate that result. The first one is a simple gradient descent and the second one is a toy example of 1D inpainting using wavelets.
We first consider the specific case of H = R 2 and f = · p 2 and g = 0. In this case the proximal operator of g is the identity. The inertial FB can be stated as follow: (6.1) x n = y n−1 − γy n−1 y n−1 p−2 and y n = x n + α n (x n − x n−1 ) We consider the following perturbed algorithm (6.2) x n = y n−1 − γy n−1 y n−1 p−2 + e n and y n = x n + α n (x n − x n−1 ) where e n is a perturbation.
In our experiments the sequence (e n ) n 1 is a sequence of random vectors such that ∀n 1, e n = C n β for a given C and β and whose directions are uniformly spread on the sphere. The minimizer of F is 0 and the minimum of the function is 0. In the next figure, several choices of β are tested and the three choices of d are compared, d=0 (FB), d=1 (FISTA) and d = 0.5 which is another inertial FB (iFB). The value of F (z n ) − F (x * ) is also given for the last value of n.
For each experiment the starting point x 0 is set to (1, 0) and the curves are a mean over 1000 trajectories. Each trajectory is oscillating but the mean of 1000 trajectories is more stable and most of the time decreasing.
One can observe on Figure 1 that, as stated in Theorem 3.5, the choice d = 0.5 may be better than FB (d=0) and FISTA (d = 1) depending on β. For high values of β, which means small pertubations, FISTA is better, for small values of β which means high perturbations, FB is better and for intermediate values of β, the choice d = 0.5 for inertial FB gives better results. Notice that when considering ergodic convergence, than iFB gives the best result except in the case of weal noise (where FISTA does a better job). Secondly we consider a simple example of 1D inpainting. We consider a 1D signal x 0 ∈ R N , piecewise regular, M a random masking operator and we want to estimate x 0 from y = M x solving (6.3) min where T is a Daubechies wavelet transform.
To solve (6.3) we consider f = 1 2 y − M · 2 2 and g = T · 1 and use FB, FISTA and iFB with d = 0.5. Here the proximal operator of g is a soft thresholding in the wavelet domain. The iFB can be stated as follows: (6.4) x n = y n−1 − γS(y n − γM y n , γλ) and y n = x n + α n (x n − x n−1 ) where S(x, t) is the soft thresholding in the wavelet domain with Threshold equal to t. The parameter γ is set to 0.99. We consider the following perturbed algorithm (6.5) x n = y n−1 − γS(y n − γM y n , γλ) + e n and y n = x n + α n (x n − x n−1 ) where (e n ) is a sequence of random vectors such that ∀n 1, e n = C n β for a given C and β and whose directions are uniformly spread on the unit sphere of R N . Several values of β have been tested. For each algorithm the associated curve is a mean over 50 trajectories. For small values of β, FB is more stable and for high values of β FISTA is faster but for intermediate values of β, iFB with d = 0.5 may be better than both of them. Notice that when considering ergodic convergence, than iFB gives the best result except in the case of weal noise (where FISTA does a better job). We can observe that the set of values of β for which the choice d = 0.5 is better than FB and FISTA is not the same that in the first example (se Figures 2 and 3). gives the best result (except in the case of weak noise). iFB is even more performant as soon as the ergodic convergence is considered. This can be explained by the fact that iFB oscillates around the solution, and thus an averaging brings improvement.
Appendix A. Appendices. We detail here most of the proofs of the results presented in the paper.
A.1. Proof of Lemma 3.2. Proof. The first inequality comes from a direct calculation. Let us remark that On the right, iFB is compared with FISTA for the same amount of noise: it can be noticed that FISTA is still oscillating around the solution, and therefore has not converged yet. Hence which concludes the proof of the Lemma. One can remark that condition a > 2 ensures that for all d ∈]0, 1], a > (2d) 1 d A.2. Proof of Proposition 3.3. The proof of Proposition 3.3 makes use of several lemmas. The first one is a generalization of Lemma 1 in [5] dealing with inexact computation of the proximal operator. The original lemma which can also be found in many other references, [21] or [4,5] is at the core of the proof of the convergence rate of FISTA since it provides an inequality mixing values of F at some points and distances between points. This lemma is a consequence of the fact that the function defining the proximal operator is strongly convex. Proof.
Since γ ∈]0, 1 L ], we have by convexity of f that: Sincex = T ε ex we have from Lemma 2.5 that there exists r with r ≤ √ 2γε and: We thus have for any x ∈ H that: Adding (A.2) and (A.4), we get: The result of the Lemma follows from the fact that F (x) = f (x) + g(x), and that: The following lemma is a generalization of Lemma 5 in [7]. It uses the previous one and the convexity of F to bound t 2 N w n + N n=2 ρ n w n1 when the proximal operator is inexact. The bound depends explicitly on errors and on x * − u n which will be bounded using other lemmas following the ideas of Schmidt et al. [2].
Lemma A.2. If the sequence (t n ) n∈N satisfies t 1 = 1, and γ 1 L then for any N 2, t n e n + r n γ , x * − u n .
with ∀n 1, r n √ 2γε n Proof. Applying Lemma A.1 tox = y n ,x = x n+1 and x = (1 − 1 tn+1 )x n + 1 tn+1 x * , we find Using the convexity of F it follows Using definitions of w n and v n this inequality can be stated Summing these inequalities from n = 0 to n = N − 1 leads to Proof. The proof is almost exactly the same as the one proposed in section 6.2.1 in [2]. It relies on a technical lemma which we recall here (and whose proof is given in [2]).
Lemma A.4. Assume that the nonnegative sequence {a n } satisfies the following recursion for all n ≥ 1: λ k a k with S n a non-decreasing sequence, S 0 ≥ a 2 0 and λ k ≥ 0 for all k. Then, for all k ≥ 1, it holds: From (A.7), using the fact that w n ≥ 0 for all n and that 2v n = u n − x * 2 , we get: t 2 n ε n and (A.18) λ n = t n e n + r n γ we get: and we conclude using the fact that

Proof. [Proof of Proposition 3.3]
Using Lemmas A.2 and A.3, we get: t n e n + r n γ x * − u n and (A.20) Hence: [Proof of corollary 3.8.] Applying Lemma A.1 tō x = y n = x n + α n (x n − x n−1 ),x = x n , and x = x n leads to which can be written with definitions of w n and δ n Multiplying this inequality by (n + a) 2d and summing from n = 1 to n = N leads to Here the majoration of (n + a) 2d − (n + a − 1) 2d by 2d(n + a) d is the same than the one used in lemma 3.2. Theorem 3.4 ensures there exists C > 0 such that x n − x n−1 C n d , wich gives If i = 3, r n+1 = 0, under the hypotheses of the Corollary which is uniformely bounded with hypotheses of the Corollary. Hence, by Theorem 3.5, the right part of the inequality in (A.23) is uniformly bounded independently of N , which ensures that the sequence (n d δ n ) n∈N belongs to ℓ 1 (N). It also follows that the sequence (n 2d δ n ) n∈N is uniformly bounded. depending on d such that ∀j ∈ N, Proof. For d = 0 and for all n, α n = 0 thus one can choose C(d) = 0. We split the proof into two cases, depending on the fact that d = 1 or d ∈ (0, 1). Case 1: We first consider the case d = 1. We can observe that condition H1 implies that a > 2. Let us define for all j 1 and for all k j and β j,k = 1 if j > k.
Remarking that for a 1, (l + a) d a d (l + 1) d , and then that − a d Since for all x ∈ R, x e x−1 it follows that We now bound the sum on the right part of the previous inequality: With the change of variables u = (t + 1) 1−d it follows that where d 1−d > 0, and we can integrate by parts: where the expression in the bracket is exactly equal to 1 K e −K(j+1) 1−d (j + 1) d . Let us remark that

equation (A.26) and (A.27) lead to
It follows that it exists j 0 depending on d such that for all j j 0 we get A B + A 2 which can be stated: With previous inequalities it follows that there exists j 0 depending on d such that for all j j 0 , To deal with small j one can use the fact that for all pair (j, k) β j,k 1 and that for all j j 0 , β j,k β j0,k which implies that for all j j 0 where the right part of the inequality is uniformly bounded. Which concludes the proof of the lemma. We detail here the proof of we have by using the definition of y n (A.30) Φ n − Φ n+1 = δ n+1 + y n − x n+1 , x n+1 − x * − α n x n − x n−1 , x n+1 − x * Then, using the monoticity of ∂g, we deduce that for any z n+1 ∈ ∂g(x n+1 ) and for any z * ∈ ∂g(x * ) By definition of x * , (A.31) −∇(f (x * )) ∈ ∂g(x * ) and using (3.4) and Lemma 2.5, there exists r n with r n ≤ √ 2γε n such that: (A.32) y n − x n+1 − γ∇f (y n ) − γe n − r n γ ∈ ∂ εn g(x n+1 ).