Combining asymptotic linearized inversion and full waveform inversion

S U M M A R Y A method for combining the asymptotic operator designed by Beylkin (Born migration operator) for the solution of linearized inverse problems with full waveform inversion is presented. This operator is used to modify the standard L2 norm that measures the distance between synthetic and observed data. The modified misfit function measures the discrepancy of the synthetic and observed data after they have been migrated using the Beylkin operator. The gradient of this new misfit function is equal to the cross-correlation of the single scattering data with migrated/demigrated residuals. The modified misfit function possesses a Hessian operator that tends asymptotically towards the identity operator. The trade-offs between discrete parameters are thus reduced in this inversion scheme. Results on 2-D synthetic case studies demonstrate the fast convergence of this inversion method in a migration regime. From an accurate estimation of the initial velocity, three and five iterations only are required to generate high-resolution P-wave velocity estimation models on the Marmousi 2 and synthetic Valhall case studies.


I N T RO D U C T I O N
Full Waveform Inversion (FWI) has reached today sufficient maturity for being routinely included in the industrial seismic imaging workflow. The method is based on the iterative minimization of the misfit between recorded and synthetic seismograms. The synthetic seismograms are computed through the solution of partial differential equations describing the seismic wave propagation. The minimization of the misfit function is performed over a set of parameters of these equations, which control the wave propagation. This simple formalism allows one to quantitatively estimate physical parameters such as P-and S-wave velocities, density, attenuation and anisotropy coefficients, with a resolution of the order of the wavelength.
Appearing in the early work of Lailly (1983) and Tarantola (1984), FWI was initially thought of as intractable for realistic applications. At that time, the lack of long offsets and low frequencies in the recorded seismic data were responsible for a large gap between the low and high part of the spatial spectrum (wavenumber) which could be reconstructed (Jannane et al. 1989). Besides, the computational cost of the method was seen as a major limitation (Gauthier et al. 1986).
The simultaneous development of wide-aperture broadband acquisition systems and high-performance computing facilities during the last three decades allowed for the successful application of FWI to real data in the 2-D acoustic and elastic approximations Ravaut et al. 2004;Gao et al. 2006;Brossier et al. 2009;Prieux et al. 2011;Plessix et al. 2012;Prieux et al. 2013a,b) as well as in the 3-D acoustic approximation (Sirgue et al. 2008;Plessix & Perkins 2010;Vigh et al. 2010;Warner et al. 2013). An overview of the FWI method and its application to synthetic and real case study data has been proposed by .
The two key ingredients involved in the FWI process are firstand second-order derivatives of the misfit function, namely the gradient and the Hessian operators. These ingredients are used at each iteration to construct the update that corrects the current subsurface model estimation. The gradient is built as the cross-correlation of the data residuals with the signals that would be scattered by localized perturbations missing in the initial subsurface model. In this sense, the model update provided by the gradient is an interpretation of the unexplained part of the seismograms (the residuals) in terms of single scattering effects. However, the gradient does not contain any information about the true amplitude of the missing perturbations, and the single scattering assumption is violated as soon as multiscattered arrivals have been recorded. In addition, when multiple classes of parameters are involved in the reconstruction process, trade-offs are expected as the signal scattered by a local perturbation of one parameter class may resemble the signal scattered by a perturbation of another parameter class. A review of these difficulties, especially in the multiparameter inversion framework, is given in Operto et al. (2013) and Alkhalifah & Plessix (2014).
Better accounting for the influence of the Hessian operator in the inversion process should help to mitigate these difficulties. Since the pioneering work of Pratt et al. (1998), numerous studies have been proposed on effects of the inverse Hessian operator. Applying this operator to the gradient restores the amplitude of the perturbations and corrects for artefacts generated by second-order reflections. In addition, it helps refocus the model update and mitigates the tradeoffs between parameter classes, as the Hessian operator accounts for spatial and parameter trade-offs.
These well-identified properties have led the FWI community to improve the Hessian operator approximation in the inversion schemes, progressively going from pre-conditioned steepest descent using diagonal approximations of the Gauss-Newton operator  or of the pseudo-Hessian operator (Shin et al. 2001), to non-linear conjugate gradient methods (Fichtner et al. 2008), l-BFGS approximations (Brossier et al. 2009) and truncated Newton methods (Métivier et al. , 2014Castellanos et al. 2015). The latter method is based on an approximate solution of the linear system associated with the Newton equations. This approximate solution is computed through few iterations of a linear conjugate gradient solver. This only requires the ability to compute efficiently the action of the Hessian operator on a given vector (matrix-free paradigm). This operation requires two additional solutions of the forward problem using second-order adjoint methods (Métivier et al. 2012).
Despite the importance of the inverse Hessian operator in FWI, the aforementioned techniques still rely on a quite crude approximation of this operator. Diagonal Gauss-Newton and pseudo-Hessian approaches only account for the diagonal elements. Few gradients (from 3 to 20 in practice) are used to construct the l-BFGS approximation. The number of internal iterations involved in the solution of the Newton equations should not exceed 10-20 for the truncated Newton method to remain computationally feasible. These numbers have to be compared to the actual size of the Hessian operator in realistic applications: after discretization, the Hessian operator is a square matrix of size N = O(10 6 ) in 2-D and N = O(10 9 ) in 3-D. The imperfect illumination of the subsurface causes this matrix to be poorly conditioned, making difficult to approximate the solution of the Newton equations in such a small number of iterations.
In this study, an alternative approach is investigated. The underlying idea is to consider a modification of the standard FWI misfit function that yields a misfit function with a Hessian operator tending asymptotically towards the identity. The minimization of this modified misfit function is a better conditioned problem compared to the original one. In particular, in the domain of validity of this modified misfit function, the trade-offs between parameters should be reduced.
The modification of the misfit function consists in projecting the observed and synthetic data on the model space, through a migration operator. The misfit function is thus computed as the L 2 norm of residuals in the model space, instead of being defined in the data space. The migration operator that is used is the one proposed by Beylkin & Burridge (1990), developed in the framework of inversion as an approximate inverse of the Born modelling operator (Beylkin 1985(Beylkin , 1987Bleistein et al. 1985;Bleistein 1987;Beylkin & Burridge 1990;ten Kroode 2012).
The dominant part of the Hessian operator of the modified misfit function (the Gauss-Newton part) is a normal operator associated with a modified Born modelling operator. As a result of the introduction of the migration operator in the misfit function, this modified Born modelling operator matrix is equal to the Born modelling operator left-multiplied by the Beylkin operator. This operator tends asymptotically towards the identity. As a consequence, the normal operator associated with this modified Born operator tends asymptotically towards the identity.
A first attempt to use the Beylkin operator in the framework of least-squares inversion has been proposed by Jin (1992), merging two apparent disconnected methods. The underlying idea was to introduce the asymptotic inverse in the misfit function as a data weighting operator to produce a modified misfit function with a nearly diagonal Hessian operator. Unfortunately, as the weighting operator is a mapping from the data space to the model space, the resulting misfit function depends on the space location, which prevents the use of standard non-linear optimization techniques. A similar idea has also been exploited by Sevink & Herman (1996) for accelerating the solution of a least-squares migration problem.
The strategy proposed here may be interpreted as applying a migration algorithm to the data residuals, meaning that the modelled and the recorded data are compared in the migrated (spatial) domain. This shares some similarities with the Differential Semblance Optimization (DSO) methods introduced by Symes & Carazzone (1991) as a particular Migration Velocity Analysis (MVA) method (Symes 2008). In these methods, however, the philosophy is different. The data are migrated using a particular background velocity model in an extended model space. The quality of the background velocity model is assessed using a semblance criterion defined in this extended image domain (invariance along the offset for instance). This semblance criterion is used to define a new misfit function in the extended image domain, and the background velocity model is updated to reduce this misfit.
In the method presented in this study, the velocity model used to perform the migration is kept constant, equal to the initial guess, throughout the iteration process. The migration of the data residuals could be interpreted as a change of metric for comparing synthetic and observed data. The residuals are projected in the model space along the geodesics defined by selected ray paths. The minimization is performed to fit the converted data in this new space, and the migration is not used to assess the quality of the background model. The quality of this model may be poor: in this case the asymptotic approximation of the Hessian operator will be far from the identity operator. However, throughout the iterative process, FWI may still allow to correct the subsurface model that is inverted.
In Section 2, the main ideas from the work of Beylkin (1985), leading to the definition of an approximate inverse of the Born operator, are exposed. The Section 3 presents how the asymptotic migration operator is used within this context to modify the FWI misfit function yielding a better conditioned minimization problem. Numerical experiments on synthetic case studies are presented in Section 4 in a 2-D acoustic frequency-domain context for the reconstruction of the pressure wave velocity only (mono-parameter inversion). The advantages and the limits of the approach are presented. The introduction of the Beylkin operator within the misfit function acts as a scattering-angle-based filtering on the seismic data. A stronger weight is given to reflection data, detrimental to data associated with diving waves. This hampers the reconstruction of the long wavelength of the models. On the other hand, starting from an accurate initial velocity model, the gain in convergence speed is substantial. This study thus suggests that the method presented here may be well suited in a quantitative migration context, for the reconstruction of high-resolution estimates of the subsurface parameters, as it is the case for instance in Métivier (2011) and Métivier et al. (2011). In addition, it shows that far from the high-frequency regime, the asymptotic approximation has still a significant effect on the conditioning of the FWI problem.

Definitions and notations
The domain of investigation is defined as where the dimension is denoted by n ∈ N (in practice n = 2 or n = 3). In the acoustic approximation, the frequency-domain pres- where the angular frequency is denoted by ω, the space variable by x ∈ R n , the density by ρ(x), the source by s(x, ω) and the bulk modulus by κ(x). In the following, the source will be considered as a Dirac impulse such that Therefore, the source is parameterized only by its position x s . The receivers are assumed to be located on a part of the border of the domain denoted by ∂ r . Similarly, the sources are assumed to be located on ∂ s . The generic subsurface parameter notation m(x) is used, such that The data are defined as a function of the angular frequency ω, the receivers and source positions x r and x s respectively, taking its values in the complex plane C. This functional space is defined as D. The synthetic data operator is a mapping from the model space M to the data space D such that The operator d cal is referred to as the forward problem operator in the sequel. This operator depends non-linearly on m, as where the solution of eq. (2) for a source located in x s and a subsurface model m(x) is denoted by p(x r , ω; x s , m).

Linearized inversion
The first step of linearized inversion consists in a linearization of the forward problem operator. The subsurface model m(x) is decomposed as the sum of a background model m 0 (x) and a perturbation of this background δm(x) such that where δm(x) is such that The first-order Taylor development of the forward problem operator is written as where . M denotes the L 2 norm in the model space M . The operator ∂d cal /∂m(m) is known as the Born modelling operator or Jacobian operator. It is denoted by J(m) throughout this study.
Considering that a data set d obs has been recorded, and a background model m 0 is known, the linearized inverse problem can be written as the reconstruction of δm such that The strategy proposed by Beylkin for solving this problem includes three steps.
(i) Use the first-order Born approximation to derive an analytic expression for J(m 0 ) depending on the Green functions G 0 (x, ω, x s ) and G 0 (x, ω, x r ) associated with eq. (2).
(ii) Compute an asymptotic approximation of J(m 0 ), denoted by J (m 0 ), through the asymptotic approximation of the Green functions G 0 (x, ω, x s ) and G 0 (x, ω, x r ).
(iii) Compute an approximate left inverse of J (m 0 ), denoted by B(m 0 ). The latter operator is roughly equivalent to a preservedamplitude migration operator.
The approximate solution m * of the linearized inverse problem computed by Beylkin is thus These three steps are reviewed in the following subsections.

First-order Born approximation
The first-order Born approximation is commonly used in seismic inverse problems for the linearization of the forward problem [see for instance Symes (1995)]. In this context, the pressure wavefield p(x, ω; x s , m) is decomposed into a reference pressure wavefield p 0 (x, ω; x s , m) solution of eq. (2) in the reference model m 0 and a perturbation δp(x, ω; x s , m), such that The perturbed wavefield δp(x, ω; x s , m) is solution of eq. (2) with a new source term depending on model perturbation δm and the reference wavefield p 0 (x, ω; x s , m): From eq. (13), we deduce that Eq. (14) is an integral representation of the Born modelling operator J(m 0 ), acting as a mapping from the model space M to the data space D, such that ∇G 0 (x r , ω, y)∇ G 0 (y, ω, x s )dy. The high-frequency approximation is introduced by considering the simple asymptotic approximation of the Green function G 0 (x, ω, y): where only one phase is considered. In complex media, more than one phase may be considered. In this introductory study, only the first arrival (from the source point y to the scattering point x in the above expression) is taken into account, although multi-arrival extensions are possible. In the expression (16), the phase term T(x; y) is the solution of the eikonal equation. It can be computed using a fast marching method (Vidale 1988;Podvin & Lecomte 1991) or a fast sweeping method (Zhao 2005) as only the first arrival is considered in this work. As no intrinsic attenuation is accounted for, the amplitude term A(x; y) is related to the geometrical spreading.
The geometrical spreading at a point x for a source located in x s can be evaluated using the true ray connecting x and x s and the paraxial quantities around this ray. Details on how to compute in practice these quantities are given in Appendix A. From the expression (16), the gradient of the Green function ∇G 0 (x, ω; y) satisfies iωT (x;y) . (17) Following the high-frequency approximation, the first term is neglected and the gradient of the Green function is approximated as Plugging expressions (16) and (18) into eq. (14) yields the relation where the following compact notations for ray quantities are used: A(x r , y, x s ) = A(x r ; y)A(y; x s ), T (x r , y, x s ) = T (x r ; y) + T (y; x s ).
The scattering matrix W(x r , y, x s ) is defined as and the diagonal matrix M 0 (y) is In the expression of the scattering matrix, the angle θ (x r , y, x s ) is the illumination angle at the scattering point y, formed by the pair of rays connecting the scattering point y to receiver position x r and source position x s as illustrated in Fig. 1. This angle is related to phases T(x r ; y) and T(y; x s ) through the equation

Computation of B(m 0 )
The asymptotic approximation of the linearized forward operator J (m 0 ) is In this section, the computation of an approximate left inverse B(m 0 ) to J (m 0 ) is presented. The operator B(m 0 ) is a mapping from the data space to the model space Beylkin (1985) defines the operator B(m 0 ) as a weighted adjoint of J (m 0 ) : where the weighting function h(x r , x, x s ) is yet to be described. As in Beylkin (1985), the integral operator F given by the composition of B(m 0 ) and J (m 0 ) is introduced: The expression of F at the scattering point x is given by an integral over scattering positions y, expressed as The most singular part of the operator should be dominant in the asymptotic approximation. The expression of F is thus localized around the scattering point x. This localization amounts to neglect the smoothest terms of the operator (Beylkin 1987). To this purpose, the first-order development of the function T(x r , x, x s ) is considered, as well as the zero-order development of the functions depending on y at point x in the integral (28). This yields the following simplified operator: We recognize the expression of a wavenumber in the argument of the exponential: Following the strategy proposed in Forgues (1996), the following change of variables is considered: where In this expression, the angles formed with the vertical axis by the ray connecting the receiver (respectively the source) to the point x are denoted by φ r (x r , x) (respectively φ s (x s , x)), as presented in Fig. 1. The inverse of this change of variables is denoted by Applying this change of variables to eq. (30) yields Furthermore, the scattering matrix W(x r , x, x s ) only depends on the illumination angle θ in the new parameterization (Forgues 1996), a justification of this parameterization selection: The following choice for the function h(x r , x, x s ) is introduced: (let us recall that n ∈ N denotes the dimension of the model space and is equal to 2 or 3 in practice). The matrix is assumed to be invertible. This yields Straightforward simplifications yield where the symbol . denotes the Fourier transform operator. Beylkin (1985) demonstrates that the operator P such that is a pseudo-differential operator which can be represented as the following sum: where the operator I is the identity, and the operators T i belong to increasingly smooth classes of pseudo-differential operators. This expansion depends on the angles min and max that characterize the illumination of the subsurface through the acquisition system. In the limit case we recognize in the double integral of the right-hand side of the expression (42) the Fourier transform in polar coordinates of the Dirac delta function (Forgues 1996). In this case P = I . The result from Beylkin may be interpreted as follows: the pseudo-differential operator P tends asymptotically to the identity as the illumination of the subsurface increases to a complete coverage. From this, it can be deduced that Physically, the incomplete illumination of the subsurface yields a pseudo-differential operator associated with a filtered Dirac function instead of an exact Dirac function. The properties of this filtered Dirac function are studied in Lambaré et al. (2003). The build-up of the operator B is thus an approximate left inverse to the linear operator J . The explicit computation of the operator B is presented in Appendix B in the 2-D mono-parameter case.

The FWI problem
The FWI problem is defined as the non-linear least-squares minimization problem where the complex modulus operation is denoted by the symbol |.|. This problem is solved through local non-linear optimization algorithms. In the framework of these methods, a sequence m k (x) is built from an initial guess m 0 (x), such that where the scalar parameter α k ∈ R is computed through a linesearch or a trust-region globalization process (Bonnans et al. 2006;Nocedal & Wright 2006). In the particular case of the Newton method, the parameter α k is set to 1 and the model update m k is given by In this expression, the operators H(m k ) and ∇f(m k ) are respectively the Hessian and the gradient of the functional f(m). These two key components of the FWI method can be expressed in terms of the Born modelling operator J(m) as

Link with the linearized inverse problem
The usual approximation of the Newton method consists of considering only the first term of the right-hand side in the expression of the Hessian operator. This is referred to as the Gauss-Newton approximation. In this context, the model update at iteration k is the solution of the linear system (50) This equation can be rewritten in terms of the Born modelling operator and the model update only: Eq. (51) is important, as it reveals that, in the Gauss-Newton approximation, the model update at iteration k is computed as the least-squares solution of the system: The first iteration of the Gauss-Newton method thus amounts to compute the subsurface model as where m 0 satisfies in the least-squares sense.
The connection with the linearized inversion can be made here. Eq. (54) is similar to the linearized inverse problem (10), where m 0 is replaced by δm. These two model updates are two approximate solutions of the same problem. They differ in the approximation used to compute them. In the linearized inverse method proposed by Beylkin, the preserved-amplitude migration operator B(m 0 ) is used. In the context of the Gauss-Newton method, the solution of this equation in the least-squares sense is computed numerically .

Introducing the asymptotic inverse within the FWI process
This statement raises the question whether the preserved amplitude migration operator B(m 0 ) could be used in the FWI process to accelerate its convergence. A first attempt to treat this question was proposed by Jin (1992) and later on pursued by Forgues (1996), Thierry et al. (1999a,b) and Lambaré et al. (2003). In these studies, the inverse problem is defined as the least-squares minimization of the misfit function This problem may be seen as an intermediate between the linearized inversion and FWI. The forward problem is linearized through the ray-Born approximation (Moser 2012); however, the solution is computed in the least-squares sense. The idea introduced by Jin (1992) is to add a proper scaling in the definition of the associated misfit function, so that the Hessian of this new misfit function should be close from the identity in the asymptotic regime. The scaling acts on the data and corresponds to a weighting function Q(x, x r , x s , ω) derived from the one introduced by Beylkin in its definition of the asymptotic inverse of the linearized forward operator. However, since the weighting Q(x, x r , x s , ω) depends on the space variable x ∈ R n , the resulting misfit function actually depends on the space variable x. This raises substantial difficulties, as for numerical optimization methods to be used, the misfit function should be positive and scalarvalued.
A new approach is proposed in this study, where we make both numerical and asymptotic approximation collaborating for the model reconstruction, while Jin (1992) and following workers were using only the asymptotic regime for least-squares migration. This approach consists in minimizing the misfit function g(m) defined as where the integral operator B(m 0 ) is the Beylkin operator: an asymptotic approximation of the left inverse of J(m 0 ). Following the Jin's interpretation, the integration of the Beylkin operator may be seen as the definition of particular weights for each triplet source, receiver and frequency, depending on the scattering point which is considered. The final misfit is the integration in space of all these local misfit functions. The expression for the gradient of g(m) is Compared to the gradient of the standard misfit function f(m) (eq. 49), the difference induced by the integration of the preconditioning operator relies on the multiplication of the residuals by B(m 0 ) and its adjoint B(m 0 ) * . These two multiplications can be interpreted as migration and demigration operations on residuals. Indeed, the asymptotic operator of Beylkin is similar by guest on April 15, 2015 http://gji.oxfordjournals.org/ Downloaded from to the Born migration operator. This mapping from the data space to the model space is performed following the geodesics defined by rays connecting sources and receivers to the scattering point of the investigated domain. In this work, a unique ray connects each scattering point to each sources and receivers. This is the consequence of the first-arrival assumption. The adjoint operation is a demigration, which is obtained by the application of the adjoint Beylkin operator to a function of the model space. The gradient of the misfit function is thus computed as the correlation between the data generated by localized perturbation in the medium in the single scattering approximation and the residuals obtained after a migration/demigration process. As it is demonstrated in the numerical experiments, this process acts as an illumination angle based filter of the residuals, which emphasizes small illumination angles detrimental to large illumination angles.
In addition, at the first iteration In the previous section, following the work of Beylkin, the latter operator has been proved to be a pseudo-differential operator that tends towards the identity as the illumination of the subsurface increases. As a pseudo-differential operator, this property holds for its adjoint. In addition, the composition of two pseudo-differential operators that tends to the identity tends also towards the identity. Therefore, H g GN (m 0 ) can be approximated as where the operators T i belong to classes of increasingly smooth operators. The introduction of the integral operator B(m 0 ) in the misfit function thus yields a better conditioned inverse problem, as the Hessian operator associated with the new misfit function should be closer to the identity. The definition of g(m) given in eq. (56) assumes implicitly that the pre-conditioning operator B(m 0 ) is computed once for all in the initial model m 0 . In fact, one may select a suitable model m b as the background model where the pre-conditioning operator B(m b ) is built. Of course, a more elaborate strategy could consist in introducing another non-linearity in g(m) by letting the preconditioning operator depend on the current subsurface model m through an updated background model. However, this brings an additional complexity to the problem, not only in terms of computational effort but also in terms of optimization strategy, as the computation of the gradient of g(m) should be modified accordingly. This study is thus restricted to the case where the pre-conditioning operator is computed in the initial model m 0 .

Experimental settings and implementation
The aim of the three experiments proposed in this section is to compare standard L 2 norm-based FWI results with the results obtained when the modified misfit function g(m) introduced in the previous section is used. This comparison is performed in the framework of 2-D acoustic frequency-domain FWI for the reconstruction of the P-wave velocity only. The modelling engine for the solution of the 2-D frequency-domain acoustic equations is based on the mixed-grid finite-difference scheme of Hustedt et al. (2004). The direct solver MUMPS (Amestoy et al. 2000) is used to solve the associated linear system. This approach benefits from the efficiency of direct solvers for handling multiple right-hand sides associated with multiple seismic sources.
The optimization scheme which is used to minimize the misfit functions is the l-BFGS algorithm (Nocedal 1980). The choice of this algorithm is made as it only requires the computation of the gradient of the misfit function and shows however improved convergence properties regarding standard steepest descent or non-linear conjugate gradient algorithms. The parameter l sets the number of gradients stored to build the l-BFGS approximation of the inverse Hessian, which is 20 in this study (Nocedal & Wright 2006). The adjoint state method is used to compute the gradient of the misfit functions (Chavent 1974;Plessix 2006). From the implementation point of view, the l-BFGS optimization is performed with the SEIS-COPE optimization toolbox, which is interfaced within the code used to achieve these experiments .
The strategy retained for implementing the asymptotic preconditioning strategy consists in computing the Beylkin operator in discrete form off-line. While the first case study is small enough for this operator to be computed explicitly on the same grid as the one used by the finite-difference scheme for the wave propagation modelling, in the second and third case studies (Marmousi2 and Valhall experiments), a compression strategy is applied to reduce the memory requirement and the computation cost. Indeed, there is no need for using a fine sampling to discretize the operator B(m 0 ). From its expression in eq. (B17), we see that it is composed of a product of smoothly varying terms multiplied by an oscillatory function. This oscillatory function is an imaginary exponential of the traveltime functions. Again, the traveltime functions are smoothly varying functions, which do not require a fine sampling for an accurate discrete approximation (Thierry et al. 1999a;Alkhalifah 2011). This allows to reduce the tremendous memory requirement that would be associated with a discretization of B(m 0 ) using the fine finitedifference grid of 25 m. In practice, in the second and third case studies, we sample the function B(m 0 ) with a discretization steph equal to 100 m both from the model side and the acquisition side, which correspond to a coarsening ratio of 4 in the model dimension and 2 in the acquisition dimension as the original spatial sampling for sources and receivers is equal to 50 m.

A canonical test case: reconstruction of two inclusions in a smooth background model
A simple synthetic example is first presented, based on the smooth P-wave background model presented in Fig. 2(a). The domain is 775 m deep and 2275 m wide. Two Gaussian perturbations are added to this model [ Fig. 2(b)], centred at the depth z = 500 m. The largest is horizontally centred at x = 1100 m (perturbation 1). The smallest is located on the right at x = 1750 m (perturbation 2). The maximum The following normalization is applied to compare the diagonal dominance pattern of the operators. For a given matrix A, the normalized matrixÃ is defined entry wise as The results are presented in Fig. 3. The refocusing of largeamplitude elements near the diagonal operated by the action of the operator B(m 0 ) can be clearly observed. While the standard Gauss-Newton operator presents a hardly visible band matrix pattern [ Fig. 3(a)], the operator B(m 0 )J (m 0 ) presents a strongly diagonal dominance pattern and appears as a narrow-band matrix [ Fig. 3(b)]. The normal matrix associated with this matrix preserves this pattern and is also narrow-band [ Fig. 3(c)].
To complement the analysis of these operators, the distribution of their eigenvalues is presented in Fig. 4. The spectrum of the standard Gauss-Newton operator presents a rapid decrease after the 200th eigenvalue. In contrast, the spectrum of the operator B(m 0 )J (m 0 ) and its associated normal operator H g G N (m 0 ) presents a slower decrease. This indicates the better conditioning of these operators compared to the standard Gauss-Newton operator.
The right most perturbation introduced in the background model is smaller in size than the other and is not centred with respect to the acquisition system. Therefore, its imprint in the data should be significantly smaller than the imprint of the largest perturbation. It is thus expected that, in the first iterations, the L 2 norm-based gradient focusses mainly on the first perturbation and neglects the second perturbation. The modification of the misfit function through the Beylkin operator should balance this focusing. The projection of the residuals in the model space should allow to give more emphasis to the data associated with the smaller perturbation. It is thus expected that the gradient of the modified misfit function g(m) provides an update accounting for the two perturbations. As discussed in the previous section, the modification of the misfit function amounts to only modifying the computation of the data residuals. In  standard L 2 norm residuals, are compared with the migrated/demigrated residuals, Fig. 5 is organized by frequencies: the 10 panels correspond to the 10 different frequencies from 3 to 12 Hz. Each panel is a presentation of the real part of the residuals in the frequency domain in the receiver/source plane. The top panels [ Fig. 5(a)] present the residuals associated with the standard L 2 norm, while the bottom panels [ Fig. 5(b)] present the residuals after migration/demigration. For the L 2 norm, at low frequency, the residuals seem to be associated only with the first inclusion. The imprint of the second inclusion is present only at higher frequencies, from 9 to 12 Hz. Along all the bandwidth, signals associated with receivers and sources at a larger distance from the targets are also visible. These signals correspond to larger offset data recorded by the acquisition system. In contrast, the imprint of the two inclusions is clearly visible in the residuals associated with the modified misfit function g(m) from lower frequencies. Already at 4 Hz, the effect of the second inclusion is visible in the residuals. Another difference is that the long-offset residuals are filtered. The migration/demigration operation results in an emphasis on the information associated with near-offset data. The illumination angle filtering resulting from the application of the operator B(m 0 ) * B(m 0 ) to the data residuals can be related to the mathematical definition of the Beylkin operator and its adjoint, given in Appendix B. From the formulas (B17) and (B18), it can be seen that a multiplication by a factor (cos (θ) + 1) 2 is applied to each residual components, which implies the observed emphasis of small illumination angles.
The gradients associated with the standard L 2 norm-based misfit function and the modified misfit function are presented in Fig. 6. For a fair comparison, the same scaling is applied to both of the gradients. As expected, the L 2 norm-based gradient [ Fig. 6(a)] mainly focusses on the first inclusion, which is centred with respect to the acquisition system, and is larger in size. In contrast, the gradient of the modified misfit function provides a more balanced update for the large and small inclusions [ Fig. 6(b)]. On the other hand, positive/negative oscillations around the inclusions are introduced, which resembles limited bandwidth effects observed in the reconstruction provided by asymptotic linearized inverse methods [see for instance Thierry et al. (1999a)]. The balance in the inclusion reconstruction and these oscillations are the imprint of the introduction of the Beylkin operator in the misfit function.

Application to the Marmousi model
The properties of the modified misfit function are analysed on the acoustic Marmousi2 model (Martin et al. 2006). The exact and initial models are presented in Fig. 7. The domain size is 3 km deep and 17 km wide. The initial model is obtained by a Gaussian smoothing of the exact model with a correlation length equal to 375 m. This yields an accurate initial model. A fixed-spread surface  acquisition system with 336 sources and receivers located at 50 m depth and spread along the upper water layer from x = 0.15 km to x = 16.9 km each 50 m is used. The frequency bandwidth for the data goes from 3 to 7.5 Hz. The P-wave velocity in the water layer is supposed to be known. It is kept constant, equal to 1500 m s −1 , throughout the inversion process. The spatial discretization h for the finite-difference scheme is set to 25 m both in vertical and horizontal directions, which yields a 141 × 681 point discrete model. As mentioned preliminary, the operator B(m 0 ) is discretized over a coarse grid with a discretization steph = 100 m. A fine frequency discretization is used, with a sampling of 0.25 Hz, which yields a data set with 19 discrete frequencies from 3 to 7.5 Hz. Numerical experiments presented in the sequel show that this fine discretization is required for the asymptotic pre-conditioning to be accurate.
The standard L 2 norm residuals are compared with the migrated/demigrated residuals for five frequencies (3, 4, 5, 6 and 7 Hz) in Fig. 8. As in the previous experiment, for each frequency, a panel presents the real part of the residuals in the source/receiver plane. The migration/demigration process operates a strong filtering of the residuals associated with long-offset data and focusses on near-offset data. This is again the impact of the angle illumination filtering associated with the Beylkin operator. This may result in emphasizing reflection data at the expense of diving-wave data.
In Fig. 9, the model update provided by the solution of the linearized inversion problem following the Beylkin strategy is compared with the gradient in the initial model of the standard FWI misfit and the modified misfit introduced in this study. Spatial aliasing corrupts the update δm. This is due to the sampling of B(m 0 ) on the coarse grid. However, it is still visible that δm is a migration result, with clearly delineated interfaces, in shallow and deep parts of the model. The main structure of the exact model appears. The gradient provided by standard FWI only focusses on smooth updates of the shallow part. No modification is provided deeper than 2 km. This gradient is driven by diving-wave data that sample the shallow part of the model and provide long-wavelength updates of the model. In contrast, the gradient associated with the modified misfit function resembles strongly the model update δm. However, the spatial aliasing is removed, and the amplitudes are balanced between shallow and deep parts of the model. In addition, a smooth, low-wavenumber correction is provided in the shallow zone between z = 0.5 km, z = 1.5 km and x = 1, x = 6 km. The gradient of the modified misfit function thus seems to be mainly based on the model update δm but also incorporates information resembling the one provided by standard FWI, through the fact that the calculated data in the residuals are computed with a full two-way wave equation rather than with the linear ray+Born approximation. The effect of the frequency sampling on the accuracy of the Beylkin operator B(m 0 ) is presented in Fig. 10. Coarsening the frequency sampling to 0.5 and 1 Hz yields important artefacts on the gradient of the modified misfit function g(m). This effect is consistent with the formulas (42)   infinity. Approximating this integral with a too coarse frequency sampling decreases the accuracy of the approximation.
The effect of the compression used to compute the Beylkin operator B(m 0 ) is also presented in Fig. 11. Not surprisingly, if the operator is approximated on a too coarse grid in space, spatial aliasing will not be totally removed in the gradient of the modified misfit function g(m). The coarsening which is used should be actually chosen consistently with the expected resolution. Here, as the frequency content of the data does not exceed 7.5 Hz, a spatial discretization of 100 m is enough for obtaining a smooth P-wave velocity gradient.
The inversion results obtained with the modified misfit function after 3 and 10 l-BFGS iterations are presented in Fig. 12. These results have to be compared with those obtained using the standard L 2 norm after 3, 10 and 40 l-BFGS iterations, presented in Fig. 13. A diagonal pre-conditioning based on the pseudo-Hessian operator is included in the l-BFGS algorithm for the standard FWI method (Shin et al. 2001). Details on practical implementation of this pre-conditioning strategy can be found in .
While in standard FWI the estimation first focusses on the shallow part and progressively reconstructs the deeper parts of the model, the modification introduced by the Beylkin operator allows to reconstruct deep structures in the very first iterations. This is due to the focusing of the method on smaller illumination angles and reflection data. A high-resolution estimation of the subsurface model is yielded directly in the first iterations. The progress between the third and the tenth iterations indicates a reconstruction of amplitudes rather than the building of interfaces. As a counterpart, the shallow part of the model, sampled both by diving and reflected waves, is less constrained when using the modified misfit function. The estimation in this zone is less accurate than the ones obtained with a standard FWI method. This seems a limitation of the approach, caused by the filter applied to diving waves induced by the asymptotic operator.
To further investigate this issue, vertical P-wave velocity profiles are presented in Fig. 14 is not recovered. The results obtained after 10 l-BFGS iterations using the modified misfit function show that the trend is recovered both in shallow and deep parts around the initial model profile. This suggests that the low-frequency part of the model is not updated. The reflectors are thus located at their proper locations, but a lack of lowwavenumber updates is visible. This analysis is confirmed in Fig. 15 where the frequency content of the vertical profiles is presented. The profiles obtained from 40 iterations of pre-conditioned l-BFGS recover the low-frequency part of the exact profiles. The profiles obtained after 10 iterations of l-BFGS using the modified misfit function clearly present a gap in the low-wavenumber part of the spectrum.

Application to the synthetic Valhall model
The numerical experiments are completed by an application on the synthetic Valhall model presented in Fig. 16, together with the initial model which is used. The model is deeper than the Marmousi 2 model; its lateral extension is smaller: it is 5 km deep and 8.7 km wide. It presents interesting low-velocity layers associated with the presence of gas and a strong curved reflector located above z = 3 km. As in the Marmousi 2 experiment, a fixed-spread surface acquisition system is used with 173 sources and receivers located at 50 m depth and spread along the horizontal axis from x = 0 km to x = 8.65 km each 50 m. The frequency bandwidth for the data goes from 3 to 7 Hz. The frequency sampling is equal to 0.25 Hz, as in the Marmousi 2 experiment, which yields a data set with 17 discrete frequencies. The spatial discretization h is set to 25 m both in vertical and horizontal directions, which yields a 207 × 349 point discrete model. Compared to the Marmousi2 case study, the reduction of the maximum offset of the acquisition system and the good quality of the initial velocity model emphasize the importance of reflected waves rather than diving waves for recovering the exact P-wave velocity model. It is expected that in this situation, the approach based on the modified misfit function g(m) provides good results.
In Fig. 17, the residuals associated with the standard L 2 normbased FWI are compared to the residuals after migration and demigration using the Beylkin operator B(m 0 ) and its adjoint. The same presentation as for the previous case studies is used: the five panels are associated with the frequencies 3, 4, 5, 6 and 7 Hz. Each panel is a presentation of the residuals in the source/receiver plane. The migration/demigration operation focusses again on near-offset data, detrimental to far-offset data. In Fig. 18, the model update δm associated with the Beylkin migration (eq. 66) is compared to the gradient in the initial model associated with the standard FWI misfit function and the modified misfit function. As a migration result, the model update δm [ Fig. 18(a)] yields a good reconstruction of the interfaces present in the exact velocity model presented in Fig. 16. Because of the coarse sampling of B(m 0 ), the update suffers from spatial aliasing, although this is less visible than in the Marmousi 2 experiment. The FWI gradient [ Fig. 18(b)] focusses mainly on the shallow part of the model. A smooth, low-wavenumber update of the zone above the first kilometre is provided, associated with transmitted energy. Two dark wings also appear connecting both lateral extremities of the acquisition to the zone in the centre. These artefacts are due to the incompleteness of the acquisition. The modified misfit gradient [ Fig. 18(c)] appears as a smoother version of δm. The spatial aliasing effect is removed. The amplitude of the model update is corrected. A very shallow reflector at z = 0.5 km also appears, which is not visible in δm.
The results obtained after 5 and 10 l-BFGS iterations using the modified misfit function g(m) are presented in Fig. 19. After five iterations, the main interfaces already appear clearly in the estimated model. The gas layers are accurately reconstructed. The main reflector at z = 3 km is well delineated from x = 1 km to x = 7 km. Additional iterations (from 5 to 10) allow to restore more accurately the amplitude and to improve slightly the resolution of the estimation. The results obtained after 5, 10 and 20 l-BFGS iterations using the standard FWI misfit function with a diagonal pre-conditioning strategy are presented in Fig. 20. At least 20 iterations are required for the standard FWI method to be able to reconstruct satisfactorily the gas layers in the middle of the domain [Fig. 20(d)]. However, the deepest part of the model, including the main reflector at z = 3 km, is less precisely estimated in this case. In comparison, this reflector is more accurately estimated when the modified misfit function g(m) is used. Here, the standard FWI method faces an important difficulty as increasing the number of iteration does not seem to improve further the solution. After 40 iterations, the main reflector appears more clearly only between x = 1 km, x = 2 km and x = 5 km, x = 7 km. The reconstruction of the gas layers starts to reveal some instabilities, despite the synthetic data are not corrupted by any noise. Note for instance the high-amplitude artefact appearing in red at z = 2.5 km, x = 5 km [Fig. 20(e)]. This instability is not observed when the modified misfit function g(m) is used.  To complement this analysis, P-wave velocity vertical profiles of the reconstructed models are presented in Fig. 21. The profiles are extracted at x = 2.1 km, x = 4.4 km and x = 6.5 km. The reconstructed models obtained after 10 l-BFGS iterations using the modified misfit function g(m) and 20 l-BFGS iterations using standard FWI with a diagonal pre-conditioning are compared to the exact and the initial models. For the three logs, the model reconstructed using the modified misfit function is the closest to the exact profile, especially in depth (z > 2 km). This is consistent with the observations made on the 2-D reconstructions. However, as for the Marmousi2 case study, the updates obtained using the modified misfit function seems to lack low-wavenumber content. This observation is supported by the 1-D spectral analysis of these updates presented in Fig. 22.

D I S C U S S I O N
The results presented in this study provide important information. First, it should be noted that despite working with FWI far from the high-frequency regime, with a seismic data bandwidth comprised between 3 and 7.5 Hz, the asymptotic approximation still brings valuable information on the propagation operators involved in the inversion process. In particular, the left inverse devised by Beylkin from the asymptotic approximation of the Born modelling operator is still a relatively good left pre-conditioner for the 'true' Born modelling operator, computed with the full wavefield. This is emphasized in the first case study, where these operators are presented (Fig. 3). This result indicates that the asymptotic approximation is meaningful and could be used to build pre-conditioners even far from the high-frequency regime.
Second, introducing the Beylkin operator within the FWI scheme through the definition of the modified misfit function (56) is not without having some consequences on the inversion process. The filtering applied on the data residuals associated with wide scattering angles should be underlined. The method amounts to significantly focus on the short-offset data rather than long-offset data, as can This makes the method similar to a migration process rather than a full waveform inversion process as long-offset data maybe not accounted for accurately.
Let us recall the expression of the modulus of the reconstructed wavenumber model |k| in the context of diffraction tomography (Devaney 1982),  where θ is the illumination angle, f the dominant frequency of the data and c(x) the local velocity. This equation shows that the loss of sensitivity to long-offset data decreases the capability of the method to retrieve low-wavenumber components of the P-wave velocity model. From the expression of the Beylkin operator, one could think, however, that this effect is counterbalanced by the frequency weighting strategy embedded in this operator. Indeed, from eq. (B17), one can see that a scaling factor equal to the inverse of the frequency is applied to the data, which gives more weight to low-frequency data rather than higher frequency data in the inversion. This makes possible to enhance the reconstruction of low-wavenumber components of the P-wave velocity model, according to the expression (67). This strategy, named as spectral shaping, is presented in Lazaratos et al. (2011) and Plessix (2013). However, the numerical results presented here tend to indicate that the compensation induced by this frequency weighting is not sufficient: the illumination angle based filtering has a stronger influence on the reconstruction process.
Nevertheless, if the low-wavenumber components of the solution is contained in the initial model, it appears that the modification of the misfit function improves significantly the convergence speed of the minimization process. This suggests that the modified misfit function could be used for instance for accelerating non-linear quantitative migration algorithms, such as in Métivier et al. (2011) andMétivier (2011) for the reconstruction of the acoustic impedance around a well with walkaway data, with a given velocity model. In Zhou et al. (2014a,b), a method is proposed for the coupled reconstruction of the low-wavenumber components of the P-wave velocity model and the high-wavenumber components of the P-impedance. At each iteration of the process, a non-linear minimization has to be performed to compute the reflectivity model with the current velocity model: the method proposed here could also be applied to accelerate this process.
In terms of practical implementation, the computation of the Beylkin operator amounts to the computation of a large-scale dense matrix. This requires a significant computational effort and has a strong impact on its memory requirement. A coarsening strategy has to be applied in accordance with the frequency content of the data which is inverted. Practicing an off-line computation of the operator as a prior stage to the inversion reduces the computational cost as the quantities from the Beylkin operator are computed only once. As a counterpart, this requires the ability of storing the dense matrix on the coarse grid. A trade-off could be found by recomputing some of the matrix entries at each iteration. Another possibility also consists in using compression strategies, such as the ACA method to devise H-matrix approximation of the Beylkin operator (Bebendorf 2008).
Finally, the approach is presented here in the frequency domain for the sake of simplicity. However, the strategy can be extended to time-domain inversion. In the work of Beylkin (1985), the derivation of the migration operator was initially performed in the time domain. In practice, the frequency-domain inversion requires a fine sampling of the frequency integration domain (see Fig. 10). The computational gain associated with frequencydomain approach for FWI relies on the possibility of exploiting data redundancy and invert for few discrete frequencies. Therefore, in this case, it seems that there are no advantages to keep a frequency-domain approach. The method could thus be considered for accelerating the convergence of time-domain least-squares migration algorithms, which are known to be intensively time consuming.

CONCLUSION
This study presents a strategy for combining concepts inherited from linearized inversion with non-linear least-squares inversion scheme such as FWI. In particular, the preserved-amplitude migration operator derived by Beylkin (1985) from the adjoint of the asymptotic approximation of the Born modelling operator is used to define a modified misfit function. The gradient associated with this new misfit function shares some similarities with the standard L 2 norm-based FWI gradient. It is the cross-correlation of the wavefield diffracted by any single localized perturbation of the discrete medium with residuals. The difference with standard FWI is that the residuals on which the method applies are first migrated/demigrated using the Beylkin operator and its adjoint. The Hessian of this modified misfit function is shown to tend asymptotically towards the identity operator. As an inverse problem, the minimization of this misfit function is thus a better posed problem, as the trade-off between parameters is reduced.
It is observed numerically that despite working far from the highfrequency regime, the modified misfit function has indeed a diagonal dominant Hessian operator. However, the modification of the misfit function through the Beylkin operator implies a focussing of the method on small illumination angle data, detrimental to large-offset data. As a consequence, the sensitivity of the misfit function to diving waves is reduced, and the method is efficient only in a migration regime. This indicates that this strategy could be employed for accelerating non-linear quantitative migration algorithms.
The results which have been obtained suggests that the asymptotic approximation should be an appropriate tool to better condition the by guest on April 15, 2015 http://gji.oxfordjournals.org/ Downloaded from FWI problem. Future studies will be led to investigate the use of the Beylkin operator for building directly an approximate inverse of the Hessian operator. This pre-conditioner could be used in the framework of a truncated Newton method. This strategy is expected to be efficient especially for the multiparameter case for which trade-offs between different classes of parameters (for instance Pwave velocity and density in the acoustic approximation) make the computation of a decoupled estimation a strongly difficult task.
It is important to emphasize that the asymptotic approximation can be extended to account for anisotropy, attenuation and the propagation of elastic waves. Extension to multi-arrival can also be performed to improve the accuracy and the impact of an asymptotic pre-conditioning. Therefore, the approach which is developed here is not limited to the acoustic case and is hopefully a preliminary step towards the definition of more general pre-conditioners for FWI.

A C K N O W L E D G E M E N T S
This study was partially funded by the SEISCOPE consortium (http://seiscope2.osug.fr), sponsored by BP, CGG, CHEVRON, EXXON-MOBIL, JGI, PETROBRAS, SAUDI ARAMCO, SCHLUMBERGER, SHELL, SINOPEC, STATOIL, TOTAL and WOODSIDE. This study was granted access to the HPC resources of the Froggy platform of the CIMENT infrastructure (https://ciment.ujf-grenoble.fr), which is supported by the Rhône-Alpes region (GRANT CPER07_13 CIRA), the OSUG@2020 labex (reference ANR10 LABX56) and the Equip@Meso project (reference ANR-10-EQPX-29-01) of the programme Investissements d'Avenir supervised by the Agence Nationale pour la Recherche, and the HPC resources of CINES/IDRIS under the allocation 046091 made by GENCI. The authors thank all the actors of the SEISCOPE consortium research group for their active and constant support, especially Stéphane Operto for helpful and animated discussions.

A P P E N D I X A : A M P L I T U D E A N D T R AV E LT I M E C O M P U TAT I O N F O L L O W I N G T H E R AY T H E O RY
The following quantities have to be computed for the asymptotic approximation (i) the traveltime maps T(x; r), T(x; y); (ii) the amplitude maps A(x; r), A(x; y). The traveltime maps are computed through the eikonal solver of Podvin & Lecomte (1991), based on a fast-marching method. The computation of the amplitude maps depends on the geometrical spreading j(x; r), j(x; y). In 2D, considering a homogeneous medium around the source, and a proper source calibration, the amplitude satisfies [see for instance Forgues (1996)]. The computation of the geometrical spreading can be performed following the paraxial ray theory. The ray coordinates q(τ ), p(τ ) satisfy the Ordinary Differential Equations (ODE) ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ dq dτ = p, by guest on April 15, 2015 http://gji.oxfordjournals.org/

Downloaded from
Considering a perturbation of the ray q (τ ) = q(τ ) +q(τ ), p (τ ) = p(τ ) +p(τ ), where and the quantity u(x) is the squared slowness For the computation of the amplitude map A(x; r), point source initial conditions are chosen q 1 = 0,q 2 = 0, p 1 = p 2 (x; r ),p 2 = −p 1 (x; r ). (A7) The geometrical spreading j(x; r) is given by The workflow for computing j(x; r) is thus as follows: (i) Compute the traveltimes from the receiver r using the eikonal solver.
(ii) Perform a back ray tracing from the point x to the receiver r using the gradient of the traveltime ∇T(x; r).
(iii) Store the components p 1 and p 2 at x and at the receiver r.
(iv) Get the initial condition for the paraxial ray tracing and solve the paraxial equations (A4).

A P P E N D I X B : C O M P U TAT I O N O F T H E B E Y L K I N I N T E G R A L O P E R AT O R B A N D I T S A D J O I N T B *
The integral operator B is defined through its kernel, which depends on the function h(x r , x s , ω) defined in eq. (38). For the sake of simplicity, the explicit formulation of B is given for the 2D mono-parameter case (constant density is assumed). The computation of h(x r , x s , ω) requires to develop the expression of the Jacobian associated with the change of variable f defined in eq. (32). This Jacobian has been denoted by (Df)(|k|, , θ ). The definition of k in eq. (31) implies |k| = ω ∇T (x, x r ) + ∇T (x, x s ) 2 = ω ∇T (x, x r ) 2 + ∇T (x, x r ) 2 + 2∇T (x, x r ) · ∇T (x, x s ).
The definition of the phase vectors T(x, x r ) and T(x, x s ) yields In addition, ∇T (x, x r ).∇T (x, x s ) = cos θ (x r , x, x s ) v 2 P (x) .