Fast Solution Methods

The standard boundary element method applied to the time harmonic Helmholtz equation yields a numerical method with O(N3) complexity when using a direct solution of the fully populated system of linear equations. Strategies to reduce this complexity are discussed in this paper. The O(N 3) complexity issuing from the direct solution is first reduced to O(N 2) by using iterative solvers. Krylov subspace methods as well as strategies of preconditioning are reviewed. Based on numerical examples the influence of different parameters on the convergence behavior of the iterative solvers is investigated. It is shown that preconditioned Krylov subspace methods yields a boundary element method of O(N 2) complexity. A further advantage of these iterative solvers is that they do not require the dense matrix to be set up. Only matrix–vector products need to be evaluated which can be done efficiently using a multilevel fast multipole method. Based on real life problems it is shown that the computational complexity of the boundary element method can be reduced to O(N log 2 N) for a problem with N unknowns.


Introduction
The standard boundary element method, that is the calculation and storage of the dense matrices H and G and the direct solution of the system of linear equations, requires O(N 2 ) memory locations and O(N 3 ) arithmetic operations. This makes the method hardly applicable to the solution of problems of practical interest. However, the fact that the BEM requires only a discretization of the boundary Γ of the fluid domain Ω is advantageous when solving radiation or scattering problems involving complex shaped radiators or scatterers.
The following chapter discusses possibilities to increase the efficiency of the BEM such that it will be applicable to problems of practical interest. Two directions are issued. Firstly, the iterative solution of the system of linear equations and secondly the efficient evaluation of the product of the matrices H and G with a vector as needed within the iterative solution process. 1

Iterative Solution
Iterative methods have been widely used for solving linear systems in various fields. For the linear system Ax = b, an iterative method gives successive approximations x i at each step i from an initial approximation x 0 in a systematic way, until the residual norm ||r i || = ||b − Ax i || sufficiently decreases. If the iterative process rapidly converges, iterative methods are much faster than direct methods, such as Gaussian elimination.
Iterative methods are classified into two groups: stationary methods and nonstationary methods. In the former methods, the coefficients for renewal of the approximation x i do not depend on the iteration count. Jacobi, Gauss-Seidel, Successive Over-Relaxation (SOR), and Symmetric Successive Over-Relaxation (SSOR) methods, etc. belong to this group. The latter methods use iteration-dependent coefficients so that the computations can involve renewal information obtained at each iteration. The Krylov subspace methods are most popular in this group.
The Krylov subspace methods, which were selected as one of the top 10 algorithms of the 20th century in SIAM News [19], are very effective for large systems of linear equations. A Krylov method successively gives the approximations x i in the process of increasing the dimensionality of the Krylov subspace, which is spanned by the Krylov sequence generated by the initial residual r 0 and the system matrix A. The Krylov subspace methods become more powerful when they are used with preconditioning technique for improvement of iterative convergence. One special feature of the methods is that the system matrix A is not explicitly necessary, and only the products of the matrix A with vectors (matrix-vector products) are required. Therefore, the Krylov methods are applicable with fast evaluation of the products without generating A.
This chapter describes the outline of the Krylov subspace iterative methods and its application to the BEM. For the details about iterative methods, see References [8,10,53,66], for example.

The Krylov Subspace Methods
Consider that the linear system of equations Ax = b (12.1) is solved with an iterative method. The initial residual r 0 is expressed as: where x 0 denotes the initial approximate solution of x. The Krylov subspace methods are iterative methods in which the approximate solution x n is taken as satisfying the next condition, x n = x 0 + z n , z n ∈ K n (A; r 0 ), (12.3) where K n (A; r 0 ) denotes the n-dimensional Krylov subspace spanned by a Krylov sequence {A i r 0 } i≥0 : K n (A; r 0 ) = span{r 0 , Ar 0 , . . . , A n−1 r 0 }. (12.4) The iteration residual r n is expressed as follows: Krylov subspace methods require an approach for identification of the approximate solution x n as follows.
• The minimum norm residual approach: identify r n for which the norm ||b−Ax|| is minimal over K n (A; r 0 ). ||r n || = min x∈x0+Kn(A;r0) ||b − Ax||. (12.6) • The Ritz-Galerkin approach: identify r n that r n is orthogonal to the current subspace. w H r n = 0, w ∈ K n (A; r 0 ), (12.7) where w H denotes the conjugate transpose of w.

Lanczos, Arnoldi and Bi-Lanczos Types
In the Krylov subspace methods, an orthogonal basis for the Krylov subspace is required for stability of computation. The Krylov subspace methods are classified by the algorithms for production of an orthogonal basis and by the approaches for identification of the approximate solution x n mentioned above. If A is a Hermitian matrix, one can compute an orthogonal basis for the Krylov subspace K n (A; r 0 ) based on only three-term reccurence relations. This process is called the Lanczos process. The methods based on the Lanczos process have the advantage that the operation count and the required memory per iteration step do not increase with the number of iteration.
If A is a non-Hermitian matrix such as that obtained with the BEM, one cannot compute an orthogonal basis with short reccurence relations. One have to keep all elements of the orthogonal basis computed in the iterative process, to obtain the next element. This process is called the Arnoldi process. The methods based on the Arnoldi process can be applied any regular matrices, whereas they have a major disadvantage that the operation count and required memory per iteration step rise linearly with the number of iteration.
There is an inexpensive process for computation of bases for non-Hermitian matrices, where bi-orthogonal bases are computed instead of an orthogonal basis. In addition to the Krylov subspace K n (A; r 0 ), one can consider another Krylov subspace K n (A H ; r * 0 ), and construct bi-orthogonal bases {v i } and {u j } for the two subspaces, satisfying the following relation where r * n denotes the shadow residual, satisfying r * H 0 r 0 = 0. This process is called the Bi-Lanczos process, where one can compute the bi-orthogonal bases with only three-term reccurence relations, like the Lanczos process. The main drawback of the methods based on this process is the possibility of break down in the iterative process, which does not occur in methods based on the Lanczos process. Since the above approaches Equations (12.6) and (12.7) for identification of x n cannot be directly used with the Bi-Lanczos algorithm, the following approach is used instead.
• The Petrov-Galerkin approach: find r n so that r n is orthogonal to the other sub- There is a short description of the representative methods below.
Conjugate Gradient (CG) [38] : The most basic method based on the Lanczos process. This method is used for linear systems with positive-definite Hermitian matrices, and extremely effective. This cannot be directly used for the BEM.
Generalized Minimal Residual (GMRes) [52] : This method is based on the Arnoldi process and known as one of the robust methods. This is applicable to the linear systems with non-Hermitian matrices and often used for sound field analyses using the BEM. The major drawback is that the amount of operation and memory required per iteration step rise linearly with the number of iteration. Thus, one needs to restart the process every l iterations to overcome this limitation (GMRes(l)). It is quite difficult to choose an appropriate value for l, since GMRes may converge slowly, or completely fail to converge if l is too small.
Bi-Conjugate Gradient (BiCG) [26,41] : This method is based on the Bi-Lanczos process and can be applied to the systems with non-Hermitian matrices. Unlike the methods based on the Arnoldi process, the amount of operation and required memory per iteration step do not rise with the number of iteration. However, BiCG requires not only the original system matrix but also its conjugate transpose. Another drawback is that the method might be break down, or converge very irregular.

BiCG-based Methods
BiCG is an alternative of the methods based on the Arnoldi process in efficiency, whereas it has some disadvantages mentioned above. A lot of variants of BiCG, here we call the BiCG-based methods, have been developed to improve these disadvantages. All of the methods shortly described below do not need the conjugate transpose of the system matrix. The convergence behavior of the methods of this group including BiCG is hardly known.
Conjugate Gradient Squared (CGS) [63] : CGS often converges or diverges much faster than BiCG, because the residual polynomial of CGS is the square of that of BiCG. The convergence behavior may be irregular for large problems, due to roundoff errors.

Biconjugate Gradient Stabilized (BiCGStab) [65]
: This method has smoother convergence behavior than CGS, keeping its rapid convergence. This is regarded as the product of BiCG and GMRes(1).
Generalized Product Type of Biconjugate Gradient (GPBiCG) [75] : This method is derived from generalization of BiCG-based methods. This is effective when the eigenvalues of the system matrix are complex.

Methods Based on Normal Equations
There is another class of Krylov subspace methods for non-Hermitian matrices. These methods are based on the application of the methods for Hermitian matrices to the normal equations. Popular methods of this class are CGNE and CGNR, in which CG is applied to the systems (A H A)x = A H b and (AA H )(A −H x) = b, respectively. These methods have a disadvantage that the matrices of normal equations are more ill-conditioned than the original matrices. The convergence may be slow.

Preconditioning
Preconditioning is a well known technique for improvement of convergence when one attempts to solve linear systems using iterative methods. A matrix M = M 1 M 2 , which approximates the system matrix A in some way, transforms the original system into the following system with more favorable properties for iterative solution: In general, a good preconditioner M should meet the following requirements: the preconditioned system should be easy to solve, and the preconditioner should be cheap to construct and apply. For the detail review on preconditioning techniques, an intensive review paper [10] can be available. We describe below some preconditioning techniques, which can be used for the BEM.

Diagonal Preconditioning
The preconditioner M consists of just the diagonal of the matrix A. This preconditioning is applicable without extra memory and time, and easy to apply to almost all kinds of the BEM. This preconditioning was reported to be effective for the BEM for some fields, for example, the thermal and elastic BEM [47], whereas reported not so effective for the acoustic BEM [43,72].

Incomplete Factorizations
Preconditioners based on incomplete factorizations of the system matrix, such as incomplete LU (ILU) and incomplete Cholesky (IC) factorizations, have been widely used. In ILU preconditioning, the system matrix A is incompletely factorized as a preconditioner M = LU ≈ A, where L and U are lower and upper triangular matrices, respectively. If the system matrix is Hermitian and positive definite, IC factorization can be applied, which is a special case of ILU factorizations. It is better to make ILU factorization near to complete factorization for good convergence, but resulting in high cost. Actually, complete LU factorization is equivalent to Gaussian elimination, and L and U obtained by complete factorization generally have many fill-in (nonzero) entries. There are some dropping strategies for discarding fill-ins to make ILU preconditioners. ILU(k) allows fill-ins only to a certain level k. ILU(0) corresponds to allowance of fill-ins only at positions for which the corresponding entries of A are nonzero. ILU(k) has some disadvantages; for example, the computational cost rapidly increases with k, and the storage and computational cost cannot be predicted in advance. ILUT(τ , p) [51] is a threshold-based ILU factorization. This employs a dual dropping strategy and more powerful than ILU(k); the threshold τ for dropping elements having small magnitude and the maximum number p of fill-in elements in each row of L and U are used to control the computational costs for factorization and application of the preconditioner.
ILU preconditioners are usually applied to sparse linear systems. They are very expensive when this is directly applied to the system with a dense system matrix, typically obtained with the BEM. Some techniques have been proposed to overcome this limitation. Please see 12.2.3.

Sparse Approximate Inverses (SPAI)
The basic idea of this class is that M −1 ≈ A −1 is explicitly computed and used as a preconditioner. The computational cost for naive construction and application of M −1 is very high, since M −1 is usually dense. In SPAI preconditioning, a sparse matrix M −1 ≈ A −1 is computed as the solution of the constrained minimization problem min where S is a set of sparse matrices and ||·|| F denotes the Frobenius norm of a matrix. This results in n independent least-squares problems for each columns of M −1 . The computational cost mainly depends on how to give the sparsity pattern S. To use this for dense linear systems, S must be sufficiently sparse.

Application to the BEM
The Krylov subspace methods require repeated calculation of matrix-vector products from the system matrix A and vectors, which occupies quite a large amount of operations in the iterative process. If A is a N × N dense matrix as generally obtained with the BEM, the operation count for a matrix-vector product is O(N 2 ). Thus, if the number of iteration is sufficiently smaller than N , the Krylov methods remarkably reduce the total operation count for solving the system compared with direct methods, which require O(N 3 ) operations. In addition, more reduction of the operation count can be achieved by using methods for efficient evaluation of the matrix-vector products, such as the fast multipole method (FMM). The required memory is also reduced sharply by using such methods, because it is not necessary to store the entire system matrix A. For more details on the efficient evaluation of matrix-vector products, please see Section 12.3.

Iterative Methods for the BEM
The BEM gives a linear system with a non-Hermitian matrix, to which the iterative methods based on the Arnoldi or Bi-Lanczos process and based on normal equations are applicable. For an advanced BEM that does not store the entire system matrix, it is difficult to apply the methods that require the conjugate transpose of the system matrix, such as BiCG, QMR and the methods for the normal equations.

Preconditioners for the BEM
There are not many preconditioners effective for non-Hermitian dense matrices, because preconditioners usually have been proposed for the sparse matrices. For the preconditioners for dense linear systems, see References [14,15,67,70] and the references therein. Moreover, fewer preconditioners are applicable to the advanced BEM that does not store the entire system matrix.
One class of preconditioners for the BEM is based on the splitting of the boundary integral operators [5,14]. The system matrix A is splitted into two matrices A near and A far , the former of which is sparse and composed of the influence parts between near elements. ILU factorization is done for the matrix A near , and the results are used for the preconditioning to the original system with A. This technique is very suitable for the advanced BEM that does not store the entire A, because this type of the BEM requires to generate A near in the same manner as the standard BEM. This technique has been proved to be very effective through some investigations, where practical acoustic problems were calculated with the advanced BEM using the regular grid method (RGM) [59] and the FMM [59,72]. For the standard BEM, a simple way of making a substitute of A near is to take the tri-diagonal band of A together with the anti-diagonal corner elements a 1n and a n1 [5].

Convergence Behavior for the BEM
The rate at which an iterative method converges relates directly to its computational time. Regarding non-Hermitian matrices, the convergence of iterative methods is not clear, and depends greatly on the properties of the matrices [8].
The following refers to general tendencies on the convergence, giving the examples using the collocation BEM with constant elements. The multilevel fast multipole algorithm (MLFMA) is used for efficient evaluation of matrix-vector products. For more details on MLFMA, see Section 12.3. The following equation is used as a stopping criterion for the linear system, where | · | is the Euclidean norm. ε = 10 −6 is used, unless noted otherwise. Instead of the number of iteration, the number of matrix-vector products is counted. This is because the operation of matrix-vector products accounts for the greater part of the iterative process, and iterative methods have different numbers of matrix-vector products per iteration. For example, CGS requires two matrix-vector multiplications per iteration, whereas GMRes requires one.

Effect of Formulation
The convergence behavior is affected by the formulation of the BEM: basic formulation (singular formulation: SF), normal derivative formulation (hypersingular formulation: HF), and formulation by Burton and Miller (BM) [13] to avoid fictitious eigenfrequency difficulties. Figure 12.2 shows the history of the residual norms for an exterior problem with a vibrating cube. The fictitious eigenfrequencies for SF and for HF are also shown in the upper part of the figure. The convergence with HF and with BM is slower than that with SF at all frequencies. This is because the matrices obtained with HF generally have eigenvalues which are not clustered compared to those with SF (i.e., HF matrices are ill-conditioned), and BM matrices inherit the ill condition from HF matrices. It has been reported that the hypersingular equation can be reduced to weakly singular one by using the high-order Galerkin BEM, resulting in improvement of the matrix condition and the convergence behavior with BM [37,74].

Effect of Boundary Shapes
The shape of boundary can greatly affect the convergence behavior. There is a tendency that the convergence is more rapid with smoother surface [43,72]. Figure 12.4 shows the history of the residual norms for two interior models ( Figure 12.3), which have nearly the same DOF and ratio of the element width to the wavelength. Slower convergence is seen for the problem with a complex shape (auditorium) than with a simple shape (cube), when both the problems are under the same boundary condition.

Effect of Boundary Conditions
The larger the absorption coefficient α is, the faster the convergence is [43,72], as is seen in Figure 12.4. This is the general tendency, independent of calculated problems and iterative methods. As shown in Figure 12.5, a little absorption greatly improves the convergence, compared to the case of perfectly rigid boundaries.

Effect of Preconditioning
The following refers to the effect of ILUT(τ , p) [51], which is one of the powerful preconditioners, on the convergence behavior with Burton-Miller formulation. In particular, it focuses on the effect of the parameter p, which greatly affects memory  requirements. This preconditioner is referred to as ILUT(p) with τ = 10 −5 in the below. Figure 12.7 shows the history of the residual norms with three kinds of iterative methods, for an exterior problem with a vibrating engine, Figure 12.6. In this case, all the unpreconditioned methods do not converge at all, while ILUT preconditioning remarkably improves the convergence of all the methods. In addition, the effect of ILUT(p) increases with p. As for GMRes(∞) (no restarting), even small p greatly improves the convergence, making GMRes(∞) the fastest among the methods. From the viewpoint of convergence behavior, GMRes(∞) with ILUT preconditioning is generally recommended for calculation in Burton-Miller formulation. Figure 12.8 shows the effect of the restart number l for GMRes(l) on the iteration residual. The residual norms stagnate from every restart point [59,72], resulting in slower convergence of GMRes(l) than CGS and GPBiCG, depending on ILUT(p), see Figure 12.7. It is stated that one had better avoid restarting in GMRes(l) for reducing computational time. If the restarting cannot be avoided due to the restriction of memory storage, preconditioned BiCG-based methods possibly converge faster than GMRes(l). For fast convergence of GMRes(l) with ILUT(p), one should consider the trade-off between the two parameters l and p in the restriction of memory storage.

Efficient Evaluation of the Matrix-Vector Product
Within the iterative solution of the dense linear system arising from a solution of the Helmholtz equation using Boundary Element Method, namely, when calculating the matrix-vector product z = (I − A)u one of the following quantities has to be evaluated (12.13) when using a Galerkin method, when using a Collocation method or when using a Nyström method for all j = 1, . . . , N with N being the number of unknowns of the given problem. Due to the fact that the kernel with the fundamental solution G(x, y) has non-local support this operation costs O(N 2 ) arithmetic operations. Several possibilities to achieve a reduction of the complexity have been proposed in the literature.
One approach consists in using a suitable wavelet basis together with a drooping strategy to sparsify the dense matrix A [12,21,27,39]. The most challenging part of this approach is the direct evaluation of the matrix A in the wavelet basis.
A second approach consists in a low rank approximation of A together with efficient H-matrix techniques [9,35]. Such an approach is based only on smoothness properties of the kernel function without necessarily knowing it explicitly. This is a very important point when considering numerical implementation. A possible change of the kernel function, as long as the smoothness properties are satisfied, does not require a complete recoding of the numerical algorithms.
A third approach is based on a suitable approximation of a specific kernel function in order to separate the x and y dependency of the kernel function. For two points x and y with |x − y| > > 0 the kernel Equation (12.16) is a smooth function, thus an approximation k(x, y) ≈ h(y − y c )μ(y c − x c )g(x − x c ) can be used to replace the original kernel function. Such an approximation is in general linked to a specific kernel k(·). Hence, changing the kernel function in general requires a complete recoding of the numerical algorithms. The use of such a kernel approximation forms the basis of methods like regular grid method [11], panel clustering [36,56] or the fast multipole method [18,31]. The latter method, especially its multilevel variant, seems to be the most widely accepted method as it covers the very low to high frequency range. An overview of the state-of-the-art of the fast multipole method can be found in the textbook [32].

Basic Concept of the Fast Multipole Method
The usage of the Fast Multipole Method for the solution of boundary value problems for the Laplace equation goes back to Greengard and Rohklin [31]. The application of this method for two-dimensional scattering problems can be found in [49]. It was again Rohklin [50] who extended it to the three-dimensional case. The Fast Multipole Method for the three-dimensional Helmholtz equation will be derived related to the work done by Anderson [7] and Rahola [48] which is often referred to as the "fast multipole method without multipoles".
The Fast Multipole Method was first used to calculate particle interactions. From this point of view (integration replaced by summation) Equation (12.13) to Equation (12.15) can be regarded as the communication of every point y j on the surface with all other points x i on the surface, hence the direct evaluation of these equations has O(N 2 ) complexity. Avoiding the evaluation of the interaction of all (y j , x i ) pairs will reduce the complexity of the algorithm. This can be achieved by splitting the direct path y j − x i . Therefore sets of points z i are introduced where the information of assigned points x i is aggregated. Thereafter, only the interaction between aggregation points is evaluated. Information is redistributed from the points z j to the associated points y j on the surface Γ after all interactions have been calculated. This situation is depict in Figure 12.9.
The idea behind the Fast Multipole Method is to replace (approximate) the kernel function k(x − y) for all well separated point x and y with |x − y| > by   V , B and W are sparse matrices. In the following the application of the Fast Multipole Method to Equation (12.14) will be considered. Its application to the Galerkin-Method can be found in [25].

Series Expansion of the Fundamental Solution
The fundamental solution G(x, y) l (k|a|)j l (k|c|)P l (â·ĉ) (12.19) with the spherical Hankel function h (1) l , the spherical Bessel function j l and the Legendre polynomial P l of order l. The notationx = x/|x|, |x| = 0 has been used for vectors on the unit sphere. Using the partial wave expansion of the plane wave (see [1, 10.1.47]) (12.20) and making use of the orthogonality of the Legendre polynomials yields the following identity Equation (12.21) enables to replace the function j l (k|c|)P l (â·ĉ) in Equation (12.19) which depends on the product ofâ ·ĉ by a product of two functions each depending onâ orĉ. Combining Equations (12.21) and (12.19) results in By setting a = z 1 − z 2 and c = (z 2 − y) + (x − z 1 ) the fundamental solution Equation (12.18) can be expressed as follows for all point x and y satisfying the admissibility condition In the same manner the series expansion for the gradient of the fundamental solution is obtained. Thus, for all pairs (x, y) with |(z 2 − y) + (x − z 1 )| < |z 1 − z 2 | the kernel function k(x − y) can be replaced by × P l (ŝ·( z 2 − z 1 ))e ik (z1−x)·ŝ do(ŝ) (12.26) This rather technical representation of the kernel function is not yet in the desired form of Equation (12.17) as the infinite summation and the integration can not be interchanged. But after truncation of the infinite summation Equation (12.26) becomes l (k|z 2 − z 1 |)P l (ŝ·( z 2 − z 1 )) . (12.27) To apply this expansion to the fast evaluation of Equation (12.14) for all points y j on Γ , the surface Γ has to be splitted according to the admissibility condition Equation (12.24). Therefore, the elements of the surface triangulation T are grouped to clusters τ j with the radius ρ j and the center z j . Thereafter, the far and near field of a cluster τ j are defined as follows The parameter η ∈ (0, 1) defines the number of buffered clusters. The series expansion of the kernel k(x − y j ) for a point y j with y j ∈ τ j can now be used for all x ∈ τ i with τ i ∈ F(τ j ). The interaction of the cluster τ j with the remaining clusters N (τ j ) -the near field of τ j -has to be calculated directly using standard boundary element techniques.
Using the splitting of the elements of the surface triangulation with respect to y j ∈ τ j Equation (12.14) becomes The second term forms the j-th row of the sparse matrix A near . The contribution of all clusters τ i belonging to the far field of τ j can be approximated using Equation (12.26) Changing the order of integration and summation yields The function Ψ τi (z i ,ŝ) is often referred to as far field pattern of the cluster τ i . Using this notation gives All far field pattern Ψ τi (z i ,ŝ) with τ i ∈ F(τ j ) have been converted to the near field pattern Υ τj (z j ,ŝ) of the cluster τ j . Thus, the contribution of the far field of the cluster τ j to the j-th row of the matrix-vector-product can be approximated via using the near field pattern Υ τj (z j ,ŝ) of the cluster τ j . Finally Equation (12.14), j-th row of z = (I − A)u, becomes To use Equation (12.29) to evaluate a matrix-vector product in a first step the far field pattern Ψ τi (z i ,ŝ) must be calculated for all τ ∈ T . This corresponds to the evaluation of a surface integral over a smooth function. Hence, standard Gaussian quadrature can be applied. In a second step the near field pattern Υ τj (z j ,ŝ) of a cluster τ j is the translation of the previously calculated far field pattern of all η-admissible clusters of τ j . Therefore, the operator μ M (z j − z i ,ŝ) is referred to as translation operator. In a third step the sound pressure at the point y j on the surface Γ is the results of the integration over the unit sphere For a numerical implementation of the above equations a suitable truncation of the infinite sum in Equation (12.23) and numerical integration over the unit sphere as well as the error introduced by doing so requires some attentions.

Truncation of the Series Expansion and Integration on the Unit Sphere
For the numerical implementation of the Fast Multipole Method the infinite sum over l in Equation (12.23) has to be truncated for a certain integer M One would expect that for increasing M the Fast Multipole Method gives greater accuracy. This holds true only in exact arithmetic as the sum over l diverges as M tends to infinity. In exact arithmetic the growth of h (1) l (k|a|) is implicitly balanced by the decay of j l (k|c|)P l (ĉ ·â) but which has been replaced by j l (k|c|)P l (ĉ·â) = 1 4πi l S 2 e ikc·ŝ P l (ŝ·â) do(ŝ) .
As the integration over the unit sphere has to be undertaken numerically using a quadrature rule with the nodesξ i and the weights w i (i = 1, . . . , K I ) the values of j l (k|c|)P l (ĉ·â) appear only to a finite precision and the index of truncation M has to be chosen according to the accuracy of the numerical integration. As the value of K I determines the efficiency of the algorithm the smallest possible value should be used. A detailed analysis of the truncation error was presented by Darve [22]. The author shows (Proposition 1 and 2) that for η ≤ 2/ √ 5 there exist constants C 1 , C 2 , C 3 and D 1 , D 2 , D 3 such that if M ≥ C 1 + C 2 k|c| + C 2 log(k|c|) + C 3 log ε −1 terms are kept in the sum in Equation (12.33) and the quadrature rule integrates exactly the first K ≥ D 1 + D 2 k|c| + D 2 log(k|c|) + D 3 log ε −1 spherical harmonics on S 2 , then the error introduced by truncation and numerical integration when replacing the fundamental solution by the truncated series expansion Equation (12.22) is bounded by ε. Darve further states "Practically a good choice for K is K ≥ 2M ". Out of the numerical tests [20,34,62] the authors empirically gave formulas for the value of M of the following type M = k|c| + d 1.6 log(k|c| + π) (12.34) where the value of d controls the number of significant digits in the approximation. For numerical implementation the value of K is often fixed to K = 2M . Consequently, the quadrature rule has to be exact up to the first K spherical harmonics. Using a Gaussian quadrature rule Q M which integrates exactly the first 2M spherical polynomials -see [64, Theorem 2.7-1] for details -K I = (M + 1)(2M + 1) sample points of each far field and near field pattern as well for each translation operator need to be stored.

Aspects of the Numerical Implementation
To evaluate (12.35) at each collocation point y j , j = 1, . . . , N with less than O(N 2 ) arithmetic complexity using the Fast Multipole Method the elements of the surface triangulation must be grouped into clusters. It is assumed that each element Δ i of the surface triangulation T can be uniquely assigned to a cluster τ i with the radius ρ i and the center z i . The maximal cluster radius is defined as The constant M 0 ensures that a sufficient number of coefficients is kept even at small wavenumbers k. Commonly M 0 = 4 is a good choice. Further details on the treatment of the low frequency range using the Fast Multipole Method we refer to [23,24,30,40,69]. Integration on the unit sphere is performed using a quadrature rule Q M that integrates exactly the first 2M spherical polynomials. The partition of the surface Γ is based on Equation (12.24). Two clusters τ i and τ j are called ηadmissible with respect to the parameter η ∈ (0, 1) if holds true, i.e. the sum of their radii is smaller than the distance between their cluster centers. Thereafter, the interaction of all η-admissible clusters can be calculated using Equation (12.26) and for the collocation point y j ∈ τ j Equation (12.35) becomes Using the notations introduced above the matrix-vector-product now reads as where A near , V , B, and W are sparse matrices. The matrix A near contains all nearby interactions that must still be calculated using standard boundary element techniques and the original kernel function Equation (12.16). The remaining interactions are evaluated using the Fast Multipole Method. It can be shown by counting the non-zero elements in A near , V , B and W that the effort for evaluating one matrix-vector product is O(N 3/2 ), cf. [22]. Therefore, a further reduction of the complexity of the algorithm is highly desirable. Obviously still O(N 2 C ) interactions of clusters have to be calculated. However, one will find that many of the η-admissible clusters would stay η-admissible if their radii were larger. In other words the interaction of larger parts of the surface could have been calculated using the series expansion. Hence, to increase the efficiency of the Fast Multipole Method the size of the clusters has to be enlarged as long as they stay η-admissible and evaluating of their interactions must be performed when they have reached the largest possible radius. This leads to the so called Multilevel Fast Multipole Algorithm (MLFMA).

The Multilevel Fast Multipole Algorithm
When looking at the Fast Multipole Method described in Section 12.3.1 it turns out that there are pairs of clusters τ i and τ j which would satisfy the admissible condition Equation (12.37) even with larger radii ρ i and ρ j or in other words more interactions of points on the surface could have been evaluated at once using the Fast Multipole Method. The main idea of the Multilevel Fast Multipole Algorithm is to enlarge the radii of the clusters adaptively using a hierarchy of clusters, a so-called cluster tree. The admissibility condition Equation (12.37) is then applied at every level l of the tree. Only these clusters interact on level l which are η-admissible and their fathers on the next higher level are not η-admissible. The interaction of the remaining clusters will be calculated at a higher level. The shift of the far and near field pattern Ψ τ and Υ τ through the different levels of the cluster tree is undertaken using the following relation complexity [18,22,62]. Two important differences of the multilevel and the singlelevel algorithm must be addressed. Firstly, the splitting of the elements of the surface triangulation must be replaced by a hierarchy of such splittings based on a cluster tree. Secondly, as cluster radii differ on different levels the expansion length M in Equation (12.36) must be adapted and in addition to the shift of the near and far field pattern interpolation and filtering of the data will be necessary when passing information between different levels. Using the notation B l for the translation operator on level l, I l+1 l for the shift and interpolation operator from level l to level l + 1 and F l+1 l for the shift and filter operator from level l + 1 to level l the approximation of the product of a vector u with the system matrix A can be written in the following form Similar to the single-level version first the far field pattern of the vector u is evaluated. But in the multilevel version the translation operator B l translates far field pattern to the near field pattern on level l only for admissible clusters on level l whose fathers on level l + 1 are not admissible. Then the far field pattern from level l is shifted to level l + 1. Once the highest level of the cluster tree is reached the near field pattern from level l is shifted to level l − 1 and accumulated to the pattern of level l − 1. When the lowest level of the cluster tree is reached the accumulated near field pattern is converted to a sound pressure on the surface Γ .

Construction of the Cluster Tree
A cluster tree is a hierarchy of sets containing the elements of the surface triangulation. Starting on the coarsest level where the elements of the surface triangulation form a single set, sets of clusters on the next finer level are obtained by a successive subdivision of the previous sets. The newly obtained sets are referred to as sons and will become the father sets for the corresponding sets of the next finer level. As each set is subdivided into two subsets, see Figure 12.10, a binary cluster tree is obtained [29]. The process is stopped when the number of elements in a set is smaller than a given value. A different strategy where the subdivision is obtained by a successive subdivision of a cube in R 3 yielding an oct-tree [22,55] will not be considered here.
Once the cluster tree is build up, for each of its levels the radius ρ and the center z of a cluster is defined as the radius and center, respectively, of the smallest open ball B(ρ, z) containing τ entirely. Further, the clusters have to satisfy τ i ∩ τ j = ∅ if i = j and Γ = ∪ i τ i . It is pointed out that the elements of T within a cluster do not necessarily have to be adjacent elements.

Interpolation and Filtering on the Sphere
The proper handling of the shifting of near and far field patterns through various levels of the cluster tree is of crucial importance for the efficiency of the multilevel fast multipole algorithm. As the radius of the clusters changes from level to level of the cluster tree, the number of coefficients to be kept according to Equation (12.34) is also different for each level of the cluster tree. To obtain the complexity estimates of the algorithm given above the number of coefficients M in Equation (12.33) must be adapted according to the radius of the clusters on each level. Hence, on lower levels fewer coefficients are needed than on higher levels. Consequently, the number of sample points on the unit sphere needed for numerical integration also changes. Thus, the values at the new sample points needed when moving up and down the tree must be interpolated of the values from the previous level. As a consequence of [22, Proposition 1 and 2] this operation must not influence the number of spherical harmonics, which are required to have, so that the interpolation error is in the same order as the truncation error. Hence, interpolation must not introduce higher order harmonics. This can be guaranteed by using spherical harmonic analysis when interpolating sample points.
The spherical harmonic analysis consists of expanding a function f given at a set of points (ϕ i , θ j ), i = 0, . . . , N and j = 0, . . . , 2N on the unit sphere in terms of spherical harmonics Y m n , that is calculating a set of coefficients α nm such that holds true. The required function values at a new set of points (φ i ,θ j ), i = 0, . . . ,Ñ and j = 0, . . . , 2Ñ are now obtained using the above expansion The values of N andÑ in Equation (12.42) determine interpolation or filtering. In the case whereÑ is larger than N Equation (12.42) is referred to as interpolation otherwise it is referred to as filtering. Using the orthogonality properties of the spherical polynomials the coefficients α nm are given by (12.43) with the substitution t = cos(θ). The numerical implementation of the interpolation and filtering consist of several steps. First, the integration over the ϕ-direction can be seen as a Fourier-transformation Integration with respect to t in Equation (12.43) and summation over n in Equation (12.42) yield the new Fourier coefficientsβ im An algorithm for large M, M that is more efficient than the direct evaluation of the double summation above was presented in [71]. It is base on the Christoffel-Darboux formula [1] and the application of a one dimensional Fast Multipole Method to evaluate the arising matrix-vector product in an efficient manner. Data at the new sample points (φ i ,θ j ) is obtained by an inverse Fourier-transformation of β im With a proper choice of M and M the above required Fourier-transformations can be implemented efficiently using fast algorithms.

Implementation of the Algorithm
The application of the Multilevel Fast Multipole Algorithm for the Helmholtz equation is divided into two parts. In the first part commonly referred to as setup-step the following operations have to be performed Details on the choice of the values for n elem , n level and M l can be found in [55,73]. Often the matrix V in Equation (12.38) is not explicitly calculated and the numerical integration over the unit sphere, see Equation (12.28) is performed directly at each matrix-vector-product.
The evaluation of a matrix-vector-product v = (I − A)u using the Multilevel Fast Multipole Algorithm, often referred to as apply-step, is realized using the following algorithm: The functions interpol and filter transfer data given on a set of points Ξ M l to a set of points Ξ M l+1 and vice versa as discussed in Section 12.3.7.
The implementation of the algorithm above shows that a large part of the memory required for the multipole part is needed to store the data μ M (z j − z i , Ξ M l ) = μ M (z, Ξ M l ) with z = [z 1 , z 2 , z 3 ] which is the translation operator that translates far field pattern to near field pattern. This is a drawback of the use of a binary cluster tree over an oct-tree. Taking a closer look at the structure of the data shows the dependence on the absolute values of the components 4 of z only. Hence, using a suitable permutation P the values μ M ([±|z 1,2 |, ±|z 2,1 |, ±|z 3 |], Ξ M l ) can be calculated out of the values of μ M ([|z 1 |, z 2 |, |z 3 |], Ξ M l ) yielding to the situation that the memory requirement is now negligible. To increase the probability that the difference of two cluster centers z = z j − z i differs only by the signs of the components the position of the cluster centers can be restricted to points on a regular grid [ih, jh, kh]. This yields a situation quite similar to that of an oct-tree obtained by a successive subdivision of a cube. If however necessary, a further compression of the data μ M can be achieved by the recalculation of μ M (z, Ξ M l ) as it is needed. An efficient algorithm for this task is given in [68]. There the authors use a one-dimensional Fast Multipole Method to evaluate (12.45) with α l = (2l + 1)i l h (1) l (k|z|) in an efficient manner. The evaluation of μ M at O(M 2 ) locations needs only O(M 2 log(1/ )) arithmetic operations where determines the accuracy of the approximation. Thereafter, only the coefficients α i which depend on the modulus of z need be stored.

An Example
The Multilevel Fast Multipole Algorithm described above will be used to demonstrate that the boundary element method can be applied efficiently to large scale problems.
The objective of the following numerical example is the prediction of the quality of an anechoic chamber in the low frequency range. The fact that the acoustic lining of such a chamber is not sufficiently absorbing at low frequencies creates several inconveniences. First, from a practical point of view, remaining reflections of the walls perturb experimental results. Secondly, from a numerical point of view, neither the geometry of the lining nor sound propagation within the absorbing material can be neglected. This excludes the use of the model of a rectangular cavity where walls are equipped with a local admittance condition to account for the acoustic lining. Therefore the numerical model must respect the real geometry of the lining and modelling of the absorbing material requires special attention. Here an admittance matrix -instead of a scalar value -, taking into account for sound propagation within a specific part of the lining and its vicinity, was used to represent the behaviour of the acoustical treatment. Hence, on the air lining interface the sound pressure p at a specific part of the surface is coupled with the surface velocity v ν on that part and its vicinity via v ν = Y p (12.46) with the dense and frequency dependent matrix Y . The approximation, that only the surface velocity at the vicinity is considered, can been seen as a localisation of the non-local behavior of an absorbing material. Numerical results will be compared with experimental data obtained from measurements carried out in the large anechoic chamber of the LMA, see Figure 12.11. The 1.5 dB region in the 20 to 200 Hz frequency range was obtained by measuring the sound pressure radiated by a bass-reflex box. For a detailed description of the experiment we refer to [58]. The anechoic chamber at the LMA has inner dimensions of 5.4 × 6.3 × 11.4 m 3 measured from wedge tip to wedge tip. Each of the 3720 wedges consists of a rectangular parallelepiped measuring .3 × .3 × .4 m 3 forming the base of the wedge and a tapering section of .7 m in length. Wedges are made of melamine foam. To calculate the admittance matrix Y a central wedge and its 8 adjacent wedges have been used. As all of the wedges are identical only a single admittance matrix Y (ω) ∈ C 324×36 is needed. However, to take into account for the anisotropy of the melamine foam three different matrices Y i , representing the three different material orientations, have actually been used. These matrices were precalculated in the frequency range of 20 to 200 Hz with a frequency resolution of 1 Hz using a finite element method. The geometry of the wedges has been modelled using 9 elements resulting in a mesh size of ≈.3 m. Using linear discontinuous basis functions [42] yields a linear system with N=138 280 unknowns and standard boundary element methods, requiring ≈ 300 Gb of memory to hold the system matrix, are hardly applicable. Therefore a six-level fast multipole method was applied. The leaves of the cluster tree contained up to n elem = 4 surface elements. The parameter η in Equation (12.37) was set to η = .7. The memory requirement of the algorithm at 200 Hz is given in the left sub -table of Table 12.2. It can be seen that the matrices of the near field interactions occupy almost all of the required memory. The matrix named "work" represents the working space needed for the multipole algorithm to hold the far and near field patterns on the different levels. The expansion length M in Equation (12.36) varied from 5 to 16 depending on the level. The memory required to hold the translation operator μ M , matrix B in Table 12.2, is indeed negligible, as stated in Section 12.3.8, due to the performed compression. Each stored operator has be reused ≈ 1500 times. Computational resources, the time and the number of floating point operations required to perform a matrix-vector-product, are given in the right sub -table of Table 12.2. It is pointed out that due to the admittance boundary condition the product A near u reads as A near u = H near u − G near Y u. Therefore the near field part of the matrices H and G must be stored seperately. Performing a single product (I −A)u took 72.0 s on a SGI Origin3800 and required 6.4e9 floating point operations. In contrast the standard BEM would require at least N 2 ≈19.1e9 floating point operations. The above given computational resources have been measured using a performance counter library 5 .
The linear system was solved using the GMRes [52] solver. Depending on frequency 80...120 iterations were necessary to obtain a residual of ε = 10 −6 . The total solution time was 2.5. . . 4 hours per frequency on a SGI Origin3800 of the Center for Information Services and High Performance Computing at the Technische Universität Dresden, Germany. Numerical results are compared with experimental data in Figure 12.12 for two different source positions. A solid dot represents the numerical result. The bounds of the 1.5 dB region obtained from experimental data are represented by a solid line. For both source positions numerical and experimental results agree well for frequencies higher than 100 Hz.
The example shows that the BEM can be applied to large scale problems. Especially when the boundary of the fluid domain has a complex shape the boundary element method is competitive to the finite element method as meshing the complex geometry of the fluid domain can be avoided.

Conclusion
Fast solution methods have been discussed to overcome the O(N 3 ) complexity of the standard boundary element method when using a direct solution of the system of linear equations. By the use of an iterative solver the complexity can be reduced to O(N 2 ), if the number of required iterations is much less than the number of unknowns N . Krylov subspace methods are the most suitable class of iterative solvers in the context here. The convergence of the iterative solvers is fairly accelerated  ) and computational resources needed to perform one matrix-vector product (right table) when using a six-level fast multipole algorithm. The first line, labeled "Equation (12.46)", represents the application of the boundary condition. The last column of the right table gives the performance of the algorithm with respect to the peak-performance of the computer's processor. by strategies of preconditioning. Preconditioners based on the splitting of the system matrix are effective for the BEM that generates the dense system matrix. The sparse matrix corresponding to the near field interaction is factorized using an incomplete LU decomposition. The use of the hyper singular formulation leads to illconditioned matrices that cause slow convergence. Furthermore the properties of the boundary of the domain have a significant influence on the required number of iterations. In general it can be stated that complex shapes and low absorption will cause slow convergence. A further reduction can be achieved through the use of fast BEMs which avoid the explicit set-up of the dense system matrix. Especially the multilevel fast multipole method seems to be the most widely accepted as such fast method. Based on the truncation of the series expansion of the fundamental solution, the fast multipole method yields an approximate factorization of the system matrix. The application of this method on multiple levels of a cluster tree enables the evaluation of a matrixvector product with O(N log 2 N ) complexity. Therefore, the iterative solution of the linear system together with the multilevel fast multipole method yields a very efficient numerical method. The obtained higher efficiency of the accelerated BEM makes this numerical method applicable to real life problems.