A Robust Denoising Process for Spatial Room Impulse Responses with Diffuse Reverberation Tails

: Spatial room impulse responses (SRIRs) measured using spherical microphone arrays are seeing increasingly widespread use in reproducing room reverberation effects on three-dimensional surround sound systems (e.g., higher-order ambisonics) through multi-channel SRIR convolution. However, such measured impulse responses inevitably present a non-negligible noise ﬂoor, which may lead to a perceptible “inﬁnite reverberation effect” when convolved with an input sound. Furthermore, individual sensor noise and momentary measurement artefacts may additionally corrupt the resulting impulse response. This paper presents a robust SRIR denoising procedure applica-ble to impulse responses with diffuse late reverberation tails, which can be modeled by a stochastic process. In such cases, the non-decaying frequency-dependent noise ﬂoor may be replaced by a synthesized incoherent tail parameterized by the SRIR’s energy decay envelope. It is shown that performing such tail re-synthesis in the spherical harmonic domain, using an independent zero-mean Gaussian noise for each component, preserves both the reverberation tail’s frequency-dependent decay as well as its spatial coherence properties. The proposed process is then evaluated through its application to SRIRs measured in real-world conditions, and ﬁnally some aspects of performance and consistency veriﬁcation are discussed. V C 2020 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/ 4.0/) . https://doi.org/10.1121/10.0001070

A robust denoising process for spatial room impulse responses with diffuse reverberation tails ABSTRACT: Spatial room impulse responses (SRIRs) measured using spherical microphone arrays are seeing increasingly widespread use in reproducing room reverberation effects on three-dimensional surround sound systems (e.g., higher-order ambisonics) through multi-channel SRIR convolution. However, such measured impulse responses inevitably present a non-negligible noise floor, which may lead to a perceptible "infinite reverberation effect" when convolved with an input sound. Furthermore, individual sensor noise and momentary measurement artefacts may additionally corrupt the resulting impulse response. This paper presents a robust SRIR denoising procedure applicable to impulse responses with diffuse late reverberation tails, which can be modeled by a stochastic process. In such cases, the non-decaying frequency-dependent noise floor may be replaced by a synthesized incoherent tail parameterized by the SRIR's energy decay envelope. It is shown that performing such tail re-synthesis in the spherical harmonic domain, using an independent zero-mean Gaussian noise for each component, preserves both the reverberation tail's frequency-dependent decay as well as its spatial coherence properties. The proposed process is then evaluated through its application to SRIRs measured in real-world conditions, and finally some aspects of performance and consistency verification are discussed. Spherical microphone arrays (SMAs) enable the directional analysis of a given sound field by sampling it over the Q transducer positions on their surface. A natural choice of representation for a function defined on such a surface S 2 is the SHD, whose basis functions Y l;m are analogues of the trigonometric functions in the application of Fourier expansion theory on the sphere (Driscoll and Healy, 1994), xðf ; X; tÞY l;m ðXÞdX; (1) where X ¼ ðh; /Þ is a point on the surface of a sphere with fixed radius r ¼ a [in conformity with ISO 8000-2:2009(E) (2009)], xðf ; X; tÞ is the time-frequency domain representation of the sound field on the sphere, and Y l;m ðXÞ are the spherical harmonics of order l 2 Z þ and degree m 2 ½Àl; l. This transform thus defines the SHD signal coefficients X l;m ðf ; tÞ for each component or mode (l, m). Using a SMA, the integral in Eq.
(1) is discretized and can be approximated by a weighted sum over the microphone positions; the specific weights are chosen such that the sum approaches the ideal integral of Eq.
The discrete transform can be simply written in matrix form x SHD ðf ; tÞ ¼ Yxðf ; tÞ; (2) where xðf ; tÞ is the column vector containing the timefrequency representation of the signal measured at each transducer position X q , Y is the ðL þ 1Þ 2 Â Q encoding matrix of elements y q;n ¼ a q Y l;m ðX q Þ (with indices n ¼ l 2 þ l þ m þ 1 up to a maximum encoding order L, and a q the aforementioned array weights), and x SHD ðf ; tÞ is the column vector of resulting SHD coefficients. The array's sampling configuration must then lead to an encoding matrix with K non-vanishing singular values such that K ¼ ðL þ 1Þ 2 Q (Noisternig et al., 2011), thereby effectively limiting the maximum achievable order L for a given SMA. Finally, in order to obtain an arrayindependent representation of the measured sound field, a subsequent correction for the so-called mode strengths (or holographic functions) of the SMA must be applied. Such is the case in the widespread higher-order ambisonics format, where the center of the sphere is used as the reference point and for which the correcting filters are determined accordingly (Daniel and Moreau, 2004 ................................... process (Schroeder, 1962), which has been shown to be valid assuming sufficiently high echo density and modal overlap is achieved (Polack, 1988). These conditions lead to a lower time limit for echo density, known as the "mixing time," and a lower frequency limit for modal overlap, known as the "Schroeder frequency." Beyond these limits, the late reverberation field is considered to be fully "diffuse," i.e., it behaves as a spatially isotropic distribution of a statistically significant number of stochastically independent (Kuttruff, 2000) and therefore incoherent (Cremer et al., 1982) plane waves. Such a field can be synthesized in the form of a zero-mean Gaussian noise filtered by an exponentially-decaying energy envelope (Jot et al., 1997). This envelope is parameterized by a frequency-dependent decay coefficient dðf Þ [usually represented as the 60 dB reverberation time, T 60 ðf Þ ¼ 3 ln ð10Þ=dðf Þ] and an initial power spectrum P 0 ðf Þ; these parameters can be extracted by analysis of the energy decay relief (EDR), a time-frequency extension of the Schroeder energy decay curve (EDC) (Jot et al., 1997). Non-decaying background noise present in a measured impulse response can therefore be replaced by a synthesized zero-mean Gaussian noise filtered by a prolongation of the energy decay envelope (Jot et al., 1997). As a result, the final signal-to-noise ratio (SNR) is limited only by the quantization noise floor for the chosen synthesis bit depth, P QN ¼ 20 log 10 ð2 Àd Þ dB, where d is the signal bit depth. Guski and Vorl€ ander (2014) have since presented a variety of other noise compensation methods, but these focus more on regularizing the broadband EDC calculation in order to improve the accuracy of extracted room acoustics parameters (e.g., T 60 , C 80 clarity, etc.), rather than faithfully re-synthesizing the reverberation tail for convolution applications. Some of their techniques resemble that proposed by Cabrera et al. (2011) for auralizing measured RIRs; all have so far only been presented in the case of monophonic RIRs with single-slope decays. Furthermore, such decay envelope adjustment methods require strict conditions on the content of the background noise (as noted by Cabrera et al., 2011): since the signal is re-used as is, it must approach a constantpower stationary white noise. This is typically not the case in many "real-world" measurement conditions.
RIRs measured with a SMA are commonly known as spatial room impulse response (SRIR) or directional room impulse response (DRIR), although the latter usually refers to RIRs representing a particular direction (e.g., through beamforming). Preliminary extensions of Jot's tail resynthesis process to the SRIR case were presented by Carpentier et al. (2013) (using a reference diffuse field simulated by large numbers of incident plane waves to denoise the individual SMA transducer signals) and Noisternig et al. (2014) (in the SHD using independent zero-mean Gaussian noise realizations per component), both once again in the single-slope decay case. The current work builds upon and further details these methods, allowing for multiple-slope decays (such as those observed in certain coupled-volume configurations) and demonstrating that tail re-synthesis in the SHD guarantees preservation of the late reverberation's spatial coherence properties.

II. PROPOSED DENOISING PROCESS
The different parts of the proposed denoising framework are presented in this section, and the sequencing of the individual steps is outlined schematically in Fig. 1. The exponential sweep method (ESM) measurement and subsequent inverse-sweep convolution are based on Farina (2000) and performed on each SMA transducer signal independently; between these two steps we introduce an artefact reduction procedure described in Sec. II A. The SHD encoding is based on the theory presented in Sec. I, and the EDR analysis is an extension of Jot et al. (1997), detailed in Sec. II B 1. Finally, the main focus of this work is on the spatial coherence and mixing time analysis (Sec. II B 2) and reverberation tail re-synthesis (Sec. II B 3).

A. Measurement artefact reduction
Measuring impulse responses using the ESM in socalled real-world conditions is inevitably subject to three main risk factors: the presence of constant, stationary background noise (including transducer self-noise), any nonstationarity of the measurement conditions (temperature, humidity, etc.), and the occurrence of non-stationary noise events. The first is what is assumed in previous work on noise reduction and what is aimed to be removed in the tail re-synthesis procedure. The second can lead to timevariance in the impulse responses which would require postprocessing correction techniques using a priori information on the measurement conditions, and will not be considered in this study. The third is what we will refer to here as "measurement artefacts," i.e., short-term sonic events occurring during the measurement, and reducing their impact is the aim of this section.
As noted by Farina (2000), averaging a repetition of several sweeps is a simple way to increase the SNR, since the ensemble mean of any incoherent stationary noise will tend to zero as the number of repetitions increases. However, any short-term non-stationary noise events present in the repetitions will inevitably end up in the noise floor of the average, and therefore also in the noise floor of the RIR obtained by convolution with the time-reversed and amplitude-corrected excitation signal. This is especially troublesome when considering Schroeder-type reverse-integrated analysis such as the EDR, since these artefacts will not only accumulate in the reverse-integration of the noise floor, they will also deviate substantially from the theoretical profile of a reverse-integrated constant-power noise floor (see Sec. II B 1 below). Attempting to reduce the relative amplitude of the artefacts by greatly increasing the number of repetitions is not only impractical but also increases the risk of breaking the long-term stationarity condition on the measurement environment. Therefore, we seek to minimize the influence of these short-term non-stationary noise events by comparing the magnitude spectrograms of the individual sweep repetitions among each other. Non-negligible positive deviations from the mean magnitude spectrogram are thus used as a discriminating criterion in order to identify artefacts. This maximum allowed deviation is defined as nðf ; tÞ ¼ lðf ; tÞ þ arðf ; tÞ, where lðf ; tÞ is the mean magnitude spectrogram, rðf ; tÞ is the standard deviation over the available repetitions, and a is an empirically-set deviation factor used as a control parameter. Artefact magnitude values identified as greater than nðf ; tÞ in each realization are then replaced with the corresponding mean magnitude over the remaining repetitions.
This process is applied independently to the ESM measurement signals recorded by each SMA transducer. Some example results are illustrated and discussed in Sec. III A.

B. Reverberation tail analysis and re-synthesis
In this section, we first review the EDR analysis procedure used to extract the reverberation decay parameters, and then present a characterization of the SRIR's mixing time using a measure of the sound field's coherence, before showing that re-synthesizing the reverberation tail as a zeromean Gaussian noise in the SHD preserves the late field's spatial properties.

EDR analysis
The EDR is a time-frequency extension of Schroeder's reverse-integrated broadband EDC, from which frequencydependent decay envelope parameters can be extracted by analyzing each frequency bin individually (Jot et al., 1997). We begin our analysis by identifying the exponential decay section of the reverse-integrated curve presented by the EDR at each frequency bin. In dB scale (such that exponential sections become linear), this curve is first segmented using an adaptive Ramer-Douglas-Peucker (RDP) algorithm (Prasad et al., 2012) in order to help identify the different sections (early reflections, exponential decay, and noise floor). Developed for dominant point detection in digital image processing, the adaptive RDP algorithm determines the point with the maximum deviation from a linear regression over the curve, and calculates a related tolerance factor.
If the maximum deviation is greater than the tolerance factor, the algorithm is recursively applied to either side of the maximally deviating point. Such adaptive segmentation is crucial to the robustness of the sectioning and fitting procedures described below.
The noise floor limit point fP noise ; t lim g can be found by fitting the theoretical dB-scale profile of a reverse-integrated constant-power noise to the curve segments (see the shaded area In Fig. 2). Additional headroom above this noise profile is then adaptively determined (see below) to ensure the limiting point fP noise ; t lim g belongs to the exponential decay section of the curve, thereby avoiding discontinuities when prolonging the reverberation envelope for tail re-synthesis. Finally, any non-exponentially decaying early reflection regimes are discarded by selecting an appropriate starting segmentation point (t start , see Fig. 2) using a criterion on the local slopes of the curve segments up until t lim (early segments to discard are assumed to be shorter and have significantly different local slopes than those belonging to exponential decays). The exponential decay section is thus delimited by t start and t lim and the reverberation time (T 60 ) and initial power (P 0 ) values can be determined by fitting an ideal decay envelope model.
In the case of single-slope decay, the envelope parameters can be found by performing a linear regression on the identified decay section of the dB-scale curve. For multiple-slope decays, such as those observed in certain configurations of coupled volumes (Cremer et al., 1982), a parameter-space search can be performed in order to fit the model to the measured decay (Xiang et al., 2011). In general, if we consider the global energy envelope of a system of C coupled volumes to be a sum of C exponential decays FIG. 2. (Color online) EDR analysis schematic for a given frequency bin. The reverse-integrated decay curve is first segmented (black points). The noise floor (shaded area) is then identified, along with the noise floor limiting point fP noise ; t lim g (dotted and dashed lines, respectively). Early decay sections are avoided by identifying t start (dashed-dotted line), and the exponential decay model is fitted between t start and t lim .
where d i ðf Þ are the frequency-dependent decay coefficients, related to the T 60 ðf Þ by T 60 ðf Þ ¼ 3 ln ð10Þ=dðf Þ, and K denotes the parameter vector containing the P 0;i and d i values, then the ideal integrated decay curve is given by (see, also, Jot et al., 1997) A model error mod can be defined as a simple meansquared error, where N fit ¼ n e À n s þ 1, with n s is the discrete time index such that t n s ¼ t start and similarly n e such that t n e ¼ t lim . This error can then be used as a loss function (or inversely as a likelihood) in order to perform the parameter search using an expectation-maximisation (EM) or maximum-likelihood algorithm. At each frequency bin, the parameter space is of dimension 2 C, since for each exponential decay both P 0;i ðf Þ and d i ðf Þ must be estimated. To optimize the EM and avoid the detection of false local likelihood maxima, the algorithm is initialized using linear regressions performed on EDC segments defined by re-applying the adaptive RDP algorithm between t start and t lim .
The model error can additionally be used to adjust the headroom above the fitted ideal noise profile mentioned above. The procedure described above (segmentation, noise fitting, start point detection, and decay parameter search) is reiterated for several headroom values, and the result with the highest overall likelihood (lowest error) is chosen. The likelihood function used in this work is based on the Akaike Information Criterion (Akaike, 1974) and can be written L ¼ 2 log ð1= mod Þ À 2C þ log ðN fit Þ, where again C is the number of coupled decays, and log ðN fit Þ is a regularization term used to promote fits made over longer decay sections (i.e., for two fits with equal likelihood, the one made over a longer section of the EDR bin will be preferred).

Coherence analysis and mixing time estimation
As mentioned in Sec. I B, replacing the non-decaying noise floor with a reverberation tail synthesized as an exponentially-decaying zero-mean Gaussian noise assumes that the late sound field described by the impulse response is fully diffuse. This leads to the classic time-frequency limits for stochastic modeling of room reverberation, respectively, the mixing time and Schroeder frequency (Polack, 1988). The exploration of strategies for denoising in the modal domain below the Schroeder frequency is left to future work; in this paper we will apply the tail re-synthesis process across all frequencies, and note that for most reverberant spaces the Schroeder frequency is low enough that the human auditory system is largely insensitive to the modal reverberation below it. [This can be seen by comparing Schroeder's measure f Sch % 2000 ffiffiffiffiffiffiffiffiffiffiffiffi ffi T 60 =V p , where T 60 is a broadband measure of the reverberation time and V is the volume of the space (Schroeder and Kuttruff, 1962), to equal-loudness contours such as those given by the ISO 226:2003 (2003) standard.] Defining the mixing time, however, is crucial to the present work. Since a diffuse field must be spatially incoherent (Cremer et al., 1982), we propose using a measure of the sound field's spatial coherence, or rather its "level of incoherence," in order to estimate the moment the SRIR becomes maximally incoherent. Furthermore, in Sec. II B 3 we will show that re-synthesizing the late reverberation tail in the SHD guarantees that the resulting sound field will preserve these coherence properties.
Several measures of "diffuseness" have been proposed that directly exploit various characteristics of the SHD. The DirAc measure (Ahonen and Pulkki, 2009) uses the zerothand first-order components to define a sound intensity vector and analyze its temporal variation. Jarrett et al. (2012) use SHD inter-component coherence to define a "signal-to-diffuse ratio" (SDR) that is evaluated with respect to a directional signal with a given direction of arrival (DOA). Finally, the COMEDIE measure (Epain and Jin, 2016) exploits the eigen-decomposition of the SHD signal covariance matrix, which will approach the identity matrix in the case of a fully diffuse field.
The COMEDIE measure was chosen for this work since it can be adapted to analyze spatial coherence without regard to the underlying spatial power distribution (which the DirAc measure, for example, assumes to be perfectly isotropic), by using a normalized covariance calculation analogous to a coherence function. This ensures that analysis accuracy is maintained even when the late reverberation field deviates from ideal isotropic diffuseness conditions, and as such we will refer to the result as a measure of "incoherence" rather than diffuseness. Finally, the COMEDIE measure also increases in accuracy with SHD order (whereas the DirAc measure is limited to first-order signals), has a relatively lightweight implementation, and is independent from additional analyses (whereas the SDR measure requires an estimation of the DOA).
In typical "large" mixing spaces, the COMEDIE incoherence profiles tend to quickly reach a stable maximum, as shown in Fig. 3 for the Kraftzentrale event venue in Duisburg, Germany (an industrial-era factory hall approximately 84 000 m 3 in volume). Estimating the mixing time then corresponds to identifying the moment the SRIR reaches its maximum incoherence. The idea here is to first characterize the maximum incoherence and then find when the SRIR reaches this maximum in a definitive manner after an initial period of instability due to coherent early reflections. This can be done by means of an appropriately-sized moving average, a reverse-cumulative average, and a reverse-cumulative standard deviation, where i ¼ 1; 2; …; ðN d À N w þ 1Þ with N d the length of the incoherence data dðt i Þ and w a chosen averaging kernel of length N w . In this work (see Fig. 3), a 24-point Gaussian kernel was used to calculate l mov on incoherence data obtained using a 1024-sample, 87.5% overlapping short-term Fourier transform, and mathematical expectations estimated by a subsequent 8-frame average (at a 48 kHz sampling rate, this corresponds to a 40.0 ms average for incoherence points and a 101 ms total average for l cov ). The mixing time is then determined by where the time values t inc satisfy ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi l mov ðt inc Þ À l cum ðt inc Þ ½ 2 q 2r cum ðt inc Þ: Additional checks can subsequently be performed to ensure that no l mov values are below a certain threshold from l cum after this time (e.g., corresponding to late-arriving discrete echoes), adjusting t mix to a satisfying t inc value if necessary. Further validation tests on the value of the maximum incoherence may also be included (e.g., an incoherence maximum below 0.5 may not be considered "maximally incoherent").
We now need to define a condition for re-synthesizing the reverberation tail using a zero-mean Gaussian noise: if the SRIR reaches its mixing time before decaying below the noise floor, the stochastic model can be used as first proposed by Jot et al. (1997). However, whereas the mixing time is a broadband property, the EDR analysis described above returns a frequency-dependent noise floor limiting time t lim ðf Þ. To get a global value for the noise floor limiting time, we use the t lim ðf Þ values determined for the SHDencoded SRIR's Y 0;0 (omnidirectional) component and perform a perceptually-weighted average over the audible frequency range. This average is weighted according to the ITU-R 468 standard noise filter and then evaluated over Bark-scale frequency bands in order to avoid the overweighting of higher-frequency bins due to the linear frequency scale of the Fourier transform.
We denote the resulting value t lim , and the condition can then be written t mix < t lim . If it is verified, tail resynthesis may be performed using a zero-mean Gaussian noise as described below, with the perceptual considerations above ensuring that any t lim ðf Þ values smaller than t mix should have a limited perceptual impact [future work is planned to further strengthen this aspect, e.g., by taking into account the corresponding P noise ðf Þ values]. If the condition is not verified, however, alternative methods of noise reduction must be considered (see the conclusion in Sec. IV below).
We now show that re-synthesizing the reverberation tail as a zero-mean Gaussian noise in the SHD preserves the spatial coherence properties of the late reverberation field. In the SHD, the signal measured by a SMA in the presence of a perfectly isotropic diffuse field is of the form Uðf ; X; tÞY l;m ðXÞdX; where P diff ðf ; tÞ is the diffuse field power envelope, Uðf ; X; tÞ ¼ e iuðf ;X;tÞ with uðf ; X; tÞ the independent and uncorrelated plane wave phase such that jUðf ; X; tÞj ¼ 1 8 f ; X; t and E Uðf ; X; tÞU Ã ðf ; X 0 ; tÞ È É ¼ d X;X 0 (with d representing the Kronecker delta and E{Á} mathematical expectation), and b l ðf Þ are the aforementioned array mode strengths (or holographic functions). It can be shown that  (Epain and Jin, 2016) and mixing time estimation for a 4th-order SHD SRIR measured at the Kraftzentrale event venue in Duisburg, Germany, using mh Acoustics Eigenmike V R . The calculated incoherence curve is smoothed using a Gaussian kernel moving average (l mov ). An inverse cumulative average (l cum ) and standard deviation (r cum ) is further used to identify the onset of the maximum incoherence and thereby estimate the mixing time (t mix ). this leads to a spatial coherence of c diff l;m;l 0 m 0 ðf ; tÞ ¼ 0 8 ðl; mÞ 6 ¼ ðl 0 ; m 0 Þ (Jarrett et al., 2012) due to the orthogonality of the spherical harmonics and the spatial independence of the plane wave phases.
On the other hand, synthesizing a zero-mean Gaussian noise of power P diff l;m ðf ; tÞ and random phase U l;m ðf ; tÞ ¼ e iu l;m ðf ;tÞ per SHD component gives a cross-power spectral density of and therefore the same field spatial coherence as a diffuse field, In other words, a diffuse field is fully incoherent in the SHD, and subsequently synthesizing a zero-mean Gaussian noise per SHD component produces an identically incoherent field.
It can also be shown (provided again that a normalized covariance is used) that synthesizing a zero-mean Gaussian noise per SHD component leads to a SHD covariance matrix that approaches the identity matrix in the same way as an incoherent field of N ) ðL þ 1Þ 2 independent plane waves, as originally demonstrated by Epain and Jin (2016) for the COMEDIE diffuseness measure using an isotropic field. As such, the use of individual power envelopes per SHD component does not guarantee an ideally isotropic diffuse field in and of itself, but it does guarantee at least a fully incoherent field. Individual power envelopes are furthermore necessary to account for both the order-dependent frequency response of the SHD components (Daniel and Moreau, 2004) as well as any deviations from perfect isotropy, in which case imposing fully diffuse power envelopes could introduce discontinuity artefacts at the noise floor limit points when prolonging the reverberation tail (see Sec. III B below).

C. Summary of denoising process
The full denoising process (outlined in Fig. 1) can thus be summarized as follows: (1) Measurement artefact reduction. The procedure described in Sec. II A is applied to the raw ESM recording signal of each SMA microphone channel.
(2) Inverse-sweep convolution and SHD transform. The resulting "cleaned" ESM measurement is convolved with a time-reversed and amplitude-corrected version of the excitation sweep signal as per Farina (2000) to obtain a RIR for each microphone channel. This multichannel RIR is then transformed to the SHD according to the theory outlined in Sec. I A.
(3) Mixing time analysis. Coherence analysis is performed in the SHD, leading to an estimation of the mixing time as presented in Sec. II B 2. (4) EDR analysis and validation of diffuse field hypothesis.
EDR analysis is performed per SHD component in order to extract the reverberation tail decay envelope parameters [T 60 ðf Þ and P 0 ðf Þ] and noise floor limit points fP noise ; t lim gðf Þ. The t lim ðf Þ values obtained for the omnidirectional Y 0;0 component are averaged over the audible frequency range in order to estimate the broadband noise floor limiting time and confirm (or invalidate) the diffuse field hypothesis required for tail resynthesis using a zero-mean Gaussian noise. Note that since the coherence analysis does not account for the isotropy condition of diffuse fields, a verification of P 0 values per order may be necessary to ensure that their variance is not "too large." (5) Tail re-synthesis. The late reverberation tail is resynthesized using a zero-mean Gaussian noise per SHD component, which preserves spatial incoherence as shown above. For every SHD component channel, each frequency bin of the re-synthesized tail is made to decay according to the corresponding parameters extracted from the SRIR, and is then used to replace the corresponding SRIR frequency bin starting at t lim ðf Þ.

III. APPLICATION RESULTS
In this section we show the effects of applying the denoising process described above to SRIRs measured in various locations and conditions. A qualitative overview of the results is first presented, followed by a brief discussion of methods leading to a more quantitative assessment of the procedure's performance. Figure 4 illustrates the application of the artefact reduction method described in Sec. II to a single microphone channel of an ESM measurement performed at the Christuskirche in Karlsruhe, Germany (a late 19th-century church with a large open dome-like nave). Figure 4(a) shows several impulsive artefacts occurring over the course of the ESM measurement signal (averaged over four repetitions), while Fig. 4(b) illustrates how these turn into repeated inverse-sweep artefacts when the ESM measurement signal is convolved with the time-reversed and amplitudecorrected excitation signal as per Farina (2000). Figures 4(c) and 4(d) show the effect of the artefact reduction procedure on the ESM measurement signal and resulting RIR, respectively. Finally,Figs. 4(e) and 4(f) highlight the timefrequency points identified as artefacts as well as their magnitude differences before and after reduction.

A. Measurement artefact reduction
The spectrograms shown in Fig. 4 are obtained by performing a moving time average over 8 frames of short-term Fourier transform magnitudes (with 87.5% overlapping frames of 1024 samples at a 48 kHz sampling rate, this corresponds to a total averaging length of 40 ms).
The removal of impulsive measurement noises and the subsequent reduction in inverse-sweep-type artefacts revealed in Fig. 4 is crucial in ensuring that the reverseintegration of the RIR's noise floor approaches the theoretical profile fitted to identify the noise floor limit point fP noise ; t lim gðf Þ, as in Fig. 2 (see Sec. II B 1). To further illustrate this, Fig. 5 compares the EDR profile from the Christuskirche SRIR's Y 0;0 component for one frequency (2461 Hz) before and after application of the artefact reduction process: not only is the SNR increased but the number of "accidents" in the curve (or deviations from the theoretical reverse-integration of a constant power) is also decreased.
Finally, in an attempt to quantify the amount of artefact reduction, we define an artefact-to-total-energy ratio as the total artefact energy removed by replacing detected outlying points with the mean magnitude value (according to the definition given in Sec. II A) versus the total signal energy in a given frame Xðf k ; tÞ ¼ Xðf k ; tÞ À lðf k ; tÞ; jXðf k ; tÞj > nðf k ; tÞ 0: ( Thus gðtÞ represents the relative total energy removed in each time frame during artefact reduction. In the current example (the Christuskirche SRIR), this measure averaged to g ¼ 0:129 or À17.8 dB over the four sweep repetitions.

B. Reverberation tail re-synthesis
The re-synthesis of the late reverberation tail is evaluated here in two steps: the consistency of the EDR analysis is first verified on deliberately noised simulations, and then results obtained on an SRIR measured in real-world conditions are subsequently presented.

Simulated late reverberation tails
The consistency of the EDR analysis process presented in Sec. II B 1 can be simply verified by applying it to simulated late reverberation tails to which a constant-power diffuse noise floor has been deliberately added. To simplify the evaluation of the analysis results, the simulation and subsequent analysis are performed in a broadband manner, i.e., without introducing or taking into account any frequency dependence in either P 0 or T 60 . In a filter-bank view, this can be seen as reducing the evaluation to that of a single frequency bin.
The reverberation tails are simulated by first synthesizing a zero-mean Gaussian noise per SHD component, to which an exponential energy decay envelope is then applied. Two types of reverberation tails are created: a perfectly diffuse late field is obtained by setting equal P 0 and T 60 values over all SHD components, and a slightly anisotropic deviation from ideal diffuseness is simulated by introducing random fluctuations of up to 10 dB over the P 0 values and up to 10% over the T 60 values. Table I displays the error measurements obtained between the EDR analysis results and the original values used in the simulations (P 0 ¼ 0 dB, T 60 ¼ 3 s, fP noise ; t lim g ¼ fÀ50 dB; 2:5 sg). These show that not only does the proposed process guarantee preservation of the late reverberation tail's spatial coherence properties, as demonstrated in Sec. II B 3, it will additionally tend to preserve a perfectly diffuse field (to within 3 Â 10 À4 60:05 dB initially, subject to an evolution of 0:360:7% over a 60 dB decay). Furthermore, the procedure is able to reproduce deviations from ideal isotropy without substantial increases in errors; this ensures that discontinuity artefacts will be avoided when replacing the noise floor from t lim onwards in realworld SRIR measurements, where these types of deviations are inevitable.
Finally, it should be noted that the large errors in the t lim results are due to the fact that the EDR analysis process is deliberately conservative with respect to the detection of the noise floor (see Sec. II B 1), both in order to avoid the influence of the reverse-integration and to further avoid discontinuities when replacing the noise floor.

Measured SRIR
We now present results obtained with the SRIR measured at the Kraftzentrale venue presented in Sec. II B 2. Figure 6 illustrates the effect of the tail re-synthesis procedure on the EDR of the omnidirectional Y 0;0 component; the arbitrary dynamic range for synthesis is chosen to match that of the signal bit depth (193 dB at 32 bits) at the most perceptually important frequencies (again using the ITU-R  468 standard), although Fig. 6 is shown over 130 dB to match the depth of human hearing. As mentioned throughout this paper, the crucial condition for successfully denoising SRIRs by reverberation tail re-synthesis is that the late field's coherence properties must be preserved. To confirm that the proposed denoising procedure achieves this, Fig. 7 shows the COMEDIE incoherence profile for the Kraftzentrale SRIR: the incoherence maximum reached at t lim (dotted line) is successfully extended and maintained beyond the average t lim (dashed line). Note that the COMEDIE incoherence increases slightly from t mix to t lim , which may be due to the method's additional sensitivity to ideally diffuse signals versus large numbers of plane waves, as initially noted by Epain and Jin (2016).
The model decay error mod used in the EDR analysis procedure [Eq. (5), Sec. II B 1], can be examined in order to assess the performance of the denoising process on a given SRIR measurement, at least in terms of the quality of the fitted envelope which prolongs the reverberation tail. Indeed, if the measurement artefact reduction successfully ensures that the noise floor of the SRIR approaches an ideal constant-power background, the procedure should accurately detect the noise floor limit point fP noise ; t lim g per frequency bin and per SHD component, thereby allowing the decay error to be minimized. Conversely, should any of these steps perform less than ideally (indicating perhaps that the measurement does not satisfy some or all of the modeling assumptions made along the way), the decay error can be expected to increase. Figure 8 shows the maximum and mean mod over all SHD components at each frequency bin for the SRIR measured at the Kraftzentrale. Most frequency bins below $10 kHz have a maximum error of less than 1 dB/frame, and the mean error only becomes important when approaching 20 kHz. As can be seen from Fig. 6(a), this corresponds to frequencies where the SRIR's SNR rapidly decreases, and consequently fitting an exponential decay becomes increasingly difficult (reasons for this include the Eigenmike transducers' inherent frequency response and SNR at high frequencies, as well as the effect of air absorption on the RIR).
The model decay error therefore provides a general overview of the proposed procedure's performance and rapidly reveals any debilitating issues in a given application. However, it does not give an accurate representation of the perceptual quality of the final denoised SRIR, as it cannot predict the effect a certain error value (e.g., the $2 dB/frame error at $1.75 kHz in Fig. 8) will have on the resulting perceived "sound" of the SRIR. Further work should attempt to combine the model decay error with the time-frequency decay envelope in order to assess the perceptual "importance" of a given mod value (e.g., errors at frequencies with high initial powers and slow decays are much more perceptually important than errors at frequencies with low initial powers and fast decays).
Finally, it should also be noted that some ambiguity may arise when working with multiple-slope decays, in which case the "knee" between the exponential decay and constant-power noise floor (due to the EDR's reverseintegration formalism) may be mistaken as a separate curve, resulting in an erroneously low mod . Improvements with respect to this point are currently being pursued, including verifying the results of the modeling process against coupled-volume theory (e.g., Cremer et al., 1982).

IV. CONCLUSION
This paper has addressed the problem of removing the non-decaying noise floor inevitably present in SRIRs measured with SMAs and replacing it with a valid extension of the exponentially-decaying late reverberation tail. Building on previous research showing that this is possible for socalled "mixing" spaces by synthesizing the late reverberation as a zero-mean Gaussian noise and parameterizing its decay envelope by analyzing the EDR, we have demonstrated that performing this synthesis in the SHD guarantees preservation of the late field's spatial incoherence. Additionally, we have shown that including an artefact reduction step before inverse-sweep convolution of the ESM measurement signal improves identification of the noise floor during EDR analysis. As a collateral development, we have also proposed an estimate of the mixing time using a measure of SRIR incoherence in the SHD, and briefly discussed error and performance metrics and analysis for the proposed method.
Further work on this topic can be organized around three main themes. First, the question of appropriately determining the number of coupled decays to consider in multislope cases must be addressed to avoid over-fitting and ensure that the detected model satisfies coupled-volume theory. Second, cases where the late reverberation field presents highly anisotropic energy distributions must be further investigated, as the spatial symmetry of the SHD will not enable proper re-synthesis using the method presented in this paper. Finally, techniques must be developed for cases where the reverberation tail cannot be considered incoherent before reaching the noise floor, i.e., spaces that cannot be considered traditionally mixing and whose late reverberation cannot be modeled as a stochastic process.