A Non-Local Dual-Domain Approach to Cartoon and Texture Decomposition

This paper addresses the problem of cartoon and texture decomposition. Microtextures are characterized by their power spectrum, and we propose to extract cartoon and texture components from the information provided by the power spectrum of image patches. The contribution of texture to the spectrum of a patch is detected as statistically significant spectral components with respect to a null hypothesis modeling the power spectrum of a non-textured patch. The null-hypothesis model is built upon a coarse cartoon representation obtained by a basic yet fast filtering algorithm of the literature. Hence, the term “dual domain”: the coarse decomposition is obtained in the spatial domain and is an input of the proposed spectral approach. The statistical model is also built upon the power spectrum of patches with similar textures across the image. The proposed approach, therefore, falls within the family of non-local methods. Experimental results are shown in various application areas, including canvas pattern removal in fine arts painting, or periodic noise removal in remote sensing imaging.


Introduction
Decomposing an image as the sum of geometric and textural components is a problem of image analysis that has lead to a great deal of work. In this problem, known as cartoon and texture decomposition, the cartoon component is piecewise smooth, made of the geometric shapes of the images, and the texture component is made of stationary or quasi-stationary oscillatory patterns filling the shapes. Of course, the problem, based on perceptual clues, is intrinsically illposed and any definition of cartoon and texture is subject to discussion. After an overview of the literature, we introduce a new algorithm for cartoon and texture decomposition based on a dualdomain approach [1]: a spatial domain filtering gives a coarse decomposition, which permits estimating statistics of the power spectrum of cartoon patches. Estimating these statistics is based on the similarity of textured patches across the image, in the same spirit as the non-local approaches [2]. Therefore, the resulting algorithm performs a non-local dual-domain cartoon and texture decomposition.

Related works
Most authors define the cartoon component as a bounded-variation function whose total variation is controlled, which discards oscillatory patterns and enforces its piecewise constant nature, giving cartoon components with sharp edges. In contrast to the cartoon component, the texture component is the subject of discussions and a huge literature has emerged in the last decades. Variational approaches aim at decomposing the input image as the sum of the cartoon and texture component by minimizing some energy, often defined as the sum of the total variation (or some variant of it) of the cartoon part and of some norm of the texture part. This latter norm may be defined as the L 2 -norm [3], L 1 -norm [4], Meyer's G-norm [5,6], Gabor-norm [7], or norms enforcing a sparse representation in an adapted basis [8], just to cite a few. The authors of [9] address the problem of structured noise, which can be seen as a texture covering the whole image domain. Structured noise is indeed modeled as a stationary process and a variational approach to denoising is proposed in [9]. Recently, it has been proposed [10] to model the texture part as the sum of a small number of base textures in a low patch-rank approach. According to the authors of [11], low patch-rank outperforms previous approaches which smooth the cartoon component by transferring high-frequency information from edges to the texture instead of the cartoon component. It is, however, a global approach which is not suited to natural images where textures show different patterns of different nature. A local approach based on the block nuclear norm is proposed in [11]. The interested reader is asked to refer to the complete discussion in [7,10,11,12].
All these methods are slow, because it is difficult to find a numerical solution of the abovementioned variational problems. A fast non-linear filter for cartoon and texture decomposition is proposed by the authors of [12] (software available in [13]). It is based on the relative reduction rate of the local total-variation across scale, which is a good feature to measure whether an image point belongs to a textured region or not. This spatial filter giving quite a blurry cartoon part, a follow-up paper [14] (software in [15]) makes use of banks of filters adapted to image edges. A scale factor governs the trade-off between textures and shapes, large-size texture details being considered as shapes rather than texture. Recent papers investigate the use of a guidance signal for image filtering [16,17,18] (after [19]), which makes it possible to remove small-scale details without affecting other structures. Applications of this approach include texture removal, the difference with the aforementioned methods is probably that the retrieved cartoon component is quite sketchy.
Many of these works are based on the fact that textures are well represented in the Fourier domain. It is indeed well known that microtextures are characterized by their power spectrum, as discussed in, e.g., [20] concerning texture feature extraction, and in [21], after [22], for texture rendering. The power spectrum of a texture even generally reduces to a few spikes of various amplitudes, which justifies the sparse representation of [8,10]. Some authors make use of such a property to design notch filter to remove structured noise covering the whole image domain [23,24,25,26,27]. In these papers, the difficulty is to automatize the detection of texture spikes. It is proposed in [26] to detect texture spikes as deviation of the expected power spectrum of natural images, which decreases proportionally to some power of the inverse of the frequency [28]. The expected power spectrum is simply obtained by averaging patches covering the whole image as in [24], which is not possible when considering localized textures. The method was adapted to a-contrario detection in [27]. Removing canvas patterns in fine art paintings can also be considered as a form of textured noise removal. For instance, the authors of [29] remove canvas patterns in paintings by fitting a parametric model of the canvas to the texture part given by the algorithm of [12].

Contribution
The proposed contribution consists in locally detecting the texture components as statistically significant Fourier coefficients under a null hypothesis modeling non-textured patches. Our statistical model is based on two ingredients: first, the Gaussian nature of image patches, which is the ground of several recent papers on denoising [30,31,32,33]; second, the use of a coarse cartoon and texture decomposition which permits estimating statistics of non-textured patches in a non-local way. Section 2 gives the expectation and variance of the power spectrum components of a random Gaussian non-textured patch. Since the probability distribution of patches is not constant across the whole image domain, these statistics are empirically estimated in a non-local way. We select a sample of patches with similar power spectra and extract the cartoon parts with the coarse yet fast algorithm of [12,13] in order to estimate the above-mentioned statistics. This approach is in line with the famous non-local means methodology [2] in which similar patches are assumed to be independent realizations of the same random patch. Section 3 explains how to build a statistical test to detect texture spectrum spikes in each image patch, based on the null model of Section 2. A non-linear filter separating cartoon and texture components is thus designed, and implemented using the discrete windowed Fourier transform.
Since texture is extracted in the Fourier domain from a coarse information coming from a spatial filter, this is a dual-domain approach, a term coined by the authors of [1] when proposing to refine denoising in the frequency domain after a first estimation in the spatial domain. Note that the present contribution is not, strictly speaking, a dual-domain filter in the sense of [1].
Experiments are discussed in Section 4. In order to reproduce the experiments presented here, we make a Matlab code available on the following webpage, together with a companion document showing additional experiments: https://members.loria.fr/FSur/software/NoLoDuDoCT/
The expectation and variance of a random variable V are denoted by E(V ) and var(V ), respectively. The covariance of two random variables V and V is denoted by cov(V, V ). The empirical estimation of a statistic S is denoted by S.

The power spectrum of Gaussian patches
The Fourier transform of a Gaussian patch follows a complex normal distribution, and the power spectrum coefficients follow a non-central generalized chi-square distribution. While accurate approximations of this distribution are available [34,35], their computational cost is not suitable for our use. It is a standard trick to use the Gaussian approximation which is more tractable and justified by the central limit theorem.

Expectation and variance of the power spectrum
We assume that any random patch p 0 = c+n is the sum of a random non-textured component c and a Gaussian noise n of variance σ 2 (x) at pixel x, statistically independent from c(x). Since we aim at building a statistical model for the absence of texture (null hypothesis), we assume that p 0 does not show any texture, although this does not impact the calculation of the expectation and variance in this section.
The Fourier transform being linear, p 0 = c + n, and the power spectrum of the patch p thus writes, for any frequency ξ = (ξ, η): We now calculate the expectation and variance of | p 0 (ξ)| 2 , for any ξ. For better readability, we drop ξ in the following equations.
) c)E( n) = 0 since c and n are independent and E( n) = 0.
since for any complex number z, Re(z) ≤ |z|, and for any random variable V , where the summation is over (x, y), (x , y ), (x , y ), and (x , y ) spanning the image domain. Now, E(n x,y n x ,y n x ,y n x ,y ) = σ 4 if (x, y, x , y ) = (x , y , x , y ), or if (x, y, x , y ) = (x , y , x , y ), or if (x, y, x , y ) = (x , y , x , y ), and it simplifies into E(n x,y n x ,y n x ,y n x ,y ) = 0 otherwise. Thus, the latter inequation holding thanks to Bessel inequality.
To sum up the discussion, we can conclude that:

Estimating the statistics of a non-textured random patch
In Section 3, texture components are detected in the power spectrum of a patch as deviations to a Gaussian model of the power spectrum of non-textured random patches. We first need to empirically estimate the expectation and variance of the power spectrum of a non-textured patch p 0 . We can see from Equation 3 and 9 that this requires to estimate the expectation and variance of the power spectrum of the cartoon component c, and the noise total variance σ 2 = x σ 2 (x).

Estimating cartoon statistics
Although the approach of [12] gives a coarse cartoon output which is often quite smooth, it is fast to compute. We therefore take it as an initial guess of the cartoon component of any patch in the image. Smoothness is even an advantage here. Indeed, it makes it possible to assume that noise has been removed from the cartoon component. In order to estimate both the expectation E(| c| 2 ) and the variance var(| c| 2 ) for any random patch p = c + n, we make use of a non-local estimation approach. More precisely, for a patch p x centered at any pixel x, we extract from the input image the set S of patches such that their power spectrum are the N nearest neighbors of | p x | 2 in the sense of the Euclidean norm. These patches are thus likely to share the same texture and to be realizations of the same random process. Weighted sample estimates of the expectation E(| c| 2 ) and variance var(| c| 2 ) are calculated based on the cartoon components of the patches in S. The weights take into account dissimilarities and are proportional to e −||| c| 2 −| s| 2 ||/λ(x) for any patch s ∈ S, λ(x) being a normalizing parameter defined later.
In practice, the proposed procedure is made of two steps. First, the power spectra of image patches and of the corresponding cartoon components are extracted from the whole image. In order to deal with shape edges and avoid mixing texture information from two adjacent regions, the contribution of each pixel is weighted as a non-decreasing function of the similarity to the central pixel in the cartoon domain. Indeed, it has been shown in several papers [36,37] that adapting patches to the shape is a good way to keep non-local methods from mixing information from overlapping shapes. Moreover, in order to deal with border discontinuities and avoid spectral leakage, we also multiply patches by a Gaussian window. Therefore, denoting by I the input image, by D = {1 . . . X} × {1 . . . Y } its domain, and by C c its coarse cartoon representation obtained by [12], we actually compute modified windowed Fourier transforms defined at each pixel x as: where g α and g β are Gaussian functions of standard deviation α and β, g α being truncated to define a L×L patch. The corresponding power spectra are denoted by P I x and P C x . By definition, . . , L}. We extract these power spectra in such a way that their overlap is limited in order to avoid bias when estimating sample mean and sample variance. In practice, we impose that the patch center x lies on a subsampled grid D s made of regularly distributed pixel locations at a distance of s pixels.
The second step consists in estimating at any pixel x ∈ D the expectation and variance of the power spectrum of cartoon components based on the P I x i and P C x i calculated above for x i ∈ D s . At any x ∈ D, these statistics are estimated over the set of the N cartoon power spectra with the x i defined such that is the set of the N -nearest neighbors of the query power spectrum P I x . We use weighted estimates to take the dissimilarity of the nearest power spectra into account. For any ξ, var where the weights w i satisfy || · || being the Euclidean norm of the components of frequency larger than 2/L cycles per pixel (to remove the effect of very low frequency changes), and λ(x) being a normalization parameter equal to the median of the distances of the N nearest neighbors of P I x . The goal of this normalization, similarly to the so-called contextual dissimilarity measure [38], is to adapt the weights to the non-homogeneous distribution of the power spectra.
The nearest neighbors of the current power spectrum are found over the whole image by using the efficient FLANN library [39].
It should be noted that, patch being here the result of the pixelwise multiplication of the underlying image and an analysis window, it is necessary to assume the noise amplitude to vary from pixel to pixel even if a Gaussian white noise affects the image to be decomposed.

Estimating noise variance
We see from Equation 3 that the noise total variance σ 2 can be estimated as the empirical mean of | p 0 (ξ)| 2 − | c(ξ)| 2 for any ξ. The non-textured patch p 0 is unknown, and only a coarse estimation of c is available. As a consequence, for any ξ, we assume that σ 2 E(P I x (ξ)) − E(P C x (ξ)). This equation mainly differs from the preceding one in that the power spectrum P I x is affected by texture components. Since textures have a sparse Fourier representation, these components only affect a few Fourier components, thus averaging the power spectrum difference over a range of frequencies smoothes them out. Moreover, we average over high frequencies since low and medium frequencies are expected to have residual image content not purely made of noise, cf. [40,41]. Consequently, we estimate σ 2 as: where H is a frequency threshold and M is the number of discrete frequencies above H. In practice, we take H = 0.5 cycle per pixel.

Estimating statistics of the power spectrum of a noisy non-textured patch
Based on Equation 3, we eventually estimate the expectation of the power spectrum P x of a nontextured noisy patch. Concerning the variance, we use the upper bound given by Equation 9.
Consequently, E(P x (ξ)) = E(P C x (ξ)) + σ 2 var(P x (ξ)) = var(P C x (ξ)) + 2 σ 4 + 4 σ 2 E(P x (ξ)) 3 A non-local dual-domain cartoon and texture decomposition The preceding section gives the expectation and variance of the power spectrum components of non-textured noisy patches, empirically estimated with a tentative coarse cartoon and texture decomposition (Equations 18 and 19). We are therefore able to detect texture components as deviations to this model, since these components have a larger amplitude than expected under a null hypothesis assuming a non-textured patch.

Statistical hypothesis testing
Given an image patch p, a significant spectrum component is detected at frequency ξ provided that the p-value Pr H 0 (| p(ξ)| 2 ≥ | p 0 (ξ)| 2 ) is below some level of confidence α. The nullhypothesis H 0 is that of a non-textured random patch p 0 . Under a Gaussian distribution assumption, a texture component is therefore detected at ξ as soon as: where P I x is the power spectrum of the image patch centered at x, E(P x (ξ)) and var(P x (ξ)) are given by Equations 18 and 19, respectively, and G is the tail distribution function (complementary cumulative distribution function) of the standard normal distribution. Here, the estimation of the variance being an upper bound of the actual variance, some significant detections may be missed.
Since a statistical test is performed for any frequency, we use the Bonferroni correction [42] to control the false alarm over each patch of size L × L under ε, thus: where ε is the familywise error-rate, set here equal to 5%. We do not take into account the correlation matrix of the power spectrum components, since estimating the correlations would require a large number of samples, which is out of reach here. It should be noted that a recent paper addresses the estimation of the correlation matrix of patches [30].

Cartoon and texture components
Section 3.1 gives a framework to detect texture components in the spectrum of any patch. For any patch centered at pixel x, we thus define a very simple bandpass filter M x : M x (ξ) = 1 if a significant texture component is detected at frequency ξ, and M x (ξ) = 0 otherwise. This gives a non-linear bandpass filter through a windowed Fourier transform of the input image I. If I(x, ξ) denotes the windowed Fourier transform of I, such that: then the texture component T is retrieved as the inverse windowed Fourier transform of the product M x (ξ) I(x, ξ). The cartoon component C is simply such that I = C + T . 1 Compute a tentative coarse cartoon representation (denoted by C c ) from I with [12], scale-factor parameter σ CT ; 2 for any pixel x = (x, y) ∈ D s do 3 Calculate the modified windowed Fourier transforms for any frequency ξ: ξ>   6 and the corresponding power spectra P I x (ξ) = |I x (ξ)| 2 and P C x (ξ) = |C x (ξ)| 2 . 7 end 8 for any pixel x ∈ D do 9 Find the N nearest image-patch power spectra P I x 1 , . . . , P I x N to P I x with the FLANN library [39]; 10 Calculate sample mean E(P C x (ξ)) (Eq. 14) and sample variance var(P C x (ξ)) (Eq. 15) based on the x i ; 11 Calculate estimation of total variance σ 2 (Eq. 17); 12 Calculate estimations of expectation E(P x (ξ)) (Eq. 18) and variance var(P x (ξ)) (Eq. 19) of a non-textured patch; 13 Detect texture components at ξ through statistical hypothesis testing:

Algorithm and parameters
Algorithm 1 summarizes the proposed non-local dual-domain cartoon and texture composition. Parameter σ CT in [12] to obtain the tentative coarse cartoon is set so that textures are discarded from the cartoon but is small enough to prevent discarding high frequency details. In Section 4, σ CT = 2 unless otherwise specified. Patch width L and parameter α satisfy α = L/5; their value define the trade-off between shape and texture. Any pattern showing a low frequency with respect to 1/α will not be detected as a texture, basically because it does not give a spike in patch power spectra. A trade-off has to be made between the ability to detect low-period patterns (which requires large L) and the need to have a large enough sample of non-overlapping patch for statistical estimation (which bounds L). We extract patches on a subsampling grid D s of step s = L/4. Concerning the experiments of Section 4, a typical value for L will be L = 32 pixels, which permits detecting texture patterns of frequency larger than 1/16 pixel −1 . Note that such a patch size is much larger than typical values in patch methods dedicated to noise removal. Parameter β prevents mixing texture information with shapes with a gray value larger or smaller than the current shape. Because a Gaussian function is used, a difference of the cartoon components larger than 2β indeed multiply the corresponding gray value by e −2 . The value of β obviously depends on the contrast of the image, low-contrasted images indeed require smaller values. We set β = 20 in Section 4 unless otherwise stated. The number of nearest neighbors N is set to 20.

Experimental results
This section presents some experiments. We deal with color images by designing the bandpass filter of Section 3.2 in the luminance channel, the same filter being eventually used in the RGB channels. The texture components shown in this section actually correspond to the 8-bit image 128 + 3T so that they appear much salient in the figures. The reader is asked to zoom in on the images in the electronic version of the paper.
Since the cartoon and texture decomposition problem is based on perceptual clues, no ground truth is available. Overall, we would like the cartoon edges to be preserved and the stationary patterns to be in the texture part. As mentioned in [12], the distinction between texture and cartoon is not very clear because at a certain scale a texture element becomes a geometric shape. We visually compare the proposed algorithm with two state-of-the-art algorithms, namely the output of the block nuclear norm based algorithm [11], which is claimed to outperform variational methods such as [10], and the directional non-linear filter proposed in [14], which certainly achieves the best trade-off between the quality of the cartoon / texture separation and computation time. We use the software implementation kindly provided by the authors of [11] and [15]. Note that the implementation of [11] only gives gray-level outputs. The default parameters of [11] are used, and the scale-factor parameter σ CT in [15] is adjusted to obtain the best visual results. Our algorithm is implemented as a publicly available Matlab code.
To begin with, we illustrate the proposed algorithm with a famous image containing several different textures. Figure 1 shows the original image, the cartoon component given by the fast non-linear filter of [12] (this is the coarse decomposition used in our algorithm) and the output of the proposed approach. As we can see, while the texture component has mostly been taken out, the cartoon component of [12] is quite blurry. Comparing with the output of [11] and [15], we can see that the output of the proposed algorithm tends to better separate texture and cartoon, with less blurry shape edges and a larger part of the texture component well separated from the cartoon as, for instance, in the tablecloth, in the chair on the background, and even in narrow shapes as in the folds of the scarf. Most edges cannot be seen in the texture part, except for periodic patterns on the face for instance. It is a positive feature of the proposed algorithm, which means that high frequency components corresponding to edges are correctly detected as non-significant spectrum components. Here, the scale-factor parameter σ CT of [15] is difficult to set: σ CT = 2 keeps textures in the cartoon part, and σ CT = 3 (not shown here, see the companion document) gives a quite smooth cartoon with edge information in the texture. Figure 2 illustrates the behavior of the algorithm for several patches. We can see that similar patches (in the sense of power spectrum similarity) are weighted according to the texture similarity. The numerical values indicated here are the w i / w i of Equation 16; they would be equal to 1/N = 0.05 if the patches were evenly distributed. Moreover, we can see that the expected cartoon power spectrum actually varies from patch to patch, which illustrates that they come from different Gaussian processes. The hypothesis testing framework allows obtaining the localization of the texture spikes in the Fourier domain, and detects no significant texture components if no textures are present, as in the third patch.  [12]. Second row: output of the proposed algorithm. Third row: output of [11]. Fourth row: output of [15] (σ CT = 2). From the left to the right: cartoon, texture (except for first row), and close-up.  Compared to the other methods, our cartoon component appears more realistic, probably because the noise component is kept in it, while the other methods minimize the total variation of the cartoon part, either explicitly or implicitly, thus smoothing out the noise.   Bottom left: output of [11]. Bottom right: output of [15]. Figure 5 shows a detail of Norman Rockwell's painting, Our Heritage 3 . It is an application of cartoon and texture decomposition to canvas pattern removal. The canvas of the paintings are sometimes visible and need to be removed for further processing, as explained in [29]. Although this problem has been the subject of works from the eighties [43], it is still an open question [29,44]. It can be seen as a cartoon and texture problem as the canvas is locally not visible if the painting layer is too thick. Here, σ CT has been set to 3 for obtaining the tentative coarse cartoon with [12], and β = 30 since the image is quite contrasted. As we can see, the proposed algorithm is interesting in this application domain since it removes the quasi-periodic component of the canvas pattern (contrary to [11] and in a better way than [12]), and it still keeps sharp edges. Bottom left: output of [11]. Bottom right: output of [15]. Figure 6 shows an application to remote sensing image processing 4 . Remote sensing image processing dates back to the dawn of robotic probe imaging [46]. Such images are often impaired by electrical interferences during image acquisition, miscalibrated sensors, or missing data, causing line dropout, striping, banding, or more complex background noise. In this image, the size of the patches L has been set to 64, since we want to capture texture information such as the checkerboard noise pattern which has a periodicity of approximately 20 pixels. The contrast being low, β is set to 10. As we can see, the proposed algorithm removes most of the background noise (contrary to [15]) while keeping a quite sharp output (in contrast to [11]).
As explained in Section 1, assessing the separation between cartoon and texture is difficult since it depends on perceptual clues. Although such an experiment may look somewhat contrived, it is possible to test the algorithms discussed here on synthetic data. Figure 7 shows cartoon and texture decompositions of a synthetic image, built by adding to a piecewise constant 8-bit image 5 a texture image whose quadrants are made of three sine-wave textures, the fourth quadrant being kept untextured. As we can see, the outputs of [11] and [15] show somewhat mixed cartoon and texture, cartoon edges being noticeable in the retrieved texture. This Figure 6: Mariner6. The contrast is augmented for a better visibility. First row: original image, output of [12], details of the original image with the "checkerboard" noise clearly visible. Second row: output of the proposed algorithm. Third row: output of [11]. Fourth row: output of [15]. Second, third, and fourth row, from left to right: cartoon, texture, and close-up view of the cartoon component.  [11], texture retrieved by [11], cartoon retrieved by [14], texture retrieved by [14].
phenomenon is less visible with the proposed algorithm. Here, the proposed algorithm and the one of [11] are used with their default parameters, and σ CT = 2 in [15]. Table 1 gives a quantitative assessment with the root-mean-square error (RMSE) between the retrieved cartoon (resp. texture) component and ground-truth cartoon (resp. texture) component. The RMSE criterion confirms the visual impression (the lower the values, the better the decomposition). Although the cartoon component of [11] looks sharper, some parts have a grayer aspect than in the original image, which explains the larger RMSE. With the proposed algorithm, RMSE are equal for cartoon and texture components because the sum of both components exactly matches the input image. Texture normalization used in the software program [15] may explain why RMSE of cartoon and texture differ for this method in this particular experiment. To conclude this discussion, Table 2 gives computation times on a desktop equipped with a 12-core Intel Xeon E5-2650 v4 @ 2.20GHz CPU and 64 Gb memory. We can see that the proposed method, implemented in a non-optimized Matlab software, is much faster than the one of [11]. In all cases, the computation time of the preprocessing step [12] in the proposed algorithm is less than 0.3 seconds.  (Fig. 1) 567 × 787 35.5 332.1 6.5 Manhattan (Fig. 3) 575 × 1024 43.7 518.9 9.2 Kodim08 (Fig. 4) 512 × 768 29.7 375.5 6.3 Rockwell (Fig. 5)

Conclusion
This paper presents a novel patch-based approach to cartoon and texture decomposition. Fourier components of textures are detected in patches through statistical hypothesis testing, the null model being estimated in a non-local way over a tentative coarse cartoon representation. The proposed algorithm achieves a good compromise between computation time and accuracy.