Parsimonious Gaussian Process Models for the Classification of Hyperspectral Remote Sensing Images

A family of parsimonious Gaussian process models for classification is proposed in this letter. A subspace assumption is used to build these models in the kernel feature space. By constraining some parameters of the models to be common between classes, parsimony is controlled. Experimental results are given for three real hyperspectral data sets, and comparisons are done with three other classifiers. The proposed models show good results in terms of classification accuracy and processing time.


I. INTRODUCTION
Thanks to the development of different Earth observation missions, the availability of hyperspectral images with high spatial resolution has increased over the last decade.The fine spectral resolution improves the discrimination of more materials while the high spatial resolution allows the analysis of small structures in the image.Such remote sensing images provide valuable information about landscapes over large areas, on a regular temporal basis and at a relatively low cost.This detailed information is then used in various thematic applications, such as ecological science, urban planing, hydrological science or military applications [1], [2], [3].
One commonly used technique to extract information from remote sensing images is classification [4].It consists in assigning a label, or a thematic class, to each pixel of the image.Several methods have been developed for images with moderate spectral resolution [5].However, because of the increasing number of spectral variables in hyperspectral remote sensing images, their classification has become a more and more challenging problem [6].For instance, modelbased classification approaches try to fit the class conditional probability distribution by an ad-hoc parametric model, e.g., Gaussian distribution [4].However, the estimation of the parameters is difficult when the spectral dimension of the data increases [7], or leads to intractable processing time when non Gaussian models are used.In addition, high spatial resolution images cannot be statistically modeled easily, because spatial details of the image lead to model each class as a mixture of Mathieu Fauvel is with the UMR 1201 DYNAFOR INRA & Institut National Polytechnique de Toulouse, e-mail: mathieu.fauvel@ensat.fr.
distributions [8].Alternatively, non-parametric methods such as random forest (RF) classifier have been investigated for the classification of hyperspectral images [9], [10].However, with the increasing size of the images in the spectral domain, RF requires a sufficient number of training samples to build uncorrelated trees and thus to reach good classification accuracy.This construction is difficult with a reduced sample set and a high number of variables.Unfortunately, the collection of training samples can be a difficult task in remote sensing, resulting in a small training set.
Since the introductory paper in 2005 [11], kernel methods have shown very good abilities in classifying hyperspectral remote sensing images [12].The use of a kernel function that defines a measure of similarity between two pixel-vectors, makes them robust to the spectral dimension or the non-Gaussianity of the data.The learning step usually involves a constrained optimization problem, where few hyperparameters have to be optimized.Regularization of the decision function is in general included in the optimization problem, making kernel methods robust to the small sample size problem.The support vector machine (SVM) is the most used kernel classifier among the available kernel methods.From its original formulation [13], several methods have been proposed, ranging from the spatial-spectral classification [6] to the semi-supervised classification [12], successfully applied in various domains.The learning step of SVM consists in estimating a separating hyperplane in the kernel feature space, i.e., a linear decision function.In general, most of kernel methods solve a linear problem in the kernel feature space, see for instance kernel Fisher's discriminant analysis or kernel principal component analysis in [12].
Mixing Bayes decision rule and kernel function, M. M. Dundar and A. Landgrebe proposed a kernel quadratic discriminant classifier (KDC) for the analysis of hyperspectral images [14].It was a first attempt to build a kernel classifier from a quadratic classifier (Gaussian Mixture Model).In order to make the problem tractable, they assumed covariance matrices of all classes to be equal in the kernel feature space.This assumption was proposed by the authors to deal with illconditioned kernel matrices.Indeed, unlike the SVM optimization process, no regularization is included in the computation of the decision function with KDC.This function is based on computing the inverse of the centered kernel matrix, which is per construction non-invertible: this n × n matrix is estimated with the original kernel matrix, from which a combination of rows and lines is removed.
Since then, techniques have been proposed to extend KDC to non-equal covariance matrices.Pseudo inverse and ridge regularization have been proposed in [15].Xu et al. also proposed a KDC where the estimation of the covariance matrix in the feature space is regularized by dropping the smallest eigenvalues from the computation [16].Similar techniques were used in [17] in the context of small sample size problems.
Although good results in terms of classification accuracy have been reported for the different KDC, several drawbacks can be identified.For instance, [16] and [17] require the estimation of a large number of hyperparameters, while the "equal covariance matrix" assumption in [14] might be too restrictive in practical situations.
In this paper, a family of parsimonious Gaussian process models is reviewed and 5 additional models are proposed to provide more flexibility to the classifier in the context of hyperspectral image analysis.These models allow to build from a finite set of training samples, a Gaussian mixture model in the kernel feature space, where each covariance matrix is free.They assume that the data of each class live in a specific subspace of the kernel feature space.This assumption reduces the number of parameters needed to estimate the decision function and makes the numerical inversion tractable.A closed-form expression is given for the optimal parameters of the decision function.This work extends the models initially proposed in [18], [19].In particular, the common noise assumption is relaxed, leading to a new set of models for which the level of noise is specific to each class.Furthermore, a closed-form expression for the estimation of the parameters enables a fast estimation of the hyperparameters during the cross-validation step.The contributions of this letter are threefold.1) The definition of new parsimonious models.2) A comparison in terms of classification accuracy and processing time of the proposed models with state-of-the-art classifiers of hyperspectral images.3) A fast cross-validation strategy for learning optimal hyperparameters.
The remainder of the letter is organized as follows.Section II presents the family of parsimonious Gaussian process models as well as the 5 new models.Section III focuses on experimental results obtained on three real hyperspectral data sets.Finally, conclusions and perspectives are discussed in Section IV.

II. CLASSIFICATION WITH PARSIMONIOUS GAUSSIAN PROCESS MODELS
In this section, it is shown how a Gaussian mixture model (GMM) can be computed in the feature space.It makes use of Gaussian processes as conditional distributions of a latent process.Then estimators are derived and the proposed models are compared to those available in the literature.
A. Gaussian process in the kernel feature space Let S = (x i , y i ) n i=1 be a set of training samples, where x i ∈ R d , is a pixel and y i ∈ {1, . . ., C} its class, and C the number of classes.In this work, the Gaussian kernel function is  Common between groups Free Common of another kernel is possible).Its associated feature space F is infinite dimensional.Therefore the conventional multivariate normal distribution used in GMM cannot be defined in F.
To overcome this, let us assume that φ(x), conditionally on y = c, is a Gaussian process on J ⊂ R with mean µ c and covariance function Σ c .We note φ(x) cj the projection of φ(x) on the eigenfunction q cj using the following scalar product φ(x), q cj = J φ(x)(t)q cj (t)dt.
Hence, for all r ≥ 1, random vectors on R r defined by [φ(x) 1 , . . ., φ(x) r ] are, conditionally on y = c, multivariate normal vectors.In R r , it is now possible to use the GMM decision rule for class c [20]: where λ cj is the j th eigenvalue of Σ c sorted in decreasing order, q cj its associated eigenfunction and π c the prior probability of class c.The classification of If the Gaussian process is not degenerated (i.e., λ cj = 0, ∀j), r has to be large to get a good approximation of the Gaussian process.Unfortunately, only a part of the above equation can be computed from a finite training sample set:

B. Parsimonious Gaussian process
To make the above computational problem tractable, it is proposed to use a parsimonious Gaussian process model in the feature space for each class.These models assume that each class is located in a low-dimensional subspace of the kernel feature space.In [18], it was assumed the noise level is common to all classes (Definition 1).In this paper, these models are extended to the situation where the noise level can be dependent on the class (Definition 2).
Definition 1 (Parsimonious Gaussian process with common noise): A parsimonious Gaussian process with common noise is a Gaussian process φ(x) for which, conditionally to y = c, the eigen-decomposition of its covariance operator Σ c is such that A1.It exists a dimension r < +∞ such that λ cj = 0 for j ≥ r and for all c = 1, . . ., C. A2.It exists a dimension p c < min(r, n c ) such that λ cj = λ for p c < j < r and for all c = 1, . . ., C.
Definition 2 (Parsimonious Gaussian process with class specific noise): A parsimonious Gaussian process with class specific noise is a Gaussian process φ(x) for which, conditionally to y = c, the eigen-decomposition of its covariance operator Σ c is such that A3.It exists a dimension r c < r such that λ cj = 0 for all j > r c and for all c = 1, . . ., C. When r = +∞, it is assumed that r c = n c − 1. A4.It exists a dimension p c < r c such that λ cj = λ c for j > p c and j ≤ r c , and for all c = 1, . . ., C.
Assumptions A1 and A3 are motivated by the quick decay of the eigenvalues for a Gaussian kernel [21].Hence, it is possible to find r < +∞ (or r c ) such as λ cr ≈ 0. Assumptions A2 and A4 express that the data of each class live in a specific subspace of size p c , the signal subspace, of the feature space.The variance in the signal subspace for class c is modeled by parameters λ c1 , . . ., λ cpc and the variance in the noise subspace is modeled by λ or λ c .This model is referred to by pGP 0 for the common-noise assumption or npGP 0 for the class-specific noise assumption.
From these models, it is possible to derive several submodels.Table I lists the different sub-models that can be built from pGP 0 and npGP 0 .For models pGP 1 and npGP 1 , it is additionally assumed that data of each class share the same intrinsic dimension, i.e., p c = p, ∀c ∈ {1, . . ., C}.In models pGP 2 and npGP 2 , variance in the signal subspace F c is assumed to be equal for all eigenvectors, i.e., λ cj = λ c , ∀j ∈ {1, . . ., p c }.For models pGP 4 and npGP 4 , it is assumed that the intrinsic dimension is common to every class and the variance is common between them, i.e., λ cj = λ c j , ∀j ∈ {1, . . ., p} and c, c ∈ {1, . . ., C}.In term of parsimony, pGP 0 is the least parsimonious model while pGP 6 is the most parsimonious one for models with common noise.pGP 0 is also more parsimonious than npGP 0 .A visual illustration of pPGP 1 in R 2 is shown in Fig. 1.
In the following, only model npGP 0 is discussed.Similar results can be obtained for other models and a discussion of model pGP 0 can be found in [18], [19].Proposition 1: Eq. ( 1) can be written for npGP 0 as Computation of eq. ( 2) is now possible since p c < n c , ∀c ∈ {1, . . ., C}.In the following, it is shown that the estimation of parameters and the computation of eq. ( 2) can be done using only kernel evaluations, as in standard kernel methods.

C. Model inference
Centered Gaussian kernel function according to class c is defined as kc (x i , With these notations, the following results hold for npGP 0 .Proposition 2: For c = 1, . . ., C and under model npGP 0 , eq. ( 2) can be computed with: where Estimation of p c is done by looking at the cumulative variance for sub-models pGP 0,2,5 .In practice, p c is estimated such as the percentage of the cumulative variance is higher than a given threshold t.For other sub-models, p is a fixed parameter given by user.Finally, all parameters can be inferred from K c .

D. Link with existing models
The associated of equal covariance matrices in [14] corresponds to our model pGP 4 with an additional equality constraint on the eigenfunction.By using parsimonious Gaussian process, we are able to provide more flexibility in the feature space by allowing covariance matrix of each class to be different, and for the 5 new models, by allowing the noise in each class to be different.Furthermore, the storage complexity of [14] is O(n 2 ), since it works on the full kernel matrix, while the storage complexity of our models is O(n 2 c ), usually very much lower.Furthermore, the eigendecomposition of the kernel matrix is of complexity O(n 3 ) for [14], while it is reduced to O(n 3 c ) with our models.Authors of [14] also implement a ridge regularization to stabilize the generalized eigenvalue problem.Numerically, it is equivalent to set small eigenvalues to a constant term, which is similar to A2 and A4 in Definition 1 and Definition 2. In the same way, KDC models proposed in [15] use ridge regularization, but they are constructed for each class, i.e, each class has a separate covariance matrix.The main difference between ridge regularization and our models is that, with ridge regularization, eigenvectors corresponding to very small values are still computed and used in the decision function while our models only use the p c first ones.Note that KDC was also extended to indefinite kernel functions in [15] by using Moore-Penrose inverse, i.e., eigenvalue thresholding.
Covariance regularization techniques proposed in [22] were used in [16] and [17].In addition to kernel hyperparameter, two additional hyperparameters have to be tuned.In practice, even for moderate size problem it can be very time consuming.Our models only have two hyperparameters to tune, and one can be computed for a moderate numerical cost, as it is explained in II-E.
The model proposed in [16] is similar to npGP 0 .The authors proposed to estimate the p first eigenvalues for each class and set the noise term to the value of the (p + 1) th eigenvalue.In our model, the noise term is estimated as the mean value of the remaining eigenvalues.Additional flexibility is provided by our models since the size of the signal subspace can be class dependent.

E. Estimation of the hyperparameters
For each proposed model, there are two hyperparameters to tune: the scale γ of the Gaussian kernel and the size p c of F c or the percentage of cumulative variance t.In this work, the v-fold cross-validation (CV) strategy is employed.For the last two parameters, it is possible to use a strategy that speedup the computing time.The most demanding part in terms of processing time of the proposed models is the eigendecomposition of K c .But for a given value of γ and a given fold of the CV, it is possible to compute the eigendecomposition of K c only once.From the decomposition, all the model parameters for every values of p c or t are available at not cost since they are derived from the eigenvectors and eigenvalues of K c .It allows efficient computation of the CV error estimate for pairs of hyperparameters.This fast computation of the CV error is possible because the model parameters are obtained through an explicit formulation, contrary to SVM for which a optimization procedure is required and need to be restarted when a new set of hyperparameters is tested.

A. Data sets and benchmarking methods
Three hyperspectral data sets have been used in these experiments.
University of Pavia: The data set has been acquired by the ROSIS sensor during a flight campaign over Pavia, northern Italy.103 spectral channels were recorded from 430 to 860 nm. 9 classes have been defined for a total of 42,776 referenced pixels.
Kennedy Space Center: The data set has been acquired by the AVIRIS sensor during a flight campaign over the Kennedy Space Center, Florida USA.224 spectral channels were recorded from 400 to 2500 nm.Because of water absorption the final data set contains 176 spectral bands.13 classes have been defined for a total of 4,561 referenced pixels.
Heves: The data set has been acquired by the AISA Eagle sensor during a flight campaign over Heves, Hungary.It contains 252 bands ranging from 395 to 975 nm.16 classes have been defined for a total of 360,953 referenced pixels.
For each data set, 50 training pixels per class were randomly selected and the remaining referenced pixels were used for validation.20 repetitions were done for which a new training set have been generated randomly.The range of each spectral variable has been stretched between 0 and 1.
Reported results are the average Kappa coefficient and the average processing time in seconds (including selection of hyperparameters, training process and prediction process).In order to test the statistical significance of the observed differences, a Wilcoxon rank-sum test has been computed between each pair of methods.
For comparison, SVM and RF classifiers have been tested using the Scikit-learn Python package [23].Furthermore, the KDC of [14] has been implemented.The parsimonious models have been implemented in Python, and codes can be download here: https://github.com/mfauvel/PGPDA.All hyperparameters of each method have been selected using a 5-fold cross validation.

B. Discussion
Results are reported in Table II.In terms of accuracy, one of the proposed parsimonious models performs the best for each data set.SVM performs the best for two data sets and KDC performs the best for one data set.In particular, for Heves data set, (n)pGP 1 provides the best results.For University of Pavia and Kennedy Spectral Center data sets, SVM provides the best results but the differences with (n)pGP 1 are not statistically significant.
RF usually provides lower accuracy.However, RF is the fastest algorithm, by far.KDC performs the worst in terms of processing time, because each class involves a n × n kernel matrix.Parsimonious models perform on average as fast as SVM and are much faster than KDC.
For the proposed models, best results are obtained by (n)pPG 1 , which are the least parsimonious models.They have free variance inside the signal subspace but common dimension of subspaces.(n)pPG 0 performs slightly worse than SVM and KDC but better than RF.All the other models are less accurate.There is no difference in terms of processing time between parsimonious models, since they all rely on the eigendecomposition of K c .

IV. CONCLUSIONS
Parsimonious Gaussian process models have been proposed in this letter.They allow the computation of a kernel quadratic discriminant classifier with limited training samples.The main assumption considered in this work is that relevant information for the discrimination task is located in a smaller subspace of the kernel feature space.Sub-models are derived by constraining some properties of the models to be common between classes, thus enforcing the parsimony.Moreover, new models have been discussed for which the noise level is specific to the class.
The proposed models have been compared in terms of classification accuracy and processing time with three other classifiers, on three real hyperspectral data sets.Results show that two proposed models ((n)pPG 1 ) are very effective both in terms of accuracy and in terms of processing time.However, other proposed models do not provide as good classification accuracy, and would not be appropriate for practical situations.From our experimental results, pPG 1 and npPG 1 behave similarly in terms of computation time or classification accuracy.
The comparison with KDC shows that (n)pPG 1 models are competitive, with better classification accuracy and smaller processing time.They offer a good alternative to the conventional KDC method.
and its associated mapping function is denoted φ : R d → F (the use xi) − µ c , qcj 2 λcj + ln(λcj) non computable quantity where r c = min(n c , r) and n c is the number of training samples of class c.Consequently, the decision rule cannot be computed in the feature space if r > n c , for c = 1, . . ., C.

2 Fig. 1 .
Fig. 1.Visual illustration of model npGP 1 .Dimension of Fc is common to both classes, they have specific variance inside Fc and they have specific noise level.

TABLE I LIST
OF SUB-MODELS OF THE PARSIMONIOUS GAUSSIAN PROCESS MODEL.

TABLE II EXPERIMENTAL
RESULTS IN TERMS OF ACCURACY AND PROCESSING TIME.THE VALUES REPORTED CORRESPOND TO THE AVERAGE RESULTS OBTAINED ON 20 REPETITIONS.BOLDFACE RESULTS INDICATE BEST RESULTS FOR A GIVEN DATA SET.MULTIPLE BOLDFACE RESULTS INDICATE THAT DIFFERENCES BETWEEN THEM ARE NOT STATISTICALLY SIGNIFICANT.