Discrete non-parametric kernel estimation for global sensitivity analysis

This work investigates the discrete kernel approach for evaluating the contribution of the variance of discrete input variables to the variance of model output, via analysis of variance (ANOVA) decomposition. Until recently only the continuous kernel approach has been applied as a metamodeling approach within sensitivity analysis framework, for both discrete and continuous input variables. Now the discrete kernel estimation is known to be suitable for smoothing discrete functions. We present a discrete non-parametric kernel estimator of ANOVA decomposition of a given model. An estimator of sensitivity indices is also presented with its asymtotic convergence rate. Some simulations on a test function analysis and a real case study from agricultural have shown that the discrete kernel approach outper-forms the continuous kernel one for evaluating the contribution of moderate or most in ﬂ uential discrete parameters to the model output.


Introduction
In the literature many works about reliability analysis approaches in general, and sensitivity analysis (SA) methods more specially, are related to different problems such as the important case of non-independent random inputs [6] and have various application domains such as maritime industry [19] or environment [14].In most cases, a mathematical modeling of the studied system is frequently revealed to be useful when the variations of input parameters in a model imply a large variability of the results with some impacts on their accuracy.In this context, the probabilistic way is of interest to encompass the variation in the input parameters of the model.SA methods are then useful to conduct such a study since they aim to evaluate how the variation of input parameters contributes to the variation of the output of a model.Particularly, works in SA have highlighted the encountered interesting aspect concerning the evaluation of the influence of discrete (categorical or ordinal) inputs.Indeed, in system reliability studies, several models involving in various engineering contexts have input discrete variables.And, one of the reliability engineering issues is to accurately evaluate the influence of such parameters.
Amongst various SA approaches, let us consider a well-known method based on the analysis of variance (ANOVA) decomposition of model f for quantifying the influence of input X i;i ¼ 1;2;…;k A T on the output Y A R. That method consists of the calculation of sensitivity indices given by [18] such that The measure of first order S i evaluates the contribution of the variation of X i to the total variance of Y, the measure of second order S ij evaluates the contribution of the interaction of X i and X j on the output, and so on.Various statistical tools as splines, generalized linear or additive model, polynomial are useful in a metamodeling approach for providing an estimation of conditional expectation Eð Y j X i Þ and, consequently, of the main effect sensitivity measure S i [4].In the framework of the non-parametric smoothing, some methods as the continuous kernel-based estimation [16] or the State-Dependent Parameter estimation [13] are good choices for estimating EðY j X i Þ.About the two estimation methods, [15,20] are respectively one of the original references of nonparametric and state-dependent parameter estimates.Nowadays [11] have shown that continuous kernel estimation is equal or better than the SDP estimation in terms of performance.However until recently in the literature the continuous kernel estimation is evenly applied on continuous input variables as on discrete ones while discrete kernel estimation suitable for discrete functions is now known [7].The discrete associated kernel method was developed for smoothing discrete functions as probability mass functions (pmf) or count regression functions on a discrete support T such as T ¼ N, the set of positive integers, or T ¼ Z, the set of integers.For a fixed target x on discrete support T and a smoothing parameter h 4 0, this method is based on the definition of the discrete associated kernel K x;h ðÁÞ which is a pmf of random variable (rv) K x;h with support S x satisfying These three assumptions, fulfilled by both continuous and discrete kernels, insure good asymptotic properties for the corresponding kernel estimator [10].Thus, for ða; xÞ A N Â T and h 40, an example of discrete associated kernel is the discrete symmetric triangular one with rv K a;x;h on support S x ¼ fx À a; …; x À 1; x; x þ1; … ; x þ ag with a pmf given by with Pða; hÞ From the discrete kernel methodology, a discrete non-parametric estimator of EðY j X i Þ was proposed by [1] adapted from the continuous version of [12,22] as follows: with the arbitrary sequence of smoothing parameters h ¼ hðnÞ 4 0 fulfilling lim n-1 hðnÞ ¼ 0 and K x;h ðÁÞ a discrete associated kernel as defined previously.
In this paper the non-parametric regression estimator b m n using a discrete symmeric triangular kernel is investigated as a novel approach in SA methods for providing estimated sensitivity indices for discrete input variables X i .Thus, the discrete kernel estimation approach is studied as a contribution to reliability analysis for model with discrete input parameters.To illustrate the performance of discrete kernel approach in comparison to continuous kernel approach, some simulations are realized using Ishigami test function and an application is proposed on a real case from agricultural.That latter concerns the evaluation of the influence of some parameters on the environmental impacts generated during the Hemp Crop production by farmers [2].

Non-parametric discrete triangular regression
This section presents first a review of the non-parametric univariate regression estimator using symmetric discrete triangular kernel with the asymptotic expansion of its global squared error as presented by [3].Herein, the optimal convergence rate of the discrete triangular regression estimator is added.
Assume that ðX 1 ; Y 1 Þ; ðX 2 ; Y 2 Þ; …; ðX n ; Y n Þ are n independent copies of (X,Y) defined on Tð D ZÞ Â R. We are interested in the non-parametric regression model where mðÁÞ ¼ EðY j X ¼ ÁÞ is an unknown regression function and the random covariate X is independent of the unobservable error variable ϵ's assumed to have zero mean and finite variance.For a A N, a fixed point x A T and a smoothing parameter h 4 0, let us consider the discrete non-parametric estimator b m n of m defined in (2) using a discrete triangular symmetric kernel such that b m n ða; x; hÞ ¼ First, about some asymptotic properties of estimator b m n ða; x; hÞ in (2), the asymptotic part of its mean integrated squared error MISE [21] defined by MISEf b m n ðx; a; hÞg ¼ X This last expression is obtained by calculating asymptotic bias and variance of b m n ðx; a; hÞ in (2) using the following expansions of the modal probability and variance of the discrete symmetric triangular kernel: Finally, we get the following inequality: where AMISE f b m n ðx; a; h opt Þg tends to 0 as n-1.Thus, for a A N, the optimal asymptotic root MISE of estimator b m n with kernel Note that the discrete kernel estimation and the resulting asymptotic expansions of estimator's bias and variance depend on two preconditions: discrete random variable and smooth hypothesis.For x A T, a discrete associated kernel satisfying assumptions (A1)-(A3) has asymptotically the same behavior that a Dirac type kernel D x ðyÞ; yA S x , such that D x ðyÞ ¼ 1 at y¼x and 0 for any ya x.That explains also the good asymptotic properties of the corresponding estimator.

Non-parametric kernel estimator for sensitivity analysis
This section aims at building the estimator of ANOVA decomposition of the model where each term is defined by Non-parametric kernel estimation of such model originates in the work of [11] for continuous case.The multidimensional version of non-parametric regression estimator b m n is presented for the calculation of Sobol indices when measuring the contribution of two or more variables to the variance of Y.

Multivariate non-parametric regression
Let us consider x ¼ ðx 1 ; x 2 ; …; where the multivariate associated kernel is defined as a product of univariate associated kernel K ½i x i ;h ii with its corresponding rv K ½i x i ;h ii on support S x i ;h ii , for all i ¼ 1; 2; …; d.Therefore, according to assumptions (A1), (A2) and (A3) for univariate associated kernel, the multivariate associated kernel of support where K x;H denotes the rv with pmf K x;H and both aðx; HÞ ¼ ða 1 ðx; HÞ; …; a d ðx; HÞÞ > and Bðx; HÞ ¼ ðb ij ðx; HÞÞ i;j ¼ 1;…;d tend, respectively, to null vector 0 and null matrix 0 d as H-0 d [17].
For a ¼ ða 1 ; a 2 ; …; x i ;h ii ðÁÞ with an optimal bandwidth matrix H opt such as Remark 2. The asymptotic convergence rates Oðn À 1=2 Þ in univariate case and Oðn À 1=ðd þ 1Þ Þ in multivariate case only hold for discrete random variables which satisfied proper smooth hypothesis.The previous convergence rates do not hold for continuous random variables where the asymptotic root MISE of nonparametric regression estimator is Oðn À 2=5 Þ in univariate case and Oðn À 2=ð4 þ dÞ Þ in multivariate case [11].
The data-driven bandwidth matrix selection procedure is an extension of univariate cross-validation criterion to multivariate case called least squared cross-validation criterion (LSCV).Thus, the optimal bandwidth matrix is obtained by  5) by excluding X k and H is a set of bandwidth matrices H.

Kernel estimator of ANOVA decomposition
From Eq. ( 4), the estimator of f 0 can be obtained by which is the arithmetic average of Y k ; k ¼ 1; 2; …; n.The terms of first order f i in the decomposition of the response model equation in (4) are estimated by In the same way, the terms of second order f ij in (4) are estimated as follows: where K x;H ðÁÞ is the multivariate associated kernel defined in Eq. ( 5) as a product of univariate associated kernel K ½i x i ;h ii ðÁÞ and K ½j x j ;h jj ðÁÞ.Now we can obtain the estimated terms of the variance decomposition.From Eq. (3) we first express the decomposition of the total variance of model output Y such as where each variance term is given by We then get the estimated variance terms by It finally ensues the calculation of the estimated Sobol indices such as for main effect sensisitivity indices in Eq. ( 1) we get

Simulations on Ishigami test function
In this section we propose to evaluate the application of both discrete and continuous kernel estimation procedures to a test function.We used the terms in the ANOVA decomposition calculated as follows.In the discrete case is the discrete uniform distribution with I A the indicator function of any given event A that takes the value 1 if the event A occurs and 0 otherwise.Then, the variance terms in decomposition result in
The kernel estimator b m n in (2) using discrete symmetric triangular kernel is applied in comparison to the continuous version of b m n using the Gaussian kernel on support S x ¼ R given by In practice we will use the parameter value a¼ 1 for discrete symmetric triangular kernel since it was proved to be the best in terms of performance [8].
The ANOVA decomposition of Y for Ishigami function is given by Based on the decomposition of model output Y in ( 3) and ( 6), we have We successively obtain For interaction term between different parameters, we have It results in the following decomposition of the variance of Y by using the expressions in (7), Finally, the main effect sensitivity indices in (1) are S 1 ¼ 0:42, S 2 ¼ 0:19, S 3 ¼ 0 and S 13 ¼ 0:26 when considering I ¼5 and J ¼0.1.Note that these values are obviously some approximations of the main effect sensitivity measures of the continuous versions of the same variables defined on ½Àπ; π.Thus, in the continuous case, we have S 1 ¼ 0:40, S 2 ¼ 0:29, S 3 ¼ 0 and S 13 ¼ 0:31.
Fig. 1 illustrates the discrete kernel regression applied for estimating the univariate conditional moment EðY j X i Þ with discrete inputs X i .One can see that the inputs X 1 and X 2 have a main effect on Y while X 3 has a null main effect with a globally flat pattern of EðY j X 3 Þ.

Results
The discrete triangular symmetric regression estimator and the continuous Gaussian kernel regression estimator are applied.In addition for first order indices, a comparison is realized by using symmetric discrete triangular kernel with modified parameter a 0 A N such that for x A T ¼ fÀ3; À 2; À1; 0; 1; 2; 3g ( This modification was proposed by [9] as a solution to boundary bias effect.
To evaluate the performance of estimators, we use a MC strategy: (i) the random generation of a number N¼100 samples ðX 1 ; X 2 ; X 3 Þ of size n A f250; 500; 750; 1000g, (ii) for each sample, the software calculation of the average first order Sobol indices S i ¼ ð1=NÞ P N j ¼ 1 S ðjÞ i ; i ¼ 1; 2; 3; and their confidence interval by considering the 5% and 95% percentiles.
The error criterion used is the mean absolute error (MAE) defined as where ŜðjÞ i is the j-th adjustment of main effect sensitivity indice S i .At last, note that for both discrete and continuous kernel estimators, the bandwidth choice is realized using cross-validation (CV) procedure defined as follows.For a given discrete kernel K x;h with x A T and h 4 0, the CV procedure is useful for finding an optimal bandwidth h cv ¼ arg min h 4 0 CV ðhÞ minimizing the function h↦ CVðhÞ such that The leave-one-out kernel estimator b m n; À i ðX i ; hÞ of b m n ðx; hÞ is calculated by excluding X i such as The score function CV is an estimator asymptotically unbiased of the term depending on parameter h 4 0 in the mean integrated squared error of estimator b m n ðx; hÞ.Looking at the results presented in Table 1, the discrete triangular symmetric kernel estimator is competing to the continuous Gaussian kernel estimator in terms of MAE for evaluating the main effect of parameters X 1 and X 2 .In contrast the main effect of parameter X 3 close to 0 is better estimated by the continuous kernel estimator than the discrete kernel estimator.Further, Table 2 shows some results regarding the sensitivity indices for interaction terms between different parameters.Similar to Table 1 the discrete triangular kernel estimator outperforms the continuous Gaussian one and both estimators detect the strong interaction between parameters X 1 and X 3 .Finally, as the sample size n increases, the errors provided by the two estimators converge monotonically towards 0 except for the interaction term X 1 X 3 due to the number of simulations N¼ 100.

Potential boundary bias for estimated indices S i
The boundary bias of kernel estimator occurs when there is large probability mass close to the boundary.The accuracy of the estimated sensitivity indices did not seem to be improved by applying discrete kernel estimator with symmetric triangular kernel a 0 ¼ 1 used for solving boundary bias.Looking at Table 1, the values of criterion MAE i were globally closed by using discrete symmetric triangular kernels with a ¼ 1 and a 0 ¼ 1.However, the sensitivity indices values were over-estimated using modified parameter a 0 ¼ 1 while they were under-estimated using parameter a ¼ 1 in comparison with the analytical values, in particular for the main effect S 1 .Finally, the potential impact of boundary bias was not clearly perceptible but this may be worth exploring, especially regarding at a largest sample size n Z 1000.

Effects of simulation number N
For estimating first order Sobol indices, in what follows we are interested in the influence of the number of simulation by increasing N from 100 to 200.Figs. 2 and 3 reports the comparison of MAEðS i Þ; i ¼ 1; 2; 3 , for N¼100 and 200, respectively, and sample sizes n A f100; 200; …; 2000g.In Fig. 2 corresponding to N¼100, the behavior of the curves of criterion MAE i did not show clearly which estimator provided the better performance for approximating S 1 while discrete estimator outperformed the continuous for S 2 and continuous estimator is better for S 3 .Further, the error criterion MAE i for the three parameters was not monotonically decreasing as sample size n was increasing.By increasing the number of simulations N from 100 to 200 in Fig. 3, one can see more clearly than in the previous figure that the convergence rates of the discrete kernel estimator were globally faster than that of the continuous kernel estimator for both S 1 and S 2 but not for S 3 .Further, the error criterion MAE i was now monotonically decreasing regarding at parameter X 3 and only for discrete kernel estimator regarding at parameter X 1 but not at all regarding at X 2 .Thus the results would be better by increasing N 4 200.However, an optimal number of simulation N which would provide a monotonically decreasing error criterion for discrete and continuous kernel estimators must be found.For this study we will keep N¼100 in the following section for the illustration on the real case.
Note that about the computational effort, it is very little (around 17 s for one simulation with sample size n ¼ 1000) for both continuous and discrete kernels.Further, there is not any big difference between the two kernel approaches if the same number of samples are used.

Simulations on a real case study
The real case study aims at evaluating the input parameters of processes involving in the Hemp Crop production which mostly influence the outcomes of a Life cycle assessment (LCA).LCA is a tool which aims at assessing environmental impacts over all whole life cycle of a product, i.e. from the extraction of raw materials to end of life as well as recycling by including fabricating and utilization.The processes involving at each step of a life cycle system are described by some numerical or analytical models.The outcomes considered herein are environmental impacts on human health (human toxicity) and marine or terrestrial ecosystem (eutrophication, ecological toxicity).Three models are used in this study (fuel consumption model for agricultural operations, exhaust emissions models from engines used for agricultural operations, and direct field emissions models from the crop) for the Hemp Crop production corresponding to a total of 52 input parameters, amongst them some discrete parameters uniformly distributed on support N such as the release year of agricultural engine, the engine rated power or the clay content of the soil.Note that a parameter name allocation method is presented but it is a qualitative parameter with coded values 1 or 2 .For more details, the complete study is presented in [2].In this part, the first order Sobol indices of some discrete input parameters are only studied.For the second order Sobol indices, a complete work is in progress on this real case study which requires a multivariate estimation mixing continuous and discrete kernels.
Some direct MC simulations are used in comparison to kernel methods for estimating the conditional expectation EðY j XÞ where the response Y is an environmental impact indicator.Similarly to the previous part, a MC strategy is applied: to obtain similar results of Sobol indices, we use N¼ 100 replications of sample size n ¼1000 for kernel methods while we use N¼ 500 of sample size n ¼5000 for direct MC simulations.Table 3 presents the results of main effect sensitivity Sobol indices of the three most influential (discrete and continuous) input parameters obtained.Note that the discrete kernel method is applied only on discrete parameters.
In this case study, the estimated values of Sobol first order indices cannot be compared to theoretical values, we can just compare between them the Sobol indice values provided by the three methods.It appears that the three methods provide some results of same order.In particular, the discrete kernel estimation provides some values greater than continuous kernel estimation, except for the allocation method parameter which is not influential.Thus, the results provided by discrete kernel method are confirmed.

Concluding remarks
This work is interested in discrete kernel estimation approach as a novel approach in reliability analysis suitable when discrete input parameters involved in a model.It pursues various works in sensibility analysis framework on the application of (continuous) non-parametric kernel method for estimating sensitivity indices calculated from ANOVA decomposition.The studied discrete nonparametric kernel method is appropriate only for discrete or ordinal variables, not for real continuous cases.The discrete approach is proposed as a competing approach more suitable for discrete input parameters than continuous non-parametric kernel method.First, the theoretical asymptotical convergence rate of the discrete kernel estimator is better than that of the continuous kernel estimator.Then, the realized simulations point out that the discrete approach is faster than continuous one in the sense of average MAE for moderate or most influential input parameters.However, the discrete kernel approach seems to be limited when estimating the main influence of discrete input parameters having a weak contribution to the variance of the model output, in comparison to continuous approach which provides better estimation in this case.The boundary bias was treated in this work but may be something worth exploring in the future as well, to further verify and improve the estimation accuracy of the proposed approach.In addition, a minimum number of simulations depending on sample sizes needs to be found to insure that the error criterion used is monotonically decreasing.
The aspect of curse of dimensionality is not included in this work.Thus, some future prospects will be to investigate multivariate estimation mixing continuous and discrete kernels when evaluating the influence of both discrete (count and categorical) and continuous input variables.
k an estimate computed as b m d n in Eq. (

Fig. 2 .Fig. 3 .
Fig. 2. Comparison of MAE for N ¼100 repetitions on Sobol indices for input parameters ðX 1 ; X 2 ; X 3 Þ of Ishigami test function by using non-parametric discrete kernel estimations (green line) in comparison with non-parametric continuous kernel estimation (blue line).(For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

Table 1
Average first order Sobol indices S i calculated by discrete and continuous kernel estimations applied to Ishigami test function.

Table 2
Average second order Sobol indices S ij calculated by discrete and continuous kernel estimations applied to Ishigami test function.

Table 3
Average first order Sobol indices S i calculated by kernel estimations and Monte Carlo simulation applied Hemp Crop production system, parameters having integer positive values are written in bold.