Discrete semiparametric regression models with associated kernel and applications

This work is concerned with a semiparametric associated kernel estimator for count explanatory variables. The proposed semiparametric estimator is a multiplicative combination between a parametric model and a discrete nonparametric kernel estimator of Nadaraya–Watson type. In this semiparametric approach, the parametric model plays the role of the start function and the nonparametric kernel estimator is a correction factor of the parametric estimate. Some asymptotic properties of the discrete semiparametric kernel regression estimator are pointed out; in particular, we show its asymptotic normality and the order of the optimal bandwidth. The parametric part is illustrated by some nonlinear and generalised linear models; for the nonparametric estimator, we apply the discrete general triangular associated kernel providing bias reduction. The usefulness of the discrete semiparametric kernel regression estimator is shown on three practical examples in comparison with logistic, generalised linear and additive models.


Introduction
Let (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n ) be the observations of the variables (X, Y ) in S × R connected through the model where m : S → R is an unknown regression function to be estimated and e i is assumed to be the residual from the real random variable i with mean E( i ) = 0 and variance Var( i ) = σ 2 < ∞.
To estimate the conditional mean function m, several approaches are available; one can consider the classical parametric regression models, some nonparametric techniques such as generalised additive models (GAMs, see Hastie and Tibshirani 1990) or local polynomials. Here, we are concerned with some multiplicative or additive procedures of nonparametric kernel and parametric where K x,h is the discrete random variable of p.m.f. K x,h (·). In addition, the finite differences g (k) , k ∈ N \ {0}, of any count function g : N → R are used instead of the usual differentiation on R such as from which the finite difference of second order may be derived as (2) Within the semiparametric context, let us consider m as a discrete weighted parametric regression function given by where l(x; ) is a nonrandom function relative to the parameter and x → ω(x) is a positive nonparametric weight function. The discrete semiparametric associated kernel regression estimator results from a parametric estimationl(x) ≡ l(x;ˆ ) of l multiplied by a nonparametric Nadaraya-Watson estimationω n of ω as follows: where h = h(n) > 0 is an arbitrary sequence of smoothing parameters that fulfils lim n→∞ h(n) = 0 and K x,h (·) is a suitably chosen discrete associated kernel function. Then, the discrete semiparametric estimator of m in Equation (3) is given bŷ m n (x) =l(x) ×ω n (x).
Concerning the parametric model, the smoothness of the function l(x, t) with respect to t is required, and the estimatorˆ of is obtained, for example, by the generalised least-squared method. In the situation where the parametric function l(x; ) is mis-specified, the estimator of converges in probability to a certain value 0 such that l(x; 0 ) ≡ l 0 (x) is the best approximant to m(x) with respect to the Kullback-Leibler distance of l(x; ) from the true function m(x) as see Abdous et al. (2010) and references therein for more details.
In this work, we establish the asymptotic normality of the discrete semiparametric kernel estimator. We use some parametric (logistic and generalised linear) models as start functions and a discrete associated kernel that provides bias reduction. The usefulness of the constructed discrete semiparametric regression model is illustrated on three practical data sets of agriculture, economy and agronomy in comparison with classical parametric regression models. The first example concerns the study of average daily fat (kg/day) yields from the milk of a single cow for each of the 35 first weeks denoted x i (Kokonendji, Senga Kiessé and Demétrio 2009b). The quantity of fat in the milk increases during the first 14 weeks and decreases thereafter. The fitted curve comes from a generalised linear model (GLM): it is a normal model with a logarithmic link (McCullagh and Nelder 1989). This model does not fit well to data. In particular, it does not detect the plateau associated with observations x = 19, 20, . . . , 27 (Figure 1). We will compare these results with those obtained by using our semiparametric model and GAM.
The second example given in Table 1 is a sales data set with multiple y i at a given x i (Kokonendji et al. 2009b). We analyse the amount of daily sales of a new product during the first 24 days. The 151 observations (x i , y i ), i = 1, . . . , 24, represent the day x i and the corresponding mean of sales numbers y i ∈ {y Ai , y Bi , . . . , y H i }. The number of sale centres for each state (A, B, . . . , H) is not available except for the state H, where this number is equal to one. We apply the GLM and GAM in comparison with the semiparametric model for fitting the sales data ( Figure 2).
The third example deals with volume data from a forest beech tree (Table 2) provided by the French national research agency project 'EMERGE' (Compatible volume/biomass and nutrient content equations for fuelwood and forest resource; tools for sustainable and clear management); (Rivoire et al. 2010). On the stem of this tree, from the base (ca. 53 cm in diameter) to the tip (0 cm), 15 measures have been taken with a diameter tape. Cumulative stem volumes denoted y have been calculated to any possible diameter x ∈ {0, 1, . . . , 53} (cm) based on cone frustum volumes. More exactly, at the base of the tree, where the diameter is close to 53 cm, the cumulative volume is 0, whereas at the tip of the tree, the diameter is close to zero and the cumulative volume is the total stem volume. We apply the GAM, semiparametric model and parametric logistic one, since the tree data distribution has a sigmoïdal form ( Figure 3).
This motivates the recommendation of a discrete model that focuses on ordinal covariates and has the same nature. Hence, the nonparametric correction in all the three examples is available only for discrete predictors even if the parametric models indeed treat predictors as continuous variables. Through these three applications, we point out that the discrete semiparametric associated kernel approach may produce better explanations of real data with both satisfying amounts of smoothing and goodness of fit. The remainder of this paper is organised as follows. Section 2 is concerned with the bias, variance and asymptotic normality of the discrete semiparametric kernel regression estimator. Section 3 presents the result of the three applications. The optimal order of the bandwidth is shown under some assumptions for the discrete associated kernels used. Finally, Section 4 presents the concluding remarks.

Asymptotic properties
This section is concerned with the usual asymptotic results for the discrete semiparametric associated kernel regression estimatorm n in Equation (4). In particular, we demonstrate its asymptotic normality; one can refer to Martins-Filho et al.(2008) for the asymptotic normality of the semiparametric estimator proposed by Glad (1998).
We state the bias and variance ofm n shown in Abdous et al.(2010). For x ∈ N, let l 0 (x) be a fixed parametric start in Equation (3). Under assumptions A1 and A2, the discrete semiparametric estimatorm n in Equation (4) admits the following bias and variance: where f > 0 is the p.m.f. of the regressor X and f (1) , m (1) and m (2) are the finite differences of f as given in Equations (1) and (2). Hence, the consistency of the discrete semiparametric estimatorm n in Equation (4) is obtained through the asymptotic behaviour of its mean-squared error (MSE) as Indeed, under assumptions A1 and A2, the asymptotic expansions of the bias in Equation (5) and variance in Equation (6) are such that as h = h(n) → 0 and n → ∞, since we assume Var(K x,h ) = O(n −1/2 ). This assumption will be developed at the end of this section. For the asymptotic normality, we need to recall the Lyapounov central limit theorem for triangular arrays (Wesolowski 1994).
then S n = X n,1 + · · · + X n,k n converges in distribution to the normal law with the mean zero and the variance 2 The notation ' d − →' stands for convergence in distribution. Now, we are able to formulate the following theorem. Theorem 2.2 For any fixed x ∈ N, under assumptions A1 and A2, the semiparametric estimator m n (x) converges in distribution to the normal law as follows: Proof For x ∈ N and h > 0, let us consider the semiparametric estimatorm n in Equation (4) and the sequencef n (x) = (1/n) n j =1 K x,h (X j ). Using the discrete Taylor expansion ofl(x)/l(X i ) around l 0 (x)/ l 0 (X i ), we havê are of order o p (h 2 ), and it ensues the following equalities: For calculating the expectation of Equation (7), we begin with the first term A n . Under assumptions A1 and A2 and using the discrete Taylor expansion such that The expectations of the second and third terms B n and C n in Equation (7) are given by

It results in E[{m n (x) − m(x)} ×f n (x)] = E{A n (x; h)} + o(h 2 ).
Then, for the variance of Equation (7), we have with σ 2 = Var( i ) < ∞. This result is essentially due to the second term in Equation (8) given by which is a sum of i.i.d. random variables; thus, we have E{A 1n (x; h)} = 0 and, under assumptions A1 and A2, T. Senga Kiessé and M. Rivoire tends to 0 as n → ∞ and h = h(n) → 0. Indeed, let y ∈ S x \ {x}, we can find a constant η = η(y) > 0 such that and for y = x, we deduce the asymptotic modal probability Pr(K x,h = x) → 1 when h → 0. The other terms in the variance of Equation (7) provide the order o(h 2 ); we omit to detail here all these calculations. Rather, by applying the Lyapounov central limit theorem on A 1n , we have . Finally, by considering the convergence off n to f states by Abdous and Kokonendji (2009) with μ ≡ μ(x; h, n) and 2 ≡ 2 (x; h). Hence, the desired result is obtained.
Remark 1 As a result, our estimator achieves O(n −1/2 ) convergence rate; in addition, one can assume Var(K x,h ) = O(n −1/2 ) and replace the assumption A2 with this. A more thorough treatment of the optimal order of the bandwidth h assuming Var(K x,h ) = O(n −1/2 ) will be presented in Section 3.2 for the discrete associated kernels applied in this work.

Applications
This section presents the illustrations on data of average daily fat, sales data and cumulative stem volume. The data are fitted by the logistic model and GLM with parameter = (θ 1 , θ 2 , θ 3 ) in comparison to the GAM and semiparametric model using general discrete triangular associated kernels. The measure of error used is the root mean square error (RMSE) defined as whereŷ j is the adjustment of the j th observation y j and n is the number of observations. In the following, we first present the parametric models (logistic and GLM) used as start functions for the discrete semiparametric model.

Parametric models
The GLM represents a normal model for the response variable Y i with a logarithmic link. It has a linear predictor based on a combination of explanatory variables, such as The nonlinear model corresponds to a logistic one for the situation of population growth towards a limited value. It is given by The fixed effect parameter θ L 1 is the asymptote towards which the population grows. The parameter θ L 2 is the midpoint and corresponds to the time at which y i = θ L 1 /2. The parameter θ L 3 is the scale and represents the distance on the time axis between the midpoint and the point where the response is θ L 1 /(1 + e −1 ). Then, let us present an example of the discrete associated kernel constructed from a new discrete probability distribution introduced by Kokonendji and Zocchi (2010). It is a generalisation of the symmetric discrete triangular distributions (Kokonendji et al. 2007). We show the optimal order of the bandwidth parameter h such as Var(K x,h ) = O(n −1/2 ) for these discrete associated kernels (Remark 1).

Discrete associated kernel
Let a 1 and a 2 be the fixed integers and h 1 and h 2 be the smoothing parameters. For any fixed x ∈ Z, consider the random variable DT x;a 1 ,a 2 ,h 1 ,h 2 defined on supports S a 1 ,x = {x − 1, x − 2, . . . , x − a 1 } and S x,a 2 = {x, x + 1, . . . , x + a 2 } and whose p.m.f. is is the normalising constant. Then, the mean is given by E (DT x;a 1 ,a 2 ,h 1 ,h 2 T. Senga Kiessé and M. Rivoire and the variance is Var (DT x;a 1 ,a 2 ,h 1 ,h 2 ) = W (a 1 , a 2 , h 1 , h 2 ) − [V (a 1 , a 2 , h 1 , h 2 Note that an R package for general discrete triangular distributions is available (Senga Kiessé, Libengué, Zocchi and Kokonendji 2010).
First, for showing the optimal order of the bandwidth assuming Var(DT x;a,h ) = O(n −1/2 ), we consider the symmetric discrete triangular associated kernels K x;a,h with one arm a = a 1 = a 2 and one smoothing parameter h = h 1 = h 2 . For h that is sufficiently small and a ∈ N fixed, we have the following expansion: It results in the following expression for the leading term of order O(n −1/2 ) of the bias term in Equation (5) given by Hence, the bandwidth h is of optimal order O(n −1/2 ). This result can be generalised to the bandwidths h i , i = 1, 2, for the discrete triangular associated kernels K x;a 1 ,a 2 ,h 1 ,h 2 , since Then, for the bandwidth selection, one can directly use the optimal values of (h 1 , h 2 ), which minimise the integrated squared error (ISE) such as ISE(h 1 , h 2 ) = x∈N {m n (x) − m 0 (x)} 2 , where m 0 is the observed value. Another method should be to minimise the integrated MSE of the proposed discrete semiparametric regression estimator; thus, the bandwidth selection might be realised by a cross-validation score function. One can refer to Chiu (1991) for kernel density estimation and to Kokonendji, Senga Kiessé and Balakrishnan (2009a) for semiparametric kernel estimation of p.m.f. Here, we do not investigate these different approaches and just propose some small and large values of h 1 and h 2 equal to 0.1 and 3.0 to point out the influence of both bandwidth parameters h 1 and h 2 on goodness of fit, degree of smoothing and boundary bias. To reduce this bias, we fix one bandwidth parameter and vary the other; in this way, we have an influence on both smoothing and fitting. Another possibility is to transform the arms for reducing bias as proposed by Kokonendji and Zocchi (2010). However, we do not exclude another discrete kernel as binomial or Poisson proposed in density estimation because of their advantage for small or moderate sample sizes (Senga Kiessé 2009), even if this advantage does not still hold for the regression.
At last, for both arms a 1 and a 2 , in practice, they are small and equal to 1, 2 or 3 . Therefore, in what follows, we consider the general discrete triangular distributions with a 1 = 3 and a 2 = 1. We recommend these discrete distributions for using an associated kernel for our proposed estimator because of the advantages provided by the two smoothing parameters (h 1 , h 2 ). Figure 1 indicates the difference between discrete semiparametric general triangular kernel model (a 1 , a 2 ) = (3, 1) and GAM, on the one hand, and GLM, on the other hand. Indeed, both first ones detect the plateau associated with the observations x = 19, 20, . . . , 27, while the third does not detect it. The results given in Table 3 show that a better discrete semiparametric adjustment is obtained using bandwidth parameters h 1 = h 2 = 0.1 (giving the smallest RMSE); however, there is a lack of smoothing. The value of the RMSE increases and the degree of smoothing is improved when the values of h 1 and h 2 increase to 3.0. Then, we fix one of the bandwidth parameter and change the other. For h 1 = 0.1 fixed and h 2 varying to 3.0, the error RMSE increases, but we keep a good estimation at the right boundary x = 35 with a satisfying amount of smoothing, which is improved in comparison with the case h 1 = h 2 = 0.1. For h 2 = 0.1 fixed and h 1 varying to 3.0, we obtain a similar result by keeping a good adjustment on the left boundary x = 1. Table 4 and Figure 2 present the results corresponding to sales data. Similar to the previous example, the h-values h 1 = h 2 = 0.1 for the discrete semiparametric general triangular model Table 3. RMSE (in %) calculated from the GLM, GAM and discrete semiparametric model with general triangular associated kernels on average daily fat data.

Sales data
Semiparametric regression with general discrete triangular kernel a 1 = 3, a 2 = 1 with (a 1 , a 2 ) = (3, 1) give the smallest RMSE but not the most satisfying amount of smoothing. In comparison to the parametric model, both satisfying degree of smoothing and fitting are obtained with h 1 = h 2 = 3.0. Furthermore, in general, the logistic model seems to underestimate the yvalues contrary to the semiparametric associated kernel model and GAM. In the two previous applications, the discrete semiparametric estimator using triangular associated kernel with (a 1 , a 2 ) = (3, 1) and (h 1 , h 2 ) = (0.1, 3.0) and GAM are closed in terms of goodness of fit and smoothing. Concerning the semiparametric model with discrete general triangular kernel (a 1 , a 2 ) = (3, 1), use of smoothing parameters h 1 = h 2 = 3.0 provides the most interesting results, considering the researched compromise between some good smoothing and fitting. Thus, some relative big bandwidths are recommended, considering the lack of smoothing, in spite of the fact that the optimal bandwidth is of order O(n −1/2 ); however, a bandwidth of this optimal order would be recommended, considering the goodness of fit. For the last example, we directly apply the semiparametric model using these values of parameters with a logistic model as the start function.

Tree data
Here, the performance of the discrete semiparametric logistic kernel regression modelm n is illustrated on a tree data set (Table 2) having a distribution with sigmoïdal form in comparison with the purely logistic model and GAM. In Figure 3, the fitted curve for the logistic model does not succeed well in describing the variations of the distribution; it results in an error RMSE = 5.378%. The semiparametric logistic model indicates that the use of discrete general triangular kernel with bandwidth parameters h 1 = h 2 = 3.0 and arms (a 1 , a 2 ) = (3, 1) provides some better amount of smoothing and adjustment on data. Thus, the capacity of the semiparametric estimator to detect the variations of the distribution to be estimated is clearly shown through the role of the nonparametric correction factor ω(x), x = 0, 1, . . . , 55, (RMSE = 1.341%) at the opposite of the parametric model. In addition, the semiparametric associated kernel model gives good estimations at the left and right boundary points. The used semiparametric model is similar to GAM (RMSE = 1.757%) in terms of performance and thus the corresponding fitted points are closed to the observations in Figure 3; however, GAM does not adjust well at the right boundary.

Concluding remarks
This paper has investigated discrete semiparametric kernel estimators with logistic and normal models as the known start functions. The general discrete triangular associated kernel with left and right bandwidth parameters was used, which provided control on both the goodness of fit and degree of smoothing. Thus, the constructed semiparametric estimators outperformed the parametric models in the three examples given. They allowed to obtain some satisfying adjustments, amount of smoothing and reduction of boundary bias, which are often required. Several parametric models may be used as start functions in the semiparametric procedure such as some nonlinear Gompertz or log-linear models for count data. Similarly, other discrete associated kernels may be useful as binomial. Finally, the introduction of count explanatory variables and an optimal choice of bandwidth parameters may be studied for the discrete semiparametric associated kernel estimator.