Discrete triangular associated kernel and bandwidth choices in semiparametric estimation for count data

This work deals with semiparametric kernel estimator of probability mass functions which are assumed to be modified Poisson distributions. This semiparametric approach is based on discrete associated kernel method appropriated for modelling count data; in particular, the famous discrete symmetric triangular kernels are used. Two data-driven bandwidth selection procedures are investigated and an explicit expression of optimal bandwidth not available until now is provided. Moreover, some asymptotic properties of the cross-validation criterion adapted for discrete semiparametric kernel estimation are studied. Finally, to measure the performance of semiparametric estimator according to each type of bandwidth parameter, some applications are realized on three real count data-sets from sociology and biology.


Introduction
The traditional approach for estimating count data distribution has been essentially parametric until recently. This approach classically consists in a departure with a structure of count distribution such that the Poisson model; however, the estimation provided in this way is not always sufficient and it becomes necessary to modify the initial distribution. In this work, the estimation approach adopted assumes that any count distribution or random variable X having a probability mass function (p.m.f.) f (x) = Pr(X = x) > 0 on support N can be written as a modified Poisson distribution: where p(x; ) = x exp(− )/x! is the p.m.f. of a Poisson distribution with mean parameter > 0, and ω(x) > 0 is a nonparametric function playing the role of a correction factor. Equation (1) is an assumption related to the works on weighted Poisson distributions (WPDs) which are investigated as some alternatives to parametric Poisson model classically applied for count data; WPDs allow one to take into account the counting phenomenon as over/underdispersion or zero-inflation/deflation [1]. Thus, the advantage of using Equation (1) is to express *Corresponding author. Email: tristan.senga@nancy.inra.fr, sengatristan@yahoo.fr the deviation from classical Poisson models and thus to take intrinsically into account special features of counting phenomenon mentioned above. Let X 1 , X 2 , . . . , X n be a sample of independent observations with an unknown count distribution f given in Equation (1). A discrete semiparametric estimator of f is proposed in [2] as a parametric estimationp(x) = p(x;ˆ ) of p followed by a nonparametric kernel estimationω n (x) of ω(x) = f (x)p(x) given bŷ such that we havê The estimatorˆ = n −1 n i=1 X i is the sample mean which is the maximum likelihood estimator of the Poisson mean θ , the bandwidth h = h(n) > 0 is an arbitrary sequence of smoothing parameters that fulfils lim n→∞ h(n) = 0, and the discrete associated kernel K x,h (·) = Pr(K x,h = ·) of random variable K x,h is a p.m.f. with support S x satisfying the hypotheses Note that the continuous version off n can be found in [3].
This paper pursues the works on estimatorf n in Equation (2) using the famous example of discrete symmetric triangular associated kernels introduced by Kokonendji et al. [4]. Under some assumptions, a mathematical result on pointwise consistency off n is formulated followed by a proposition on global consistency off n using discrete triangular kernels. About smoothing parameter h > 0, some data-driven bandwidth selection procedures are investigated. Indeed, the well-known cross-validation procedure, which consists in the minimization of a score function, has been just applied forf n ; here, we further study this function by establishing some asymptotic properties such as bias and variance. It results in an other mathematical result on these properties of cross-validation score function. In addition, for bandwidth choice, the minimization of an approximate global-squared error off n is investigated; thus, an explicit expression of optimal bandwidth not available until now forf n is obtained by using discrete triangular kernels. Note that for choosing bandwidth parameters in continuous nonparametric kernel estimation, one can refer to [5][6][7]. In addition, let us also remark that a Bayesian local approach is developed by Zougab et al. [8,9] for bandwidth selection in discrete nonparametric associated kernel estimation of p.m.f. Finally, concerning count data, the problem of their semiparametric regression is treated by Abdous et al. [10].
We attempt to illustrate our investigations via three real count data-sets. The first data-set, used earlier by Kokonendji et al. [2, Table 4, p. 12], comes from a sociological experiment concerning the number of days per week in which alcohol was consumed. The second application concerns count data characterizing development of spiralling whitefly, which is an insect pest plant collected in Republic of Congo-Brazzaville [11]. This insect causes some damages such as sucking the sap, decreasing photosynthesis activity and drying up the leaves; and the congolese biologists are searching for a suitable modelling of data related to this insect growth. The third application is realized on wood cell count data from times series: it concerns data relative to the annual wood formation dynamics of two silver fir trees. Indeed, the study of wood formation has become an innovative and fast-growing field in plant sciences over the last decade since wood is a major component of the biosphere and plays a key role in ecosystem functioning, representing for example one of the strongest sink of CO 2 , which is a major contributor to climate change [12]. Therefore, modelling wood development involves crucial issues and has become an important problem of plant sciences.
The remainder of this paper is organized as follows. Section 2 gives details about discrete triangular associated kernels with some expansions of its modal probability and variance. Some results on consistency of semiparametric estimatorf n in Equation (2) of p.m.f. f are also given. Moreover, the choice of optimal h-values is studied forf n according to the two procedures mentioned previously; a mathematical result on the asymptotic bias and variance of cross-validation criterion is established. Then, Section 3 presents applications on count data-sets from sociology and biology; in particular, the bootstrap method is applied on data of insect pest plant for a robust evaluation of the bandwidth parameter choices proposed. Finally, Section 4 contains some concluding remarks. The proofs of mathematical results are postponed to the appendix.

Semiparametric kernel estimator
This section first presents a class of discrete symmetric kernels [4,13]. Then, the two optimal bandwidth choices for estimatorf n in Equation (2) are presented.

Discrete symmetric triangular associated kernels
Let a be a fixed integer and h > 0 be a smoothing parameter. For any fixed x ∈ N, consider the random variable (r.v.) K a;x,h of discrete symmetric triangular associated kernel K a;x,h defined on support S a,x = {x, x ± 1, . . . , x ± a} and whose p.m.f. is given by where P(a, h) = (2a + 1)(a + 1) h − 2 a k=1 k h is the normalizing constant. The parameter a plays a role on the number of observations falling in the set S a,x while h is directly the smoothing parameter. This class of kernels satisfies assumptions H 1 -H 2 , which implies Pr(K a;x,h = x) → 1 and Pr(K a;x,h = y) → 0 for y ∈ S a,x \ {x}, as h → 0. However, these assumptions remain quite general and do not allow some investigations as, for example, providing an explicit expression of optimal bandwidth or specifying the convergence of a kernel K x,h to x. Then, to further study the choice of optimal bandwidth, some expansions of modal probability and variance of kernel K a;x,h are provided in what follows. For h sufficiently small, we have and The asymptotic behaviours pointed out in Equations (3) and (4) allow the discrete symmetric triangular associated kernel to tend to the Dirac-type kernel D x,h ≡ D x of r.v. D x,h given by for any x ∈ N and any h ≥ 0, such that S x = {x}, Pr(D x,h = x) = 1 and Var(D x,h ) = 0. In fact, the expansions in Equations (3) and (4) are useful attempts to provide specific behaviours of discrete triangular kernel less general than H 1 -H 2 .
The second example concerns the discrete associated kernel deduced from [14] such that has also modal probability and variance which can be expressed as in Equations (3) and (4) with The third example concerns a discrete kernel proposed by Wang and Van Ryzin [15] such that The modal probability and variance of this kernel can be expressed as in Equations (3) and (4) with A ≡ 1 2 and V ≡ (1 + 3h)/2. Lastly, the expansions in Equations (3) and (4) are not available for standard asymmetric discrete kernel constructed from usual p.m.f. (binomial, Poisson or negative binomial) which satisfy only assumption H 1 and have variance such that Var [16]. Thus, we are not interested in this work with these kernels since we cannot express them in an explicit optimal bandwidth parameter. However, it stays interesting to use the binomial kernel (which outperforms the other standard kernels) for small or moderate sample in comparison with discrete triangular kernels; but discrete standard kernels do not tend asymptotically to the Dirac kernel, i.e. Pr(K x,h = x) 1 as h → 0, in contrast with discrete triangular kernels.
Remark 2.1 It would be of interest to define discrete associated kernel such that Next, we study a first data-driven bandwidth selection procedure using expansions in Equations (3) and (4).

Global-squared error
Let us assume p 0 (x) = p(x; 0 ) be a fixed p.m.f. of Poisson start in Equation (1), i.e. f = p 0 ω, such thatˆ converges to 0 in probability. We first formulate the following result on the pointwise consistency off n ; the proof is given in the appendix.
To express an explicit expression of optimal bandwidth requires the use of the global error of f n defined as By considering the expressions in Equations (3) and (4), we have the following expansions of pointwise bias and variance of estimatorf n given by where ω (2) is the finite difference of second order of ω such that (see also [16]); and, the term where the leading term is an approximate MISE denoted AMISE{f n, It ensues the following proposition on global consistency of f n,K a;x,h .
Proposition 2.2 For (a, x) ∈ N × N and h > 0, let us consider the semiparametric estimatorf n using discrete symmetric triangular kernel K a;x,h . One has Then, the bandwidthĥ opt = arg min h>0 AMISE(a; n, h, f ) comes by solving the equation d{AMISE(a; n, h, f )}/dh = 0 which is equivalent to Moreover, it comes the following asymptotic relationship:ĥ opt ∼ k 0 n −1 with A comparison can be realized with the discrete nonparametric kernel estimatorf n of p.m.f. f proposed by Kokonendji and Kiessé [16] such that The expression off n can be deduced from that off n in Equation (2) by assuming p ≡ 1 in Equation (1) thusp ≡ 1 in Equation (2). Thus, the bias off n is given by , for x ∈ N, while its variance is identical to that of semiparametric estimatorf n . Hence, by calculating MISE off n as in Equation (5), the optimal bandwidth parameter minimizing the corresponding AMISE is such that Finally, the comparison betweenĥ opt in Equation (7) andh opt in Equation (9) depends on the finite differences f (2) and p 0 ω (2) since f (2) − p 0 ω (2) can be either positive or negative with (2) . Note that the comparison between estimatorsf n andf n , having the same variance, is related to their respective bias for which the leading terms of their difference given by (1) )(x)}V(K a;x,h ) can also be either positive or negative depending on start parametric function p 0 .

Cross-validation function
A second bandwidth parameter value can be obtained by applying the cross-validation procedure which consists in the minimization of the score function cross-validation (CV), i.e. h cv = arg min h>0 CV(h), such that with where 0,−i is computed as 0 by excluding X i [2]. Indeed, let us consider the two first terms depending on h in the following expression of MISE in Equation (5): The expression A n is an unbiased estimator of the first term E{ x∈Nf Then, the expression B n is an estimator of the second term E{ x∈Nf n (x)f (x)} since we first have and, then, A similar procedure is presented in [16] for a bandwidthh cv of nonparametric estimatorf n of p.m.f. f in Equation (8). The CV's mean and variance for fixed h > 0 are provided in the following theorem under assumptions H 1 -H 2 .
where K x,h is a discrete associated kernel satisfying H 1 -H 2 .
Proof See the appendix.

Remark 2.2
(i) Similar to Theorem 2.3, the bias and variance of cross-validation criterion can be calculated for nonparametric estimatorf n in Equation (8). It results in the same expression for E{CV(h)} while the difference comes from their variance since in nonparametric case the score function CV admits the following variance: (ii) It would be of interest to compare MISE(f n,ĥ cv ) and MISE(f n,h opt ). To derive such a result, for some sufficiently large region H n , some information about sup H n |CV(h) + x∈N f 2 (x) − MISE(h)| would be necessary. That will be the subject of a forthcoming article.

About choice of parameter 'a' for discrete triangular kernel
Looking at expression of MISE in Equation (6) of semiparametric estimatorf n using discrete triangular kernel, we are not able to calculate theoretically the optimal parameter a ∈ N minimizing MISE{f n,K a;x,h (x)}. Hence, Figure 1 illustrates the comparative results of function a → MISE(a) off n with discrete triangular kernel for the simulated p.m.f.
which is a mixture of two Poisson distributions P(μ) with respective means μ 1 = 0.5 and μ 2 = 10.
For h ∈ {0.1, 0.5, 1}, the optimal value a opt = arg min a∈N MISE(a) is small and equal to 1, 2 or 3, while for h = 0.01 we have a opt ≥ 5. It appears that the optimal a opt decreases as the sample size n ∈ {50, 100, 150, 200} increases. Note that the case a = 0 for discrete triangular kernel results in kernel of Dirac type. In addition, Figure 2 presents the comparative function a →ĥ opt (a) in Equation (7). For fixed sample size n ∈ {50, 100, 150, 200}, the optimal h-value tends to 0 as a ∈ N is increasing. Finally, we propose to consider a ∈ {1, 2, . . . , 5} for applications in the following section. This procedure is proposed for choosing the parameter a ∈ N since we are not able to calculate an explicit expression of this parameter minimizing global-squared error MISE in Equation (6) aŝ h opt in Equation (7) for bandwidth parameter h > 0. It would be interesting to propose the general rule for the choice of a ∈ N for future works.

Applications
In this section, we present the results of the applications of the semiparametric estimatorf n ≡ f n,K a;x,h using discrete triangular kernels on three real count data-sets. The performance off n,K a;x,h with respect to each optimal bandwidth parameterĥ cv andĥ opt is evaluated by using the following descriptive measure of degree-of-fit: (integrated squared error) with f 0 being the empirical frequency estimate of observations. Concerning discrete triangular associated kernels, we propose the values of the integer parameter a ∈ {1, 2, . . . , 5}.

Data of alcohol consumption
A randomly selected sample of n = 399 Dutch respondents were asked to keep a diary for two consecutive weeks in which they recorded their daily alcohol consumption. For a = 1, the use of bandwidthĥ cv is clearly better (in term of ISE) than that ofĥ opt ; and, for a ∈ {2, 3, 4, 5} the use ofĥ opt is similar or better than that ofĥ cv (Table 1). Both h-valuesĥ cv andĥ opt tend to 0 as the parameter a ∈ N is increasing. Thus, in this example, using the semiparametric estimator f n,K a;x,h with h =ĥ cv is appropriated for a small value of parameter a ∈ N, while usingf n,K a;x,h with h =ĥ opt becomes interesting or more appropriated for large values of this parameter.

Data of insect growth
Some experimental plantations were realized on several host plants among them some fruit trees well known in Congo safou (Dacryodes edulis), hura (Hura crepitans), mango (Mangifera indica), citrus (Citrus paradisi) and avocado (Persea americana). These plantations were realized with some young trees being 5-6 months old under some conditions of temperature and humidity, and the observations were done using a binocular loupe. The development of the insect parasite studied is described by several count explanatory variables observed on days such as the preimarginal development time from egg to adult stage (Table 2). Similar to the previous example about alcohol consumption, the semiparametric estimator f n,K a;x,h withĥ cv is appropriated for parameter a = 1 while usingf n,K a;x,h withĥ opt is generally more appropriated for a ∈ {2, 3, 4, 5}. There is an exception for safou fruit tree probably due to the small numbers of observed points x. Note that for the hura fruit tree, the cross-validation procedure does not converge (a well-known problem for this method) when a = 1 while an optimal valueĥ opt is available.

Bootstrap method
We pursue our study with a more robust evaluation which consists in resampling the observations of citrus tree species used, for example. From the data of this tree species, we draw n ∈ {25, 50, 75} bootstrap samples on which we apply our estimator and the methods for optimal choice of the bandwidth parameter. It ensues the calculation of the averages ISE,h opt andh cv of ISE and optimal bandwidth parameters, respectively, such that we Table 4, the results confirm the first conclusions formulated previously: semiparametric estimatorf n with optimalĥ cv -value provides better estimations for a ∈ {1, 2} while usingf n withĥ opt -value is better for a ∈ {3, 4, 5}. Note that the average ISE is an approximate of global-squared error MISE in Equation (5) since we have MISE = E(ISE).

Data
We illustrated our investigations using count data from time series relative to the wood formation of silver fir, one of the widest spread European conifer species. Wood derives from the cambium, a tissue consisting of a thin layer of cells able to divide between the wood and the bark in tree In conifers, a newly produced wood cell (called tracheid) undergoes two differentiation phases: (1) first, its radial diameter increases, it is in the cell-diameter enlargement phase; (2) second, cell-wall thickening and lignification begin, it enters the secondary cell-wall formation phase [17,18]. Once the differentiation process is complete, programmed cell death takes place, giving a mature and functional cell of the wood, i.e. a dead slender tracheid able to transport water (in trees, water is transported from the root to the leaves in the wood) and confers mechanical support to the stem. Under temperate climate conditions, wood formation presents an annual pattern: it is active during the hot season and inactive during the cold season. All the tracheids produced during a year are organized as juxtaposed radial files and form an annual tree-ring which adds to the tree-rings formed in the previous years ( Figure 3). Here, we use the number of cells weekly counted, from April to November, in the cell-diameter enlargement phase of the wood formation along the radial files of the forming tree-ring of two silver fir trees: the silver fir tree 1 in 2007 and the silver fir tree 2 in 2008 (Table 5).
In general, the best estimations in term of ISE are provided by using the optimal bandwidth h 0 opt . Moreover, the value h 0 opt decreases when the value of parameter a ∈ {1, 2} increases. For the unimodal data-set related to the silver fir tree 1, the bandwidth h 0 opt is better approximated byĥ cv thanĥ opt using a = 1 (Table 6). For the bimodal data-set related to the silver fir tree 2, the bandwidth value h 0 opt is better approximated byĥ opt thanĥ cv using a ∈ {1, 2} (Table 7). For the two wood cell count data-sets, the values of the three types of optimal bandwidth become closer when the parameter a ∈ {1, 2} increases. Thus, one can see that the performance of each bandwidthĥ opt andĥ cv s also depends on the start functions in these examples.

Concluding remarks
In this work, an expression of optimal bandwidthĥ opt , unavailable until now in discrete case, is developed for the semiparametric (and nonparametric) kernel estimator with discrete triangular associated kernels. Therefore, an explicit optimal bandwidth is provided like that which is available for continuous kernel density estimation. The expression ofĥ opt proposed depends both on parameter a ∈ N of discrete triangular kernel and sample size n; of course, this new optimal bandwidth goes to 0 when sample size n goes to ∞. The performance of this new optimal bandwidth is comparable to optimal bandwidthĥ cv provided by applying the cross-validation procedure when the parameter a ∈ N increases; in particular, concerning the examples studied in our work,ĥ opt clearly outperformsĥ cv for a ≥ 3. However, this first attempt for obtaining an explicit optimal bandwidth for estimator using discrete triangular kernel cannot be generalized to other discrete kernels as, in particular, those which have been constructed from usual discrete distributions as binomial or Poisson. Thus, an interesting perspective would be to find a general rule to express of optimal bandwidth for discrete kernel estimation as it is available for continuous kernel one. Finally, following the idea developed in the current paper, some works are in progress to express an optimal bandwidth for discrete nonparametric or semiparametric triangular kernel estimators for the count regression function.