Weighted Poisson and Semiparametric Kernel Models Applied to Parasite Growth

This work deals with some parametric and semiparametric modeling approaches for count data distributions related to development of spiraling whitefly which is an insect pest collected in Brazzaville, Republic of Congo. In this study, the count data distributions are assumed to be modified Poisson probability mass functions. For the discrete semiparametric associated kernel estimator investigated, its almost sure consistency and asymptotic normality are shown under some asumptions. Some weighted Poisson models (WPD) are applied in comparison with the semiparametric approach for finite samples characterizing the growth of spiraling whitefly. Finally, the discrete semiparametric estimation is simple and effective for estimating any count distribution while WPD are practically more meaningful.


Introduction
The spiraling whitefly (Aleurodicus dispersus Russell) is an insect pest which causes damage to plants by sucking the sap, decreasing photosynthesis activity and drying up the leaves. This insect comes originally from Central America and the Caribbean islands, and is now present in the Congo-Brazzaville. Congolese biologists are searching for a suitable method for modeling data related to the growth of this insect. Thus, some experimental populations were raised on plantations of several host plants, among them some fruit trees well-known in the Congo, such as safou (Dacryodes edulis), mango (Mangifera indica) or citrus (Citrus paradisi); see Kiyindou et al. (1999), Miz ere et al. (2008) and Miz ere (2007). These plantations consisted of young trees (5 to 6 months old) under varying conditions of temperature and humidity, and the observations were made using a binocular loupe. The development of the insect parasite studied is described by the following count explanatory variables observed in days: the preimarginal development time from egg to adult stage, the total number of days of egg laying and the longevity of the adult insect. These count data deviate from the equidispersion assumption; thus it becomes necessary to use suitable count estimation models for under-or overdispersed data and the standard framework provided by the Poisson model is not sufficient. In order to express the deviation from classical Poisson models, any count data distribution f , on the set N of non-negative integers, can be formulated as a weighted Poisson distribution (WPD) such that f ðxÞ ¼ pðx; lÞ Â xðxÞ; x 2 N; where pðx; lÞ ¼ l x expðÀlÞ=x! is the Poisson probability mass function (p.m.f.) with mean parameter l [ 0 and xðxÞ is the nonnegative normalized Poisson weight function. When the discrete function x does not represent the real recording mechanism, or is not well-specified, it is better to allow the count data to yield an estimate of this weight function by a nonparametric method. This opens the way for semiparametric modeling which consists of the construction of an estimate b p of the standard Poisson p.m.f. p multiplied by a nonparametric kernel estimate of the function x ¼ f =b p. The nonparametric estimate plays the role of a correction factor of the parametric estimate and intrinsically takes into account special features of the counting phenomenon such as overdispersion (or underdispersion) and zero-inflation (or zero-deflation); see Kokonendji et al. (2009). For comparison, several WPD are investigated as alternatives to the parametric Poisson model classically applied for count data by specifying different discrete Poisson weight functions x. Indeed, these weighted versions of the standard Poisson distribution allow us to take into account the counting phenomena mentioned previously. More precisely, some truncated and translated Poisson distributions, are investigated. Finally, the semiparametric estimation procedure and WPD are applied to count datasets related to the growth of spiraling whitefly in plantations of citrus trees. The advantages provided by each method are investigated with respect to the goodness-of-fit, the new information on insect growth and the meaningfulness of the results in these applications. The rest of the paper is organized as follows. Section 2 presents the discrete semiparametric kernel estimator using the Poisson p.m.f. as the start function, then WPD are also presented. Basic properties of discrete kernel estimator studied are shown; in particular, mathematical results on the strong consistency and asymptotic normality of the estimator are formulated. Section 3 contains the results of applications of the parametric and semiparametric methods. Concluding remarks are given in Section 4.

Semiparametric estimation models and weighted Poisson distributions
Let us recall some notions about discrete semiparametric kernel estimation and WPD.

Semiparametric kernel estimation
For the semiparametric procedure, the discrete Poisson weight function xðÁÞ in (1) is not specified; thus, a discrete nonparametric kernel estimator of xðÁÞ is used in addition to a parametric estimate of pðÁ; lÞ.

Estimator
Let X 1 ; X 2 ; . . .; X n be a sample of independent observations with an unknown count distribution f as in (1). A discrete semiparametric estimator of f is proposed by Kokonendji et al. (2009) as the combination of a parametric estimator b pðxÞ ¼ pðx; b lÞ of p followed by a nonparametric kernel estimator b x n ðxÞ of xðxÞ ¼ f ðxÞ=b pðxÞ, such that we have The estimator b l ¼ n À1 P n i¼1 X i is the sample mean, the bandwidth h ¼ hðnÞ [ 0 is an arbitrary sequence of smoothing parameters that fulfills lim n!1 hðnÞ ¼ 0, and the discrete associated kernel K x;h ðÁÞ of the random variable K x;h is a p.m.f. with support S x (included in N) satisfying the following hypotheses: The two previous quite general assumptions can be replaced by Indeed, one can verify that the hypotheses H1 0 -H2 0 lead to H1-H2 and are also less general. Note that the expressions for A and V are related to the chosen discrete kernel K x;h but do not depend on x and h as we will see in the following example.

Remark 1.
Other examples of discrete associated kernels satisfying H1 0 -H2 0 are the Dirac and Aitchison-Aitken kernels given as examples by Kokonendji & Senga Kiess e (2011). For the Dirac kernel, which is a particular case of an associated kernel without smoothing parameter, i.e. h ¼ 0, the modal probability at x is equal to 1 and thus A ¼ 0.
For the Aitchison-Aitken kernel, the modal probability at x is equal to 1 À h and thus A ¼ 1.
Now we propose a data-driven bandwidth selection procedure for the estimator b f n .
Bandwidth choice. The bandwidth is generally chosen to minimize the mean integrated squared error (MISE) of b f n such that an ideal parameter value is Thus, for a given discrete kernel K x;h with x 2 N and h [ 0, an optimal bandwidth parameter h cv ¼ arg min h [ 0 CVðhÞ is obtained by minimizing the cross-validation estimator where b f n;Ài ðxÞ ¼ ðn À 1Þ À1 P n j6 ¼i K x;h ðX j Þ is the leave-one-out kernel estimator of b f n ðxÞ and b l Ài is computed as b l by excluding X i . This estimator is asymptotically unbiased for MISE cv ðhÞ:

Asymptotic properties
First, the basic properties of the estimator b f n , such as its bias and variance have been established already by Kokonendji et al. (2009); here, we take into account the novel assumptions H1 0 -H2 0 which ensure that we have where p 0 ¼ pðx; l 0 Þ is the Poisson p.m.f with mean l 0 , b l converges to l 0 and x ð2Þ is the finite difference of second order of x. It ensues Biasð b f n ðxÞÞ ! 0 and Varð b f n ðxÞÞ ! 0 when h ¼ hðnÞ ! 0 and n ! 1: Therefore, the pointwise and global consistencies of b f n can be deduced easily by showing, respectively, that the mean squared error MSE and the integrated MISE both tend to 0 as h ! 0 and n ! 1 since we have: Varð b f n ðxÞÞ: Next a mathematical result on the almost sure consistency of the estimator b f n is formulated, followed by another result concerning its asymptotic normality. The proofs of the two results are postponed to the Appendix. In the following section, we are interested in WPD when the discrete Poisson weight function xðÁÞ in (1) is well-specified. Thus, the modeling approach developed is completely parametric.

Weighted Poisson distributions
Let X be a r.v. having a Poisson p.m.f. pðx; lÞ with mean parameter l [ 0. The r.v. X / said to be the weighted version of X has a p.m.f. given by where /ðxÞ is a nonnegative weight function on N and the denominator is the normalizing constant depending on l such that 0 \ Eð/ l ðXÞÞ\1. The discrete weight function /ðxÞ ¼ /ðx; kÞ can depend both on the parameters k and l, where k represents the recording mechanism. Clearly, the standard Poisson distribution is a WPD with unit weight function xðxÞ ¼ 1; 8x 2 N. In addition, the weighted variable X / is said to be overdispersed (underdispersed) when Fisher dispersion indicator IðX / Þ ¼ VarðX / Þ=EðX / Þ is greater (smaller) than 1, while the Poisson variable is equidispersed when IðXÞ ¼ 1. Let us finally remark that by comparing the equation (3) to equation (1) we have xðÁÞ ¼ /ðÁÞ=Eð/ l ðXÞÞ. In the following we give some examples of WPD.
• The second model considered is WPD 3 ðl; k; kÞ with We have WPD 3 ðl; k; kÞ ! PTðl; kÞ as k ! 1; where PT is the translated Poisson p.m.f. with parameters l and k. This WPD is also underdispersed.
• The third model is the zero-modified weighted distribution ZMWðl; k; p 0 Þ with the following p.m.f.: Let us give some details about the possible interpretation of the parameters and the method for their estimation. The integer parameter k serves to construct a family of distributions p / ðx; l; kÞ which converge to p / ðx; lÞ and has no particular biological interpretation; the parameter l, is the mean of Poisson p.m.f.. These two parameters can be estimated by maximum likelihood. The parameter p 0 is the theoretical zero proportion and can be estimated by the empirical zero proportion. Finally, the parameter k is the absolute minimum time it takes for an insect to become an adult parasite; thus the host plant with the lowest k-value is more favorable to the development of the parasite. This last parameter is estimated using the method of moments. In the applications, our main concern is the estimated value of k because it is useful for controlling reproduction of this specific insect species. For more details on modeling count data phenomena and WPD, see Kokonendji et al. (2008) andMiz ere (2006).

Applications
In this section, some diagnostic checks are used to choose between the parametric and semiparametric models. Then, the results are given for the application of each method (classical Poisson, WPD and semiparametric b f n ) on count datasets related to the growth of Congolese spiraling whitefly.
Note that, for the discrete triangular kernel semiparametric estimator, the parameter a 2 N is equal to 1; 2 or 3 in practice. We propose here to fix a ¼ 1 since the global error MISE increases with a 2 N (Kokonendji et al., 2007). For example, Figure 1 illustrates the comparative behaviors of function a 7 ! MISEða; n; h; f Þ of b f n with a discrete triangular kernel K a;x;h for the simulated p.m.f. f ðxÞ ¼ 0:4Pnðx; 0:5Þ þ 0:6Pnðx; 10Þ; x 2 N; which is a mixture of two Poisson distributions Pnðx; lÞ with respective means l 1 ¼ 0:5 and l 2 ¼ 10. For fixed h [ 0 and sample sizes n, the optimal value a opt ¼ arg min a2N MISEðaÞ is less than or equal to 3; note that the case a ¼ 0 for the discrete triangular kernel results in a naive kernel of Dirac type.

Model diagnostics for semiparametric estimation
The estimated discrete Poisson weight function b x n in (2) provides useful information for model diagnostics. This Poisson weight function should equal one if the Poisson p.m.f. is indeed the true p.m.f. Therefore, the adequacy of the model can be checked by examining a plot of the weight function: we are interested here in plotting the log weight function log b x n ðxÞ ¼ logf b f n ðxÞ=pðx; b lÞg to see how far away it is from zero. Thus, a simple graphical goodness-of-fit diagnostic emerges by plotting x against  Figures 2 and 3, respectively. This suggests that it would be of interest to consider parametric Poisson (or also WPD) models for these data. For the longevity data, only 44:4% of Z(x)-values belong to the confidence band AE1:96 (Figure 4). This means semiparametric methods should work better than parametric methods (see also Table 3 later).
The following section provides the detailed results about the performances of standard and weighted Poisson models in comparison with the semiparametric kernel estimator.

Parametric and semiparametric results
In this section, the performance of each model applied is evaluated by using the practical integrated squared error which is a descriptive measure-of-fit, where f 0 is the observed frequency and b f represents the estimated frequency from the application of WPD or b f n . For count data, we can also measure performance through the following chi-squared (v 2 ) distance:    see Loader (1999, page 92). Using this equation, the values of df are not integers but real numbers. Here, these computed values of df are rounded to integers; they are often called effective number of parameters. Thus, model comparison using the computed df is made through the Akaike information criterion (AIC) and the v 2 goodness-of-fit test.
Preimarginal duration. Table 1    metric triangular kernel estimator b f n with a ¼ 1, the cross-validation procedure provides an optimal h-value h cv ¼ 0:38. The descriptive measure-of-fit ISE indicates that the semiparametric model is better than the models WPD 3 ðl; 20; 30Þ and PTðl; kÞ which both have closed performances. In particular, let us note that the modal frequency at x ¼ 29 is equal to 18 for observations while it is equal to 14.970 and around 10.8 for the semiparametric and Poisson models, respectively. Looking at the v 2 goodness-of-fit test, for the models PT and WPD 3 we have v 2 0 -values equal to 13:954 and 13:995 (df ¼ 7) with the p-values equal to 0:052 and 0:051, respectively; in comparison, for the semiparametric model we have v 2 0 ¼ 2:540 but a smaller df equal to 6 with p-value ¼ 0:864. Finally, looking at the Akaike information criterion, for PT and WPD 3 we have AIC values around À28:7; the value is À30:523 for b f n .
Total number of days of egg laying. The observations of these data have mean 0:439 and variance 0:323 so the Fisher dispersion indicator is I ¼ 0:736\1; ZMWðl; k; p 0 Þ and standard PðlÞ are applied on these data (see Table 2). Concerning ZMW ðl; 30; b p 0 Þ, the zero proportion observed b p 0 ¼ 0:585 is smaller than the zero proportion expðÀ0:439Þ ¼ 0:644 expected under the Poisson model PðlÞ. For the discrete semiparametric triangular kernel estimator b f n with a ¼ 1, we have h cv ¼ 0:05. Finally, the quality of fit of b f n is better (in terms of ISE) than that of ZMWðl; 30; b p 0 Þ which is itself better than PðlÞ. However, b f n and ZMWðl; 30; b p 0 Þ have some ISE-values of the same order. Looking at the v 2 -test, for the models ZMWðl; k; p 0 Þ and PðlÞ we have v 2 0 values equal to 0 and 8:677 with p-values equal to 1 and 0:003, respectively, for the same df equal to 1; in comparison, for the semiparametric estimator b f n , we have v 2 0 ¼ 0 and p-value ¼ 0 with a null value of df. In particular, let us consider the AIC since zero degrees of freedom looks unusual: ZMW, P and b f n have AIC values of À8:675, À15:120 and À13:051, respectively. AIC and v 2 0 values do not result in the same conclusion concerning the performance of the different models applied, but, a parametric model (ZMW or P) is better than the semiparametric model, depending on the criterion used.
Longevity. Table 3 presents the longevity (in days) of the adult insect studied; the data have mean 2:463 and variance 2:425 with a Fisher dispersion indicator of I ¼ 0:985\1 which is almost identical to one. Here WPD 2 ðl; kÞ and truncated P 0 ðlÞ are used since there is no observed value at day 0. The discrete semiparametric triangular kernel estimator b f n with a ¼ 1 and h cv ¼ 0:07 outperforms WPD 2 ðl; 30Þ and P 0 ðlÞ models in the measure ISE. In particular, using b f n reduces boundary bias since it gives a good adjustment at the left boundary x ¼ 1. Looking at the v 2 -test, b f n has a v 2 0 value equal to 0:259 (with df ¼ 2 and p-value ¼ 0:878) while WPD 2 and P 0 have v 2 0 values around 6:13 (with df ¼ 3 and p-value=0:105). Finally, WPD 2 , P 0 and b f n have AIC values of À13:617, À8:732 and À13:712, respectively.

Concluding remarks
The discrete semiparametric kernel approach, in addition to being simple and effective for estimating any unknown count distribution, was intended to work well even if the unknown p.m.f. cannot be approximated well by the Poisson distribution. This semiparametric modeling intrinsically took into account the special features of count data via the discrete nonparametric weight function x, and, it provided some interesting measureof-fit in diagnostics. However, WPD opened the way for more practical interpretation and discussion of counting phenomena observed in the data. Thus, concerning the examples treated in this work, the biologist had to focus on stopping the reproduction of Aleurodicus to fight against its spread because the minimum development time of this insect pest was b k ¼ 20 days for citrus trees. Finally, WPD were more meaningful in these applications than semiparametric modeling since they provided new information on insect growth.

Appendix A: Proofs
The proof of Theorem 1 requires the use of the following lemma (see Hoeffding, 1963).