On Modeling Wood Formation Using Parametric and Semiparametric Regressions for Count Data

Understanding how wood develops has become an important problematic of plant sciences. However, studying wood formation requires the acquisition of count data difficult to interpret. Here, the annual wood formation dynamics of a conifer tree species were modeled using generalized linear and additive models (GLM and GAM); GAM for location, scale, and shape (GAMLSS); a discrete semiparametric kernel regression for count data. The performance of models is evaluated using bootstrap methods. GLM was useful to describe the wood formation general pattern but had a lack of fitting, while GAM, GAMLSS, and kernel regression had a higher sensibility to short-term variations.


Introduction
Wood is the most abundant biological component of the biosphere and plays a key role in ecosystem functioning, representing, for example, one of the strongest sink of CO 2 (Tans and White, 1998), which is a major contributor to climate change. Wood also plays a crucial economical role, being one of the most important products of the world trade (FAO, 2007). Understanding how wood develops therefore implies crucial issues, and the study of wood formation has become an innovative and fast-growing field in plant sciences over the last decade (Gricar et al., 2011).
Wood derives from the cambium, a thin layer of cells between the wood and the bark which divides to produce the new wood cells inward (Lachaud et al., 1999). Conifers represent the easiest case of wood formation because one type of cells, called tracheids, composed 90% to 95% of their wood. A cell newly produced by division in the cambium and destined to become a functional tracheid undergoes a differentiation program in order to acquire the particular morphological and physiological characteristics of this specialized cell type ( Fig. 1) (Plomion et al., 2001): (1) first, its diameter enlarges mainly in the radial direction under water turgor pressure; (2) second, its thick, rigid, and waterproof secondary wall mainly composed of cellulose, hemicellulose, and lignin is deposited. Under temperate climate, wood formation presents an alternation between active and inactive periods related to the alternation of the hot and cold seasons. All the tracheids formed during the active period (growing season) are organized in juxtaposed radial files and formed an annual tree ring, which adds to the tree rings formed the previous years (Fig. 2).
Detailed analyses of wood formation need repeated sampling (generally at a weekly time step) of the developing tree ring during the growing season (Rossi et al., 2006). One of the main problem of this method is that growth is not homogenous along and around tree stem, so that the number of wood cells produced at a given time can considerably vary according to the position of the sampling (Wodzicki and Zajaczkowski, 1970). Therefore, it is difficult to know whether the wood cell number variations observed between the successive samples are related to growth or within-tree growth variability, making the characterization of the wood cell number variation in the successive phases of tracheid differentiation during the season difficult from a simple description of the raw cell count dataset. Modeling these count data could be a good way to highlight the growth signal by smoothing the variations resulting from within-tree variability.
In this work, we are concerned in finding a suitable methodology to model the number of cells weekly counted in the enlargement and thickening phases of five silver fir (Abies alba Mill.) trees during the season 2007. Thus, let Y be a response variable in R, X be a count explanatory variable in N d , and the conditional mean m(·) = E(Y |X = ·) an unknown count regression function (c.r.f.) to estimate. A wide range of structured models could be considered in modeling count data, for instance, generalized (linear additive) models or generalized (partial linear) models discussed in Pardinas and Sperlich (2010). First, we propose to focus on Generalized Linear and Additive Models (GLMs and GAMs, respectively) for parametric and semiparametric regressions, respectively, of function m, useful in the situation of a regression model for which the error term is not normally distributed, as for binomial response variables or count data (Hastie and Tibshirani, 1990;McCullagh and Nelder, 1989). More precisely, because the observations are repeated on the same trees during the year, we are concerned with mixed effects versions of GLMs and GAMs, denoted as GLMM and GAMM, respectively. Mixed models are recommended to reduce the bias of the estimations in the case of repeated measurements since random effects allow to take into account the correlation between observations (Zuur et al., 2009). Then, we investigate a discrete semiparametric regression for count explanatory variables, which requires to assume that c.r.f can be expressed as m = l × ω with l a parametric function and ω an unknown discrete nonparametric one. The semiparametric estimation is realized in two steps: a first approximation l of l followed by a discrete nonparametric kernel regression of ω = m/ l using a discrete version of Nadaraya-Watson estimator; in this way, the nonparametric estimate plays the role of a correction coefficient of the parametric estimate (Abdous et al., 2012;Senga Kiessé and Rivoire, 2011). The purpose of this semiparametric model is to improve the performance of parametric model l; later, in the applications, we will illustrate this semiparametric approach using (parametric) model l having the worst performance as departure. This approach focuses on the discrete character of variables, and is thus appropriate for estimating count regression function by using some associated kernels that are also discrete (Kokonendji and Senga Kiessé, 2011). Finally, for comparison with other competitive generalized structured models, we apply GAMs for location, scale, and shape (GAMLSS) proposed by Rigby and Stasinopoulos (2005) as semiparametric regression-type models. They were introduced as a way of overcoming some of the limitations associated with GLMs and GAMs. Note that for other works on basic nonparametric or semiparametric regression models, one can refer to Dai and Sperlich (2010), Lombardia and Sperlich (2008), or Alberts and Karunamuni (2003).
The remainder of this article is organized as follows. The mathematical models applied are recalled in Section 2. For semiparametric kernel estimator, some examples of discrete kernels are given. Moreover, some data-driven bandwidth selection procedures well known in continuous case but not available until now for our discrete semiparametric regression estimator are adapted. In particular, a new theoretical expression of optimal bandwidth parameter is given for a type of discrete associated kernels used. The model application on wood formation data and the results are provided in Section 3, bootstrap resampling methods are used for a simulational assessment of the performance of models. Section 4 contains some concluding remarks. Finally, all mathematical details that are related to data-driven bandwidth selection procedures are postponed to the Appendix.

Mathematical Models
In this section, we first present generalized linear and additive models used for a parametric and semiparametric regression, respectively, of function m; then we detail the semiparametric kernel regression methodology that consists by an improvement of the parametric estimate by using a nonparametric correction factor.

Generalized Models
An important statistical development over the last 30 years has been the advance in regression analysis provided by GLMs and GAMs. GAMs are semiparametric extensions of GLMs, which are themselves mathematical extensions of classical linear models (Hastie and Tibshirani, 1990). By applying a mathematical transformation to the response variable according to the real distribution of the errors, GLMs generalize the linear model to nonlinear responses and variables that can have other than a normal distribution, including the binomial, Poisson, or gamma probability distributions. GAMs are GLMs in which the linear predictor depends, in part, on a sum of smooth functions of predictors. The strength of GAMs is their ability to deal with highly nonlinear and nonmonotonic relationships between the response and the set of explanatory variables. At last, in GAMLSS the exponential family distribution assumption for the response variable is relaxed and replaced by a general distribution family, including highly skew and/or kurtotic continuous and discrete distributions (Rigby and Stasinopoulos, 2005). The ability of this tool to handle nonlinear data structures can aid in the development of ecological models that better represent the underlying data, and hence increase our understanding of ecological systems (Guisan et al., 2002).
Concerning GLM with the standard Poisson model as an appropriate choice, it consists here by a linear predictor l(·; ) with parameter = (θ 0 , θ 1 , θ 2 ), based on a combination of realization x ∈ N of the explanatory variable X with a logarithmic link such that we have where e i represents the residuals. The linear and logarithm parts in l(·; θ ) allow to describe the tendency of data. About GAM, we have with s(·) a nonparametric function that can be modeled by splines. Furthermore, the mixed effects versions of GLM and GAM (GLMM and GAMM, respectively) are available for describing relationship between a response variable and covariates in the repeated measurements data that are grouped according to a cluster factor. At last, we consider GAMLSS models defined as follows (Rigby and Stasinopoulos, 2005). Assume θ = (θ 1 , θ 2 , θ 3 , θ 4 ) be a vector of four distribution parameters where each of which can be a function to explanatory variables. For k = 1, 2, 3, 4, let g k be known monotonic link functions relating the distribution parameters to explanatory variables. The semiparametric additive formulation of GAMLSS can be given by where d jk is an unknown function of the explanatory variable. Under some conditions, GAMLSS in (1) can be extended to parametric linear, nonlinear parametric, or nonlinear semiparametric additive model.

Semiparametric Kernel Regression
2.2.1. Estimator. Let us consider the sequence of independent and identically distributed with i the residuals assumed to have zero mean and finite variance.
In the semiparametric approach, the distribution of Y i is assumed to be a modified parametric function given by where l(x; ) is a function relative to the parameter and ω(x) > 0 is a nonparametric function. The discrete semiparametric regression estimator of m in (2) results from a parametric estimation l(x) = l(x; ) of l followed by a nonparametric kernel estimation see Abdous et al. (2012). The bandwidth h = h(n) > 0 is an arbitrary sequence of smoothing parameters, which fulfills lim n→∞ h(n) = 0; and the discrete associated kernel K x,h (·) of random variable K x,h is a probability mass function (p.m.f.) with support S x (containing x) included in N satisfying the following hypotheses: Examples of discrete kernel. We present two examples of discrete kernels for which more details are available in Kokonendji and Senga Kiessé (2011), and references therein. The first discrete kernel satisfies only the assumption (H1) and has its variance such that lim h→0 Var(K x,h ) ∈ V(0) instead of (H2), where V(0) is a neighborhood of 0. This kernel might be particularly useful for smoothing small or moderate samples sizes (refer to Kokonendji and Senga Kiessé, 2011;Zougab et al., 2012Zougab et al., , 2013. The second kernel fulfills (H1)-(H2).
Example 2.2. For (x, a) ∈ N × N and h > 0, the second kernel is a discrete symmetric triangular one with random variable K a;x,h defined on support S a,x = {x, x ± 1, . . . , x ± a} and whose p.m.f. is given by In addition, we propose the following expansions of modal probability and variance of this kernel such that k 2 log(k). These expansions are useful for establishing an expression of optimal bandwidth (see the Appendix).
2.2.3. Bandwidth choices. We adapt two data-driven bandwidth selection methods from continuous regression but not developed until now for discrete regression estimator m n . The first consists by finding an optimal value h cv by minimizing the score function The second consists by finding a theoretical expression of optimal bandwidth parameter h opt obtained by minimizing the asymptotic part of mean integrated squared error (MISE) of m n such that Indeed, the problem with using MISE to bandwidth selection is that we are not able to provide a theoretical expression of optimal bandwidth h opt minimizing MISE in discrete associated kernel estimation. Thus, h opt is an approximate of the true h opt . However, this optimal bandwidth is available only for estimator using discrete associated kernels satisfying (H1)-(H2). Thus, a theoretical expression h opt is not available for binomial kernel that does not fulfill (H2) but only for discrete triangular kernels. Moreover, we will see that h opt cannot be directly used since it requires to know the true function f . Therefore, in the following section, the estimator m n in (3) is just applied with binomial kernel using the optimal h-value h cv . The mathematical details of these two methods are postponed to the Appendix. A Bayesian approach is proposed in Zougab et al. (2014) as an alternative approach to bandwidth selection in the context of nonparametric count regression using binomial kernel; this approach permits also the variance estimation of the model error. An extension of this approach in the context of semiparametric count regression using binomial kernel would provide an improvement of the smoothing quality; this requires a thorough job in a separate article.
Remark 2.1. (i) First, it would be of some interest to compare, for example, MISE( m n,h cv ) and MISE( m n, h opt ). To derive such a result, some information about sup H n |CV 1 − 2CV 2 + (1/n) n i=1 Y 2 i − MISE|, for some sufficiently large region H n , would be necessary. That will be the subject of a forthcoming article.
(ii) Second, a common problem with cross-validation methodology is that the criteria CV is not always consistent depending on discrete kernel K x,h and sample (X i ) i=1,2,...,n . In this situation, for example, the function h → CV(h) is only increasing and does not alternate the phases of decreasing and increasing that allow to find a minimum h-value. It would be interesting to adapt a generalized cross-validation procedure in forthcoming works for our discrete semiparametric regression, for example, see the works of Tong and Yao (1998) for regression estimation based on dependent data.

Applications
In this section, we consider the data of the number of cells weekly counted in the diameter enlargement and wall-thickening phases of five silver fir trees in 2007 presented in Fig. 3.

Data
Small samples of wood called microcores (2 mm diameter, 15-20 mm length) were collected weekly from April to November 2007 at breast height on the stems of five silver fir trees using a specifically designed puncher, the Trephor tool (Vitzani, Belluno, Italy) (Rossi et al., 2006), and following an ascending spiral pattern. For each sample, the number of cells in the enlargement and thickening phases was counted along three radial files (Cuny et al., 2012(Cuny et al., , 2013. The number of cells weekly counted in the wood formation phases of diameter cell enlargement and wall thickening of one silver fir tree are presented in Fig. 3(a) and 3(b). The observations in the two same phases of the five silver fir trees are plotted in Fig. 3(c) and 3(d).

Mixed Effect Models
For wood cell count data in Fig. 3(c) and 3(d), which present five trees i observed on 31 weeks j, we apply the mixed model such that m(·) = E(Y ij |X ij = ·), i = 1, 2, . . . , 5, j = 1, 2, . . . , 31, where the response variable Y i,j is the number of cells measured at time j, the function m relates the response variable to other covariates X i,j varying with each tree and time. Thus, for GLMM as example, the parameter of GLM is replaced by i = (θ 0 + a 0i , θ 1 + a 1i , θ 2 + a 2i ), where θ k , k = 1, 2, 3, are the fixed effects and a ki are random effects related to each individual i. The different types of generalized models mentioned are implemented on the R packages "lme4" (Bates et al., 2011) and "gamm4" (Wood, 2011). We first apply the mixed effects models GLMM and GAMM on data of five trees presented in the graphs (c) and (d) in comparison with discrete semiparametric binomial kernel regression estimator using h = h cv . The parametric estimates provided by GLMM are used as departure for the semiparametric procedure, since we will see that GLMM has the worst performance.
In order to evaluate the performance of the models, we just apply the root mean square error (RMSE), which is a descriptive measure of degree-of-fit defined as RMSE i = { n j =1 (y ij − y ij ) 2 }/n, i = 1, 2, . . . , 5, where y ij is the adjustment of the jth measurement y ij of a tree i and n = 31 the total number of observations. The average RMSE is also presented in Table 1 such that RMSE = (1/5) 5 i=1 RMSE i . For the diameter enlargement phase, the GLMM and GAMM as well as the semiparametric kernel estimator were suitable to exhibit the unimodal tendency of the seasonal dynamics for all trees except the tree i = 5 (Fig. 4). For this last tree, GLMM and GAMM described a unimodal curve while semiparametric regression provided a bimodal curve. In general, the estimated curves provided by GLMM were over-smoothed, while those provided by GAMM and semiparametric binomial regression reflected more data variations. Moreover, the modal value was globally under-estimated and moved from a few days by GLMM; and, this parametric model over-estimated the null values observed at left boundary points. Finally, concerning GAMM and semiparametric regression, they allowed to highlight similar tendencies, except for tree i = 5.
For thickening phase, the conclusions are similar as previously. In particular, let us present the following remarks. In Fig. 5, for individual tree i = 2, the semiparametric binomial regression clearly pointed out some estimations with a bimodal tendency in contrast to GLMM and GAMM. Moreover, GLMM largely over-estimated the values observed at right boundary points. The corresponding results are given in Table 1. Finally, it appeared that the nonparametric correction provided by estimator using binomial kernel allowed to clearly improve the parametric estimates resulting from GLMM and was better than GAMM in term of performance (Table 1). In the following in order to deepen our study by using bootstrap methods, a comparison is realized with GAMLSS, which is another competitive generalized structured model. We apply GLM, GAM, GAMLSS, and m n on re-sampled data of one tree presented in the graphs (a) and (b). Note that for GAMLSS we proceed by fitting smoothing cubic splines to the data and for semiparametric kernel model we omit to present h cv -values.

Bootstrap Methods
These methods are applied for a robust evaluation, which consists by re-sampling the number of wood cells weekly counted in the diameter enlargement or wall-thickening phase of one silver fir tree during 2007 presented in Fig. 3(a) and 3(b). In addition, compared to the previous section, we deepen our study by comparing to GAMLSS. Thus, from the data of this tree we draw N = {25, 50, 150, 200, 250} bootstrap samples on whom we apply the different models studied. In addition with RMSE applied in previous section, the performance of models is also evaluated by using the Akaike information criterion (AIC). For the number N of bootstrap samples, the average RMSE and AIC with degrees of freedom (df), rounded to integers, are presented in Table 2. Looking first at the descriptive measure of degree-of-fit, GAMLSS and semiparametric binomial kernel regression have closed performance; they are both better than GLM and GAM. Then, by taking also into account the number of parameters via the AIC, the smallest values of this measure are provided by GAMLSS (df = 10), followed by GAM (df = 10), GLM (df = 28), and at last semiparametric kernel regression (df = 27). From here, GAMLSS is the most interesting model.

Discussion
Detailed analyses of wood formation need repeated sampling (generally at a weekly time step) of the developing tree ring during the growing season (Rossi et al., 2006). One of the main problem encountered when studying wood formation is that the wood cell number variations related to growth, which interest the biologists, are confused with the wood cell number variations related to the heterogenous growth along and around the stem. So, the objective is to find a model that smooths the data while maintaining a suitable fit. In this study, the application of GL(M)M, GA(M)M, GAMLSS, and semiparametric binomial kernel regression on the data of five trees offered different solutions to cope with this trade-off between degree of smoothing and goodness of fit, so that these different methods should be intended to different purposes. The GL(M)M method, which allows for a high degree of smoothing, may be sufficient when the will is to describe the wood formation general pattern, because it highlights the general shape of cell number variations during the season. But its lack of fitting makes it unsuitable when the will is to assess in detail the dynamics of the wood formation process. For example, one of the central aim of wood formation studies is to understand how climate influences wood formation, and this needs to relate climatic and wood formation data at a fine temporal resolution (Gricar et al., 2011). In this case, the GA(M)M, GAMLSS, and semiparametric kernel procedures could offer great perspectives because of their higher sensibility to the high-frequency variations in the dataset. The incorporation of mixed effects in GA(M)M is also a good point because it allows to accurately account for the nonindependence of weekly cell count data (Zuur et al., 2009), providing a more robust description of wood formation dynamics. Finally, the discrete semiparametric kernel regression and GAMLSS offer interesting alternative to GL(M)M and GA(M)M. Indeed, GAMLSS is known in literature to have a systematic part expanded to allow modeling not only of the mean but other parameters of the distribution of Y as linear and/or nonlinear, parametric and/or additive nonparametric functions of explanatory variables, and/or random effects (Rigby and Stasinopoulos, 2005). Concerning semiparametric kernel regression, the balance wished between degree of smoothing and goodness of fit can be obtained by playing on the value of the bandwidth parameter h > 0 according to the purpose of the study.

A.1. Cross-validation Procedure
An h-value denoted h cv can be selected by the well-known cross-validation procedure adapted to the estimator m n in (3). For this, we express the discrete semiparametric estimator m n of the c.r.f. m as . The optimal bandwidth selection by the crossvalidation method consists by the minimization of the terms depending on h in the criterion is the leave-one-out kernel estimator of m n (X i ; h); thus, we have h cv = arg min CV(h) with CV(h) = CV 1 − 2CV 2 . By calculating the expectation of (A.1), we show that it is an unbiased estimator of the following version of mean integrated squared error (MISE) weighted by p.m.f f given by Indeed, we directly see that the terms are, respectively, unbiased asymptotic estimates of E{ x∈N m 2 n (x)f (x)} and E{ x∈N m 2 (x)f (x)}. Then, we show that E{ x∈N m n (x)m(x)f (x)} is asymptotically approximated by Note that the estimate −i is calculated as by excluding X i . One can refer to Hardle and Marron (1985) for bandwidth selection in nonparametric continuous regression.
In the next part, we establish a new theoretical expression of optimal parameter h opt for estimator m n with discrete associated kernels satisfying (H1 )-(H2 ).

A.2. Minimization of Asymptotic Part of Mean Integrated Squared Error
This approach is required to calculate bias and variance of estimator m n since MISE in (4). Let us assume l 0 (x) = l(x; 0 ) be a fixed parametric start in (2), that is, m = l 0 ω, such that converges to 0 in probability. For x ∈ N, the discrete semiparametric estimator m n in (3) using discrete triangular kernels admits the following bias: The conditional variance Var(Y |X = x) is finite, the function f > 0 is the p.m.f. of the regressor X; ω (1) , f (1) , and ω (2) are respectively finite differences of first and second order (Abdous et al., 2012). Hence, we give the following expression of MISE: where AMISE(h) is the main term in the sum of integrated variance and squared bias of m n . The bandwidth h opt = arg min h>0 AMISE{ m n (x)} comes by solving the following equation d{AMISE(h)}/dh = 0, which is equal to That leads to expression h opt (a; n, f ) = A(a) x∈N Var(Y |X = x)/f (x) A 2 (a) x∈N Var(Y |X = x)/f (x) + nV 2 (a) x∈N W 2 (x) such that h opt → 0 when n → ∞. The following asymptotic relationship can be pointed out: h opt ∼ k 0 n −1 with k 0 = A(a) x∈N Var(Y |X = x)/f (x) V 2 (a) x∈N W 2 (x) .