Sensitivity analysis for high quantiles of ochratoxin A exposure distribution

Using available data from a consumption survey and contamination data on ochratoxin A (OA) in food, a sensitivity analysis (SA) for high quantiles (95th and 99th quantiles) of OA exposure distribution was carried out, obtained by a Monte Carlo simulation in French children. Exposure assessment for food contaminants is important to control the risk of foodborne diseases. Risk assessors are interested in high quantiles of contaminant exposure distributions. As these exposure distributions are generally very asymmetrical, it is difficult to obtain relevant and stable high quantiles in such a context. Determining OA exposure distribution is complex because it is based on the sum of elementary exposure distributions (eight foodstuffs are analysed here), and each one of these is the product of a consumption distribution and a contamination distribution. The SA enables us to quantify the influences of the parameter variability of the consumption and contamination probability density functions (pdf ) which have been fitted to the data, our simulation model inputs, on the 95th and 99th quantiles of the output exposure distribution. After some preliminary trials, we have postulated a quadratic polynomial regression model for the quantiles of OA exposure distribution in view of undertaking this SA. This regression model comprises 32 main factors, their 496 two-factor interactions and their 32 quadratic terms. The 32 factors are the parameters of the fitted pdf: 16 parameters of Gamma distributions relative to the eight consumed foods and 16 parameters of Gamma distributions relative to the eight food OA contaminations. For an optimal parameter estimation of such a large model, we used an experimental design approach depending on a resolution-V fractional factorial design of 6561 experiments. The factor ranges are established by a preliminary study of bootstrap sampling. From the bootstrap samples, the factor ranges are obtained taking into account the correlation between the two parameters of the fitted Gamma pdf. A full exposure distribution is simulated for each of the 6561 experiments. The consumption dependencies are taken into account by the Iman and Conover method. On the basis of this analysis, validated and useful models for each desired quantile are obtained showing a major influence of the parameters of ‘‘Cereals’’ (consumption and contamination) and slightly less so for parameter of ‘‘Pork’’ consumption in the sensitivity of the quantiles.


Introduction
Ochratoxin A (OA) is a mycotoxin produced by fungi of the genera Aspergillus and Penicillium. This mycotoxin may be a contaminant of grain stored in poor conditions, especially in Europe, and through the food chain also contaminates other foodstuffs, especially pork and poultry meat. Several toxicological studies (e.g. Pfohl-Leszkowick, 1999;JECFA, 1995;CCFAC, 1999) have shown this mycotoxin is an acute kidney toxin in humans. For these reasons, this mycotoxin remains under the close surveillance of the «Conseil Supérieur d'Hygiène Publique» in France and thus evaluating human exposure to this mycotoxin is an important public health question.
A consumption survey (ASPCC, 1999) -was conducted from June 1993 to June 1994 on 1161 French individuals (children, women and men) to estimate risk assessment exposure to OA in food. During 7 days, the participants were questioned on their consumption of several types of food. Our sensitivity analysis (SA) was restricted to children (individuals younger than 15) and to eight types of food, given in Table 1, keeping in mind that certain cover several foods, consumed by children. Table 2 gives descriptive statistics of these data for child consumers and non-consumers, and for men and women descriptive statistics are given in Verger (1999) and Gauchi (in press).
From consumption and contamination data, several exposure statistics have been computed by a nonparametric method-the NP -NP method-and a parametric method-the MP -P method (Gauchi, in press). In this paper, we use the MP -P method to evaluate the high quantiles of the OA exposure distribution in the SA. Indeed, the parametric method (the mixed parametric/parametric method, named the MP-P method in the text) seems to be very suitable for estimating relevant and stable high quantiles. This MP-P method, based on a Monte Carlo simulation, depends on the random samplings from fitted pdf (mixed parametric pdf for consumption and parametric pdf for contamination). The 95th and 99th quantiles of the exposure distribution are the model outputs (the responses) of our SA. The inputs are the parameters of the pdf fitted to the data. The SA enables us to quantify the influences of these parameters (and their interactions) on the responses via a quadratic polynomial model postulated for modelling the variation of these responses. For an optimal estimation of the parameters of the polynomial model, we used an experimental design approach depending on a resolution-V fractional factorial design of 6561 experiments. The factors of this design are the fitted pdf parameters. To simulate a meaningful range for each factor, we determine it by a parametric bootstrap approach (Efron and Tibshirani, 1993) carried out on the consumption and contamination samples. In  order to consider the correlation between the pdf parameters, three experimental domains of variation were designed inside a 95% concentration ellipse. This type of SA should be distinguished from marginal analysis: what is the effect of infinitely small changes (perturbations)? The effects in the SA are measured by the parameter vector b of the postulated quadratic polynomial regression model.
In Section 2, we present how we obtain the OA exposure distribution and the methodology of our SA. In Section 3, we give the results, and we conclude in Section 4 by a discussion.

Methods
The contamination data (mg kg À1 OA in food) come from elements independent of the consumption survey: they originate from SCOOP task 3.2.7. (see detailed reference). Table 3 gives descriptive statistics of OA contamination in the eight types of food. For poultry contamination data, the only information available was 31 values between 0 and 0.03, and 62 values between 0.03 and 0.18: the mean of 0.075 (Table 3) is calculated by the weighted arithmetic mean of the two means calculated from the two uniform distributions on [0; 0.03] and [0.03; 0.018]. On the other hand, the standard deviation of the unknown data is estimated by the Bienaymé -Tchebysheff inequality by supposing that the range [0; 0.18] encompasses 100% of the data. This inequality indicates that 75% of the data are contained in the interval [lF2r], which, with an estimation of 0.075 for l, leads here to an estimation of 0.0342 for r. In the following, poultry contamination data are simulated by a Gamma distribution with mean = 0.075 and standard deviation = 0.0342 corresponding to a shape parameter = 4.81 and a scale parameter = 64.12. Histograms of consumption and contamination data are given in Figs. 1, 2 and 3.
Several methods and tools are needed to carry out this SA. First, we present the MP-P method used for estimating an exposure distribution and the quantiles that we are interested in. We refer the reader to previous work (Gauchi, in press) for more detailed information about this method. Second, for the SA, we expose the four-step methodology used to attain our objectives: (i) postulation of a polynomial regression model; (ii) construction of the appropriate experimental design; (iii) determination of the factor ranges; (iv) validation of the model. Determining OA exposure distribution is rather complicated because it is based on the sum of eight elementary exposure distributions (one per type of food analysed), and each one of these is the product of a consumption distribution and a contamination distribution. Let us give a rigorous definition of the exposure in our context. Consider the consumption C j of one food j, divided by the individual weight, i.e. a normalized consumption, and the contamination T j in OA of this food. Then, for p foods consumed per day, we define the normalized global exposure E as the sum of the p normalized products consump-tionÂcontamination: E ¼ P p j¼1 C j T j , expressed in nanograms per kg of body weight per day, i.e. in ngÂbw À1 Âday À1 units.

The MP -P method
The MP -P method, described in detail previously (Gauchi, in press), is essentially parametric in the sense that a mixed pdf is adjusted on each food consumption and a parametric pdf is adjusted on each food contamination. For consumption, apart from cereals, the typical histogram that can be plotted for each foodstuff is shown in Fig. 4, where two zones appear corresponding to the two sub-classes: nonconsumers (sub-class of proportion h, represented by the tall bar on the left) and consumers (sub-class of proportion (1Àh), represented by the asymmetrical part). In this situation, we propose for each consumed foodstuff j to fit a mixed distribution defined by a continuous uniform distribution for the non-consum-ers and a Gamma distribution for the consumers. An example of such a mixed pdf has been given by Vose (1996). The Gamma distribution has been validated by means of the Anderson -Darling statistic (D'Agostino and Stephens, 1986) among several asymmetrical  distributions. In the same way, this statistic has been used to decide Gamma distributions for contamination data.
The Gamma pdf for a continuous random variable scale and threshold parameters, respectively, and C(r) is the usual Euler's integral. The mean and variance of the Gamma distribution are connected to r and k in the following way: The estimated parameters of the corresponding pdf are given in Table 4. The fitted Gamma distributions are presented in the histograms Figs. 2 and 3. An exposure value is calculated as follows: where: -c i,j is a random normalized consumption for the foodstuff j, drawn from the corresponding cumulative density function (cdf ) of the above-mentioned mixed distribution, whose density parameters are given in Table 4; -t i,j is a random contamination for the foodstuff j, drawn from the fitted Gamma cdf, whose density parameters are given in Table 4.
Then, with N = 10,000 outputs of form (2), we obtain an exposure simulation set S. The eight vectors of the N consumption random deviates have been arranged by means of the Iman and Conover (1982) method to be correlated with the aim of respecting the consumption dependencies. The N outputs lead to calculate various usual statistics and notably the 95th and 99th quantiles. Fig. 5 shows the simulated output exposure histogram.

Calculation of quantiles
The ath theoretical exposure quantile is here referred to as Q a and defined by: where the notation #[x i V K] means the number of x i less than or equal to K. For example, if N = 10,000, the estimate of the 0.95th quantile is the 9500th largest value of all the Ê i (S) . If aN, is not an integer, the  , respectively. These quantiles are known as the empirical quantiles.

Methodology of the SA
The SA used is based on two main joint tools: a quadratic polynomial model and a simulation design which is optimal relative to this model. Our aim is to determine 95th and 99th quantile models for clearly expressing the influence of the factors chosen and, secondly, to have at our disposal a simulation tool to easily evaluate the amplitude of a quantile variation when new inputs are given.

Polynomial model
After some preliminary trials, we assumed a full quadratic polynomial regression model for the responses (outputs) studied. This regression model is based on 32 main factors, their 496 two-factor interactions and their 32 quadratic terms. The 32 factors are the pdf parameters: 16 parameters of the Gamma distributions relative to the eight consumed foods (each of these distributions is described by three parameters; the influence of the threshold parameter is not studied because it merely demarcates the lower position of the distribution of the consumers to consider the non-consumers), and 16 parameters of the two parameter Gamma distributions relative to the eight food OA contaminations. The general form of the model is: where the b are unknown parameters, the factors z are the consumption and contamination pdf parameters, and e is an error term for which no particular probability distribution is assumed and we only suppose its expectation is zero. Model (4) is a second degree (quadratic) polynomial in z. In the following, model (4) is successively applied to the 95th and 99th empirical quantiles, Q 0.95 and Q 0.99 respectively, defined by Eq. (3).

Experimental design
In order to estimate optimally the b parameters of model (4), an appropriate simulation design is chosen. As the three-factor interactions and over are assumed negligible, a full factorial design of 3 32 experiments was not necessary, especially as it would have taken an extremely long time. Instead of this unfeasible design, we built a resolution-V fractional factorial design (Box et al., 1978) of 6561 experiments by means of the Factex procedure of the SAS software (SAS/QCR Institute). A resolution-V design allows us to estimate independently the parameters of the main effects of the factors (the z j ), the parameters of the 496 two-factor interactions (the z j z k ), and the parameters of the 32 quadratic effects (the z j 2 ) assuming that threefactor and over interactions are negligible. The factor levels were coded as À1, 0 and +1.

Factor ranges
The z factors, i.e. the parameters of the consumption and contamination pdf, are the inputs of the SA. The aim is to make the inputs vary in a reasonable range but large enough to observe how the response outputs vary. Because we had only one data set for consumption and contamination, no hypothesis of the factor variations could be made, so no supplementary surveys or contamination analyses were available. Thus, to simulate a meaningful range for each factor, we decided to determine it by a parametric bootstrap approach (Efron and Tibshirani, 1993). We considered the correlation between the shape and scale parameter of the pdf. For each consumption and contamination sample (16 in all), 10,000 bootstrap samples were obtained. A parametric bootstrap sample is obtained by sampling n times [n is the size of the data sample; see Table 2 (consumers) and Table 3 for the values of n] in the distributions fitted to the data. Table 4 gives the parameters of those distributions. From the mean and the standard deviation of each bootstrap sample, we calculated the scale and shape parameters from relation (1). As the bootstrap means were biased due to the sparseness of the data (see histograms Figs. 1-3), all the bootstrap scale and shape parameters were re-centred on the fitted param-eter values given in Table 4. As the scatter plot of the 10,000 parameter pairs (scale and shape) showed an elliptical appearance for each food consumption and food contamination, we determined a variation domain by calculating a 95% concentration ellipse, one in each zone. In each ellipse, three squared variation domains were designed for the shape and scale parameters: the first one in the lower part of the ellipse (zone I), the second one in the middle part (zone II) and the third one in the upper part (zone III). Thus, in this approach, three polynomial models were determined for the 95th and 99th quantiles. Fig. 6 gives an example of the three zones for the consumption of cereals. For example, Tables 5, 6 and 7 indicate the three factor levels À1, 0 and +1 for the lower, middle and upper zones, respectively.

Model validation
To check whether the estimated regression model was a valid approximation, we proceeded as follows: 2.2.4.1. Calculation of the adjusted squared multiple correlation coefficient.
where R is the usual multiple correlation coefficient, n is the number of observations (here n = 6561) and p is the number of parameters in the model (in Eq. (4), p = 1+32 + 496 + 32 = 561).

Examination of the residuals.
We considered the DFFITS i statistic defined as: where ŷ i is the prediction for the ith observation and ŷ (i) is the ith prediction but without the ith observation, s (i) is the estimated standard deviation after deleting the ith observation, and h i is the ith diagonal of the projection matrix for the predictor space (see Belsley et al., 1980 for more details). As the distribution of the model residuals shows an approximately normal distribution, a possible threshold is the size-adjusted cut-off recom-mended by Belsley et al. (1980) equal to 2 ffiffiffiffiffiffiffi ffi p=n p . Therefore, a value of DFFITS i above this cut-off indicates an influential residual e i . The DFFITS i was calculated by SAS software (Proc REG).

Test experiments.
A final test was performed to evaluate the model validation: the comparison between the predicted outputs for K, typically 10,000, new input combinations randomly sampled in the three squared experimental domains and the corresponding K simulated outputs. Then, we check if the K residuals are Gaussian and if their variation ranges are reasonable. We calculate the percentages de i : where y i is the observation and ŷ i the value predicted by the model to see the percentages of discrepancy between predicted and observed values for the test experiments.

Results
We fitted the model (4) to the 6561 experiments for the two responses Q 0.95 and Q 0.99 in the three zones as described in Section 2.2.3, after the factors z j had been   appropriately coded (use of orthogonal polynomials and scaling). Table 8 gives the values of the R adj 2 of the four main models in the middle part for the 95th quantile. According to these R adj 2 , the model with only the main effects was chosen in the first step (model 4 in Table 8). Indeed, the model has just 32 terms and its value of R adj 2 is still very high (R adj 2 = 0.97). The results being equivalent for the 99th quantile and the R adj 2 values being slightly smaller (about 3% less), the model with only the 32 main effects was also chosen for the 99th quantile. For both quantiles, the results were similar in the other two zones.
In the second step, with a view to parsimony, for each zone, variables were selected considering their explained sum of squares. The variables with the higher sum of squares were kept in the final model. Table 9 gives the final models with their parameter estimations for the 95th and 99th quantiles and for the three zones. The model for the 95th quantile includes only four terms (the main effects) in the lower and upper zones and five terms in the middle zone. The results are similar for the 99th quantile with a slightly smaller percentage of explained variability for each model. The results show a major influence of the parameters of the distributions on cereals (contamination and consumption). In the middle part, a slight influence of the pork consumption scale parameter appears. We noted that both the orthogonality of the simulation designs and the above-mentioned coding enable us to compare rigorously the influences of all the food distribution parameters. We obtained the marginal effect of each parameter and all of the parameters could be compared with each other because they have the same scale. On the other hand, the constant terms of the models of Table 9 represent the quantile means inside the cuboidal zones: for the 95th quantile they are 28.1, 20.1 and 16.7, and for the Table 9 Polynomial models of the 95th and 99th quantiles in the three zones of the 95% concentration ellipse with z 1 = cereals contamination scale parameter, z 2 = cereals contamination shape parameter, z 3 = cereals consumption shape parameter, z 4 = cereals consumption scale parameter, and z 5 = pork consumption scale parameter   The distributions of the residuals of the different models did not indicate outliers. Only a few DFFITS measures slightly exceeded the threshold as defined in Section 2.2.4.2. Ten thousand new combinations of inputs have been randomly sampled inside the three experimental domains. In the three zones, the distributions of the residuals of each model are approximately Gaussian and have moderate variation ranges. For example, in the middle part for the 95th quantile, the estimated standard deviation of the model residuals is estimated at 2.92 (for the 10,000 test experiments). It is higher than the estimated standard deviation (0.37) of the residuals on the 6561 experiments from which the model has been chosen, but their variation range is reasonable because 95% of these residuals have percentage, as defined in Eq. (5), between F28%. The discrepancy between the predicted and observed values for the test experiments is acceptable. Fig. 7 represents the histogram of the 10,000 residuals for the 95th quantile in the middle zone.

Discussion
On the basis of the observation, well known to risk analysts, that to obtain a good estimation of the high quantiles of exposure is a difficult operation when it is based on data which are both sparse and characterised by very asymmetrical distributions, we undertook a sensitivity analysis of the 95th and 99th quantiles of exposure to OA. We conducted an analysis of their sensitivity to probability density parameters adjusted to consumption and contamination data. That is, we measured the influence of the shapes of the consumption and contamination distributions of the studied foods to highlight those foods that possibly require particular investigation in order to better estimate the high exposure quantiles.
One of the original ideas of this work was to generate a realistic variability of data available by a parametric bootstrap method. This enabled us to construct variation intervals of the density parameters concerned, intervals presumed to be representative of the child population. Another original idea was to take into account the natural correlation of the parameters of the Gamma laws using concentration ellipses. However, we have not explored all the variation domains of the inputs: we studied the sensitivity of the 95th and 99th quantiles in three subdomains of cuboidal symmetry distributed judiciously in the variation domain (lower, middle and upper zone). The advantage of these subdomains is that they make it possible to set up orthogonal experimental designs and to take into account a large variation of the quantiles (see the constant terms of the models in the three zones). On these subdomains, it appears very clearly that a small number of parameters have a great influence on estimating these quantiles compared with the very large number of parameters envisaged at the beginning. These include mainly the parameters of the laws of consumption and contamination of the category cereals. Cereals are known by epidemiologists and toxicologists to play a major role in health problems concerning OA in Europe.
It is difficult to take into account all the volume of the variation domain of the inputs (for example, because of the non-convexity of this domain). The necessary tools are derived from the theory of optimum experimental designs. This could possibly improve our work. However, it would be appropriate beforehand to Fig. 7. Histogram of the 10,000 residuals for the 95th quantile for the 10,000 new observations in the middle part. guarantee the pertinence of our domain of variation (obtained by bootstrap sampling) ideally with data from further surveys.
Finally, it is difficult to compare our results with other results because, to our knowledge, no other research group has conducted a study on contamination by OA. The other estimations of the OA exposure quantiles found elsewhere are very empirical as indicated previously in Gauchi (in press) and the sensitivity analysis presented here is the first SA on OA exposure quantiles.