Prediction of total silicon concentrations in French soils using pedotransferfunctions from mid-infrared spectrum and pedological attributes

16 Silicon (Si) is the second most abundant element of the Earth’s crust, and its terrestrial cycle 17 depends on soil, vegetation, and human activities. The spatial extent of terrestrial Si 18 perturbation is poorly documented since maps of Si concentration in soils are rare. In 19 addition, Si content is rarely measured in non-paddy soil databases. Here we demonstrate that 20 pedotransfer functions based on either pedological attributes (particle size fraction, pH, 21 organic carbon, cation exchange capacity, calcium carbonate and parent material) or mid 22 infrared spectra (MIRS) can be used to accurately predict total Si concentration. In this 23

research, we utilised a unique dataset from the French monitoring network of soil quality 24 (RMQS -Réseau de Mesures de la Qualité des Sols) database. Pedotransfer functions were 25 built using a regression tree model on a subset of the data for which total Si concentration was 26 measured. To compare the relative performance of the models obtained for the two different 27 sources of data, a suite of performance indicators were calculated. Our results showed that 28 PTF based on MIR spectra produces highly accurate and precise estimates of the total Si 29 concentration for French soils. The pedological PTF is less accurate, but still provides a good Indeed, despite being considered as a non-essential element, Si is encountered in most 46 terrestrial plants with concentrations highly variable, ranging from less than 0.2 to more than 47 10 % dry weight (Ma and Takahashi, 2002). Meanwhile, plant Si is a significant pool of the 48 global Si cycle as evidenced from the total annual biogenic Si retention in terrestrial plants,  MIRS were also acquired on RMQS samples (Grinand et al., 2012). 0.5-g aliquots of < 0.2-102 mm ground sample were scanned from 4000 to 400 cm -1 (i.e., 2500-25,000 nm) at 4 cm -1 103 resolution using a Nicolet 6700 Diffusive Reflectance Fourier Transform Spectrophotometer 104 (Thermo Fisher Scientific Instruments, Madison, WI, USA). Then, 32 scans per sample were 105 acquired and averaged. Spectra were recorded as absorbance. 106 MIRS were pre-processed, before statistical modelling to reduce baseline variations, enhance 107 spectral features, reduce the particle-size scattering effect, remove linear or curvilinear trends 108 of each spectrum, or remove additive or multiplicative signal effects (Boysworth and Booksh, 109 2008). The pre-processing routine consisted of an 11 bands window smoothing  Golay filter (Savitzky and Golay, 1964) using the sgolayfilt function from the signal R 111 package (Ligges et al., 2015) followed by a standard normal variate (SNV, Barnes et al., 112 1989) transform. (1) basic pedological attributes including particle size fraction, pH, organic carbon, cation 129 exchange capacity, calcium carbonate and parent material, or 130 (2) pre-processed MIRS data, 131 These PTFs are termed as pedological PTF and MIRS PTF hereafter. The variables of the 132 pedological PTF were chosen from the most common soil analytical variables in soil 133 databases. For that reason, parent material was also included, while soil type was not as it had 134 only a limited influence in the model (data not shown).  Moreover, this type of approach is flexible as it can handle both quantitative and qualitative 145 variables as well as spectral data, which allows having a unique approach for both PTFs, and 146 thus their results can be fairly compared. 147 To optimize the Cubist model, two parameters can be adjusted: the number of model trees as 148 ensembles (committees) and the number of nearest-neighbors to adjust the prediction of the 149 rules (neighbors). To optimize the model parameters, we used the train function in the caret 150 R package. The tuning parameter 'neighbors' was held constant at a value of zero to avoid 151 shortcomings in the interpretability of the rules by local averaging. The optimal numbers of 152 'committees' was found to be 5 for the two PTFs. 153 Calibration and evaluation steps: The modelling approach is summarized in Figure 2. we 154 used a leave p out cross-validation approach combined with a bootstrap step (James et al., 155 2013). The p cross-validation leaves out a p proportion of samples for validation. We used a 156 75%-25% split for calibration and validation respectively, and this procedure was repeated 10 157 times. Cross-validation allows a more robust assessment of the quality of the prediction. The 158 subdivision was performed using the conditioned Latin Hypercube Sampling (cLHS) method 159 (Minasny and McBratney, 2006). This method is a stratified random procedure that provides 160 an efficient way of sampling variables from their multivariate distributions. The bootstrap step 161 involved simulating 100 datasets by random sampling with replacement from 95% of the 162 calibration dataset (formed at the cross-validation step). This whole procedure generated 100 163 Cubist models in the calibration procedure. These outcomes were used to build the 164 distribution of the prediction. The mean prediction could be obtained from the average of the 165 100 bootstrapped models. implemented in clhs package (Roudier, 2011). 173 The CRPS is a distance criterion, which is a positive value and should be close to 0. The 204 prediction is expressed in terms of a probability distribution rather than a single value. The 205 CRPS compares the cumulative probability distribution of the predicted value to the observed 206 value. In our case, we only took into account the uncertainty of the prediction and assumed 207 the uncertainty of the observation is small. The probability distribution of our observation is 208 set to equal to 1 for the observed value and null elsewhere. As a distance, the CRPS can be 209 linked to the mean absolute error used in the deterministic prediction. It uses the information 210 provided by the probabilistic prediction instead of just using the mean of the median value.  Table 2). Considering the whole dataset, the PCAs showed a good overlap 238 between the RMQS sites with and without Si measurement ( Figure 6). Therefore, we can 239 consider the Si dataset to be representative of the whole RMQS. 240 241

242
The validation of the MIRS PTF estimating Si content was excellent with an R² of 0.96. 243 Estimates from this PTF were unbiased, and their average RMSE is 15.31 g kg -1 (Table 3). 244 The average CRPS was very close to the RMSE value. The validation of the pedological PTF 245 was also very good with an R² of 0.87. Estimates from the pedological PTF were slightly 246 biased, with an average RMSE of 26.48 g kg -1 . Finally, the average CRPS were larger 247 indicating higher prediction uncertainty. The results of leave p out cross-validation were also 248 used to compute indicators of the variability of the performance indicators (Table 3)

283
Regarding pedological significance, the MIRS PTF uses mostly combination-overtones bands 284 of quartz ranging from 1800 to 2000 cm -1 (Table 4), to predict total Si concentration, which is 285 expected, as quartz is a mineral composed of Si and oxygen (O) atoms (SiO 2 ) (Figure 8a). 286 This region of the spectrum presents the peak with the most important weight (>80%) around 287 2000 cm -1 followed by two other peaks around 1800 and 1900 cm -1 (> 40%, Figure 8b, Table  288 4). The carbonate concentration also has a role in the MIRS PTF, with carbonates bands 289 ranging from 2400 to 3100 cm -1 , which correspond to CaCO 3 bonds (Table 4). It exhibits 290 three peaks of average weight > 20%, one around 2500 and two around 3000 cm -1 (Figure 8b). 291 As shown in Figure 8a, samples with low Si concentration contain carbonates while samples 292 with high Si concentration do not. This link is due to the absence of Si in carbonates (Table  293 4). Bands related to Si-O bond ranging from 1400 to 400 cm -1 also presents a noticeable 294 weight in the PTF (Figure 8b).
For the pedological PTF, the two most important predictors are the organic carbon and the 296 carbonate concentrations, with an average weight of 78.6% and 73.6% respectively. 297 Carbonates as discussed for the MIRS PTF act as a diluent for Si, which is also the case of 298 organic carbon. No significant organic carbon contribution was observed in the MIRS PTF 299 probably because, in MIRS, "organic carbon cannot be identified with clearly separated peaks 300 but as a whole spectral region with overlapping bands", as stated by Grinand et al. (2012). In 301 the pedological PTF, an important influence of the sand fraction could be expected as a 302 positive correlation between total Si and both fine and coarse sand is observed (Kendall's 303 correlation coefficient: tau = 0.17 and p-value = 3.585 10 -11 ; tau = 0.11 and p-value = 3.108 304 10 -5 , respectively). Indeed, these two variables have an average weight of 37% in the 305 pedological PTF (Figure 9). In addition, the clay fraction also has an important weight in the 306 pedological PTF (60%) with a negative correlation between the total Si concentration and the 307 clay fraction (Kendall's correlation coefficient: tau = -0.46 and p-value < 2.2e-16). 308 As a conclusion, the two PTFs were mainly underlined by the same processes: dilution of the 309 Si concentration by carbonates, organic carbon and possibly the clay fraction to a lesser extent 310 and concentration due to the presence of quartz mainly in the sand fractions. 311 312 We compared pedological PTF predictions of total Si concentration for the non-Si analysed 313 RMQS sites to that predicted by the MIRS PTF ( Figure 10). The relative difference of 314 predictions between the two PTFs is small, 90% of the time, the difference between the 2 315 PTFS is less than 20%. This result highlights the consistency of the two PTFs and confirms 316 that despite less accurate, the pedological PTF gives a reasonable estimation of the soil Si 317 concentration for the whole dataset. The soil observations of this study came from a 318 systematic probability sampling which leads to good spatial coverage, i.e. the sites are 319 uniformly spread over France. This design proves to be efficient in providing accurate estimates of means over the whole area and can be used to generalize the results for the whole 321 area of France (Brus and Saby, 2016). concentrations (320.7 g kg⁻¹). Nevertheless, the Mann and Whitney test shows no significant 331 difference between the two datasets (p-value = 0.5757). For the European territory, the 332 comparison is shown in Figure 11. The two datasets cover the same range of Si concentration 333

3.4-Domain of potential application of the developed PTFs
with an over-representation of the soils with Si concentration ranging from 350 to 400 g kg -1 334 in French compare to other European soils, resulting in contrasted median Si-concentrations 335 of 327.2 g kg -1 and 313.9 g kg⁻¹ respectively. When looking at the Si average, the Mann and 336 Whitney test also shows a small significant difference between the two datasets (0.05 > p-337 value = 0.02273 > 0.01). This result was expected since France is one of the countries 338 exhibiting the largest soil diversity in the world (Minasny et al., 2010). Thus, the established 339 PTFs can be applied at the Europeans scale to predict total soil Si concentrations at a higher 340 spatial density than that provided by the GEMAS study with the exception of some soil types 341 that are not represented in France, such as Chernozems, Kasternozem, Solonetz.  Table 4.