How to estimate complementarity and selection effects from an incomplete sample of species

Declines in global biodiversity have inspired a generation of studies that seek to characterize relationships between biodiversity and ecosystem functioning. The metrics for complementarity and selection effects derived by Loreau and Hector in 2001 remain some of the most influential and widely used statistics for studying these relationships. These metrics quantify the degree to which the effect of biodiversity on a given ecosystem function depends on only a few species that perform well in monoculture and in mixture (the selection effect) or if the effect of biodiversity on a given ecosystem function is independent of monoculture performance (the complementarity effect). This distinction may be useful in determining the consequences of the loss of rare versus common or dominant species in natural systems. However, because these metrics require observations of all species in a community in monoculture, applications in natural systems have been limited. Here, we derive a statistical augmentation of the original partition, which can be applied to incomplete random samples of species drawn from a larger pool. This augmentation controls for the bias introduced by using only a subsample of species in monocultures rather than having monocultures of all species. Using simulated and empirical examples, we demonstrate the robustness of these metrics, and provide source code for calculating them. We find that these augmentations provide a reliable estimate of complementarity and selection effects as long as approximately 50% of the species present in mixture are present in monoculture and these species represent a random subset of the mixture. We foresee two primary applications for this method: (a) estimating complementarity and selection effects for experimentally assembled communities where monoculture data are lacking for some species, and (b) extrapolating results from biodiversity experiments to diverse natural systems.


A.I: Selection effects
From the definition of covariance, we know that the expected value of sample-level covariance is equal to the population-level covariance, such that E[Cov(DRY S , M S )] = Cov(DRY P , M P ). We can therefore express the relationship between sample-level and population-level selection effects as (S1b) where = E[X] is the expected value of random variable X, and the symbol ≅ indicates that Q/N SE S is an unbiased estimate of SE P (i.e. the average value is distributed around the population-level value, with some error).
As discussed in the main text, note that sample-level covariance only provides an unbiased estimate of population-level covariance if it has been corrected for sample size (i.e. scaling by N/ (N-1)). Because the sample-size corrected formula is used by default in most computerised methods, this difference is probably not a major concern for most users. Please see Appendix B for more details.

A.II: Complementarity effects
As noted in Eq. (2) in the main text, we know from the definition of covariances that the expected value of the product of two random variables is equal to the product of their In other words, because the deviations between 3 :::: and ∆RY 3 ::::::: and their corresponding population-level means are correlated, these deviations leads to a systematic bias, proportional to their covariance.

B.II: Estimation of covariance
One important note about the functions we provide here is that they include three potential methods for calculating covariance. Recall from the introduction that we discussed that the sample-level variance is a biased estimate of the population-level variance, which is why sample variance is typically calculated as var( ) = ∑(( − : ) R ) /( − 1). A similar bias occurs for sample-based estimates of covariance, such that cov( , ) = ∑B( − : )( − : )C /( − 1) for sample-level estimates.
Given the default argument uncorrected_cov = FALSE, we use the standard "cov" function from R, which applies the sample-size correction and assures that SE P ≅ SE S .
This approach is probably the correct one to use for most applications, and is the approach that we apply in all analyses presented in this manuscript. However, one drawback to this method is that CE + SE is no longer guaranteed to equal the true deviation in yield ∆Y. In

Sample-level BEF partitioning
Clark et al. 2019 6 general, this difference is probably of minor importance, since estimates if SE are typically used to determine the direction of the association between monoculture and mixture yields, rather than to precisely calculate the change in yield itself (n.b. this difference is also the reason why CE S can be a biased estimate of the population-level statistic, while SE S is not, even though their sum is theoretically equal to ∆Y).
Nevertheless, we also include the option uncorrected_cov = TRUE, which applies the non-sample-size-corrected formula cov( , ) = ∑B( − : )( − : )C / . This implementation guarantees that CE + SE = ∆Y for any mixture of species and is potentially useful for testing that the functions have been written correctly. However, it should probably not be applied in most analyses, as it no longer guarantees that SE P ≅ SE S (especially for small N).
Finally, we include a compromise function, uncorrected_cov = "COMP", which applies an augmented correction similar to that in in Eq. (3c) in the main text: For each of these 20,000 simulated iterations of noisy observations of M and Y, we calculated two types of estimates of CE P and SE P . First, in order to quantify the effect of observation error, we used the full pool of Q species to calculate the classical complementarity and selection effect metrics (yellow lines in Fig. S1). Second, in order to show how our sample-level approximations of the population-level statistics were influenced by observation error, we estimated CE P and SE P based on incomplete samples of N species drawn from the full pool of Q species (dark blue lines in Fig. S1). Thus, these sample-level estimates were influenced by both observation error and sampling error (i.e. inaccuracies due to only partially sampling the full community of species). Lastly, for both types of metrics, In general, we found that observation error led to high uncertainty, but that this uncertainty could be effectively controlled with realistic numbers of homogeneous replicates (Fig. S1). Though error in the sample-level approximations of CE P and SE P were (by necessity) always higher than those estimated from the full pool of Q species, the difference 10 mown plant material is removed as usual for extensively managed hay meadows in the region. Plots do not receive any fertilizer. Aboveground biomass is harvested twice per year shortly before mowing in late May and late August. For our case study, we used biomass data from the first harvest, which usually represents peak biomass (sampled in 2006). Biomass was harvested in rectangles of 50 × 20 cm size by cutting plant material 3 cm above soil surface. Four and two randomly distributed samples were taken in each large and small plot, respectively. Biomass samples were sorted to sown species, weeds and detached dead material, dried at 70°C for at least 48 h and weighed.

D.II: Semi-natural grasslands
The studied semi-natural grasslands were old permanent grasslands, which are managed by mowing two times per year without fertilization, as in the Jena Experiment plots.
One study site ( Note that for our comparisons of Jena vs. these semi-natural grasslands, we use two different sets of sample years. We justify this inter-year comparison in three ways. First, Species abundance in mixture was calculated for each of the six plots at each site, based on sorted species-level biomass measured in the 40 x 40 cm plots nested within the larger 80 x 80 cm plots, and reported in dried gm -2 . In addition to using these values as indicators of mixture biomass for our calculations of selection and complementarity effects, we also used these species-level abundances to calculate exponentiated Shannon Diversity in each plot, as e H = exp(-S pi log(p i )), where pi is the relative abundance of species i.

E.II: Confidence intervals
To estimate the confidence intervals for SE P and CE P shown in Fig. 5 in the main text, we proceeded in two steps. First, for each plot in each site, we parameterised a bivariate normal distribution based on the relative yield difference (DRYi) and the mean monoculture