Theoretical derivation of a bias-reduced expression for the extrapolation of the species accumulation curve and the associated estimation of total species richness.

Under-sampling becomes the current situation for an increasing part of biodiversity surveys, as more and more speciose assemblages and increasingly complex taxonomic groups are progressively addressed. Accordingly, (i) extrapolating the Species Accumulation Curve and (ii) estimating the total species richness of partially-sampled species assemblages (or taxonomic-groups)


INTRODUCTION
As biodiversity inventories are continuously expanding, they progressively address less surveyed taxonomical groups which usually give rise to assemblages of numerous species, with often small sized-individuals, more or less hard to detect in the field (such as, for example, assemblages of small-or micro-invertebrates). Thus, more or less incomplete species samplings are deemed to become increasingly frequent [1].
Under-samplings immediately raise two kinds of important questions in practice: -How to properly estimate the total species richness of a partially sampled assemblage, that is the expected number of recorded species that would be recorded if the sampling was ideally complete; -How to properly estimate the expected kinetic of discovery of new species, as sampling would be carried on further, beyond its present size, that is how to estimate adequately the shape (the governing equation) of the extrapolated "Species Accumulation Curve" ('SAC'), beyond the already reached sampling-size [2].
Both issues are of major importance and thus prompted researchers to propose a series of expressions for both the nonparametric estimation of total species richness (review in [3,4]) and the extrapolation of the Species Accumulation Curve (review in [5]).Accordingly, the issue, now, is rather to select among the varied reported types of estimators, since, unfortunately, each of these types of estimators provide a substantially different results in practice [6].Indeed, a considerable amount of work has been devoted to test comparatively these different types of estimators, mainly on an empirical basis [7-13].But, as might be expected, no consensus emerged from these studies.This is because, in fact, each type of estimator may provide a centered, unbiased prediction in a very specific case only: for a particular shape of the Species Accumulation Curve, resulting in turn from a particular shape of the species abundance distribution within the sampled assemblage of species [3,13,14-16].Thus, a specific shape of 'SAC' is associated to each type of estimator of species richness, insuring the accuracy of the resulting estimations.Accordingly, the empirical approaches above hardly help to disclose any information of general value and this is the reason why, given these ambiguities, a common but rather unsatisfactory advice consists in considering all, or at least several types of estimators concurrently, without choosing between them [4].Now, apart from these rather unsuccessful, purely empirical approaches, a few semiempirical studies have more recently offered suggestive insights, trying to opportunely choose among the different types of estimators provided in the literature [17][18][19].These studies yet remain semi-empirical, as they are based on simulations specifically computed for particular kinds of species abundance distributions: 'broken-stick', 'random fraction', 'random assortment ' [17]; 'lognormal', 'log-series ' [18]; 'geometric series' [19].
Although these studies lead to broadly similar trends, in terms of selected types of estimators, they significantly differ in details, due to the different kinds of species abundance distributions involved in each study.Moreover, being based on simulations (computed from particular types of abundance distributions), all these studies lack a definite theoretical basis that could provide added soundness to the analysis and ensure more general applicability to the results.
Accordingly, a more appropriate approach might consist in addressing the issue from a less empirical, more theoretical point of view, aiming at more general applicability for the extrapolation of the 'SAC' and for the estimation of the total species richness of the sampled assemblage.
In this perspective, a theoretical based guide of choice is needed for: (i) Defining which, among the most commonly used types of species richness estimators, is able to provide the less biased estimations and, thus, should be selected according to the degree of unevenness of species abundances and the level of sampling completeness; (ii) Defining accordingly which mathematical expression is to be selected for the associated extrapolation of the 'SAC', in accordance with the selected estimator above.
A preliminary step in this respect, addressing the problem this way, was achieved by considering at first the estimators involving only the numbers f 1 and f 2 of species already recorded once and twice respectively: Restricting to f 1 and f 2 only, yet reveals progressively insufficient, as the level of sampling completeness decreases (or as the degree of unevenness of species abundances increases), as already suggested by the semiempirical studies cited above.
Thus, including the supplementary data provided by the values of the numbers f 3 , f 4 , f 5 ,… of those species recorded 3, 4, 5,… times respectively, is a further step necessary to cope with more incomplete samples.The corresponding issue is addressed in detail hereafter, starting from the same theoretical basis as for the previous developments.

GENERAL MATHEMATICAL RELATIONSHIPS CONSTRAINING THE EXPRESSION OF ANY SPECIES ACCUMULATION CURVE
The 'SAC' reflects the increasing number R(N) of recorded species with growing sampling effort N (for example, the number of collected individuals).Fundamentally, the expression of the 'SAC' is a continuously differentiable function (being understood that, in practice, the 'SAC's may not be so, due to sampling stochasticity; yet, the classical rarefaction procedure restores the theoretical shape and the genuine continuous differentiability of the 'SAC').

Two mathematical relationships of general applicability actually constrain the theoretical expression of any 'SAC':
-A general relationship between (i) the series of the successive derivatives ∂ x R (N) /∂N x of the 'SAC' and (ii) the series of the numbers f x(N) of species recorded x times: with C N, x designing the number of combinations of x objects among N (the derivation of this general relationship is provided at Appendix A.1 & A.2).
-The rule of 'additivity' [21], which stipulates that if the whole set of species, within a sampled assemblage, may be distributed among several mutually exclusive subsets (for example, taxonomic subsets such as orders, or families, or genus), then the 'SAC' for the whole assemblage must be the sum of the 'SAC's relative to each of the constitutive subsets: The classical expressions that may be considered to extrapolate the 'SAC' beyond the size N 0 of an (incomplete) sample (see [5] for a review) are unsatisfactory in these respects: they do not satisfy the rule (2) of additivity and they satisfy the constraining relationship (1) on derivatives, at best, for the first derivative only, as may easily be verified.For example, the Clench model, R(N) = S.N/(b+N), with S standing for the total species richness, complies with the additivity rule only if the parameter 'b' takes the same value for the whole set and for each of the constitutive subsets, which is almost never the case in practice.Also, the Clench model does not satisfy the relationship on derivatives beyond the first order derivative (except very specific cases).

DERIVING THE EXPRESSIONS OF THE EXTRAPOLATED SPECIES ACCUMULATION CURVE AND OF THE ESTIMATED TOTAL SPECIES RICHNESS
Satisfying the rule of additivity suggests to model the extrapolation R(N) of the 'SAC', using a polynomial function in N.More precisely, the monotonic, continuously decelerating increase of R(N) with N, and the ultimate asymptotic approach of the total species richness S, lead to the following general polynomial form: R(N) = S -A.N -1 -B.N -2 -C.N -3 -….
(3) with the coefficients S, A, B, C,… of this polynomial expression in N being dependent on the recorded data issued from the already achieved sampling, that is: the sample size N 0 , the number of recorded species R 0 (= R(N 0 )) and the numbers f 1 , f 2 , f 3 ,… of species already recorded 1, 2, 3,… times.In addition, to comply with the prescribed rule of additivity, the coefficient S, A, B, C,… should depend linearly upon N 0 , R 0 , f 1 , f 2 , f 3 ,… ; as is actually the case, as shown later.Thus, in these conditions, the addition of several 'SAC's having a polynomial form as specified by equation (3) still remains of this same polynomial form (3), as prescribed by the rule of additivity when an assemblage of species may be considered as the union of several, mutually exclusive subsets [21].On the contrary, classical models, do not satisfy the rule of additivity in general, as already underlined above.Of course, this polynomial expression (3) is specifically designed for the extrapolation of 'SAC's, that is beyond the actual sample size N 0 and should not be considered for the intrapolation (when N < N 0 ), especially when N approaches low values.
Satisfying equation (1), up to the x th derivative leads to the following system of x+1 equations, labelled (4): x ] N0 by their corresponding expressions, according to the polynomial definition (3) of R(N), converts the system (4) above in a linear system of x+1 equations, in terms of the x+1 unknown coefficients S, A, B, C, …, .This linear system, labelled (5), is, according to (3): Resolving this system (5) yields the expressions of the coefficients S, A, B, C, …, each of them being, as expected, a linear function of the recorded terms R 0 , f 1 , f 2 , f 3 ,… .The details of the resolution are given in Appendix A.3.
In the system above, the constraining relationship ( 1) is taken in consideration for the x first derivatives ∂ x R (N) /∂N x only.This, however, accounts for the essential since it is the series of the first derivatives that actually play the major role in precisely defining the shape of the (extrapolated) 'SAC'.
In practice, as will be shown later, the less complete is the achieved sampling, the larger is the number of terms in the polynomial expression (3) of R(N) which have to be considered (and accordingly, the larger the number of values of f 1 , f 2 , f 3 ,… which have to be implemented to derive the expression of the extrapolated 'SAC').
Finally, the resolution of the linear system of equations ( 5), considering successively a growing number (x+1) of equations, yields the following expressions for: (i) The extrapolation of the 'SAC': Table 1; (ii) The estimated number of missing species ∆ (= S -R 0 ): Table 2; Table 1.Successive expressions -at increasing orders m = 1 to 5 -of the extrapolated Species Accumulation Curve R(N), respectively associated to the "Jackknife" type estimators ∆ m (with S m = R 0 + ∆ m ) defined in Table 2 Table

Successive expressions -at increasing orders m = 1 to 5 -of the "Jackknife" type estimators of the estimated number ∆ m of missing species and of the estimated total species richness S m
The general solution, derived at order m (that is for the system (5) with m+1 equations), provides the estimation of the number ∆ m of missing species and the corresponding expression R m (N) of the associated extrapolation of the 'SAC'.
The derivation of the general solution is addressed in more details in Appendix A.3.
As suggested above, the less complete is the sampling under consideration, the larger is the required number of terms to be accounted for in the polynomial expression (3) of the extrapolation R(N) ; that is, the larger is the order m to be selected in Tables 1 and 2. In other words, each order m (i.e. each expression R m (N), ∆ m ) has a specific domain of appropriateness (essentially related to the degree of completeness of sampling and, also, to the degree of unevenness of the distribution of species abundances).Now, in practice, the degree of completeness and the degree of abundances unevenness are unknown a priori, but the choice for the appropriate order m is imposed, however, by the required compliance with the rule of continuity.This rule specifies that the continuity of value of R(N) (and of ∆ as well) is to be satisfied at the boundary between the domains of appropriateness of successive orders, m and m+1.That is, R m+1 (N) = R m (N) and ∆ m+1 = ∆ m , at the boundary between orders m and m+1.This, in turn, unambiguously defines the positions of the boundaries delimiting the specific domain of appropriateness of each order m and the corresponding expressions R m (N) and ∆ m to be selected in Tables 1 and 2. Thus, satisfying the continuity at the boundary between order 1 and order 2, requires that, at this boundary, R 1 (N) = R 2 (N), that is: f 1 = f 2 (see Table 1).Similarly, the rule of continuity requires that (i) at the boundary between orders 2 and 3, R 2 (N) = R 3 (N), that is: f 1 = 2f 2 -f 3 ; (ii) at the boundary between orders 3 and 4, R 3 (N) = R 4 (N), that is, f 1 = 3f 2 -3f 3 + f 4 ; etc… (see Table 1).
Further proceeding the same way, the following guide of choice is finally derived, highlighting the relevant order m to be used for the estimation ∆ m of the number of missing species and for the associated expression R m (N) of the extrapolation of the 'SAC': Table 3.

Nota:
Apart from the Jackknife-type series derived above, another commonly referred estimator of species richness is the Chao-type, which, yet, does not comply in general with the additivity rule and does not satisfy the constraining rule (1) beyond the first derivative.These are the reasons why Chao estimator is discarded in general (the same applying to the more recently derived [22] i-Chao estimator (see [21]).Yet, this lack of compliance does not hold true in the very specific (and rather unusual) circumstance when the distribution of species abundances is (sub-) even or, as well, when the sampling comes close to completeness [20,21].

Table 3. Key to select the more appropriate (less biased) choice for (i) the expression of the extrapolated Species Accumulation Curve R(N) and (ii) the estimates of the number of missing species ∆ (and the resulting estimation of the total species richness S). The key is based upon
the recorded data issued from the actually achieved sampling, i.e. the values taken by the series of numbers f x of species respectively recorded x-times (N.B.: to minimize the consequence of sampling stochasticity, the distribution of the recorded values of f x is preferably smoothened, as indicated in section 4 -Practical implementation) In these particular cases, Chao estimator even provides a strictly accurate estimates and may then arguably be preferred.It is thus possible, and even advisable, to replace the first order Jackknife estimator ∆ 1 = f 1 by the Chao estimator /(2f 2 ) and replace the extrapolation R 1 (N) associated to Jackknife-1 by the expression R C (N) associated to Chao estimator, defined by: according to [20].
The two first lines of the guide of choice above (Table 3) are then modified accordingly: Table 4.
The keys provided at Tables 3 or 4 thus allow to select (i) the more appropriate (i.e. less biased) type of estimator of the total species richness of the sampled assemblage and (ii) the associated expression of the extrapolation of the Species Accumulation Curve.

PRACTICAL IMPLEMENTATION OF THE PROCEDURE
Consider an incomplete sampling of an assemblage of species: sample size N 0 , including R 0 recorded species, among which f 1 , f 2 , f 3 ,…, f x , of them are recorded 1, 2, 3, .., xtimes respectively.In practice, to reduce the consequence of unavoidable sampling stochasticity it is generally advisable to smoothen the distribution of values of the f x [3].This is more specifically the case for the f x with x > 2, since the f x > 2 often have comparatively lower values, as such especially sensitive to stochastic dispersion.This may be obtained by classically applying a rarefaction procedure to the recorded data [3,4], thanks to what a smoothened distribution, f 1* , f 2* , f 3* ,…, f x* is derived and substituted to the originally recorded distribution f 1 , f 2 , f 3 ,…, f x .Alternatively, the smoothened distribution f 1* , f 2* , f 3* ,…, f x* may also be derived by a regression applied to the originally recorded distribution f 1 , f 2 , f 3 ,…, f x .
The guide of choice (Tables 3 or 4) and the selected expressions for the extrapolation of the 'SAC' and the associated estimate of the number of missing species (Tables 1 and 2) should thus be considered preferably on the basis of the series of values f 1* , f 2* , f 3* ,…, f x* .
-An illustrative example: A partial survey of mining moths in Burgundy involving 2605 recorded individuals encompasses 168 recorded species with f 1* = 30, f 2* = 19, f 3* = 13, f 4* = 9, f 5* = 6.The estimated number ∆ of missing species and the resulting total species richness S = R 0 + ∆ are computed, for the different types of estimators, according to Table 2.The results are provided at Table 5.The corresponding extrapolations of the 'SAC', computed according to each type of estimator, are plotted at Fig. 1.In turn, the evolution of sampling completeness R(N)/S with growing sampling size N is plotted at Fig. 2, for each type of estimator.The rather strong differences between the results obtained according to the different types of estimators highlight the importance of selecting appropriately the less biased among these different types of estimators.Referring to the guide of choice (Tables 3 & 4), it follows that, here, the order m = 5 is to be selected, since f 1 * > 4f 2 * -6f 3 * + 4f 4 * -f 5 *.Thus, the best extrapolation of the Species Accumulation Curve is given by equation R 5 (N) (see Table 1) and the best estimated number of missing species is

DISCUSSION
As already mentioned, the common practice with the various types of estimators of species richness (and the various solutions for extrapolation of the Species Accumulation Curve ['SAC']) has long been -and still is -to consider this various types of estimators altogether, without choosing among them [4], in spite of their divergent results.
Yet, this obviously unsatisfactory situation has been subsequently challenged by several authors [17-19], testing semi empirically the respective relevance of several types of estimators (Chao, Jackknife of different orders), considering different classical models of species abundance distributions ('broken-stick', 'random fraction', 'random assortment' [17], 'log-normal' and 'log-series' [18], 'geometric series' [19]).One common point, highlighted by all these studies, is that the more appropriate type of estimator is dependent upon both (i) the kind of species abundance distribution in the sampled assemblage of species and (ii) the degree of sampling completeness.This being due to the fact that the shape of the extrapolated 'SAC' -which governs the estimation of the total species richness -is dependent, itself, upon the species abundance distribution and the degree of completeness of sampling.Then, assuming (according to [17][18][19]): (i) The kind of species abundance distribution, among the six cited above (which may possibly be tested a priori on the basis of the already recorded data) and (ii) The degree of sampling completeness (which may be tested a posteriori for each type of estimator), The particular type of estimator to be selected is the one which provides an estimate in accordance with the resulting degree of completeness computed with this estimator [17-19].However, it remains that having to cope and handle with the assumptions above, which require verifications a priori and a posteriori, makes those approaches not so easy to implement in practice.
Indeed, it is in this respect that the resolutely theoretical approach described above, has proven being able to bring substantial further progress.
Instead of considering the recorded data (R 0 and the series of f x ) only as a mean to express the estimation of species richness (as in [17-19]), we highlighted, here, that this recorded data carried also essential information regarding the shape of the extrapolated part of the 'SAC', thanks to the general relationship (1) governing the shape of any 'SAC'.This relationship, linking the series of f x to the shape of the 'SAC' (ruled by the successive derivatives ∂ x R (N) /∂N x of the 'SAC') is, indeed, the key tool that allows to derive an estimation of the shape of the extrapolated part of the 'SAC' from the already recorded data.This supplementary data, extracted from the series of f x , thanks to equation ( 1), thus avoids having to make any assumptions a priori, regarding both the kind of species abundance distribution involved and the degree of sampling completeness (as is necessary according to procedures derived in [17][18][19]).This, indeed, is the very significant contribution of the theoretical approach of the question.

Table 4. Alternative version of the key provided at
Interestingly, this theoretically based, independently derived approach comforts a most important general trend already highlighted by the semi-empirical approaches [17-19]: the relevance of selecting increasing orders of Jackknife series as the degree of sample completeness decreases (and, in particular, restricting the use of Jackknife-1 or Chao estimators to sample completeness approaching exhaustivity, i.e. when f 1 < f 2 or f 1 < 0.6 f 2 , respectively).This convergence, indeed, results from the fact that the classical kinds of species abundance distributions considered in [17-19], although exhibiting significant differences, yet satisfy, for all of them, a common trend towards a negative log-linear dependence between abundances and abundances ranking [23].
Also, defining the boundaries separating the respective domains of use of the different types of estimators on the basis of the recorded values of the f x not only avoids having to make a priori assumptions on the degree of sample completeness and on the type of species abundance distribution (as just mentioned) but also satisfy the desirable continuity of the estimates at the boundaries separating the successive orders m of the estimators ∆ m and R m (N) (Tables 3 and 4).A requirement that semiempirical approaches procedures [17-19] cannot satisfy, as may easily be verified.
At last, it should also be noted that the selection of the appropriate type among available estimators, according to the procedure described above (that is Jackknife types of various orders in the general case and Chao restricted to the vicinity of full completeness), satisfies the prescribed rule of additivity (section 2, equation (2)) [21], a required property also shared by the three semi-empirical approaches [17-19], but neglected by the traditional approach [4].

CONCLUSION
In short, the derivation of the procedure described here aims at taking a maximum advantage from a preliminary theoretical analysis of the process of the species accumulation during the progressive sampling of species (or of any other kind of objects), thereby resulting in a general relationship between the already recorded data (the series of f x ) and the mathematical parameters (the series of successive derivatives) defining the shape of the extrapolated Species Accumulation Curve: equation (1).
In turn, this preliminary theoretical approach allows building a general procedure: -To define and select the more appropriate (less biased) type of estimator of total species richness, only based on the data directly issued from the incomplete sample under consideration (the f x ), without requirement of any prior assumption; -And also to define and select the more relevant (less biased) expression for the extrapolation of the Species Accumulation Curve.
Yet, it should not be forgotten that "less biased" does not signify "unbiased" and it should also be kept in mind that the implemented data (the series of f x ) is submitted to stochastic dispersion, that smoothing by rarefaction or by regression may substantially reduce but does not entirely suppress.
unrecorded Consider an assemblage of species containing an unknown total number 'S' of species.Let R be the number of recorded species in a partial sampling of this assemblage comprising N individuals.Let p i be the probability of occurrence of species 'i' in the sample This probability is assimilated to the relative abundance of species 'i' within this assemblage or to the relative incidence of species 'i' (its proportion of occurrences) within a set of sampled sites.The number ∆ of missed species (unrecorded in the sample) is ∆ = S -R.
The estimated number ∆ of those species that escape recording during sampling of the assemblage is a decreasing function ∆ (N) of the sample of size N, which depends on the particular distribution of species abundances p i : with Σ i as the operation summation extended to the totality of the 'S' species 'i' in the assemblage (either recorded or not).
The expected number f x of species recorded x times in the sample, is then, according to the binomial distribution: We shall now derive the relationship between the successive derivatives of R (N) , the theoretical Species Accumulation Curve ('SAC') and the expected values for the series of 'f x '.
According to equation (A1.2): Then, according to equation (A1) it comes: where ∆' (N) is the first derivative of ∆ (N) with respect to N. Thus: where ∆'' (N) is the second derivative of ∆ (N) with respect to N. Thus: ] which, by the same process, yields: where [∂ x ∆ (N) /∂N x ] is the x th derivative of ∆ (N) with respect to N, at point N. Now, the number of recorded species R (N) is equal to the total species richness S minus the expected number of missed species ∆ (N) .Then it comes: x /x! = R(N 0 ) + Σ x (-1) (x-1) f x(No) (N-N 0 ) x /x! /C No, x