Extrapolation of the Species Accumulation Curve Associated to “Chao” Estimator of the Number of Unrecorded Species: A Mathematically Consistent Derivation

Incomplete samplings are doomed to become common practice for many inventories of biodiversity, thereby inviting to extrapolate what the rate of accumulation of newly recorded species would be if sampling was to be continued any further. allows to extrapolate separately the numbers of species expected to be recorded 0-, 1-, 2-, >2-times, thereby permitting to analyse rationally the process of species accumulation during continuously growing sampling. At last, the preferred range of applicability of both “Chao” estimator and the associated extrapolation of the Species Accumulation Curve estimator is discussed, by comparison with the alternative type estimator Jackknife-2.


INTRODUCTION
Incomplete inventories of biodiversity are likely doomed to become increasingly frequent, as surveys progressively address new taxonomic groups more difficult to cope with, in particular those groups giving rise to species assemblages with high number of species made of tiny individuals, such as, for example, small or microinvertebrates. In addition, more commonly investigated taxonomic groups, as well, are likely doomed to remain more or less incompletely surveyed at the local scale, due to sampling efforts often being far less at these small scales than they usually are in larger areas [2,3].
Incomplete samplings raise two important questions, of high practical relevance: -how many "missing" species might be left unrecorded by the incomplete sampling and, accordingly, what would be the estimated total species richness of the assemblage (that is the expected number of recorded species if the sampling was ideally complete) ; -what would be the extrapolated shape of the so called "Species Accumulation Curve", beyond the currently achieved sampling-size, that is, how the rate of discovery of new species would vary with increasing sampling size, beyond the currently achieved sampling-size. A major practical interest of extrapolating the species accumulation curve being the possibility to predict quantitatively the level of additional sampling effort that would be required to obtain any desired increment of sampling completeness. In other words, extrapolation offers the possibility to gauge the ratio between the expected gain in newly recorded species and the corresponding additive sampling effort needed.
Regarding the first issue -the estimation of the number ∆ of missing (unrecorded) species -a lot of non-parametric estimators have been proposed during the last decades (reviewed in [4,5]). All these estimators are based upon the numbers f x of species recorded x-times during the considered incomplete sampling, especially the two first numbers, f 1 and f 2 . Among the more commonly implemented non-parametric estimators are: (i) the "Chao" estimator (∆ Ch = f 1 2 /(2 f 2 )) and (ii) the series of "Jackknife" estimators at different orders. Jackknife estimators are more commonly implemented at orders 1 and 2 (i. e. ∆ J1 = f 1 and ∆ J2 = 2f 1 -f 2 ) but higher orders, up to 5, should be considered however, when samplings are substantially incomplete [6,7]. Now, regarding the second issue -the extrapolation of the species accumulation curvea series of parametric models are classically considered (reviewed in [8]). These models are expected to fit more or less the main common feature of species accumulation curves considered as a whole (that is: species accumulation rate monotonically decreasing with additional sampling efforts, finally slowing to zero when total species richness is reached). Yet, none of these formal models have direct relevance to the process of species accumulation itself, during progressive sampling and, accordingly, none of these models explicitly satisfy the general mathematical relationship (equation (1)) that systematically constrains any kind of species accumulation curves.
In fact, as might have been expected, it has been previously demonstrated that a specific expression of the extrapolation of the species accumulation curve is associated to each type of non-parametric estimator of the number of unrecorded species [5,7,9]. This is so because both the number of unrecorded species and the shape of the species accumulation curve are jointly dependent upon a same cause: the particular Distribution of Species Abundances (the so-called "S.A.D.") within the sampled assemblage of species (as has been already suggested implicitly in [6]). More specifically, this linkage between the type of estimator of the number of unrecorded species and the expression of the extrapolated species accumulation curve is precisely ruled by the constraining mathematical relationship mentioned above (equation (1)).

EXTRAPOLATION OF THE SPECIES ACCUMULATION CURVE ASSOCIATED TO THE "CHAO" ESTIMATOR OF THE NUMBER OF UNRECORDED SPECIES * a new derivation, complying with the mathematical requirements constraining the shape of any Species Accumulation Curve
The successive derivatives, ∂ x ∆(N)/∂N x , of the number ∆(N) of species expected to remain unrecorded after a sampling of size N are respectively related to the numbers, f x (N) , of species recorded x-times during this sampling of size N: with C N, x = N!/X!/(N-x)!. A detailed proof of this general theorem is given in Appendix.
Leaving aside the very beginning of sampling (of no practical relevance), the sampling size N rapidly exceeds widely the numbers x of practical concern, so that, in practice, the preceding equation simplifies as: In particular, These relations have general relevance because their derivation does not require any specific assumption relative to the particular shape of the distribution of species abundances ("S.A.D.") in the sampled assemblage of species. Accordingly, the general equation (2) and its successive forms (3), (4),… actually constrain any theoretical form of Species Accumulation Curves.
Let now focus upon the case of the "Chao" estimator of the number of missing (still unrecorded) species in a sample of size N: Applying the general relation (2) and its particular consequences (3) and (4) to the definition (5) of the "Chao" estimator yields: The general solution of this differential equation (6) is: with k and k' as constants, independent of N.
For samplings of sizes greater than the size N 0 of the actually realised sample, the expression of the extrapolation of the Species Accumulation Curve specifically associated to "Chao" type estimator is thus: or, as well, accounting for ∆ 0 = f 1 2 /(2 f 2 ): Thus, the extrapolation of the species accumulation curve, R(N) [ = R(N 0 ) + ∆ (N0) -∆ (N) ] takes the following form, when associated to "Chao" type estimator: * comparison with previous formulations of the extrapolation of the Species Accumulation Curves associated to "Chao" estimator An extrapolation of the Species Accumulation Curve specifically associated to "Chao" type estimator was previously proposed by Chao & Chiu [1]. This formulation (their equation (9)), converted in our own notations, is: As may easily be verified, this expression is formally different from expression (9) derived above and, as such, does not satisfy -as it should do -the relationships (3) and (4) which actually constrain all kinds of Species Accumulation Curve. Indeed, following equation (10): and thus, the first derivative of ∆ (N) at N = N 0 is: which formally differs from the required value,f 1 /N 0 , given by equation (3).
Although this non-compliance with mathematical requirements has relatively limited quantitative consequences in practice, it does remain unsatisfactory on theoretical ground. At last, another expression for the extrapolation of the Species Accumulation Curve associated to "Chao" estimator has been formerly proposed [5,7], which, also, does note cope correctly with equations (3) and (4) and, for this reason, should thus be discarded.

SEPARATE EXTRAPOLATIONS OF THE NUMBERS OF SPECIES EXPECTED TO BE RECORDED ONCE, TWICE & MORE THAN TWICE, ACCORDING TO "CHAO" ESTIMATOR
The number R(N) of recorded species is, of course, nothing else that the sum of the numbers f 1 (N) , f 2 (N) , f 3 (N) , …, f x (N) ,… of those species respectively recorded 1-, 2-,3-,…, x-times… Accordingly, the evolution of the number R(N) of recorded species with sample size N has a complex determinism, resulting from the additive contributions of all the f x (N) , each of them having its own pattern of evolution with increasing sampling size N. Disentangling these respective contributions may thus shed some light on the complex mechanism underlying the evolution with N of the number R(N) of recorded species.
For this purpose, it is necessary to consider separately the extrapolations of each of the numbers f x (N) . And, precisely, this is made possible thanks to considering the general mathematical relationship (1) (and the associated equations (3) & (4)).
Here, I shall consider the separate extrapolations of the numbers f 1 (N) , f 2 (N) , f >2 (N) , which, altogether, govern the evolution of R(N) with increasing sample size N: In the specific context of implementation of the estimator "Chao" and, thus, in accordance with equations (3) and (8): that is:   number of species (i) by increasing sampling completeness, starting from a low level (say: f 1 /f 2 decreasing from 8.3 to 5 and even to 2.5 [Figs. 1, 2, 3]), both f 1 (N) and f 2 (N) begin to grow, then successively pass through a maximum and finally slowly decrease asymptotically towards zero (while f >2 (N) steadily increases, at a rate sufficient to more than compensate the decreases of f 1 (N) and f 2 (N) , so that R(N) steadily remains monotonically increasing with N, as is expected of course). (ii) then, by continuing to increase sampling size towards higher degrees of completeness (say from f 1 /f 2 = 1.0 to 0.5 [Figs. 4,5]), the maxima of both f 1 (N) and f 2 (N) are now let behind so that both f 1 (N) and f 2 (N) are already in process of monotonic decrease towards zero (while f >2 (N) steadily increases at sufficient rate to more than compensate for these decreases).
Incidentally, the following general trends should be noticed: intersects f 2 (N) precisely when f 2 (N) reaches its maximum; (ii) the maximum of f 2 (N) is reached at a sample size exactly double of the sample size when f 1 (N) reaches its own maximum.
Indeed, these trends are general properties for the extrapolations of all the f x (N) associated to "Chao" estimator, as demonstrated below.
From equations (8) and (11), it follows that f 0 (N) (= ∆ (N) ) intersects f 1 (N) at N such that: that is: N = ½ N 0 .f 1 /f 2 and from equation (11), the maximum of f 1 (N) is reached for N such that ∂ f 1 (N) / ∂N = 0: which leads to the same value of N as just above: (11) and (12), it follows that f 1 (N) intersects f 2 (N) at N such that: that is: N = N 0 .f 1 /f 2 and from equation (12), the maximum of f 2 (N) is reached for N such that ∂ f 2 (N) / ∂N = 0: In complement to the mathematical demonstration above, it is interesting to highlight the underlying "physical" process behind this general pattern. Consider a sample of any size N (i.e. N individuals already observed) extracted from an assemblage of species having an ideally even distribution of species abundances -the ideal condition for "Chao" estimator being relevantly applied, as demonstrated in the next section. Under this specific condition, the next individual collected (thus making sample size growing from N to N+1) may concern with equal probability any species (either a species previously unrecorded, or a species already recorded once, or a species already recorded twice, …, or a species already recorded x-times, etc....). Now, the probability of drawing a species previously observed x-times is expected to be proportional to its relative abundance, reflected by the number, f x , of those species already recorded x-times. Accordingly, the number f x of species already recorded x-times will tend to: -increase if the probability of drawing a species already recorded x-1 times exceeds the probability of drawing a species already recorded x times because, thus, the probability for f x to increase by one exceeds the probability for f x to decrease by one; -decrease if the probability of drawing a species already recorded x-1 times is less than the probability of drawing a species already recorded x-times because, thus, the probability for f x to decrease by one exceeds the probability for f x to increase by one.
Therefore, f x is expected either (i) to increase, (ii) to pass by a maximum, (iii) to decrease, depending on f x-1 being either (i) larger, (ii) equal, (iii) less than f x respectively. This, indeed, is the fundamental -"mechanical" -reason which explains the general trend highlighted above. In other words, this argumentation unravels the basic underlying process behind the pattern described and mathematically demonstrated above and graphically exemplified at Figs. 6 and 7. /(2 f 2 )), so that: The formal correspondence between the preceding expression (15) and the expression of ∆ (N) associated to the "Chao" estimator (equation (7)), confirms that the extrapolation R(N) associated to the "Chao" estimator (equation (9)) corresponds, ideally, to the progressive sampling of a species assemblage with evenly distributed species abundances, as was expected.
Now, as no species is comparatively rarer than any other one when abundances are evenly distributed, the progressive sampling, in such a case, is expected to reach completeness comparatively faster than for any other assemblage of the same total species richness but having a less even distribution of species abundances. Indeed, faster achievement of completeness is a characteristic feature of the Species Accumulation Curve associated to "Chao" estimator, as compared to Species Accumulation Curves associated to any other type of estimator, in the same context. This is highlighted considering three examples, where comparisons are made between the extrapolated Species Accumulation Curves respectively associated to "Chao" estimator and to "Jackknife-2" estimator (both estimators relying upon the numbers f 1 and f 2 of species recorded once and twice): Figs. 8, 9, 10. In these examples, the pair of values f 1 and f 2 are chosen to examine three cases: "Chao" estimate of the number of unrecorded species (= f 1 2 /(2f 2 )) being either (i) smaller, (ii) equal, (iii) larger than the corresponding "Jackknife-2" estimate (= 2f 1 -f 2 ).
In all three cases, regardless of the sign of the gap between "Chao" and "Jackknife-2" estimates (negative [ Fig. 8], zero [Fig. 9] or positive [ Fig.  10]), the Species Accumulation Curve associated to "Chao" estimator always reaches its asymptote far more rapidly than the Species Accumulation Curve associated to "Jackknife-2" estimator. This, once more, is in agreement with the fact that the extrapolation associated to "Chao" clearly refers to the hypothesis of an ideally homogeneous distribution of the species abundances in the sampled assemblage.
The extrapolations of the Species Accumulation Curve, respectively associated to "Chao" and "Jackknife-2" estimators are plotted at Fig. 11.   Fig. 11. Extrapolations of the Species Accumulation Curve beyond {N 0 , R(N 0 )} respectively associated to "Chao" and "Jackknife-2" estimators, for a survey of Lepidoptera of Gariwang-san (field data from [11] ∆ 0 Chao = 6 , ∆ 0 JK-2 = 12). Accordingly, the total species richness is estimated to 111 and 117 species respectively. In fact, in agreement with the procedure of selection of the less biased estimation [5,7], it is the "Jackknife-2" estimator and its associated extrapolation which are to be adopted rather than "Chao" These extrapolations may serve to predict the sampling effort that would be necessary to reach any given level of sampling completeness. In particular, the sampling efforts predicted to reach a quasi-exhaustive species inventory (say reaching total species richness minus one; that is 110 and 116 species respectively) are strikingly different, depending on whether "Chao" estimator or "Jackknife-2" estimator is selected. For the extrapolation associated to "Chao" the sampling effort required is N = 4600 against N = 21000 for the extrapolation associated to "Jackknife-2". This clearly highlights the importance of selecting the less-biased extrapolation [5,7]. Here, between the extrapolations associated to "Chao" and to "Jackknife-2", it is the latter which ought to be adopted, according to the procedure of selection described in [5]. The level of sampling completeness of this inventory of the butterfly fauna at Mount Gariwang-san, 105/117 = 90% thus appears fairly good.

CONCLUSION
Incomplete inventories of biodiversity invite to extrapolate the species accumulation process beyond the actually reached sample size, ultimately trying to estimate the asymptotic, total species richness of the sampled assemblage of species. In this perspective, many attempts have been made in recent decades to find appropriate expressions for the extrapolation of the Species Accumulation Curve (reviewed in [8]), each of these expressions supposed to be as close as possible to some hypothetical "characteristic feature" of the Species Accumulation Curves. In fact, all these attempts were doomed to some form of failure, being confronted with the severe difficulty of identifying a common and generalizable feature for an entity as polymorphous as the Species Accumulation Curve actually is.
Hence, the recent attempt by Chao & Chiu [1] to postpone the difficulty by limiting the scope and focusing only upon the very specific case when the species abundance distribution is ideally even or close to be so. This, indeed, considerably reduces the polymorphism of the Species Accumulation Curve, which, accordingly may be extrapolated more accurately. But, yet, not derived in a strictly satisfying manner, as has been shown above.
In fact, as suggested previously, and demonstrated here, a general feature valid for all theoretical forms of Species Accumulation Curves R(N) (i.e. independently of the type of species abundance distribution) does exist indeed, derived from equation (1) above, that is : This equation actually constrains the detailed shape of any kind of Species Accumulation Curve, by means of controlling the series of its derivatives, ∂ x R (N) /∂N x .
Accordingly, satisfying this general relationship is a prerequisite to any relevant attempt to extrapolate the species accumulation process beyond actual incomplete sampling. And the general relevance of this constraining relationship allows to address, in turn, the extrapolation of the Species Accumulation Curve for any type of species abundance distribution as well.
Coming back to the specific case of an ideally even distribution of species abundances, dealt with by Chao & Chiu [1], we proposed an alternative expression for the extrapolation, which is mathematically relevant (that is, in accordance with equation (1)). As such, this formulation actually differs formally from the expression proposed by the preceding authors and should therefore be considered as more reliable.

A.1 -Derivation of the constraining relationship between ∂ x R (N) /∂N x and f x(N)
The shape of the theoretical Species Accumulation Curve is directly dependent upon the particular Species Abundance Distribution (the "S.A.D.") within the sampled assemblage of species. That means that beyond the common general traits shared by all Species Accumulation Curves, each particular species assemblage give rise to a specific Species Accumulation Curve with its own, unique shape, considered in detail. Now, it turns out that, in spite of this diversity of particular shapes, all the Species Accumulation Curves are, nevertheless, constrained by a same mathematical relationship that rules their successive derivatives (and, thereby, rules the details of the curve shape since the successive derivatives altogether define the local shape of the curve in any details). Moreover, it turns out that this general mathematical constraint relates bi-univocally each derivative at order x, [ ∂ x R (N) /∂N x ], to the number, f x(N) , of species recorded x-times in the considered sample of size N. And, as the series of the f x(N) are obviously directly dependent upon the particular Distribution of Species Abundance within the sampled assemblage of species, it follows that this mathematical relationship between ∂ x R (N) /∂N x and f x(N) , ultimately reflects the indirect but strict dependence of the shape of the Species Accumulation Curve upon the particular Distribution of the Species Abundances (the so called S.A.D.) within the assemblage of species under consideration. In this respect, this constraining relationship is central to the process of species accumulation during progressive sampling, and is therefore at the heart of any reasoned approach to the extrapolation of any kind of Species Accumulation Curves.
This fundamental relationship may be derived as follows.
Let consider an assemblage of species containing an unknown total number 'S' of species. Let R be the number of recorded species in a partial sampling of this assemblage comprising N individuals. Let p i be the probability of occurrence of species 'i' in the sample This probability is assimilated to the relative abundance of species 'i' within this assemblage or to the relative incidence of species 'i' (its proportion of occurrences) within a set of sampled sites. The number ∆ of missed species (unrecorded in the sample) is ∆ = S -R.
The estimated number ∆ of those species that escape recording during sampling of the assemblage is a decreasing function ∆ (N) of the sample of size N, which depends on the particular distribution of species abundances p i : with Σ i as the operation summation extended to the totality of the 'S' species 'i' in the assemblage (either recorded or not) The expected number f x of species recorded x times in the sample, is then, according to the binomial distribution: We shall now derive the relationship between the successive derivatives of R (N) , the theoretical Species Accumulation Curve and the expected values for the series of 'f x '.
According to equation (A1. where ∆' (N) is the first derivative of ∆ (N) with respect to N. Thus: Similarly: where ∆'' (N) is the second derivative of ∆ (N) with respect to N. Thus: which, by the same process, yields: where ∆''' (N) is the third derivative of ∆ (N) with respect to N. Thus : Now, generalising for the number f x of species recorded x times in the sample: with Σ j as the summation from j = 0 to j = x-1. It comes: with Σ k as the summation from k = 1 to k = x-1 ; that is: where C (N-x+1+k), k = (N-x+1+k)!/k!/(N-x+1)! and f k * is the expected number of species recorded k times during a sampling of size (N-x+1+k) (instead of size N).
The same demonstration, which yields previously the expression of f 1 * above (equation (A1.5)), applies for the f k * (with k up to x-1) and gives: where ∆ (k) (N-x+1+k) is the k th derivate of ∆ (N) with respect to N, at point (N-x+1+k). Then, which finally yields : That is: is the x th derivative of ∆ (N) with respect to N, at point N. Conversely: Note that, in practice, leaving aside the beginning of sampling, N rapidly increases much greater than x, so that the preceding equation simplifies as: In particular: This relation (A1.9) has general relevance since it does not involve any specific assumption relative to either (i) the particular shape of the distribution of species abundances in the sampled assemblage of species or (ii) the particular shape of the species accumulation rate. Accordingly, this relation constrains any theoretical form of species accumulation curves. As already mentioned, the shape of the species accumulation curve is entirely defined (at any value of sample size N) by the series of the successive derivatives [∂ x R (N) /∂N x ] of the predicted number R(N) of recorded species for a sample of size N: as the x th derivative of R (N) with respect to N, at point N and C N, x = N!/(N-x)!/x! (since the number of recorded species R (N) is equal to the total species richness S minus the expected number of missed species ∆ (N) ).
As above, equation (A1.13) simplifies in practice as: ∂ x R (N) /∂N x = (-1) (x-1) (x!/N x ) f x(N) (A1.14) Equation (A1.13) makes quantitatively explicit the dependence of the shape of the species accumulation curve (expressed by the series of the successive derivatives [∂ x R (N) /∂N x ] of R(N)) upon the shape of the distribution of species abundances in the sampled assemblage of species.

A2 -An alternative derivation of the relationship between ∂ x R (N) /∂N x and f x(N)
Consider a sample of size N (N individuals collected) extracted from an assemblage of S species and let G i be the group comprising those species collected i-times and f i(N) their number in G i . The number of collected individuals in group G i is thus i.f i(N) , that is a proportion i.f i(N) /N of all individuals collected in the sample. Now, each newly collected individual will either belong to a new species (probability 1.f 1 /N = f 1 /N) or to an already collected species (probability 1-f 1 /N), according to [12]. In the latter case, the proportion i.f i(N) /N of individuals within the group G i accounts for the probability that the newly collected individual will contribute to increase by one the number of species that belong to the group G i (that is will generate a transition [i-1 → i] under which the species to which it belongs leaves the group G i-1 to join the group G i ). Likewise, the probability that the newly collected individual will contribute to reduce by one the number of species that belong to the group G i (that is will generate a transition [i → i+1] under which the species leaves the group G i to join the group G i+1 ) is (i+1).f i+1(N) /N.