Least-biased extrapolation of a partial Inventory of butterfly fauna in Manas Range (Royal Manas National Park, Bhutan).

collaboration between both authors. Author TN collected and published field data. Author JB conducted the extrapolation procedure applied to crude field data, discuss the results and wrote the manuscript. Both authors read and approved the manuscript. ABSTRACT As a rule, most biodiversity inventories at local scales remain ecosystems reaches around 120 species; accordingly the achieved-sampling completeness is estimated around 76%. Alternative estimations, based on six empirical models of species accumulation curves (namely: Clench, Negative Exponential, Exponential, Logarithmic B, Power and Margalef) prove markedly less accurate than the selected least-biased extrapolation, with Clench model being the less worst, however.


INTRODUCTION
Incomplete inventories of biodiversity are likely doomed to become increasingly frequent, as surveys progressively address new taxonomic groups more difficult to cope with, in particular those groups giving rise to species assemblages with high number of species. In addition, more commonly investigated taxonomic groups, also, are likely doomed to remain more or less incompletely surveyed at the local scale, due to sampling efforts often being far less intensive at these small scales than they usually are across wider areas. Accordingly, most of ongoing published inventories are admittedly more or less incomplete [1]. This incompleteness may be partially compensated (yet, in numerical terms only) by the estimation of the number of "missed" (i.e. unrecorded) species, thereby leading to the evaluation of the total species richness of the sampled assemblage of species. Many different (nonparametric) estimators of the number of "missing" species have been proposed in recent decades (reviewed in [2,3]). As expected, these different types of estimators provide divergent evaluations of the number of unrecorded species, without any consensus having ever been reached regarding which estimator would feature more accurate than the others [1]. And the commonly accepted suggestion to consider all these divergent estimates without being able to choose between them [4] remains rather frustrating. This, in turn, probably contributes to explain why many partial inventories are still not extrapolated numerically, in order to derive a reliable estimation of the total species richness. Yet, reliable evaluations of the richness of species assemblages would be highly desirable, at least in relative, if not in absolute terms. Note that even in relative terms, a relevant comparison of species richness between two or several assemblages requires that inventories be actually compared at a same level of completeness. A mandatory condition, that neither standardised sampling nor rarefaction to a same sampling size may actually secure [5], contrary to what is still too often asserted in literature (and this, simply because the level of completeness is dependent not only upon sample size but also is tightly dependent on the degree of heterogeneity of the species abundances distribution, which may usually differs between sampled assemblages). Now, a rational method of selection in favour of the least-biased estimator, among the most commonly referenced ones, has recently been developed [6,7], enlarging the path initiated by Brose et al. [8]. This newly derived procedure avoids the above mentioned frustration of having to deal with divergent estimates without knowing how to choose the most accurate of them all.
Hereafter, advantage is taken from using this procedure to extrapolate an incomplete inventory of Butterfly fauna in the Manas Range (Royal Manas National Park, Bhutan), carried on by Tshering Nidup and coworkers [9]. Thereby, a reliable estimate of the "true" total species richness of butterfly fauna within the partially sampled ecosystems of the Royal Manas National Park is expected. Moreover, reliable predictions of the additional sampling effort required to improve the completeness of the already performed inventory are derived. This, in turn, provides a rational basis to decide whether or not it seems worth further continuing the sampling operations, putting in balance the additional effort required and the expected benefit in terms of newly recorded species.

MATERIALS AND METHODS
All details relative to the environmental context of the partial inventory and the list of butterfly species with their respective abundances are provided on-line with open access [9] and, accordingly, these details will not be recalled here. Accounting for species abundances is of prime interest in the perspective of the extrapolation of partial samplings, since abundance data provides estimates of the numbers f 1 , f 2 , f 3 , f 4 ,…, f x , … of those species recorded respectively 1-, 2-, 3-, …, x-times in 3 the realised partial sampling. These numbers are required, in turn, to reliably extrapolate the species accumulation curve, as explained below.

Numerical Extrapolation of Species
Accumulation beyond the Achieved Sampling Size As sampling size increases, the number of recorded species is monotonically growing, at first rapidly and then less and less quickly. The so-called 'Species Accumulation Curve' R(N) accounts for the growth kinetics of the number of recorded species R with increasing sampling size N (N: typically, the number of observed individuals during sampling). The mathematical expression (and thus the details of the shape) of the Species Accumulation Curve are dependent upon both the total species richness of the sampled assemblage of species and the degree of heterogeneity of the species abundance distribution within the sampled assemblage of species [1]. This would apparently make the extrapolation of the Species Accumulation Curve rather difficult to compute, since both preceding factors are unknown a priori. Yet, the numbers f 1 , f 2 , f 3 , f 4 ,…, f x , … of those species recorded respectively 1-, 2-, 3-, 4-, …, x-times during sampling are directly dependent also upon the total species richness and the degree of heterogeneity of the species abundances. This explains why these numbers f 1 , f 2 , f 3 , f 4 ,…, may serve as an appropriate basis from which to extrapolate the Species Accumulation Curve, beyond the actual size of the sample under consideration. In particular, the most commonly used estimators of the number of unrecorded species (i.e. non-parametric estimators such as 'Chao' and the series of 'Jackknife') are computed from the recorded values of the first numbers f x [2]. In practice, a problem remains however: as already mentioned, each of these different types of estimators provides a substantially distinct estimate and none among these estimators remains consistently the more appropriate. Accordingly the traditional practice has become to consider together all of them without making any choice [4], an admittedly frustrating situation! Yet, it has been shown recently that although none of the available estimators consistently remains the more accurate [8], each of them may prove, in turn, being the less biased, depending on the value taken by f 1 as compared to the other f x>1 [6]. Accordingly, in practice, the most appropriate -i.e. the least biased -estimator of the number of unrecorded species may be selected by comparing the value of f 1 to the values of the other f x for x > 1 [6,7]. Selecting this way the least-biased type of estimator thereby provides the best possible estimate of the number ∆ of "missing" species and, in turn, the best estimate of the total species richness S t of the partially sampled assemblage. In addition, the less biased expression for the extrapolation of the species accumulation curve R(N) is straightforwardly derived.
In practice, the formulations summarised in Appendix 1 provide (i) the expressions of ∆, S t and R(N), according to each of the most commonly used types of nonparametric estimators and (ii) the key to select among them the less biased estimator and, thereby, the lessbiased expressions for ∆, S t and R(N). Also, in order to reduce the influence of drawing stochasticity, which affects the as-recorded values of the f x , it is advisable to regress the asrecorded distribution of the numbers f x versus x.

RESULTS
The survey conducted by Nidup and coworkers yields R 0 = 91 recorded species from N 0 = 1319 observations. The recorded values of the numbers f x at the end of sampling are plotted in Fig. 1 (grey points) together with their values after regression (black points) which are considered for the extrapolation of the species accumulation curve.
The extrapolations respectively associated to six types of non-parametric estimators -Chao and the five first Jackknife's at orders 1 to 5 -are plotted at Fig. 2. As the (regressed) values of the f x satisfy the inequality f 1 > 4f 2 -6f 3 + 4f 4 -f 5 , it follows that, here, the more accurate extrapolation of the species accumulation curve is that associated to Jackknife-5 (cf. Appendix 1). Fig. 2 and Table 1 highlight the strong differences between the different extrapolations, in particular the strong difference between the selected extrapolation, associated to JK-5, and the extrapolations associated to JK-2, JK-1 and Chao (even though the latter are among the most widely used estimators however!).
The practical importance of selecting the more accurate extrapolation is obvious: for example, the estimated number of missing species differs by a factor ≈ 2 and the required sampling size to reach 90% (resp. 95%) completeness differs by a factor ≈ 3 (resp. ≈ 4) when comparing the extrapolations respectively associated to Jackknife-5 and Chao (Table 1).  According to the selected least-biased extrapolation of the species accumulation curve, here associated to Jackknife-5, the number of missing species is estimated at 28, the total species richness at 91 + 28 = 119 species and, accordingly, the completeness reached by the inventory is estimated close to three quarters.

Fig. 1. The recorded values of the numbers f x of species recorded x-times (grey discs) and the regressed values of f x (black discs) intended to reduce the consequences of stochastic dispersion
Although this level of sampling completeness is fair, a more thorough investigation still features desirable, since a quarter of the total number of species still remains to be recorded, among which a majority of them are expected to be comparatively rare species, thereby of particular potential interest, scientific and patrimonial. As sampling "performance" -in terms of the ratio between the number of newly discovered species and the corresponding additional effort requiredconsistently decreases severely, as the inventory goes on, the additional investment is expected to be heavy. This is why a reliable estimate of the additional sampling investment needed to reach a given improvement of completeness would be so useful, in term of prospective programming.
An accurate extrapolation of the species accumulation curve opportunely answers this need: Fig. 2 shows the expected additional effort required to increase the completeness, from the present 76% level up to any higher values.
Besides, it is also possible to derive the extrapolation of the numbers f x of those species that would be recorded x-times after any additional sampling effort, by applying equation [A.1] to the selected extrapolation of the species accumulation curve (that is, here, R 5 (N)). Accordingly, f x is given here by: Numbers f 1 , f 2 , f 3 have already pass their respective maxima and accordingly are consistently decreasing along continuously progressing sampling, while f 4 , f 5 respectively reach their maximum values at sampling size N ≈ 1500 and ≈ 1700, respectively (Fig. 4) and then continuously decrease. As expected, the rate of decrease of the f x , slows down consistently from f 1 to f 5 .
A more thorough theoretical analysis of the regulation process that applies to the series of the f x is given in [10].

DISCUSSION
To extrapolate the species accumulation curve and estimate the number of missing species, I have considered the series of the more commonly implemented types of nonparametric estimators (Chao and the five first Jackknife's).
All of them are based on the values taken by the series of the number f x of those species recorded x-times at the end of sampling. But each type of estimator is, yet, formulated differently and thus provides an estimation which is distinct from the others. Accordingly, a procedure of selection among them all is necessary to resolve this hardly acceptable ambiguity. Applying the procedure of selection recently developed for this purpose [6] makes possible to remove this ambiguity and, here, leads to retain: (i) Jackknife-5 as the least-biased estimator of the number of missed (still unrecorded) species and (ii) The expression associated to Jackknife-5 (see Appendix 1) for the least-biased extrapolation of the species accumulation curve. Incidentally, the selected estimator proves, here, to be the one having the highest value (Fig. 2). This, indeed, is not surprising since all the nonparametric estimators available in the literature (including the six types considered in the implemented procedure) are considered as yielding under-estimates of the true number of missing species [1,2]. Accordingly, it is logically expected that the less-biased, among them, should be the one leading to the highest estimate. In fact, this trend is quite general indeed, as demonstrated directly from the inequalities defining the respective ranges of appropriate use of each of the Jackknife estimators (see Appendix 1 for more details).
Apart from the range of non-parametric estimators considered above; a series of purely empirical formulations of the species accumulation curve R(N) might also be considered alternatively. These empirical formulations are not associated to any kind of estimator of the number of missing species, but have adjustable parameters that enable them to satisfy the two following compulsory conditions: Also, a model with only one adjustable parameter may be easily derived from the Margalef index, as: R(N) = a.Ln(N) + 1 (the derivation is based on the postulated independence of Margalef index upon sampling size N, which is implicit in the conception of this index, although this is practically never the case in practice).
As already mentioned, the adjustable parameters a and b are defined, for each model, in order to satisfy both relationships R(N 0 ) = R 0 and ∂R(N)/∂N = f 1 /N 0 at N = N 0 (see Appendix 2 for the computations of the values taken by parameters a and b in each case). Table 2 provide representations of the extrapolated species accumulation curves at sampling sizes N > N 0 , for each of the six empirical models and for the least-biased extrapolation associated to Jackknife-5 estimator.

Figs. 5 and 6 and
All six empirical models lead to extrapolations that differ more or less markedly from the leastbiased extrapolation associated to Jackknife-5.
At first, Exponential model, Logarithmic B model, Power model and Margalef-index associated model all are non-asymptotic models, thereby being inappropriate to estimate the number of missing species and the resulting total species richness.
Clench model and Negative exponential model, on the contrary, are asymptotic expressions which may deliver, accordingly, finite estimations of the number of missing species and of the resulting total species richness ( Table 2).
As compared to the least-biased estimate of 28 missing species, the estimates provided by Clench model and Negative exponential model are substantially lower: 21 and 6 missing species respectively. Therefore, here, Clench model works better than Negative exponential model (Exponential, Logarithmic B, Power and Margalef models being out of competition as mentioned above).  As regards, now, the comparison between the extrapolations according to Clench model, on the one hand and the series of Jackknife estimators on the other hand, Fig. 7 shows that, here, the Clench model delivers a better prediction than Jackknife-1, but does less good than the species accumulation curves respectively associated to all the other Jackknife's : JK-2, JK-3, JK-4 and, of course, JK-5.

CONCLUSION
Incomplete inventories of local biodiversity, which are doomed to become most often the ordinary rule in practice (at least for speciose taxonomic groups and/or for local investigations involving insufficient sampling efforts) may provide, however, much more information than would be expected from the crude consideration of the crudely recorded data. Releasing this additional information requires, however, that species inventories include not only the simple list of occurring species but also the respective abundances of each recorded species. Under this condition, extrapolating the Species Accumulation Curve, beyond the actually achieved inventory, may easily be implemented, using either non-parametric estimators of the number of missed species or considering alternatively, several kinds of empirical models. Literature provides numerous types of nonparametric estimators as well as several kind of empirical models of species accumulation function. Reliable extrapolation, however, is conditioned by the rational selection, for each inventory, of the least-biased estimator of the number of missing species, among the series of estimators made available in the literature. Empirical models, for their own, prove hardly appropriate, especially those models having nonasymptotic expressions. Among the asymptotic empirical models, Clench model performs more or less as the average of the non-selected nonparametric estimators (see Fig. 7) while the Negative-exponential model is very strongly negatively biased.
According to the least-biased extrapolation of the species accumulation curve (involving, for this particular inventory, the Jackknife 5 nonparametric estimator), 28 additional species would still remain unrecorded by the present inventory. The 91 recorded species thus represent about three quarters of the true species richness (≈ 119 species) of the set of investigated ecosystems within the Manas Range by Tshering Nidup and co-workers.
This, indeed, invites to add some supplementary sampling effort, at first applied to the same set of ecosystems already inventoried partially. In this perspective, the least-biased extrapolation of the species accumulation curve provide useful information that may serve to predict the level of additional sampling effort (in term of sampling size, i.e. number of individual records) that would be necessary to reach a given increment of sampling completeness. As might be expected, the additional sampling effort needed to progress in completeness increases very rapidly, that is, the cost of recording new species becomes progressively but rapidly higher and higher. Beyond this intuitive expectation, it is the merit of a reliable extrapolation, as plotted in Fig. 2, to quantify the rapidly increasing cost required by a continuous improvement of completeness of inventory. For example, increasing the completeness from the actual 76% level to 90% completeness would require multiplying by a factor ≈ 3 the currently achieved sampling size (≈ 4250 individuals to be recorded, as compared to the 1319 presently recorded). And reaching a desirable 95% level of completeness would imply increasing the present sampling size by a factor ≈ 7 (≈ 9500 individuals to be recorded against 1319).
Finally, this opens the desirable possibility of comparing, on a rational common basis, (i) The expected number of newly recorded species, many if not all of them being of potential scientific and patrimonial interest (as they are admittedly expected to be among the rarest species of the sampled assemblage) and, (ii) The additional sampling efforts/costs that would be required to obtain this expected number of new records.
That is the respective ranges within which each estimator will benefit of minimal bias for the predicted number of missing species.
Besides, it is easy to verify that another consequence of these preferred ranges is that the selected estimator will always provide the highest estimate, as compared to the other estimators. Interestingly, this mathematical consequence, of general relevance, is in line with the already admitted opinion that all non-parametric estimators provide under-estimates of the true number of missing species [1,2]. Also, this shows that the approach initially proposed by BROSE et al. [8] -which has regrettably suffered from its somewhat difficult implementation in practice -might be advantageously reconsidered, now, in light of the very simple selection key above, of far much easier practical use.

N.B. 2:
In order to reduce the influence of drawing stochasticity on the values of the f x , the asrecorded distribution of the f x should preferably be smoothened: this may be obtained either by rarefaction processing or by regression of the as-recorded distribution of the f x versus x.

N.B. 3:
For f 1 falling beneath 0.6 x f 2 (that is when sampling completeness closely approaches exhaustivity), then Chao estimator may be selected: see reference [7].