Is citizen science an open science in the case of biodiversity observations?

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Is citizen science an open science in the case of biodiversity observations? Quentin Groom, Lauren Weatherdon, Ilse R. Geijzendorffer


Introduction
Citizen science in biodiversity research covers a wide variety of volunteer activities, from the collection of casual observations through to conducting detailed species monitoring (Wiggins & Crowston 2011). The skills of citizen scientists range from general members of the public with little scientific experience through to expert amateurs and retired professionals. The EuMon project has calculated that volunteers outnumber professionals 18 to 1 in species monitoring in Europe (EuMon 2015). The voluntary aspect of the time invested by citizen scientists is generally interpreted as being motivated primarily by its contribution to society and that society should profit from this effort through openly accessible data. For example, in the European Union's Digital Agenda for Science, citizen science is listed as a subcategory of open science (European Commission 2015), implying that citizen science data are open, permissively licensed and available to all. However, the motivations of citizen scientists are diverse and include a general interest in a specific species or question, involvement in a community with similar interests, recognition for personal achievements, learning new skills and contributing to environmental activism (Bell et al. 2008;Rotman et al. 2012;Tulloch et al. 2013). Given the diversity of their motivations, data sharing could have many potential advantages for citizen scientists, such as ensuring the persistence of their data, safe-guarding their scientific legacy and increasing the visibility and impact of their observations through use in others' research. Likewise, citizen scientists can benefit from the openness of others' data for their own projects.
Access to citizen scientists' data is not only essential for science, but also for continual monitoring and environ-mental impact assessment. For example, continental and global policy instruments, such as the Convention on Bio-logical Diversity, have a pressing need for biodiversity data to fulfil their reporting requirements (Geijzendorffer et al. 2015). Open access to biodiversity data promotes their use, encourages novel applications and supports reproducible science (Arzberger et al. 2004;Tenopir et al. 2011;Thessen & Patterson 2011). This is reflected in the increasing openness of governmental and taxpayer-funded environmental data (e.g. http://www.data.gov/; http:// data.gov.uk/; http://data.gov.be/). Likewise, scientists are becoming more open with their data and publications (Piwowar 2011). Some scientific journals, such as the Journal of Applied Ecology and other British Ecological Society Journals mandate that data used within research articles, are deposited in public digital repositories. Funding agencies are also actively requesting open access to research data; for example, the European Framework Programme for Research and Innovation, Horizon 2020, is conducting a pilot on open access to research data.
In this paper, we examine the openness of biodiversity observation data in relation to the sources of these data to identify the relative openness of citizen science data.

Materials and methods
The Global Biodiversity Information Facility (GBIF) is a long-standing, influential global resource on biodiversity distribution information with a pan-taxonomic approach. It is funded by governments of participating countries to provide open information exchange on biodiversity. As such, it is a highly suitable data set for this analysis. Observation data from GBIF are only one of a diverse range of data types collected by citizen science projects. However, unlike many other data sets, the data available in GBIF are accessible with comparatively clear licensing (Table 1).
Data set metadata were extracted from GBIF using R (version 3.2.0) on 19 January 2016 using the 'RGBIF' package (version 0.9.0) (Chamberlain et al. 2015). Where the legal right to use the data was explicitly mentioned, it was represented by either a short rights statement in the metadata or a link to a longer licence document. The 'rights' statements or URL to a licence was extracted for all occurrences and survey data sets with one or more observations. A total of 12 458 data sets were extracted, but only 11% of these data sets included an explicit data-usage-rights statement in the data set metadata. The licensing information can be found in three places in the 'RGBIF' output: once in the data set metadata and twice in the occurrence record, wherein the licensing is noted in both the 'rights' and 'accessRights' fields. The 'rights' field is a deprecated term that preceded the 'accessRights' term, originating from the Darwin Core standard. Darwin Core is used in GBIF to define fields in the data base and as a data exchange format (Wieczorek et al. 2012).
When a rights statement was missing from the data set metadata, the occurrence-level rights information was obtained from the first record of each data set. It is assumed that the rights within a data set are uniform and that the first record is representative of all records in that set. Rights statements for a further 0 25% of data sets were obtained this way.
Licensed data sets use standard licences such as a Creative Commons or an Open Data Licence, while the remainder use bespoke licence statements of various sorts. To simplify the interpretation of the different licences, the intention of the licence holder was interpreted and simplified into seven categories. Each of these categories was then given a data openness score from zero to three, from the least to the most open, respectively. Details of these categories are given in Table 2.
The sources of the data sets were classified into thirteen types based on the data set names and provider names (Table 3). These types were chosen by reviewing the word frequency in the data providers' titles. For example, common words in data provider titles included 'University', 'Museum', 'Institute', 'Research' and 'Herbarium'. If the data provider's name could not be used to assign a type or was ambiguous, the data set description, domain name and organization's website were used to guide attribution. The majority of data sets are described in English; for all others, Google Translate was used to interpret the titles and descriptions. Finally, the names of the data sets were reviewed to ensure they had been classified correctly. For example, some citizen science and scientific society data sets are submitted to GBIF by data centres of various types and can be recognized from their data set name. If these could be identified, they were then reclassified. It is acknowledged that many institutions fall within multiple types -for example, some museums will also be research institutions and vice versa -but each organization was assumed to belong to their self-identified type and that the data set's description took precedence over the provider's name. It is also assumed that all data sets consist of one type of data. We used a non-parametric Mann-Whitney U-test to determine the significance of differences between data openness scores, which was considered a rigorous test on these categorical and bounded scores.

Results
Our assessment showed that citizen science data sets com-prise 10% of data sets on GBIF, but account for 60% of all observations. The largest data set by far is from eBird (Cornell Lab of Ornithology 2015), which is a citizen science data set that contains over 200 million observations. When comparing the data openness scores of GBIF data sets with the data source types (Fig. 1), the citizen science projects ranked low on the openness of their data, although the vast majority do not include a licence statement (mean data openness scores 1 67, n = 33), whereas institutions such as museums, educational institutions and research institutes ranked higher (mean data openness scores 2 13, n = 335; 2 04, n = 324 and 2 01, n = 219, respectively). Commercial organizations ranked the most open (mean data openness score 2 82, n = 11). Of course, some data from citizen science projects were entered into GBIF using other organizations as intermediaries and are therefore hidden from our view. This is particularly true of data centres, which also score poorly (1 80, n = 368) and of societies (2 10, n = 39). A comparison of the scores of the three data set types together ('citizen science', 'societies' and 'data centres') demonstrated lower scores for these predominantly volunteer-provided data sets (mean 1 83, n = 440), than for all other data sets together (mean 2 10, n = 1048) (Mann-Whitney U-test, W = 194680, P < 0 01).

Discussion
The results confirm the important contribution of citizen scientists to biodiversity research. However, contrary to expectations, biodiversity data sets on GBIF derived from citizen science projects were often associated with more restrictive licences than other data types, and frequently restricted the data use by commercial organizations. Data centres, which distribute citizen science data as an intermediary, also receive low scores. Even though these data are collected voluntarily, the circumstances under which these data are managed and distributed seem to result in more restrictive data sharing. Scientific societies scored better on their accessibility. As these societies often have a largely voluntary member-ship, this raises questions on why their openness differs from citizen science projects. This category does, how-ever, contribute fewer observations, forming only 2% of those contributed by citizen science and data centres. Surprisingly, commercial organizations scored highest on open access to biodiversity data. However, as the provisioning of biodiversity data is not the core business of these organizations, they form only a small fraction of providers and observations.

Heterogeneous licensing
Within the GBIF data sets, there are a wide variety of licences. For example, 26% of licensed data sets restrict commercial usage, presumably to avoid undermining potential revenue sources for the data provider. However, they may be unaware that this stipulation also prevents not-for-profit research that they may assume is permitted (Hagedorn et al. 2011). This limitation is also true for the 88% of GBIF data sets that lack licensing information. Although it is perhaps assumed by some users that no licence information implies that the data can be used openly, this is not the case (Groom et al. 2015). Academic users can probably risk using these data, but the potential risk is much higher for commercial users. Policymakers should be aware that this makes it difficult to outsource the reporting of biodiversity targets that require these data. The heterogeneity of licensing is yet another obstacle that users need to resolve. The GBIF acknowledges these problems of data licensing and is transitioning to a simpler obligatory system that offers only three licensing options (Table 1) (Desmet & Aelterman 2013;GBIF 2015).

Data use barriers
In addition to licensing issues, there are also an unknown number of organizations and individuals who hold data but do not share these openly. For instance, commercial companies may not want to share data that might be used against them (e.g. urban development projects can be delayed or blocked by the presence of protected species). Furthermore, we are aware from personal experience that many publicly available citizen science data sets are obfuscated by reducing their spatial or temporal resolution. For example, volunteers provide observations with a precise grid reference and date, but the data providers only supply summary data to GBIF, combining all observations for a year and grid cell into one record. In personal communications with several European GBIF nodes, they acknowledged that much of their country's data was obfuscated before these were provided to GBIF. An example is the National Biodiversity Network in the UK that provides most of the UK data available in GBIF, most of which comes from volunteer observers. At least 50% of these observations have their coordinates obfuscated at the request of the data providers (NBN Trust 2015).
The fact that some biodiversity observation data are either restricted, obfuscated or inaccessible may have several probable causes. For instance, data holders may use a conservation-based argument, such as protecting locations of species vulnerable to persecution or exploitation. Data holders may also withhold data at the request of a landowner or because they did not receive legal per-mission to access the area where the observations were made. Among professional scientists, funding shortages and institutional support for data openness are important reasons for not sharing data (Tenopir et al. 2011;Fecher, Friesike & Hebing 2015).

The mandate for sharing
These reasons might also inhibit citizen scientists' data sharing, but there are additional reasons. For example, the mandate for decisions on sharing citizen science data is not held by the citizens themselves, but with intermediary organizations such as data distribution centres or citizen science organizations. Multiple reasons can cause these organizations to be unwilling to share data. For instance, data can be used as leverage to fund their activities, or to obtain acknowledgement of their contributions, particularly by being included as authors on publications. With the difficulty of finding sufficient funding, these are understandable reasons for withholding data, even though they considerably reduce the value of the data and can act contrary to the missions of these organizations. Indeed, commodification is a serious area of conflict between amateurs, their managing organizations and data aggregators (Ellis & Waterton 2005). Funding agencies should recognize that sustainability is fundamental to the reliable provisioning of good quality biological observations to long-term monitoring. For instance, some scientific societies have an enviable record in longevity. The Botanical Society of Britain and Ireland and the Audubon Society were established in 1856 and 1896, respectively, among other examples. The value of these sustainable models should not be underestimated.

Moving forward
There are some inspiring examples of advancements that can be applied. One crucial step forward is for organizations managing citizen science data to implement explicit data management policies with standard licences to pre-vent misconceptions regarding data sharing. A good example is Wikipedia, which has clear policies and no restrictions on commercial usage, but requires acknowledgement and 'share alike'. In the field of biodiversity observations, iNaturalist.org allows users to select from a range of Creative Commons licences, including releasing their data into the public domain.
Volunteers tend to have a particularly local conservation-based focus, whereas professionals may concentrate on international issues (Turnhout & Boonman-Berson 2011). This can lead to a lack of understanding between professional organizations and amateur societies, which have different goals and perspectives (Ellis & Waterton 2005). For example, volunteers are perhaps less likely to be motivated by citation in academic journals, but welcome acknowledgements that are visible to their local peer group. Some of the most productive volunteer observers collect biodiversity data for their own projects, such as producing regional floras or breeding bird atlases. Sharing their data becomes an interesting option for these citizen scientists if they can be assured that their own projects are not affected. Data users should support the activities of citizen scientists and societies through acknowledgement of their contributions in a way that matters for the citizen scientists.
This study illustrates that although citizen scientists contribute important biodiversity data to GBIF, the open-ness of their data ranks low among the different data providers studied. Several methods to stimulate data sharing are feasible. The assumption that voluntary data collection leads to data sharing does not do justice to those who collect data, nor does it acknowledge the contributions of these data to long-term monitoring of biodiversity trends. To improve data openness, citizen scientists should be encouraged in ways that correspond with their motivations. A first step would be for data distribution centres to delineate clearer licensing approaches and to thereby to enable citizen scientists to select appropriate levels of data accessibility.
Wieczorek, J., Bloom, D., Guralnick, R., Blum, S., Doring, M., Giovanni, R., Robertson, T. & Vieglais, D. (2012) 1. The average data openness score of data sets on GBIF separated by the organization type of the data set provider. Only data sets with an explicit expression of data usage rights have been included. A data set with a score of zero is not usable with-out express permission of the owner; a data set with a score of one does not permit commercial use, requires acknowledgement and may have other restrictions; a score of two only requires acknowledgement; a score of 3 is given to data sets which are completely open. Error bars show the 95% confidence interval using a t distribution.