Improving data identification and tagging for more effective decision making in agriculture

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Improving data identification and tagging for more effective decision making in agriculture Pascal Neveu, Romain David, Clement Jonquet

High-throughput phenotyping (phenomics), the plant selection process that aims to identify the most adapted genotypes, is a good illustration of the data challenges faced by the agricultural research community. For example, in plant sciences, phenomics platforms produce huge complex datasets (images, spectrum, human readings, soil analysis) from different scales (molecular to plant population) in various contexts of strongly instrumented installations (field, greenhouse).
Phenomics datasets must be accessible to the scientific communities (genetician, bioinformatician, ecophysiologist, agronomist, statistician, sociologist, etc.) who have intensive data integration needs in order to help them in their selection. This case study will be further detailed in Section 3.
2 Structuring the data In agriculture, observation and management systems, developed and used in many settings, produce a large volume of heterogeneous data, which are difficult to aggregate since they focus on specific issues. There are various data sources in agriculture that require miscellaneous knowledge and skills to be used together accordingly. For instance, agricultural data sources can be related to agricultural production, farm practices, transformation, distribution and so on. Since a few years, another important sources of data are not only connected objects in agriculture (Tzounis et al., 2017)weather stations, insect traps, soil moisture sensors and water meters connected to irrigationbut also various sensors installed on animals to evaluate their conditions (health measures, temperature, movement), milking robots (quantity and quality of milk) or feeding automata. Agro-equipment is increasingly enriched with sensors, for precision farming (e.g., provide the plant exactly what it needs) and predictive maintenance. Satellite images are another example: the Sentinel constellation delivers free images at a very high temporal frequency (every 5 days), which opens up new research and business opportunities. Agricultural production traceability requirements are now supported, in part, by automated reading systems, with radio frequency identification (RFID) and NFC chips, or by the manual input of agricultural interventions from smartphones with direct transmission to the applications software. The challenge is to automate data acquisition so that it has virtually no cost and is not an additional charge for farmers or scientists (Wolfert et al., 2017). Finally, high-throughput phenotyping methods, essential for shortening the production cycle of new seeds, are also sources of massive data (e.g., phenotype-monitoring platforms produce thousands of images per day) to link with genotypic data (Halewood et al., 2018).
Organized and structured access to primary agricultural data is a sine qua non condition for building efficient decision support systems to achieve the conservation of biodiversity and sustainable development. Organizing, managing and storing of various data require new approaches. Proper data structuring enables to organize data to suit a specific purpose so that they can be accessed and worked with in appropriate ways. The better the data structure, the better we will be able to group them with other data and learn from them.

Identification
An identifier is a sort of name that identifies a specific object (digital or not) in a set of objects. In an ideal world, identifier should be unique for each object (bijection); in practice this is rarely the case. In most cases a resource (object) can have several (not all unambiguous) identifiers depending on the context. An identifier is unambiguous if it makes it possible to identify an individual in a specific context in a safe way (McMurry et al., 2017). An unambiguous identifier, which cannot refer two different objects, is called GUID (globally unique identifier) or UUID (universally unique identifier); irrespective of whatever the database or source, all disciplines taken together, no other object will be designated identically, for example, ISBN for books. For software objects, GUIDs are typically randomly generated 128-bit codes. There are several specifications for identifiers, for example, UUID, LSID, ARK, DOI, URI, RFID, XRI. The relevance of these different mechanisms depends on the context and, of course, of the characteristics of objects to identify. Data identification also depends on the range of the use of the resource. If the resource shall be referenced only within a limited range or system, it could be assigned a local identifier.
But if it shall move to another system (e.g., for the purposes of expert measures such as chemistry of soil and water quality) or if it shall be reused and aggregated with data of different provenances or contexts, a 'reliable' global and long-term identification mechanism is necessary.
Long-term structuring of data requires to reliably identify all the concepts, objects and their properties described in the information systems. A persistent identifier is an identifier that is permanently assigned to an object (ideally usable in several decades). For example, once an ISBN is assigned to a particular book, that number is always associated with that book and no other book will ever receive the same number. Likewise, identifiers must be persistent and shall not change.
The problem is that during periods of decades, many changes can occur not only within databases but also in institutions or organizations in charge of the data. It is thus necessary to preserve and recover dependencies between these elements, in time and in localization. Persistent identifiers play a key role in adopting Open Science (Dappert et al., 2017). The reliability of this identification depends on some essential qualities described, for instance, in W3C Recommendations (https://www.w3.org/TR/cooluris/) and must assure persistent security, traceability and reusability of data. The key to rich integration is a commitment to deploy and reuse globally unique, shared identifiers and to implement services that link those identifiers (Page, 2008 For instance, persistent GUIDs are usually generated as groups of dash-separated hexadecimal characters, for example, 120a-e29f-a861-12f5-5a52. Their three main qualities are: 1) to be generated in a non-centralized way, 2) to make extremely improbable the random generation of two identical identifiers and 3) to be completely opaque and not sensitive to the changes of authorities or names of authorities. These automatically generated GUIDs can be used as a basis for the construction of other identifiers, for example, by adding a prefix (URI, URL, domain name, authority name). A GUID, when it is integrated in an URI, can be dereferenceable as explained hereafter.
Uniform resource identifier (URI) is defined by the RFC 3986 standard1 provided by the W3C, which specifies: 'An URI is a compact sequence of characters that identifies an abstract or physical resource. This specification defines the generic URI syntax and a process for resolving URI references that might be in relative form, along with guidelines and security considerations for the use of URIs on the Internet. The URI syntax defines a grammar that is a superset of all valid URIs, allowing an implementation to parse the common components of a URI reference without knowing the scheme-specific requirements of every possible identifier. This specification does not define a generative grammar for URIs; that task is performed by the individual specifications of each URI scheme'. All Web hyperlinks (URLs) are expressed as URIs.
Dereferencing is also an important aspect: an URI is said to be dereferenceable if it is possible to obtain all the digital contents describing the referenced resource (e.g., URL). The act of retrieving Digital Object Identifier (DOI), initially used in bibliographic databases, allows the identification of digital resources, such as a report, scientific articles or any other type of digital object objects.
The purpose of the DOI is to associate metadata describing the object, for example, in bibliography, to produce more reliable, unambiguous and longer-lasting citations. DOIs are issued by DOI agencies, part of the DataCite consortium. A DOI is a special case of Handle ID3 with the following format: doi:10.<Naming Authority>:<Registry_Number>, and it contains a link to the metadata (restrictions of use or copyright and naming authority among others), described by a data model common to all DOIs, the indecs Data Dictionary, an address or physical location for the digital object (usually a URL) that the DOI translator will use to redirect. For instance, prefixing a DOI with https://doi.org/ allows to dereference the identifier into a landing page storing or describing the object identified. DOI provides a good frame for a persistent identification of agricultural datasets.
ARK is a perennial identifier system based on the URI standard. ARK is designed to ensure long term identification of a resource, scalability and independence. An ARK contains a portion impervious to changes and a flexible portion, which designates a shape of the object or a mode of access thereto. An ARK URL is subdivided into two URLs: the first, optional, gives the addressing authority NMA (Name Mapping Authority), while the second is the ARK URL, fixed and proper, which includes a NAAN (Name Assigning Authority Number) and the name given to the object. goal of XRI is to provide a universal format for abstracts, structured identifiers that are independent of domains, locations and transport applications, so that they can be shared across a large number of domains, directory and protocols.
Identifying samples and real objects with a persistent identifier is possible with several standardized methods that can be linked with previous persistent identifiers (e.g., bar code that is a visual, machine-readable representation that describes something about the object that carries the barcode). It can have one or two dimensions and represents a numerical identifier. For instance, Universal Product Code from industrial sector is a worldwide retail, GS1-approved international standard (ISO/IEC 15420).
The identification of real objects has been increasing since the appearance of the internet of things (IoT). Between RFID chips, naming solutions and middlewares, the IoT is composed of many complementary elements, each having its own specificities.
For real objects, RFIDs are based on radio tags that can be pasted or embedded in objects or products and even implanted into living organisms (animals, human body). This identification method can be used to identify objects, such as those with a barcode (electronic label), people We clearly established identifiers that are used in schemas or standard vocabularies and ontologies (cf. Section 2.2) to provide information (properties, relations) about the object (e.g., responsible organization, type of object, definition, labels). As ontologies are changing with digital objects, the persistent identification method must support different versions of an object. Versioning becomes then an important aspect when building identifiers, for example, predefined period and important update releases (curation of data, campaign of collection). On the other hand, some versioning processes must trace all the transformations made on the data for history management.
Services such as B2HANDLE can allow to support this.
Today, the URI system is a standard used in a large variety of domains: genetic, chemistry, IoT, life sciences and so on. As an identifier, the URI must have some properties: non-ambiguousness, unicity, persistence, stability and resolvability.
• Unicity: only one URI for one resource.
• Persistence: once a resource is given an URI, one should not replace or delete the URI.
• Stability: URI has to remain the longer possible (at least 20 years) and should not be reassigned to another resource. The definition is close to the persistence; stability is persistence over a long time.
• Resolvability: URI should be used through internet browser to find information about the resource or the resource itself (also called dereferenceable).
When these principles are not respected, one may encounter several issues. Usually nonambiguousness and unicity are usually not a problem as everyone understands their importances.
However, stability and persistence are much more difficult to get: typical case is when part of this URI is changing. For example, the domain name www.phenome-fppn.fr later becomes phenotyping.fr. Thus, the unicity of phenotyping.fr/m3p/arch/2017/c17000915 is not guaranteed; there could be two different resources identified with the same URI, the phenotyping.fr ID and another with the phenome.fppn.fr ID.
In summary, few rules must be followed to create good URIs in agriculture: i) use minimal information and do not use everything that may change, ii) use persistent URL, iii) provide multiple output formatcontent negotiationand link them together, iv) request on the external Semantic interoperability enables data integration and fosters new scientific discoveries by exploiting various data acquired from different perspectives (e.g., agricultural and context data).
For instance, a scientist experimentally measures the sensitivity of a plant to a disease (agronomy vision), whereas a farmer concretely observes the leaves of the plant turning brown (agriculture vision). Both are phenotypes, or traits, information, but they come from two different worlds that must yet be more connected. This shall be possible only through lifting the data into meaningful knowledge for humans, yet exploitable by machines.
A researcher studying a certain plant trait (e.g., resistance to a disease) is interested in the gene that controls this phenotype, the expression of this trait in different crop varieties observed in different environments and, of course, its effect on the crop yield or for associated needs such as the use of pesticides. The information we need to answer such questions is available in multiple datasets expressed using various ontologies (crop ontology, plant ontology, trait ontology, etc.) and at various levels (e.g., population, individual, organ); the issue is finding that information and combining it in a meaningful way for researchers, breeders and ultimately farmers, consumers or any stakeholders of the value chain.
Ontology engineering is a sub-domain of knowledge engineering that deals with knowledge representation and reasoning. An ontology is described as a 'formal specification of conceptualization' (Gruber, 1993); it 'defines the terms to describe and represent an area of knowledge'. Ontologies are composed of concepts, relations and instances. For example, if you want to define a car, you should say: 'a car is a transportation object, with four wheels, and one needs a licence to drive it. My blue Ford Mustang is a car'. 'Car' is a concept, 'is a' is a relation and 'My blue Ford Mustang' is an instance.
The Semantic Web is the area in which ontologies are used to structure data into formal knowledge. The

☆☆☆☆☆ All of the above and links to other LOD
The purpose of the Web of data is not to create another Web, since it is based on its current architecture (the URI system and the HTTP protocol), but to create an extension. RDF is to structured data what HTML is to documents, an interoperability framework that ensures consistency in the handling and processing of these data by machines.

Ontologies and semantic tagging in agriculture
In recent years, we have seen an explosion in the number of semantic resources (thesauri, AgroPortal offers a robust and reliable service to the community that provides ontology hosting, search, versioning, visualization, comment and recommendation; enables semantic annotation; stores and exploits ontology alignments; and enables interoperation with the Semantic Web.
One important use of ontologies is for annotating and indexing text data. Indeed, ontologies allow representing data with clear semantics that can be leveraged by computing algorithms to search, query or reason on the data. One way of using ontologies is by means of creating semantic annotations or semantic tags. An annotation is a link from an ontology term to a data element, indicating that the data element (e.g., article, experiment, observation, medical record) refers to the term. When doing ontology-based indexing, we use these annotations to 'bring together' the data elements from these resources. However, explicitly annotating data is still not a common practice for several reasons ): • Annotation often needs to be done either manually by expert curators or directly by the authors of the data.
• The number and format of ontologies available for use are large, and ontologies change often and frequently overlap.
• Users do not always know the structure of an ontology's content or how to use the ontology to do the annotation themselves.
• Annotation is often a boring additional task without immediate reward for the author.

Identification in PHIS
Tracking all objects involved in a phenotyping experiment (e.g., plants, pots, sensors) and representing relationships between them are essential in a high-throughput context where thousands of plots, plants or sensors are involved. This requires a proper strategy that allows to individually identify each specific object as well as semantic properties for creating relationships between such objects.
For instance, the replacement of a sensor at a given position (e.g., meteorological sensor or soil tensiometer) is not obvious in the outputs of an environmental database. In greenhouse experiments, a plant can be replaced by another plant at the same position and vector (e.g., pot, cart) during an experiment, potentially generating confusion. All objects therefore need to be identified in order to keep the necessary information associated to them (e.g., positions over time, successive calibration for sensors, origin for plants).
In the following text, we illustrate PHIS's identification system. PHIS object identification is based on URIs. This ensures traceability in space and time, while a typical identification by numbers (e.g., 'plant 736') refers to different plants in different experiments and installations. URIs are generated automatically for each object via the user interface and implemented by QR codes, creating a set of connected objects that can be accessed, along with all their properties, from any terminal (e.g., mobile device, barcode reader).
What are the things to identify? Ideally, we want to identify everything, but we have very different resourcesdo we identify them the same way? Are URIs the best option to identify every resource? Those are questions one should ask before designing an URI scheme. For example, measures collected by a sensor can be gathered in a dataset and require only one URI for the dataset, or even be aggregated in a database. Then, the measures per day are identified with a primary key or an incremental ID.

How to make non-ambiguous URI
PHIS's non-ambiguous identifiers use an incremental number (the number of plants), prefixed with a letter that helps human manipulate the URI and real objects.
In PHIS, the semantic implementation is realized by a set of standardized ontologies written in OWL2. Based on these ontologies, the first step is to organize objects and concepts with a specialization hierarchy (sort of). For instance, corncob is a sort of a plant organ, that is, corncob is subClassOf plantOrgan. The description of this object (metadata) is formalized as properties.
These properties can be values (dataProperty) or objects (objectProperty). Semantic links between objects, between events and between traits used in PHIS are realized through the annotation ontology and some specific application ontologies (such as Ontology of Experimental Events (OEEV)).11 In order to integrate data, the relations between objects need to be represented adequately irrigation trouble) and Incident (a pot falls down, a leaf is blocked in an imaging cabin, lodging of a plot, human error, etc.). As described in the associated semantic graph, an event can be associated with objects (e.g., plant, plot, sensor) and with the user who has annotated the event, and the occurrence data can be tracked along with every relevant detail.
The use of ontologies allows to deal with the complexity of phenotyping data in order to link a large number of different data sources. Data integration process can be done automatically: • Concept mapping is one of the approaches for data integration from different sources.
Ontologies will help for concept mapping. For instance, the 'field' is equivalent to 'cultivated land'.
• Data-linking approach is based on the use of common standardized RDF properties in several data sources. It allows to identify common individuals in different sources. For instance, GPS coordinate values and the plant species name allow to know common plots of different datasets.
Ontology-driven approach for data management allows to deal with the same system data from greenhouse or fields, thanks to a precise formalization of agricultural objects. This approach makes easier the data integration process. In other words, by connecting greenhouse and field experiments, the decision-making process is strongly improved.
Other uses are made. Indeed, this approach based on ontology-driven information systems can facilitate decision making for many agricultural applications such as agroecological system design, precision agriculture and breeding. For instance, in agroecology we formalized bioagressors, lifecycles and impacts. All these applications require interdisciplinary work and intensive data integration. The formalization of concepts, the links between concepts and tagging are fundamental and constitute a crucial step. This information system generation encourages the production of FAIR data that can be used across disciplines.

Conclusion and future trends
As we have seen earlier, the structuring of data in order to make them reusable is based on their identification in the long term (beyond the decade) and on the reuse of ontologies and interdisciplinary standards. In practice, organization evolutions and staff turnover have important effects for the long-term data management. In too many cases, data are often produced and designed for 'immediate consumption'. Reusing ontologies is the way that we must choose, but efficient tools for improving reuse are needed. Data come from various devices; simulation, observation or crowdsourcing and too often data repeatability/reproducibility is not well known or impossible. Structuring data will be a significant advance. For projects, institutions and companies, Data Management Plans (DMPs) are a sine qua non condition for the evaluation of produced data in agriculture. DMPs will allow the development and improvements of methods for the identification of agricultural objects and the associated data semantics. An interesting example is the world of software where many developers do not hesitate and are very active to share their production. Data papers improve the process of data sharing and data indexing.14 A citation mechanism is designed to reward the efforts of people and institutes that collect and manage data.
But recognizing data sharing is still in its infancy, and the generalization of persistent identifiers, data papers and the Web of data could help change things. Part of the answer is also in the availability of integrative data tools for visualization, analysis, prediction and decision support.
14 https://freshwaterblog.net/2012/06/29/what-does-a-data-paper-look-like/ Access to a new generation of tools can motivate communities of agriculture. It will support agriculture to raise challenges. Data curation needs to be developed and to go further than 'cleaning' its imperfections. The curation of data, from Latin curare, which means 'to take care', is essential before any process of analysis or decision. It consists of improving the capacity of the data to describe a system in an unambiguous and explicit way. It is essential to prepare a dataset for a large set of analysis methods, given the opportunity to aggregate different datasets of different provenances, structures and semantics.
To meet the agricultural challenges, well-structured and described data are essential, but how to use them better? Ideally, we shall have powerful tools to automatically select and integrate huge datasets from various sources (agriculture, environment, social, health, etc.). A first stage is more reasonably to use semi-automatic tools, in order to produce the most complete knowledge that constitutes the decision support material.
The structuring of data must be accompanied and allow the construction of different kinds of decision support tools. The main goal is to promote the adoption of increasingly decision-making and 'smart' decision support tools in the agricultural domain. These systems will use not only more data but also better data, updated, cheaper to produce, more standardized and more efficient for decision making.
Produced data should meet FAIR principles; if widely adopted, the connections they enable will result in improved access to information, opportunities for collaboration, reduced administrative overhead and, ultimately, increased trust in studies and research (Meadows et