Units of measure identification in unstructured scientific documents in microbial risk in food - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2013

Units of measure identification in unstructured scientific documents in microbial risk in food

Résumé

OBJECTIVE(S) A preliminary step in microbial risk assessment in food is to gather and capitalize experimental data. Data capitalization is a crucial stake in an overall decision support system which consists of predicting microbial behavior [1]. In the framework of the French ANR project MAP'OPT (Equilibrium Gas Composition in Modified Atmosphere Packaging and Food Quality), the predictive modeling platform Sym'Previus (www.symprevius.org) should be able to propose a global approach to establish a scientifically sound method for choosing an appropriate modified atmosphere and associated packaging solution. Our work is part of this overall system and aims at extracting semi-automatically experimental data from unstructured scientific documents. Indeed, these documents use natural language combined with domain-specific terminology that is extremely time-consuming and tedious to extract in the free form of text and therefore to gather and capitalize. Our work relies on the MAP'OPT-Onto ontology [4], which has been built as an extension of the ontology used in Sym'Previus by adding concepts about food packaging, quantity concepts and concepts managing units of measures. Experimental data are often expressed with concepts (e.g packaging, permeability) or a numerical value often followed with its unit of measure (e.g. 258 amol m-1 s-1 Pa-1). In this paper, our work deals with unit recognition, known as a scientific challenge. METHOD(S) Extracting automatically quantitative data is a painstaking process because units suffer from different ways of writing within documents. We can encounter same units written in different manners such as amol m-1 s-1 Pa-1 written as amol.m-1 .s-1 .Pa-1 or as amol/m/s/Pa. We aim at focusing on the extraction and identification of these variant units seen as synonyms, in order to enrich iteratively an ontology, which represents a predefined vocabulary used to annotate, capitalize and query experimental data extracted from texts [2]. Our work addresses unit extraction and identification issues from texts to enrich an ontology in a two-step approach. First, we use text-mining methods and supervised learning approaches in order to predict relevant parts of the text where synonyms of units or new units are. The second step of our method consists in extracting specific strings representing units in the segments of texts found in the previous step. The extracted candidates are compared to units already present in the ontology using a new edit measure based on Damerau-Levenshtein [3]. RESULTS We have made experiments on 115 scientific documents (i.e. around 35 000 sentences) on food packaging. Each unit is recognized from a list of 211 units already defined in the MAP'OPT-Onto. Our learning algorithms predict that almost 5 000 sentences contain units. This prediction is correct for 95,5% of cases. In the second step, we have successfully extracted 38 terms as either synonyms or new units from sentences selected in the first step. So, we can propose 18% of enrichment of the pre-existing MAP'OPT-Onto.
Fichier principal
Vignette du fichier
berrahou_2013.pdf (36.38 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01123269 , version 1 (19-11-2019)

Identifiants

  • HAL Id : hal-01123269 , version 1
  • PRODINRA : 279671

Citer

Soumia Lilia Berrahou, Patrice Buche, Juliette Dibie-Barthelemy, Mathieu Roche. Units of measure identification in unstructured scientific documents in microbial risk in food. 8. International Conference on Predictive Modelling in Food, Sep 2013, Paris, France. pp.254-255. ⟨hal-01123269⟩
233 Consultations
33 Téléchargements

Partager

Gmail Facebook X LinkedIn More