Skip to Main content Skip to Navigation
Theses

Simplification automatique de textes techniques et spécialisés

Abstract : Automatic text simplification is a subdomain of natural language processing (NLP). It aims at processing texts that are difficult to read for a given audience in order to make them more accessible. Our goal consists in automatically simplifying medical texts. We present our whole work on that question, that goes from data collection and analysis to automatic simplification experiments.We begin with the process of collecting a comparable corpus of biomedical texts. The corpus is made of document pairs that deal with the same subject: one is written for a specialist audience and the other is written for non specialists. The corpus contains three types of texts: drug information, medical literature reviews and encyclopedia articles. Once the documents are collected, we annotate a subset of the corpus and analyze the linguistic transformations that occur during simplification.From the comparable corpus, we build a method to extract a parallel corpus, a corpus that contains sentence pairs where the sentences have the same meaning but differ by their degree of difficulty. This type of corpus represents the basic material for automatic simplification methods. Our parallel sentences extraction method is made of two steps: (1) prefiltering the pairs that are candidate for alignment using syntactic heuristics and (2) using a binary classifier to distinguish sentences that have the same meaning. We evaluate various classifiers as well as the impact of the data imbalance on the results. In order to promote the parallel corpus, also create a corpus of sentence pairs that are annotated according to their degree of semantic similarity, with scores ranging from 0 (no similarity) to 5 (same meaning). Both corpora are available for research.Finally, we present a series of experiments for the automatic simplification of biomedical french texts. Indeed, we use a neural method that comes from automatic translation. We use several resources: the parallel medical corpus that we built, the parallel general language corpus that we automatically translated from English to French and a lexicon that matches medical terms with terms or paraphrases that are more accessible. We describe the experimental protocol and evaluate the results in two manners, quantitatively and qualitatively. The results are similar to the state of the art in general language simplification and show that the resulting simplifications can be exploited as part of a computer aided simplification task.
Complete list of metadata

https://hal.archives-ouvertes.fr/tel-03343769
Contributor : Abes Star :  Contact
Submitted on : Thursday, October 7, 2021 - 2:29:10 PM
Last modification on : Tuesday, October 19, 2021 - 11:31:25 PM

File

2021LILUH007.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-03343769, version 2

Collections

Citation

Rémi Cardon. Simplification automatique de textes techniques et spécialisés. Informatique et langage [cs.CL]. Université de Lille, 2021. Français. ⟨NNT : 2021LILUH007⟩. ⟨tel-03343769v2⟩

Share

Metrics

Record views

66

Files downloads

43