Towards a Better Semantic Matching for Indexation Improvement of Error-Prone (Semi-)Structured XML Documents

Arnaud Renard 1 Sylvie Calabretto 1 Béatrice Rumpler 1
1 DRIM - Distribution, Recherche d'Information et Mobilité
LIRIS - Laboratoire d'InfoRmatique en Image et Systèmes d'information
Abstract : Documents containing errors in their textual content (which we will call noisy documents) are difficultly handled by Information Retrieval systems. The same observation is verified when it comes to (semi-)structured IR systems this paper deals with. However, the problem is even bigger when those systems rely on Semantics. In order to achieve that, they need an additional external semantic resource related to the documents collection. Then, ranking is made possible thanks to concepts comparisons allowed by similarity measures. Similarity measures assume that concepts related to the words have been identified without ambiguity. Nevertheless, this assumption can't be made in presence of noisy documents where words are potentially misspelled, resulting in a word having a different meaning or at least in a non-word. Semantic aware (semi-)structured IR systems lay on basic concept identification but they don’t care about spelling uncertainties. As this can degrade systems results, we suggest a way to detect and correct misspelled terms which can be used in documents pre-processing of IR systems. First results on small datasets seem promising.
Document type :
Book sections
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01354866
Contributor : Équipe Gestionnaire Des Publications Si Liris <>
Submitted on : Friday, August 19, 2016 - 5:46:50 PM
Last modification on : Friday, January 11, 2019 - 4:35:34 PM

Identifiers

Citation

Arnaud Renard, Sylvie Calabretto, Béatrice Rumpler. Towards a Better Semantic Matching for Indexation Improvement of Error-Prone (Semi-)Structured XML Documents. Joaquim Filipe, José Cordeiro. Lecture Notes in Business Information Processing (LNBIP), Springer-Verlag, pp.286-298, 2011, ⟨10.1007/978-3-642-22810-0_21⟩. ⟨hal-01354866⟩

Share

Metrics

Record views

129