LIA at INEX 2010 Book Track

Abstract : In this paper we describe our participation and present our contributions in the INEX 2010 Book Track. Digitized books are now a common source of information on the Web, however OCR sometimes introduces errors that can penalize Information Retrieval. We propose a method for correcting hyphenations in the books and we analyse its impact on the Best Books for Reference task. The observed improvement is around 1%. This year we also experimented different query expansion techniques. The first one consists of selecting informative words from a Wikipedia page related to the topic. The second one uses a dependency parser to enrich the query with the detected phrases using a Markov Random Field model. We show that there is a significant improvement over the state-of-the-art when using a large weighted list of Wikipedia words, meanwhile hyphenation correction has an impact on their distribution over the book corpus.
Document type :
Journal articles
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01314937
Contributor : Bibliothèque Universitaire Déposants Hal-Avignon <>
Submitted on : Thursday, May 12, 2016 - 1:51:09 PM
Last modification on : Saturday, March 23, 2019 - 1:22:19 AM

Links full text

Identifiers

Collections

Citation

Romain Deveaud, Florian Boudin, Patrice Bellot. LIA at INEX 2010 Book Track. Lecture Notes in Computer Science, Springer, 2011, ⟨10.1007/978-3-642-23577-1_10⟩. ⟨hal-01314937⟩

Share

Metrics

Record views

124