Skip to Main content Skip to Navigation
Journal articles

LIA at INEX 2010 Book Track

Abstract : In this paper we describe our participation and present our contributions in the INEX 2010 Book Track. Digitized books are now a common source of information on the Web, however OCR sometimes introduces errors that can penalize Information Retrieval. We propose a method for correcting hyphenations in the books and we analyse its impact on the Best Books for Reference task. The observed improvement is around 1%. This year we also experimented different query expansion techniques. The first one consists of selecting informative words from a Wikipedia page related to the topic. The second one uses a dependency parser to enrich the query with the detected phrases using a Markov Random Field model. We show that there is a significant improvement over the state-of-the-art when using a large weighted list of Wikipedia words, meanwhile hyphenation correction has an impact on their distribution over the book corpus.
Document type :
Journal articles
Complete list of metadata
Contributor : Bibliothèque Universitaire Déposants Hal-Avignon Connect in order to contact the contributor
Submitted on : Thursday, May 12, 2016 - 1:51:09 PM
Last modification on : Tuesday, January 14, 2020 - 4:16:28 PM

Links full text





Romain Deveaud, Florian Boudin, Patrice Bellot. LIA at INEX 2010 Book Track. Lecture Notes in Computer Science, Springer, 2011, ⟨10.1007/978-3-642-23577-1_10⟩. ⟨hal-01314937⟩



Record views