Skip to Main content Skip to Navigation
Book sections

Leveraging image, text and cross-media similarities for diversity-focused multimedia retrieval

Abstract : This chapter summarizes the different cross–modal information retrieval techniques Xerox Research Centre implemented during three years of participation in ImageCLEF Photo tasks. The main challenge remained constant: how to optimally couple visual and textual similarities, when they capture things at different semantic levels and when one of the media (the textual one) gives, most of the time, much better retrieval performance. Some core components turned out to be very effective all over the years: the visual similarity metrics based on Fisher Vector representation of images and the cross–media similarity principle based on relevance models. However, other components were introduced to solve additional issues: We tried different query– and document–enrichment methods by exploiting auxiliary resources such as Flickr or open–source thesauri, or by doing some statistical ‘semantic smoothing’. We also implemented some clustering mechanisms in order to promote diversity in the top results and to provide faster access to relevant information. This chapter describes, analyses and assesses each of these components, namely: the monomodal similarity measures, the different cross–media similarities, the query and document enrichment, and finally the mechanisms to ensure diversity in what is proposed to the user. To conclude, we discuss the numerous lessons we have learnt over the years by trying to solve this very challenging task.
Complete list of metadatas

Cited literature [25 references]  Display  Hide  Download
Contributor : Julien Ah-Pine <>
Submitted on : Monday, April 10, 2017 - 12:22:52 PM
Last modification on : Tuesday, February 12, 2019 - 10:30:06 AM
Document(s) archivé(s) le : Tuesday, July 11, 2017 - 12:39:50 PM


Files produced by the author(s)


  • HAL Id : hal-01504565, version 1


Julien Ah-Pine, Stephane Clinchant, Gabriela Csurka, Florent Perronnin, Jean-Michel Renders. Leveraging image, text and cross-media similarities for diversity-focused multimedia retrieval. ImageCLEF - Experimental Evaluation in Visual Information Retrieval, Springer, pp.315-342, 2010, 978-3-642-15181-1. ⟨hal-01504565⟩



Record views


Files downloads