Finding beans in burgers: Deep semantic-visual embedding with localization

Abstract : Several works have proposed to learn a two-path neural network that maps images and texts, respectively, to a same shared Euclidean space where geometry captures useful semantic relationships. Such a multi-modal embedding can be trained and used for various tasks, notably image captioning. In the present work, we introduce a new architecture of this type, with a visual path that leverages recent space-aware pooling mechanisms. Combined with a textual path which is jointly trained from scratch, our semantic-visual embedding offers a versatile model. Once trained under the supervision of captioned images, it yields new state-of-the-art performance on cross-modal retrieval. It also allows the localization of new concepts from the embedding space into any input image, delivering state-of-the-art result on the visual grounding of phrases.
Complete list of metadatas

Cited literature [45 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02171857
Contributor : Martin Engilberge <>
Submitted on : Wednesday, July 3, 2019 - 11:25:21 AM
Last modification on : Wednesday, July 10, 2019 - 1:35:55 AM

File

findingbeansinburger.pdf
Files produced by the author(s)

Identifiers

Citation

Martin Engilberge, Louis Chevallier, Patrick Pérez, Matthieu Cord. Finding beans in burgers: Deep semantic-visual embedding with localization. CVPR 2018 - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018, Salt Lake City, United States. pp.3984-3993, ⟨10.1109/CVPR.2018.00419⟩. ⟨hal-02171857⟩

Share

Metrics

Record views

10

Files downloads

22