Multi-modal representation learning towards visual reasoning

Hedi Ben-Younes 1
1 MLIA - Machine Learning and Information Access
LIP6 - Laboratoire d'Informatique de Paris 6
Abstract : In this thesis, we are interested in Visual Question Answering (VQA), which consists in building models that answer any natural language question about any image. Because of its nature and complexity, VQA is often considered as a proxy for visual reasoning. Classically, VQA architectures are designed as trainable systems that are provided with images, questions about them and their answers. To tackle this problem, typical approaches involve modern Deep Learning (DL) techniques. In the first part, we focus on developping multi-modal fusion strategies to model the interactions between image and question representations. More specifically, we explore bilinear fusion models and exploit concepts from tensor analysis to provide tractable and expressive factorizations of parameters. These fusion mechanisms are studied under the widely used visual attention framework: the answer to the question is provided by focusing only on the relevant image regions. In the last part, we move away from the attention mechanism and build a more advanced scene understanding architecture where we consider objects and their spatial and semantic relations. All models are thoroughly experimentally evaluated on standard datasets and the results are competitive with the literature.
Complete list of metadatas

Cited literature [136 references]  Display  Hide  Download
Contributor : Hedi Ben-Younes <>
Submitted on : Monday, July 29, 2019 - 10:43:20 AM
Last modification on : Wednesday, July 31, 2019 - 1:30:01 AM


Files produced by the author(s)


  • HAL Id : tel-02196626, version 1


Hedi Ben-Younes. Multi-modal representation learning towards visual reasoning. Artificial Intelligence [cs.AI]. EDITE de Paris, 2019. English. ⟨tel-02196626⟩



Record views


Files downloads