VisQA: X-raying Vision and Language Reasoning in Transformers

Visual Question Answering systems target answering open-ended textual questions given input images. They are a testbed for learning high-level reasoning with a primary use in HCI, for instance assistance for the visually impaired. Recent research has shown that state-of-the-art models tend to produce answers exploiting biases and shortcuts in the training data, and sometimes do not even look at the input image, instead of performing the required reasoning steps. We present VisQA, a visual analytics tool thatexplores this question of reasoning vs. bias exploitation. It exposes the key element of state-of-the-art neural models --- attention maps in transformers. Our working hypothesis is that reasoning steps leading to model predictions are observable from attention distributions, which are particularly useful for visualization. The design process of VisQA was motivated by well-known bias examples from the fields of deep learning and vision-language reasoning and evaluated in two ways. First, as a result of a collaboration of three fields, machine learning, vision and language reasoning, and data analytics, the work lead to a better understanding of bias exploitation of neural models for VQA, which eventually resulted in an impact on its design and training through the proposition of a method for the transfer of reasoning patterns from an oracle model. Second, we also report on the design of VisQA, and a goal-oriented evaluation of VisQA targeting the analysis of a model decision process from multiple experts, providing evidence that it makes the inner workings of models accessible to users.

Mots clés

Transformers Visual Question Answering Visual analytics XAI

Domaines

Intelligence artificielle [cs.AI] Interface homme-machine [cs.HC]

Fichier principal

_VIS_21__VisQA_Camera_ready.pdf (4.42 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Théo Jaunet : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03293079

Soumis le : mardi 20 juillet 2021-17:03:28

Dernière modification le : mercredi 27 mars 2024-09:16:03

Archivage à long terme le : jeudi 21 octobre 2021-19:00:12

Dates et versions

hal-03293079 , version 1 (20-07-2021)

Identifiants

HAL Id : hal-03293079 , version 1
DOI : 10.1109/TVCG.2021.3114683

Citer

Theo Jaunet, Corentin Kervadec, Romain Vuillemot, Grigory Antipov, Moez Baccouche, et al.. VisQA: X-raying Vision and Language Reasoning in Transformers. IEEE Transactions on Visualization and Computer Graphics, 2021, ⟨10.1109/TVCG.2021.3114683⟩. ⟨hal-03293079⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS UNIV-LYON1 UNIV-LYON2 INSA-LYON EC-LYON LIRIS INSA-GROUPE UDL EC_LYON_STRICT

172 Consultations

107 Téléchargements