Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?

Recent advances in vision-and-language modeling have seen the development of Transformer architectures that achieve remarkable performance on multimodal reasoning tasks. Yet, the exact capabilities of these black-box models are still poorly understood. While much of previous work has focused on studying their ability to learn meaning at the word-level, their ability to track syntactic dependencies between words has received less attention. We take a first step in closing this gap by creating a new multimodal task targeted at evaluating understanding of predicate-noun dependencies in a controlled setup. We evaluate a range of state-of-the-art models and find that their performance on the task varies considerably, with some models performing relatively well and others at chance level. In an effort to explain this variability, our analyses indicate that the quality (and not only sheer quantity) of pretraining data is essential. Additionally, the best performing models leverage fine-grained multimodal pretraining objectives in addition to the standard image-text matching objectives. This study highlights that targeted and controlled evaluations are a crucial step for a precise and rigorous test of the multimodal knowledge of vision-and-language models.

Domaines

Informatique et langage [cs.CL] Vision par ordinateur et reconnaissance de formes [cs.CV]

Fichier principal

2327_Paper-2.pdf (700.77 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Mitja Nikolaus : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03823490

Soumis le : vendredi 21 octobre 2022-00:23:51

Dernière modification le : vendredi 22 mars 2024-18:24:04

Dates et versions

hal-03823490 , version 1 (21-10-2022)

Identifiants

HAL Id : hal-03823490 , version 1

Citer

Mitja Nikolaus, Emmanuelle Salin, Stéphane Ayache, Abdellah Fourtassi, Benoît Favre. Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing., 2022, Abu Dhabi, United Arab Emirates. ⟨hal-03823490⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLN CNRS UNIV-AMU GENCI LPL-AIX ILCB LIS-LAB ANR INCIAM

49 Consultations

47 Téléchargements