Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron; Hugo Touvron; Ishan Misra; Hervé Jegou; Julien Mairal; Piotr Bojanowski; Armand Joulin

Communication Dans Un Congrès Année : 2021

Emerging Properties in Self-Supervised Vision Transformers

(1, 2) , (1, 3) , (1) , (1) , (2) , (1) , (1)

1
2
3

Mathilde Caron

Fonction : Auteur
PersonId : 1046708

Facebook AI Research [Paris]

Apprentissage de modèles à partir de données massives

Hugo Touvron

Fonction : Auteur

Facebook AI Research [Paris]

Sorbonne Université

Ishan Misra

Fonction : Auteur

Facebook AI Research [Paris]

Hervé Jegou

Fonction : Auteur

Facebook AI Research [Paris]

Julien Mairal

Fonction : Auteur
PersonId : 1034832
ORCID : 0000-0001-6991-2110
IdRef : 152125256

Apprentissage de modèles à partir de données massives

Piotr Bojanowski

Fonction : Auteur

Facebook AI Research [Paris]

Armand Joulin

Fonction : Auteur

Facebook AI Research [Paris]

Résumé

In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.

Domaines

Informatique [cs]

Fichier principal

main.pdf (29.3 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Mathilde Caron : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03323359

Soumis le : mardi 24 août 2021-13:32:44

Dernière modification le : jeudi 4 avril 2024-21:33:09

Archivage à long terme le : vendredi 26 novembre 2021-09:22:35

Dates et versions

hal-03323359 , version 1 (24-08-2021)

Identifiants

HAL Id : hal-03323359 , version 1

Citer

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jegou, Julien Mairal, et al.. Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021 - International Conference on Computer Vision, Oct 2021, Virtual, France. pp.1-21. ⟨hal-03323359⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 UGA CNRS INRIA IRISA LJK LJK_GI INRIA2 LJK-GI-THOTH UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES SORBONNE-UNIVERSITE MIAI ANR UR1-MATH-NUM

767 Consultations

100 Téléchargements

Emerging Properties in Self-Supervised Vision Transformers

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager