When adversarial attacks become interpretable counterfactual explanations - Archive ouverte HAL Accéder directement au contenu
Pré-Publication, Document De Travail Année : 2022

When adversarial attacks become interpretable counterfactual explanations

Mathieu Serrurier
Franck Mamalet
Thomas Fel
Louis Béthune
Thibaut Boissin

Résumé

We argue that, when learning a 1-Lipschitz neural network with the dual loss of an optimal transportation problem, the gradient of the model is both the direction of the transportation plan and the direction to the closest adversarial attack. Traveling along the gradient to the decision boundary is no more an adversarial attack but becomes a counterfactual explanation, explicitly transporting from one class to the other. Through extensive experiments on XAI metrics, we find that the simple saliency map method, applied on such networks, becomes a reliable explanation, and outperforms the state-of-the-art explanation approaches on unconstrained models. The proposed networks were already known to be certifiably robust, and we prove that they are also explainable with a fast and simple method.
Fichier principal
Vignette du fichier
hkr_explainability_Arxiv.pdf (28.44 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03693355 , version 1 (10-06-2022)
hal-03693355 , version 2 (20-06-2023)
hal-03693355 , version 3 (02-02-2024)

Identifiants

Citer

Mathieu Serrurier, Franck Mamalet, Thomas Fel, Louis Béthune, Thibaut Boissin. When adversarial attacks become interpretable counterfactual explanations. 2022. ⟨hal-03693355v1⟩
146 Consultations
64 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More