Unjustified Classification Regions and Counterfactual Explanations In Machine Learning

Post-hoc interpretability approaches, although powerful tools to generate explanations for predictions made by a trained black-box model, have been shown to be vulnerable to issues caused by lack of robustness of the classifier. In particular, this paper focuses on the notion of explanation justification, defined as connectedness to ground-truth data, in the context of counterfactuals. In this work, we explore the extent of the risk of generating unjustified explanations. We propose an empirical study to assess the vulnerability of classifiers and show that the chosen learning algorithm heavily impacts the vulnerability of the model. Additionally, we show that state-of-the-art post-hoc counterfactual approaches can minimize the impact of this risk by generating less local explanations (Source code available at: https://github.com/thibaultlaugel/truce).

Domaines

Informatique [cs] Intelligence artificielle [cs.AI] Apprentissage [cs.LG]

Christophe Marsala : Connectez-vous pour contacter le contributeur

https://hal.sorbonne-universite.fr/hal-02275348

Soumis le : vendredi 30 août 2019-16:16:48

Dernière modification le : jeudi 4 janvier 2024-22:26:03

Dates et versions

hal-02275348 , version 1 (30-08-2019)

Identifiants

HAL Id : hal-02275348 , version 1
DOI : 10.1007/978-3-030-46147-8_3

Citer

Thibault Laugel, Marie-Jeanne Lesot, Christophe Marsala, Xavier Renard, Marcin Detyniecki. Unjustified Classification Regions and Counterfactual Explanations In Machine Learning. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019, Sep 2019, Würzburg, Germany. pp.37-54, ⟨10.1007/978-3-030-46147-8_3⟩. ⟨hal-02275348⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS LIP6 SORBONNE-UNIVERSITE SU-SCIENCES

137 Consultations

0 Téléchargements