Vision Transformers have achieved notable results in diabetic retinopathy (DR) classification; yet, their interpretability remains a critical barrier to clinical adoption. Existing explainability studies often lack systematic comparisons across XAI paradigms and rely primarily on qualitative assessment. This work addresses this gap by presenting a comprehensive benchmark of six post-hoc explainability methods: Grad-CAM, Grad-CAM++, Score-CAM, Raw Attention, Attention Rollout, and AGCAM, applied to a fixed state-of-the-art Vision Transformer (Dino2-DR) using expert-annotated fundus images. Methods are evaluated along complementary dimensions, including lesion localization accuracy, intrinsic reliability (complexity, faithfulness, and robustness), and computational efficiency. Results indicate that no single method simultaneously optimizes all criteria. Grad-CAM achieves the strongest detection balance, Raw Attention provides superior spatial delineation, and AGCAM combines broad lesion coverage with the highest causal alignment to model predictions. Localization accuracy and intrinsic reliability emerge as partially independent quality dimensions, with gradient-based approaches favoring precision and hybrid methods emphasizing contextual coverage. Runtime analysis further highlights substantial efficiency differences across methods, identifying Grad-CAM and AGCAM as the most practical candidates for interactive deployment, while Score-CAM and Attention Rollout incur prohibitive overhead. These findings underscore the necessity of multi-dimensional explainability evaluation and provide actionable guidance for task-aware selection of XAI methods in Vision Transformer-based ophthalmic imaging.

Benchmarking Explainable AI Methods for Vision Transformer-based Diabetic Retinopathy Analysis

Cascone, Lucia;Campiglia, Pietro;Nappi, Michele;Narducci, Fabio;Simone, Benedetto
2026

Abstract

Vision Transformers have achieved notable results in diabetic retinopathy (DR) classification; yet, their interpretability remains a critical barrier to clinical adoption. Existing explainability studies often lack systematic comparisons across XAI paradigms and rely primarily on qualitative assessment. This work addresses this gap by presenting a comprehensive benchmark of six post-hoc explainability methods: Grad-CAM, Grad-CAM++, Score-CAM, Raw Attention, Attention Rollout, and AGCAM, applied to a fixed state-of-the-art Vision Transformer (Dino2-DR) using expert-annotated fundus images. Methods are evaluated along complementary dimensions, including lesion localization accuracy, intrinsic reliability (complexity, faithfulness, and robustness), and computational efficiency. Results indicate that no single method simultaneously optimizes all criteria. Grad-CAM achieves the strongest detection balance, Raw Attention provides superior spatial delineation, and AGCAM combines broad lesion coverage with the highest causal alignment to model predictions. Localization accuracy and intrinsic reliability emerge as partially independent quality dimensions, with gradient-based approaches favoring precision and hybrid methods emphasizing contextual coverage. Runtime analysis further highlights substantial efficiency differences across methods, identifying Grad-CAM and AGCAM as the most practical candidates for interactive deployment, while Score-CAM and Attention Rollout incur prohibitive overhead. These findings underscore the necessity of multi-dimensional explainability evaluation and provide actionable guidance for task-aware selection of XAI methods in Vision Transformer-based ophthalmic imaging.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4944275
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact