The rise of audio deepfakes is becoming a growing concern for media credibility, particularly on social platforms. This study explores an approach to detecting audio deepfakes using Convolutional Neural Networks (CNNs) applied to Mel spectrograms, which serve as visual representations of audio signals. Six CNN architectures (VGG16, VGG19, ResNet50, DenseNet121, MobileNetV2, and EfficientNetB0) were evaluated using the FakeAVCelebV2 dataset, considering metrics such as precision, recall, F1-score, and accuracy. To provide better insight into model decisions, Grad-CAM, an Explainable Artificial Intelligence (XAI) technique, was employed to highlight the most relevant regions of the spectrogram for distinguishing between real and fake audio. The study also tested the model’s performance under conditions with added Gaussian and white noise to assess its robustness. The results confirm that CNN-based Mel spectrogram analysis is an effective method for audio deepfake detection, and they underline the importance of interpretability to ensure trustworthy media detection systems.

Mel Spectrogram-Based CNN Framework for Explainable Audio Deepfake Detection

Castiglione A.;Pero C.
2025

Abstract

The rise of audio deepfakes is becoming a growing concern for media credibility, particularly on social platforms. This study explores an approach to detecting audio deepfakes using Convolutional Neural Networks (CNNs) applied to Mel spectrograms, which serve as visual representations of audio signals. Six CNN architectures (VGG16, VGG19, ResNet50, DenseNet121, MobileNetV2, and EfficientNetB0) were evaluated using the FakeAVCelebV2 dataset, considering metrics such as precision, recall, F1-score, and accuracy. To provide better insight into model decisions, Grad-CAM, an Explainable Artificial Intelligence (XAI) technique, was employed to highlight the most relevant regions of the spectrogram for distinguishing between real and fake audio. The study also tested the model’s performance under conditions with added Gaussian and white noise to assess its robustness. The results confirm that CNN-based Mel spectrogram analysis is an effective method for audio deepfake detection, and they underline the importance of interpretability to ensure trustworthy media detection systems.
2025
9783031877834
9783031877841
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4910236
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact