The rise of audio deepfakes is becoming a growing concern for media credibility, particularly on social platforms. This study explores an approach to detecting audio deepfakes using Convolutional Neural Networks (CNNs) applied to Mel spectrograms, which serve as visual representations of audio signals. Six CNN architectures (VGG16, VGG19, ResNet50, DenseNet121, MobileNetV2, and EfficientNetB0) were evaluated using the FakeAVCelebV2 dataset, considering metrics such as precision, recall, F1-score, and accuracy. To provide better insight into model decisions, Grad-CAM, an Explainable Artificial Intelligence (XAI) technique, was employed to highlight the most relevant regions of the spectrogram for distinguishing between real and fake audio. The study also tested the model’s performance under conditions with added Gaussian and white noise to assess its robustness. The results confirm that CNN-based Mel spectrogram analysis is an effective method for audio deepfake detection, and they underline the importance of interpretability to ensure trustworthy media detection systems.
Mel Spectrogram-Based CNN Framework for Explainable Audio Deepfake Detection
Castiglione A.;Pero C.
2025
Abstract
The rise of audio deepfakes is becoming a growing concern for media credibility, particularly on social platforms. This study explores an approach to detecting audio deepfakes using Convolutional Neural Networks (CNNs) applied to Mel spectrograms, which serve as visual representations of audio signals. Six CNN architectures (VGG16, VGG19, ResNet50, DenseNet121, MobileNetV2, and EfficientNetB0) were evaluated using the FakeAVCelebV2 dataset, considering metrics such as precision, recall, F1-score, and accuracy. To provide better insight into model decisions, Grad-CAM, an Explainable Artificial Intelligence (XAI) technique, was employed to highlight the most relevant regions of the spectrogram for distinguishing between real and fake audio. The study also tested the model’s performance under conditions with added Gaussian and white noise to assess its robustness. The results confirm that CNN-based Mel spectrogram analysis is an effective method for audio deepfake detection, and they underline the importance of interpretability to ensure trustworthy media detection systems.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.