Remote sensing (RS) imagery plays a pivotal role in various applications, and recent deep learning models integrate nuanced linguistic information to enhance semantic understanding. This work introduces RSDiX-CLIP, a fine-tuned CLIP model that addresses intra-class similarity in RS image datasets and improves data efficiency through the OTTER self-distillation framework. Additionally, we propose RSDiX-CLIPCap, a variant of the CLIPCap framework that incorporates a pre-trained RSDiX-CLIP. Our models outperform state-of-the-art methods on zero-shot RS image classification at various model scales and attain competitive RS image captioning results, while being smaller and more data-efficient than existing methods. We also explore the impact of mixed distillation strategies and alternative contrastive learning frameworks, introducing RSDiX-CLIP-S-BERT, employing a text-only model of the Sentence-BERT family as the teacher, and RSDiX-SigLIP, built on the SigLIP contrastive learning framework. We present a novel RS captioning dataset, S2LCD, consisting of 1533 Sentinel-2 images with 7665 wide-vocabulary, diverse and detailed captions. Finally, we challenge traditional N-gram-based captioning metrics such as the BLEU score, providing statistical evidence for the higher effectiveness of semantic scores like Sentence-BERT-Similarity. These advancements aim to contribute to the data-efficiency of deep learning models for RS image-text tasks, offering promising avenues for further exploration in the field. Code & Data at https://github.com/NeuRoNeLab/RSDiX-CLIP.

RSDiX: Lightweight and Data-Efficient VLMs for Remote Sensing through Self-Distillation

Terlizzi, Andrea;Bardozzo, Francesco;Tagliaferri, Roberto
2025

Abstract

Remote sensing (RS) imagery plays a pivotal role in various applications, and recent deep learning models integrate nuanced linguistic information to enhance semantic understanding. This work introduces RSDiX-CLIP, a fine-tuned CLIP model that addresses intra-class similarity in RS image datasets and improves data efficiency through the OTTER self-distillation framework. Additionally, we propose RSDiX-CLIPCap, a variant of the CLIPCap framework that incorporates a pre-trained RSDiX-CLIP. Our models outperform state-of-the-art methods on zero-shot RS image classification at various model scales and attain competitive RS image captioning results, while being smaller and more data-efficient than existing methods. We also explore the impact of mixed distillation strategies and alternative contrastive learning frameworks, introducing RSDiX-CLIP-S-BERT, employing a text-only model of the Sentence-BERT family as the teacher, and RSDiX-SigLIP, built on the SigLIP contrastive learning framework. We present a novel RS captioning dataset, S2LCD, consisting of 1533 Sentinel-2 images with 7665 wide-vocabulary, diverse and detailed captions. Finally, we challenge traditional N-gram-based captioning metrics such as the BLEU score, providing statistical evidence for the higher effectiveness of semantic scores like Sentence-BERT-Similarity. These advancements aim to contribute to the data-efficiency of deep learning models for RS image-text tasks, offering promising avenues for further exploration in the field. Code & Data at https://github.com/NeuRoNeLab/RSDiX-CLIP.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4935416
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact