The reliable detection of unconventional obstacles, such as rocks, is essential for ensuring safe and continuous railway operations. However, this task is challenging due to the visual heterogeneity of rocks and the scarcity of annotated railway data. Existing vision-based systems often rely on conventional classifiers, whose generalization ability is limited in this setting. In this work, we introduce Textual-Visual Alignment for Rock Detection (TVA-Rock), a parameter-efficient textual-visual alignment framework that leverages the representational strength of vision-language models for rock detection within a two-stage pipeline comprising track segmentation and patch-wise classification. TVA-Rock employs a sequential textual-visual prompt alignment strategy, in which textual prompts are first optimized to encode task-relevant semantics, and subsequently used to guide visual prompt tuning, yielding rock-aware visual representations without modifying backbone weights. This sequential design stabilizes multimodal optimization, mitigates overfitting in data-scarce regimes, and enables efficient inference by discarding the textual branch after training. Extensive experiments on realistic onboard-camera samples show that TVA-Rock outperforms existing approaches, that are also more complex, achieving state-of-the-art performance in identifying small and visually irregular rocks. Ablation studies further validate the incremental contribution of textual prompting, visual prompting, and architectural generalization across different VLM backbones. These results demonstrate the potential of the proposed method as a robust and efficient strategy for the detection of rocks on railway tracks.

Improving Rock Detection Accuracy by Using Vision-Language Models With Textual–Visual Prompt Alignment

Carletti, Vincenzo;Greco, Antonio;Saggese, Alessia;Spingola, Camilla;Vento, Bruno
2026

Abstract

The reliable detection of unconventional obstacles, such as rocks, is essential for ensuring safe and continuous railway operations. However, this task is challenging due to the visual heterogeneity of rocks and the scarcity of annotated railway data. Existing vision-based systems often rely on conventional classifiers, whose generalization ability is limited in this setting. In this work, we introduce Textual-Visual Alignment for Rock Detection (TVA-Rock), a parameter-efficient textual-visual alignment framework that leverages the representational strength of vision-language models for rock detection within a two-stage pipeline comprising track segmentation and patch-wise classification. TVA-Rock employs a sequential textual-visual prompt alignment strategy, in which textual prompts are first optimized to encode task-relevant semantics, and subsequently used to guide visual prompt tuning, yielding rock-aware visual representations without modifying backbone weights. This sequential design stabilizes multimodal optimization, mitigates overfitting in data-scarce regimes, and enables efficient inference by discarding the textual branch after training. Extensive experiments on realistic onboard-camera samples show that TVA-Rock outperforms existing approaches, that are also more complex, achieving state-of-the-art performance in identifying small and visually irregular rocks. Ablation studies further validate the incremental contribution of textual prompting, visual prompting, and architectural generalization across different VLM backbones. These results demonstrate the potential of the proposed method as a robust and efficient strategy for the detection of rocks on railway tracks.
2026
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4951255
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact