UniSa - IRIS Institutional Research Information System

Pedestrian attribute recognition (PAR) is a critical task for real-time video surveillance and person re-identification in-the-wild. While modern vision–language models pre-trained on billions of image–text pairs have recently achieved outstanding accuracy, their substantial latency and high memory requirements make them impractical for real-world deployments. To overcome these limitations, we present Parvelous, an efficient and versatile multi-task framework built on an optimized vision encoder pre-trained for image–text matching and specifically adapted to the PAR domain. Through targeted architectural refinements and a selective layer-wise fine-tuning strategy, our framework ensures both efficiency and strong task specialization. Modular task-specific branches equipped with channel-wise attention are tailored in depth to address the varying complexity of binary and multi-class attribute recognition tasks. Additionally, a multi-task loss based on asymmetric loss functions mitigates the severe class imbalance inherent in standard PAR datasets, fostering robust learning across diverse attributes. Extensive evaluations on the public MIVIA PAR KD benchmark demonstrate that Parvelous achieves a 0.957 accuracy rate, setting a new state of the art, while delivering up to an 80-fold inference speedup compared to competing vision language models. This combination of accuracy and computational efficiency positions Parvelous as a practical and deployable solution for real-world PAR applications.

Parvelous: pedestrian attribute recognition using a vision encoder for multi-task learning on unbalanced sample distributions

Greco, Antonio;Ricciardi, Andrea Vincenzo;Vento, Bruno;Vitale, Antonio

2026

Abstract

Pedestrian attribute recognition (PAR) is a critical task for real-time video surveillance and person re-identification in-the-wild. While modern vision–language models pre-trained on billions of image–text pairs have recently achieved outstanding accuracy, their substantial latency and high memory requirements make them impractical for real-world deployments. To overcome these limitations, we present Parvelous, an efficient and versatile multi-task framework built on an optimized vision encoder pre-trained for image–text matching and specifically adapted to the PAR domain. Through targeted architectural refinements and a selective layer-wise fine-tuning strategy, our framework ensures both efficiency and strong task specialization. Modular task-specific branches equipped with channel-wise attention are tailored in depth to address the varying complexity of binary and multi-class attribute recognition tasks. Additionally, a multi-task loss based on asymmetric loss functions mitigates the severe class imbalance inherent in standard PAR datasets, fostering robust learning across diverse attributes. Extensive evaluations on the public MIVIA PAR KD benchmark demonstrate that Parvelous achieves a 0.957 accuracy rate, setting a new state of the art, while delivering up to an 80-fold inference speedup compared to competing vision language models. This combination of accuracy and computational efficiency positions Parvelous as a practical and deployable solution for real-world PAR applications.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2026

Appare nelle tipologie:

1.1.1 Articolo su rivista con DOI

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4942321

Citazioni

ND

0

ND

social impact