Pedestrian attribute recognition (PAR) is a critical task for real-time video surveillance and person re-identification in-the-wild. While modern vision–language models pre-trained on billions of image–text pairs have recently achieved outstanding accuracy, their substantial latency and high memory requirements make them impractical for real-world deployments. To overcome these limitations, we present Parvelous, an efficient and versatile multi-task framework built on an optimized vision encoder pre-trained for image–text matching and specifically adapted to the PAR domain. Through targeted architectural refinements and a selective layer-wise fine-tuning strategy, our framework ensures both efficiency and strong task specialization. Modular task-specific branches equipped with channel-wise attention are tailored in depth to address the varying complexity of binary and multi-class attribute recognition tasks. Additionally, a multi-task loss based on asymmetric loss functions mitigates the severe class imbalance inherent in standard PAR datasets, fostering robust learning across diverse attributes. Extensive evaluations on the public MIVIA PAR KD benchmark demonstrate that Parvelous achieves a 0.957 accuracy rate, setting a new state of the art, while delivering up to an 80-fold inference speedup compared to competing vision language models. This combination of accuracy and computational efficiency positions Parvelous as a practical and deployable solution for real-world PAR applications.
Parvelous: pedestrian attribute recognition using a vision encoder for multi-task learning on unbalanced sample distributions
Greco, Antonio;Ricciardi, Andrea Vincenzo;Vitale, Antonio
2026
Abstract
Pedestrian attribute recognition (PAR) is a critical task for real-time video surveillance and person re-identification in-the-wild. While modern vision–language models pre-trained on billions of image–text pairs have recently achieved outstanding accuracy, their substantial latency and high memory requirements make them impractical for real-world deployments. To overcome these limitations, we present Parvelous, an efficient and versatile multi-task framework built on an optimized vision encoder pre-trained for image–text matching and specifically adapted to the PAR domain. Through targeted architectural refinements and a selective layer-wise fine-tuning strategy, our framework ensures both efficiency and strong task specialization. Modular task-specific branches equipped with channel-wise attention are tailored in depth to address the varying complexity of binary and multi-class attribute recognition tasks. Additionally, a multi-task loss based on asymmetric loss functions mitigates the severe class imbalance inherent in standard PAR datasets, fostering robust learning across diverse attributes. Extensive evaluations on the public MIVIA PAR KD benchmark demonstrate that Parvelous achieves a 0.957 accuracy rate, setting a new state of the art, while delivering up to an 80-fold inference speedup compared to competing vision language models. This combination of accuracy and computational efficiency positions Parvelous as a practical and deployable solution for real-world PAR applications.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


