Autonomous vehicle driving is gaining ground, by receiving increasing attention from the academic and industrial communities. Despite this considerable effort, there is a lack of a systematic and fair analysis of the input representations by means of a careful experimental evaluation on the same framework. To this aim, this work proposes the first comprehensive, comparative analysis of the most common inputs that can be processed by a conditional imitation learning (CIL) approach. With more details, we considered the combinations of raw and processed data—namely RGB images, depth (D) images and semantic segmentation (S)—to be assessed as inputs of the well-established Conditional Imitation Learning with ResNet and Speed prediction (CILRS) architecture. We performed a benchmark analysis, endorsed by statistical tests, on the CARLA simulator to compare the considered configurations. The achieved results showed that RGB outperformed the other monomodal inputs, in terms of success rate on the most popular benchmark NoCrash. However, RGB did not generalize well when tested on different weather conditions; overall, the best multimodal configuration was a combination of the RGB image and semantic segmentation inputs (i.e., RGBS) compared to the others, especially in regular and dense traffic scenarios. This confirms that an appropriate fusion of multimodal sensors is an effective approach in autonomous vehicle driving.

Imitation Learning for Autonomous Vehicle Driving: How Does the Representation Matter?

Greco A.;Rundo L.;Saggese A.;Vento M.;Vicinanza A.
2022

Abstract

Autonomous vehicle driving is gaining ground, by receiving increasing attention from the academic and industrial communities. Despite this considerable effort, there is a lack of a systematic and fair analysis of the input representations by means of a careful experimental evaluation on the same framework. To this aim, this work proposes the first comprehensive, comparative analysis of the most common inputs that can be processed by a conditional imitation learning (CIL) approach. With more details, we considered the combinations of raw and processed data—namely RGB images, depth (D) images and semantic segmentation (S)—to be assessed as inputs of the well-established Conditional Imitation Learning with ResNet and Speed prediction (CILRS) architecture. We performed a benchmark analysis, endorsed by statistical tests, on the CARLA simulator to compare the considered configurations. The achieved results showed that RGB outperformed the other monomodal inputs, in terms of success rate on the most popular benchmark NoCrash. However, RGB did not generalize well when tested on different weather conditions; overall, the best multimodal configuration was a combination of the RGB image and semantic segmentation inputs (i.e., RGBS) compared to the others, especially in regular and dense traffic scenarios. This confirms that an appropriate fusion of multimodal sensors is an effective approach in autonomous vehicle driving.
978-3-031-06426-5
978-3-031-06427-2
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4804771
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact