Speech is among the main forms of communication between humans and robots in industrial settings, being the most natural way for a human worker to issue commands. However, the presence of pervasive and loud environmental noise poses significant challenges to the adoption of Speech-Command Recognition systems onboard manufacturing robots; indeed, they are expected to perform in real time on hardware with limited computational capabilities and also to be robust and accurate in such complex environments. In this paper, we propose an innovative system based on an End-to-End architecture with a Conformer backbone. Our system is specifically designed to achieve high accuracy in noisy industrial environments and to guarantee a minimal computational burden to meet stringent real-time requirements while running on computing devices that are embedded in robots. In order to increase the generalization capability of the system, the training procedure is driven by a Curriculum Learning strategy combined with dynamic data augmentation techniques, that progressively increase the complexity of input samples by increasing the noise during the training phase. We have conducted extensive experimentation to assess the effectiveness of our system, using a dataset composed of more than 50,000 samples, of which about 2,000 have been acquired during the daily operations of a Stellantis Italian factory. The results confirm the suitability of the proposed approach to be adopted in a real industrial environment; indeed, it is able to achieve, on both English and Italian commands, an accuracy higher than 90%, maintaining a compact model size (the network is 1.81 MB) and running in real-time on an industrial embedded device (namely 41ms over an NVIDIA Xavier NX).
Robust speech command recognition in challenging industrial environments
Bini S.;Carletti V.;Saggese A.;Vento M.
2024-01-01
Abstract
Speech is among the main forms of communication between humans and robots in industrial settings, being the most natural way for a human worker to issue commands. However, the presence of pervasive and loud environmental noise poses significant challenges to the adoption of Speech-Command Recognition systems onboard manufacturing robots; indeed, they are expected to perform in real time on hardware with limited computational capabilities and also to be robust and accurate in such complex environments. In this paper, we propose an innovative system based on an End-to-End architecture with a Conformer backbone. Our system is specifically designed to achieve high accuracy in noisy industrial environments and to guarantee a minimal computational burden to meet stringent real-time requirements while running on computing devices that are embedded in robots. In order to increase the generalization capability of the system, the training procedure is driven by a Curriculum Learning strategy combined with dynamic data augmentation techniques, that progressively increase the complexity of input samples by increasing the noise during the training phase. We have conducted extensive experimentation to assess the effectiveness of our system, using a dataset composed of more than 50,000 samples, of which about 2,000 have been acquired during the daily operations of a Stellantis Italian factory. The results confirm the suitability of the proposed approach to be adopted in a real industrial environment; indeed, it is able to achieve, on both English and Italian commands, an accuracy higher than 90%, maintaining a compact model size (the network is 1.81 MB) and running in real-time on an industrial embedded device (namely 41ms over an NVIDIA Xavier NX).I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.