In industrial environments, it is crucial to establish a strong collaboration between humans and robots to enhance productivity. However, the nature of the work demands that workers have the authority to provide specific instructions to the robots. The scientific community has extensively investigated these dual requirements, aiming to develop advanced systems capable of recognizing voice commands and implementing speaker authentication. Nevertheless, in the industrial context, these tasks should be executed simultaneously on low-cost and low-power embedded devices that can be mounted on board the robotic platform. To overcome this challenge, we propose a multi-task network for Speech-Command Recognition and Speaker Identification. Additionally, we employ the GradNorm adaptive algorithm to address the issue of task imbalance. To evaluate the proposed system, we introduce a new dataset, MIVIA-ISC, consisting of 20,857 samples uttered by 562 speakers for 31 distinct commands. Our approach significantly reduces the network size by 47% and its execution time by 48% compared to the commonly used methodology, which employs one network for each task. Furthermore, our approach demonstrates a significant improvement in the accuracy of the Speaker Identification task, achieving an 11% increase compared to the corresponding single-task network. Importantly, this enhancement is achieved without compromising the accuracy of the Speech-Command Recognition task, which experiences only a minimal 3% decrease in performance.
A multi-task network for speaker and command recognition in industrial environments
Bini, Stefano;Percannella, Gennaro;Saggese, Alessia
;Vento, Mario
2023-01-01
Abstract
In industrial environments, it is crucial to establish a strong collaboration between humans and robots to enhance productivity. However, the nature of the work demands that workers have the authority to provide specific instructions to the robots. The scientific community has extensively investigated these dual requirements, aiming to develop advanced systems capable of recognizing voice commands and implementing speaker authentication. Nevertheless, in the industrial context, these tasks should be executed simultaneously on low-cost and low-power embedded devices that can be mounted on board the robotic platform. To overcome this challenge, we propose a multi-task network for Speech-Command Recognition and Speaker Identification. Additionally, we employ the GradNorm adaptive algorithm to address the issue of task imbalance. To evaluate the proposed system, we introduce a new dataset, MIVIA-ISC, consisting of 20,857 samples uttered by 562 speakers for 31 distinct commands. Our approach significantly reduces the network size by 47% and its execution time by 48% compared to the commonly used methodology, which employs one network for each task. Furthermore, our approach demonstrates a significant improvement in the accuracy of the Speaker Identification task, achieving an 11% increase compared to the corresponding single-task network. Importantly, this enhancement is achieved without compromising the accuracy of the Speech-Command Recognition task, which experiences only a minimal 3% decrease in performance.File | Dimensione | Formato | |
---|---|---|---|
1-s2.0-S0167865523002945-main.pdf
non disponibili
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Copyright dell'editore
Dimensione
774.37 kB
Formato
Adobe PDF
|
774.37 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.