In the last years, a big interest of both the scientific community and the market has been devoted to the design of audio surveillance systems, able to analyse the audio stream and to identify events of interest; this is particularly true in security applications, in which the audio analytics can be profitably used as an alternative to video analytics systems, but also combined with them. Within this context, in this paper we propose a novel recurrent convolutional neural network architecture, named DENet; it is based on a new layer that we call denoising-enhancement (DE) layer, which performs denoising and enhancement of the original signal by applying an attention map on the components of the band-filtered signal. Differently from state-of-the-art methodologies, DENet takes as input the lossless raw waveform and is able to automatically learn the evolution of the frequencies-of-interest over time, by combining the proposed layer with a bidirectional gated recurrent unit. Using the feedbacks coming from classifications related to consecutive frames (i.e. that belong to the same event), the proposed method is able to drastically reduce the misclassifications. We carried out experiments on the MIVIA Audio Events and MIVIA Road Events public datasets, confirming the effectiveness of our approach with respect to other state-of-the-art methodologies.
DENet: a deep architecture for audio surveillance applications
Antonio Greco;Antonio Roberto;Alessia Saggese
;Mario Vento
2021-01-01
Abstract
In the last years, a big interest of both the scientific community and the market has been devoted to the design of audio surveillance systems, able to analyse the audio stream and to identify events of interest; this is particularly true in security applications, in which the audio analytics can be profitably used as an alternative to video analytics systems, but also combined with them. Within this context, in this paper we propose a novel recurrent convolutional neural network architecture, named DENet; it is based on a new layer that we call denoising-enhancement (DE) layer, which performs denoising and enhancement of the original signal by applying an attention map on the components of the band-filtered signal. Differently from state-of-the-art methodologies, DENet takes as input the lossless raw waveform and is able to automatically learn the evolution of the frequencies-of-interest over time, by combining the proposed layer with a bidirectional gated recurrent unit. Using the feedbacks coming from classifications related to consecutive frames (i.e. that belong to the same event), the proposed method is able to drastically reduce the misclassifications. We carried out experiments on the MIVIA Audio Events and MIVIA Road Events public datasets, confirming the effectiveness of our approach with respect to other state-of-the-art methodologies.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.