In the last years, a big interest of both the scientific community and the market has been devoted to the design of audio surveillance systems, able to analyse the audio stream and to identify events of interest; this is particularly true in security applications, in which the audio analytics can be profitably used as an alternative to video analytics systems, but also combined with them. Within this context, in this paper we propose a novel recurrent convolutional neural network architecture, named DENet; it is based on a new layer that we call denoising-enhancement (DE) layer, which performs denoising and enhancement of the original signal by applying an attention map on the components of the band-filtered signal. Differently from state-of-the-art methodologies, DENet takes as input the lossless raw waveform and is able to automatically learn the evolution of the frequencies-of-interest over time, by combining the proposed layer with a bidirectional gated recurrent unit. Using the feedbacks coming from classifications related to consecutive frames (i.e. that belong to the same event), the proposed method is able to drastically reduce the misclassifications. We carried out experiments on the MIVIA Audio Events and MIVIA Road Events public datasets, confirming the effectiveness of our approach with respect to other state-of-the-art methodologies.

DENet: a deep architecture for audio surveillance applications

Antonio Greco;Antonio Roberto;Alessia Saggese
;
Mario Vento
2021

Abstract

In the last years, a big interest of both the scientific community and the market has been devoted to the design of audio surveillance systems, able to analyse the audio stream and to identify events of interest; this is particularly true in security applications, in which the audio analytics can be profitably used as an alternative to video analytics systems, but also combined with them. Within this context, in this paper we propose a novel recurrent convolutional neural network architecture, named DENet; it is based on a new layer that we call denoising-enhancement (DE) layer, which performs denoising and enhancement of the original signal by applying an attention map on the components of the band-filtered signal. Differently from state-of-the-art methodologies, DENet takes as input the lossless raw waveform and is able to automatically learn the evolution of the frequencies-of-interest over time, by combining the proposed layer with a bidirectional gated recurrent unit. Using the feedbacks coming from classifications related to consecutive frames (i.e. that belong to the same event), the proposed method is able to drastically reduce the misclassifications. We carried out experiments on the MIVIA Audio Events and MIVIA Road Events public datasets, confirming the effectiveness of our approach with respect to other state-of-the-art methodologies.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11386/4756763
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 5
  • ???jsp.display-item.citation.isi??? ND
social impact