In the era of third generation surveillance systems, it becomes more and more useful to have available a solution able to automatically detect abnormal events. The interest for audio analysis is thus growing in the last years, due to the large amount of situations where a microphone and an audio surveillance system can be profitably used by the human operator in charge of control. In this paper, we propose a method for automatically analyzing the audio stream for surveillance purposes: it is able to detect the presence of abnormal events such as screams, gun shots and broken glasses. Instead than processing directly raw data (the audio signal), the stream is represented by means of an image, namely the spectrogram, a time-frequency representation of the audio stream. In this way, we formulate the problem of audio analysis as a problem of image classification. Thus, we propose to use a Convolutional Neural Network with the following two main properties: inspired by VGG network, we employed very small kernels in convolutional layers; furthermore, we adopted a pyramidal structure in fully connected layers. These choices allow to have good generalization capabilities of the network even in presence of a not so wide dataset. The performance, computed over a standard dataset already used for benchmarking purposes in the field of audio surveillance, confirms the effectiveness of the proposed approach.

SoReNet: a novel deep network for audio surveillance applications

Greco, Antonio;Saggese, Alessia;Vento, Mario;Vigilante, Vincenzo
2019

Abstract

In the era of third generation surveillance systems, it becomes more and more useful to have available a solution able to automatically detect abnormal events. The interest for audio analysis is thus growing in the last years, due to the large amount of situations where a microphone and an audio surveillance system can be profitably used by the human operator in charge of control. In this paper, we propose a method for automatically analyzing the audio stream for surveillance purposes: it is able to detect the presence of abnormal events such as screams, gun shots and broken glasses. Instead than processing directly raw data (the audio signal), the stream is represented by means of an image, namely the spectrogram, a time-frequency representation of the audio stream. In this way, we formulate the problem of audio analysis as a problem of image classification. Thus, we propose to use a Convolutional Neural Network with the following two main properties: inspired by VGG network, we employed very small kernels in convolutional layers; furthermore, we adopted a pyramidal structure in fully connected layers. These choices allow to have good generalization capabilities of the network even in presence of a not so wide dataset. The performance, computed over a standard dataset already used for benchmarking purposes in the field of audio surveillance, confirms the effectiveness of the proposed approach.
978-1-7281-4569-3
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11386/4731743
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 5
  • ???jsp.display-item.citation.isi??? 5
social impact