Emotion estimation from face expression analysis is nowadays a widely-explored computer vision task. In turn, the classification of expressions relies on relevant facial features and their dynamics. Despite the promising accuracy results achieved in controlled and favorable conditions, the processing of faces acquired at a distance, entailing low-quality images, still suffers from a significant performance decrease. In particular, most approaches and related computational models become extremely unstable in the case of the very small amount of useful pixels that is typical in these conditions. Therefore, their behavior should be investigated more carefully. On the other hand, real-time emotion recognition at a distance may play a critical role in smart video surveillance, especially when controlling particular kinds of events, e.g., political meetings, to try to prevent adverse actions. This work compares facial expression recognition at a distance by: 1) a deep learning architecture based on state-of-the-art (SOTA) proposals, which exploits the whole images to autonomously learn the relevant embeddings; 2) a machine learning approach that relies on hand-crafted features, namely the facial landmarks preliminarily extracted using the popular Mediapipe framework. Instead of using either the complete sequence of frames or only the final still image of the expression, like current SOTA approaches, the two proposed methods are designed to use rich temporal information to identify three different stages of emotion. Expressions are time-split accordingly into four phases to better exploit their temporal-dependent dynamics. Experiments were conducted on the popular Extended Cohn-Kanade dataset (CK+). It was chosen for its wide use in related literature, and because it includes videos of facial expressions and not only still images. The results show that the approach relying on machine learning via hand-crafted features is more suitable for classifying the initial phases of the expression and does not decay in terms of accuracy when images are at a distance (only 0.08% of decay). On the contrary, deep learning not only has difficulties classifying the initial phases of the expressions but also suffers from relevant performance decay when considering images at a distance (52.68% accuracy decay).

Emotion recognition at a distance: The robustness of machine learning based on hand-crafted facial features vs deep learning models

Bisogni C.;Cimmino L.
;
De Marsico M.;Narducci F.
2023

Abstract

Emotion estimation from face expression analysis is nowadays a widely-explored computer vision task. In turn, the classification of expressions relies on relevant facial features and their dynamics. Despite the promising accuracy results achieved in controlled and favorable conditions, the processing of faces acquired at a distance, entailing low-quality images, still suffers from a significant performance decrease. In particular, most approaches and related computational models become extremely unstable in the case of the very small amount of useful pixels that is typical in these conditions. Therefore, their behavior should be investigated more carefully. On the other hand, real-time emotion recognition at a distance may play a critical role in smart video surveillance, especially when controlling particular kinds of events, e.g., political meetings, to try to prevent adverse actions. This work compares facial expression recognition at a distance by: 1) a deep learning architecture based on state-of-the-art (SOTA) proposals, which exploits the whole images to autonomously learn the relevant embeddings; 2) a machine learning approach that relies on hand-crafted features, namely the facial landmarks preliminarily extracted using the popular Mediapipe framework. Instead of using either the complete sequence of frames or only the final still image of the expression, like current SOTA approaches, the two proposed methods are designed to use rich temporal information to identify three different stages of emotion. Expressions are time-split accordingly into four phases to better exploit their temporal-dependent dynamics. Experiments were conducted on the popular Extended Cohn-Kanade dataset (CK+). It was chosen for its wide use in related literature, and because it includes videos of facial expressions and not only still images. The results show that the approach relying on machine learning via hand-crafted features is more suitable for classifying the initial phases of the expression and does not decay in terms of accuracy when images are at a distance (only 0.08% of decay). On the contrary, deep learning not only has difficulties classifying the initial phases of the expressions but also suffers from relevant performance decay when considering images at a distance (52.68% accuracy decay).
2023
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4845152
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 21
  • ???jsp.display-item.citation.isi??? 14
social impact