heal.abstract |
Visual emotion recognition constitutes a major subject in the interdisciplinary field of Computer
Vision which is associated with the process of identifying human emotion on categorical
(discrete) and/or dimensional (continuous) level, as it is being depicted in still images
or video sequences. A review of related literature reveals that the majority of past efforts
in visual emotion recognition have been mostly limited to the analysis of facial expressions,
while some studies have either incorporated information relative to body pose or have attempted
to perform emotion recognition solely on the basis of body movements and gestures.
While some of these approaches perform well in controlled environments, they fail to interpret
real-world scenarios where unpredictable social settings can render one or multiple
of the aforementioned sources of affective information inaccessible. However, evidence from
psychology related studies suggest that visual context, in addition to facial expression and
body pose, provides important information to the perception of people’s emotions.
In this work, we aim at reinforcing the concept of context-based visual emotion recognition.
To this end, we conduct extensive experiments on two newly assembled and challenging
databases, i.e. the EMOTions In Context (EMOTIC) and Body Language Dataset (BoLD),
tackling both the image-based and video-based versions of the problem. More specifically we:
• Extend already successful baseline architectures by incorporating multiple input streams
that encode bodily, facial, contextual as well as scene related features, thus enhancing
our models’ understanding of visual context and emotion in general.
• Directly infuse scene classification scores and attributes as additional features in the
emotion recognition process that function in a complementary manner with respect to
all other sources of affective information. To the best of our knowledge, our approach
is the first to do so.
• Exploit categorical emotion label dependencies, that reside within the datasets, through
the usage of Graph Convolutional Networks (GCN) and the addition of metric-learning
inspired loss that is based on GloVe word embeddings.
• Achieve competitive results on EMOTIC and significant improvements over the state-of-the-art techniques with relation to BoLD.
A big portion of our contributions was submitted to the 16th IEEE International Conference
on Automatic Face and Gesture Recognition (FG), with the authors being Ioannis
Pikoulis, Panagiotis Paraskevas Filntisis and Petros Maragos. |
en |