Visual scene-aware hybrid and multi-modal feature aggregation for facial expression recognition

Min Kyu Lee, Dae Ha Kim, Byung Cheol Song

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Facial expression recognition (FER) technology has made considerable progress with the rapid development of deep learning. However, conventional FER techniques are mainly designed and trained for videos that are artificially acquired in a limited environment, so they may not operate robustly on videos acquired in a wild environment suffering from varying illuminations and head poses. In order to solve this problem and improve the ultimate performance of FER, this paper proposes a new architecture that extends a state-of-the-art FER scheme and a multi-modal neural network that can effectively fuse image and landmark information. To this end, we propose three methods. To maximize the performance of the recurrent neural network (RNN) in the previous scheme, we first propose a frame substitution module that replaces the latent features of less important frames with those of important frames based on inter-frame correlation. Second, we propose a method for extracting facial landmark features based on the correlation between frames. Third, we propose a new multi-modal fusion method that effectively fuses video and facial landmark information at the feature level. By applying attention based on the characteristics of each modality to the features of the modality, novel fusion is achieved. Experimental results show that the proposed method provides remarkable performance, with 51.4% accuracy for the wild AFEW dataset, 98.5% accuracy for the CK+ dataset and 81.9% accuracy for the MMI dataset, outperforming the state-of-the-art networks.

Original languageEnglish
Article number5184
Pages (from-to)1-24
Number of pages24
JournalSensors
Volume20
Issue number18
DOIs
StatePublished - 2 Sep 2020

Bibliographical note

Publisher Copyright:
© 2020 by the authors. Licensee MDPI, Basel, Switzerland.

Keywords

  • Convolutional neural networks
  • Facial expression recognition
  • Multi-modal fusion

Fingerprint

Dive into the research topics of 'Visual scene-aware hybrid and multi-modal feature aggregation for facial expression recognition'. Together they form a unique fingerprint.

Cite this