Multi-Task Conformer with Multi-Feature Combination for Speech Emotion Recognition

Jiyoung Seo, Bowon Lee

Research output: Contribution to journalArticlepeer-review

11 Scopus citations

Abstract

Along with automatic speech recognition, many researchers have been actively studyingspeech emotion recognition, since emotion information is as crucial as the textual information foreffective interactions. Emotion can be divided into categorical emotion and dimensional emotion.Although categorical emotion is widely used, dimensional emotion, typically represented as arousaland valence, can provide more detailed information on the emotional states. Therefore, in thispaper, we propose a Conformer-based model for arousal and valence recognition. Our model usesConformer as an encoder, a fully connected layer as a decoder, and statistical pooling layers as aconnector. In addition, we adopted multi-task learning and multi-feature combination, which showeda remarkable performance for speech emotion recognition and time-series analysis, respectively. Theproposed model achieves a state-of-the-art recognition accuracy of 70.0 ± 1.5% for arousal in termsof unweighted accuracy on the IEMOCAP dataset.

Original languageEnglish
Article number1428
JournalSymmetry
Volume14
Issue number7
DOIs
StatePublished - Jul 2022

Bibliographical note

Publisher Copyright:
© 2022 by the authors.

Keywords

  • arousal
  • speech emotion recognition
  • spoken language understanding
  • valence

Fingerprint

Dive into the research topics of 'Multi-Task Conformer with Multi-Feature Combination for Speech Emotion Recognition'. Together they form a unique fingerprint.

Cite this