Abstract
Along with automatic speech recognition, many researchers have been actively studyingspeech emotion recognition, since emotion information is as crucial as the textual information foreffective interactions. Emotion can be divided into categorical emotion and dimensional emotion.Although categorical emotion is widely used, dimensional emotion, typically represented as arousaland valence, can provide more detailed information on the emotional states. Therefore, in thispaper, we propose a Conformer-based model for arousal and valence recognition. Our model usesConformer as an encoder, a fully connected layer as a decoder, and statistical pooling layers as aconnector. In addition, we adopted multi-task learning and multi-feature combination, which showeda remarkable performance for speech emotion recognition and time-series analysis, respectively. Theproposed model achieves a state-of-the-art recognition accuracy of 70.0 ± 1.5% for arousal in termsof unweighted accuracy on the IEMOCAP dataset.
Original language | English |
---|---|
Article number | 1428 |
Journal | Symmetry |
Volume | 14 |
Issue number | 7 |
DOIs | |
State | Published - Jul 2022 |
Bibliographical note
Publisher Copyright:© 2022 by the authors.
Keywords
- arousal
- speech emotion recognition
- spoken language understanding
- valence