Multi-cultural speech emotion recognition using language and speaker cues. (May 2023)
- Record Type:
- Journal Article
- Title:
- Multi-cultural speech emotion recognition using language and speaker cues. (May 2023)
- Main Title:
- Multi-cultural speech emotion recognition using language and speaker cues
- Authors:
- Pandey, Sandeep Kumar
Shekhawat, Hanumant Singh
Prasanna, S.R.M. - Abstract:
- Abstract: Speech Emotion Recognition (SER) has been an active area of research to make Human–Computer Interaction (HCI) smoother and more natural. However, due to the dependence of the expressed emotions in an utterance on factors like culture, speaker, etc., the robustness of the SER systems in a multi-cultural setting is always a topic of discussion among researchers. Both the universalness and cultural specificity of emotions are debated in the literature. Thus we propose two methods, one incorporating cultural specificity and another demonstrating the universal nature of emotions across cultures. In this work, we propose a novel method to make a multi-cultural SER by incorporating impactful factors such as speaker and language as markers of cultural distinctiveness. We develop a language and a speaker model to get language and speaker embeddings, and a multi-modal fusion architecture is proposed to fuse the information along with emotional cues. Moreover, a triplet-loss-based multi-cultural SER is proposed, which tries to normalize speaker and cultural variabilities and focuses on learning emotions, irrespective of culture. Experiments conducted on a collection of five language emotion dataset shows the robustness of the proposed technique in predicting emotions in a leave-one-language-out setting. The design of the triplet loss-based system allows for the incorporation of a new language and speaker without the need to retrain the whole system again. Highlights: WeAbstract: Speech Emotion Recognition (SER) has been an active area of research to make Human–Computer Interaction (HCI) smoother and more natural. However, due to the dependence of the expressed emotions in an utterance on factors like culture, speaker, etc., the robustness of the SER systems in a multi-cultural setting is always a topic of discussion among researchers. Both the universalness and cultural specificity of emotions are debated in the literature. Thus we propose two methods, one incorporating cultural specificity and another demonstrating the universal nature of emotions across cultures. In this work, we propose a novel method to make a multi-cultural SER by incorporating impactful factors such as speaker and language as markers of cultural distinctiveness. We develop a language and a speaker model to get language and speaker embeddings, and a multi-modal fusion architecture is proposed to fuse the information along with emotional cues. Moreover, a triplet-loss-based multi-cultural SER is proposed, which tries to normalize speaker and cultural variabilities and focuses on learning emotions, irrespective of culture. Experiments conducted on a collection of five language emotion dataset shows the robustness of the proposed technique in predicting emotions in a leave-one-language-out setting. The design of the triplet loss-based system allows for the incorporation of a new language and speaker without the need to retrain the whole system again. Highlights: We propose a novel multi-modal SER system using audio modalities such as emotion, language, and speaker for the multi-cultural scenario. Building upon the motivation that emotions are universal across cultures, a triplet-loss-based metric learning approach is also investigated to normalize cross-cultural and personality of speakers related variabilities. Embeddings extracted from a metric-learning-based model can be combined with a simple DNN classifier to predict labels for unseen languages scenario. Experimental evaluation on emotion datasets belonging to five different languages shows that the proposed techniques are robust enough to tackle the challenge of cross-cultural variabilities. … (more)
- Is Part Of:
- Biomedical signal processing and control. Volume 83(2023)
- Journal:
- Biomedical signal processing and control
- Issue:
- Volume 83(2023)
- Issue Display:
- Volume 83, Issue 2023 (2023)
- Year:
- 2023
- Volume:
- 83
- Issue:
- 2023
- Issue Sort Value:
- 2023-0083-2023-0000
- Page Start:
- Page End:
- Publication Date:
- 2023-05
- Subjects:
- Tensor factorized neural network -- Speech emotion recognition -- Multi-cultural -- Multi-modal -- Language model -- Speaker model -- Metric learning -- Triplet loss
Signal processing -- Periodicals
Biomedical engineering -- Periodicals
Signal Processing, Computer-Assisted -- Periodicals
Image Processing, Computer-Assisted -- Periodicals
Biomedical Engineering -- Periodicals
610.28 - Journal URLs:
- http://www.sciencedirect.com/science/journal/17468094 ↗
http://www.elsevier.com/journals ↗
http://www.sciencedirect.com/science?_ob=PublicationURL&_tockey=%23TOC%2329675%232006%23999989998%23626449%23FLA%23&_cdi=29675&_pubType=J&_auth=y&_acct=C000045259&_version=1&_urlVersion=0&_userid=836873&md5=664b5cf9a57fc91971a17faf20c32ec1 ↗ - DOI:
- 10.1016/j.bspc.2023.104679 ↗
- Languages:
- English
- ISSNs:
- 1746-8094
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 2087.880400
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 26143.xml