Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning. (March 2022)
- Record Type:
- Journal Article
- Title:
- Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning. (March 2022)
- Main Title:
- Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning
- Authors:
- Ding, Shaojin
Zhao, Guanlong
Gutierrez-Osuna, Ricardo - Abstract:
- Abstract: Foreign accent conversion (FAC) aims to create a new voice that has the voice identity of a given second-language (L2) speaker but with a native (L1) accent . Previous FAC approaches usually require training a separate model for each L2 speaker and, more importantly, generally require considerable speech data from each L2 speaker for training. To address these limitations, we propose Accentron, an approach that can generate accent-converted speech for arbitrary L2 speakers unseen during training. In the proposed approach, we first train a speaker-independent acoustic model on L1 corpora to extract bottleneck features that represent the linguistic content of utterances. Then, we develop a speaker encoder and an accent encoder to generate embedding vectors for the desired voice identity (L2 speaker's) and accent (L1 accent), respectively. Lastly, we use a sequence-to-sequence model to transform bottleneck-features to Mel-spectrograms, conditioned on the L2 speaker embedding and the L1 accent embedding. We conducted experiments on the L2-ARCTIC corpus under two testing conditions: the standard FAC setting where test L2 speakers were seen during training, and a zero-shot FAC setting where test L2 speakers were unseen during training. Accentron achieves over 27% relative improvement in accentedness ratings compared to two state-of-the-art FAC systems in the standard FAC setting. More importantly, our results show that Accentron generalizes to the zero-shot FAC settingAbstract: Foreign accent conversion (FAC) aims to create a new voice that has the voice identity of a given second-language (L2) speaker but with a native (L1) accent . Previous FAC approaches usually require training a separate model for each L2 speaker and, more importantly, generally require considerable speech data from each L2 speaker for training. To address these limitations, we propose Accentron, an approach that can generate accent-converted speech for arbitrary L2 speakers unseen during training. In the proposed approach, we first train a speaker-independent acoustic model on L1 corpora to extract bottleneck features that represent the linguistic content of utterances. Then, we develop a speaker encoder and an accent encoder to generate embedding vectors for the desired voice identity (L2 speaker's) and accent (L1 accent), respectively. Lastly, we use a sequence-to-sequence model to transform bottleneck-features to Mel-spectrograms, conditioned on the L2 speaker embedding and the L1 accent embedding. We conducted experiments on the L2-ARCTIC corpus under two testing conditions: the standard FAC setting where test L2 speakers were seen during training, and a zero-shot FAC setting where test L2 speakers were unseen during training. Accentron achieves over 27% relative improvement in accentedness ratings compared to two state-of-the-art FAC systems in the standard FAC setting. More importantly, our results show that Accentron generalizes to the zero-shot FAC setting with no performance loss. Therefore, in practical use scenarios (e.g., computer-assisted pronunciation training software), Accentron can effectively avoid the need to adapt or retrain the model, which significantly reduces computations and the users' waiting time. … (more)
- Is Part Of:
- Computer speech & language. Volume 72(2022)
- Journal:
- Computer speech & language
- Issue:
- Volume 72(2022)
- Issue Display:
- Volume 72, Issue 2022 (2022)
- Year:
- 2022
- Volume:
- 72
- Issue:
- 2022
- Issue Sort Value:
- 2022-0072-2022-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-03
- Subjects:
- Foreign accent conversion -- Speech synthesis -- Computer-assisted pronunciation training
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2021.101302 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 20051.xml