An investigation of neural uncertainty estimation for target speaker extraction equipped RNN transducer. (May 2022)
- Record Type:
- Journal Article
- Title:
- An investigation of neural uncertainty estimation for target speaker extraction equipped RNN transducer. (May 2022)
- Main Title:
- An investigation of neural uncertainty estimation for target speaker extraction equipped RNN transducer
- Authors:
- Shi, Jiatong
Zhang, Chunlei
Weng, Chao
Watanabe, Shinji
Yu, Meng
Yu, Dong - Abstract:
- Abstract: Target-speaker speech recognition aims to recognize the speech of an enrolled speaker from an environment with background noise and interfering speakers. This study presents a joint framework that combines time-domain target speaker extraction and recurrent neural network transducer (RNN-T) for speech recognition. To alleviate the adverse effects of residual noise and artifacts introduced by the target speaker extraction module to the speech recognition back-end, we explore to training the target speaker extraction and RNN-T jointly. We find a multi-stage training strategy that pre-trains and fine-tunes each module before joint training is crucial in stabilizing the training process. In addition, we propose a novel neural uncertainty estimation that leverages useful information from the target speaker extraction module to further improve the back-end speech recognizer (i.e., speaker identity uncertainty and speech enhancement uncertainty). Compared to a recognizer with target speech extraction front-end, our experiments show that joint-training and the neural uncertainty module reduce 7% and 17% relative character error rate (CER) on multi-talker simulation data, respectively. The multi-condition experiments indicate that our method can reduce 9% relative CER in the noisy condition without losing performance in the clean condition. We also observe consistent improvements in further evaluation of real-world data based on vehicular speech. Highlights: A frameworkAbstract: Target-speaker speech recognition aims to recognize the speech of an enrolled speaker from an environment with background noise and interfering speakers. This study presents a joint framework that combines time-domain target speaker extraction and recurrent neural network transducer (RNN-T) for speech recognition. To alleviate the adverse effects of residual noise and artifacts introduced by the target speaker extraction module to the speech recognition back-end, we explore to training the target speaker extraction and RNN-T jointly. We find a multi-stage training strategy that pre-trains and fine-tunes each module before joint training is crucial in stabilizing the training process. In addition, we propose a novel neural uncertainty estimation that leverages useful information from the target speaker extraction module to further improve the back-end speech recognizer (i.e., speaker identity uncertainty and speech enhancement uncertainty). Compared to a recognizer with target speech extraction front-end, our experiments show that joint-training and the neural uncertainty module reduce 7% and 17% relative character error rate (CER) on multi-talker simulation data, respectively. The multi-condition experiments indicate that our method can reduce 9% relative CER in the noisy condition without losing performance in the clean condition. We also observe consistent improvements in further evaluation of real-world data based on vehicular speech. Highlights: A framework utilizing target speaker extraction and RNN-T for noise robust ASR. Neural uncertainty estimation module to compensate front-end distortions. A multi-stage training strategy to optimize target speaker extraction and RNN-T. Comprehensive studies based on synthesized and realistic test data. … (more)
- Is Part Of:
- Computer speech & language. Volume 73(2022)
- Journal:
- Computer speech & language
- Issue:
- Volume 73(2022)
- Issue Display:
- Volume 73, Issue 2022 (2022)
- Year:
- 2022
- Volume:
- 73
- Issue:
- 2022
- Issue Sort Value:
- 2022-0073-2022-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-05
- Subjects:
- Target-speaker speech recognition -- Target-speaker speech extraction -- Uncertainty estimation
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2021.101327 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 20459.xml