Joint speaker diarization and speech recognition based on region proposal networks. (March 2022)

Record Type:: Journal Article
Title:: Joint speaker diarization and speech recognition based on region proposal networks. (March 2022)
Main Title:: Joint speaker diarization and speech recognition based on region proposal networks
Authors:: Huang, Zili
Delcroix, Marc
Garcia, Leibny Paola
Watanabe, Shinji
Raj, Desh
Khudanpur, Sanjeev
Abstract:: Abstract: Speaker diarization, the process of partitioning an input audio stream into homogeneous segments according to the speaker identity, is an important task for speech processing. The standard clustering-based diarization pipeline (1) segments the whole utterance into small chunks, (2) extracts speaker embedding for each chunk, and (3) groups the chunks into clusters, where each cluster represents one speaker. It has two major disadvantages: first, it contains several individually optimized modules in the pipeline, and second, it cannot handle overlapping speech. To address these issues, we proposed region proposal network-based speaker diarization (RPNSD) (Huang et al., 2020). In this paper, we perform a detailed study of the RPNSD system, and make two important contributions. First, we report its diarization performance on additional datasets and empirically investigate the impact of different system settings. Second, we integrate an automatic speech recognition (ASR) component into the RPNSD system and propose a new framework called RPN-JOINT that simultaneously performs diarization and ASR. Our experiments reveal that (1) the RPNSD system can consistently achieve diarization results that are competitive with state-of-the-art methods, and (2) the RPN-JOINT system offers several advantages over the conventional cascade of diarization and ASR systems. Highlights: Describes how to apply Region Proposal Network (RPN) to the speaker diarization task. Shows how a … (more)
Is Part Of:: Computer speech & language. Volume 72(2022)
Journal:: Computer speech & language
Issue:: Volume 72(2022)
Issue Display:: Volume 72, Issue 2022 (2022)
Year:: 2022
Volume:: 72
Issue:: 2022
Issue Sort Value:: 2022-0072-2022-0000
Page Start:
Page End:
Publication Date:: 2022-03
Subjects:: Speaker diarization -- Region proposal network -- Faster R-CNN -- Multi-speaker speech recognition
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454
Journal URLs:: http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗
DOI:: 10.1016/j.csl.2021.101316 ↗
Languages:: English
ISSNs:: 0885-2308
Deposit Type:: Legaldeposit
View Content:: Available online (eLD content is only available in our Reading Rooms) ↗
Physical Locations:: British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store
Ingest File:: 20051.xml