Overlapped Speech Detection and speaker counting using distant microphone arrays. (March 2022)
- Record Type:
- Journal Article
- Title:
- Overlapped Speech Detection and speaker counting using distant microphone arrays. (March 2022)
- Main Title:
- Overlapped Speech Detection and speaker counting using distant microphone arrays
- Authors:
- Cornell, Samuele
Omologo, Maurizio
Squartini, Stefano
Vincent, Emmanuel - Abstract:
- Abstract: We study the problem of detecting and counting simultaneous, overlapping speakers in a multichannel, distant-microphone scenario. Focusing on a supervised learning approach, we treat Voice Activity Detection (VAD), Overlapped Speech Detection (OSD), joint VAD and OSD (VAD+OSD) and speaker counting in a unified way, as instances of a general Overlapped Speech Detection and Counting (OSDC) multi-class supervised learning problem. We consider a Temporal Convolutional Network (TCN) and a Transformer based architecture for this task, and compare them with previously proposed state-of-the art methods based on Recurrent Neural Networks (RNN) or hybrid Convolutional-Recurrent Neural Networks (CRNN). In addition, we propose ways of exploiting multichannel input by means of early or late fusion of single-channel features with spatial features extracted from one or more microphone pairs. We conduct an extensive experimental evaluation on the AMI and CHiME-6 datasets and on a purposely made multichannel synthetic dataset. We show that the Transformer-based architecture performs best among all architectures and that neural network based spatial localization features outperform signal-based spatial features and significantly improve performance compared to single-channel features only. Finally, we find that training with a speaker counting objective improves OSD compared to training with a VAD+OSD objective. Highlights: We study the problem of detecting multiple speakers in aAbstract: We study the problem of detecting and counting simultaneous, overlapping speakers in a multichannel, distant-microphone scenario. Focusing on a supervised learning approach, we treat Voice Activity Detection (VAD), Overlapped Speech Detection (OSD), joint VAD and OSD (VAD+OSD) and speaker counting in a unified way, as instances of a general Overlapped Speech Detection and Counting (OSDC) multi-class supervised learning problem. We consider a Temporal Convolutional Network (TCN) and a Transformer based architecture for this task, and compare them with previously proposed state-of-the art methods based on Recurrent Neural Networks (RNN) or hybrid Convolutional-Recurrent Neural Networks (CRNN). In addition, we propose ways of exploiting multichannel input by means of early or late fusion of single-channel features with spatial features extracted from one or more microphone pairs. We conduct an extensive experimental evaluation on the AMI and CHiME-6 datasets and on a purposely made multichannel synthetic dataset. We show that the Transformer-based architecture performs best among all architectures and that neural network based spatial localization features outperform signal-based spatial features and significantly improve performance compared to single-channel features only. Finally, we find that training with a speaker counting objective improves OSD compared to training with a VAD+OSD objective. Highlights: We study the problem of detecting multiple speakers in a multichannel scenario. We consider two novel architectures for this task based on TCN and Transformer. We propose ways of improving detection by using spatial features. The Transformer-based architecture perform best with lower computational footprint. … (more)
- Is Part Of:
- Computer speech & language. Volume 72(2022)
- Journal:
- Computer speech & language
- Issue:
- Volume 72(2022)
- Issue Display:
- Volume 72, Issue 2022 (2022)
- Year:
- 2022
- Volume:
- 72
- Issue:
- 2022
- Issue Sort Value:
- 2022-0072-2022-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-03
- Subjects:
- Voice activity detection -- Overlapped Speech Detection -- Speaker counting -- Distant microphones -- Spatial features
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2021.101306 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 20100.xml