Localizing speakers in multiple rooms by using Deep Neural Networks. (May 2018)
- Record Type:
- Journal Article
- Title:
- Localizing speakers in multiple rooms by using Deep Neural Networks. (May 2018)
- Main Title:
- Localizing speakers in multiple rooms by using Deep Neural Networks
- Authors:
- Vesperini, Fabio
Vecchiotti, Paolo
Principi, Emanuele
Squartini, Stefano
Piazza, Francesco - Abstract:
- Highlights: MLP and CNN architectures for multi-room speaker localization are investigated. Localization is performed by using the microphone signals coming from all rooms. An in-depth study on the effect of the temporal context is conducted. A reduced dependence on the microphones location inside the room is observed. The CNN approach with temporal context outperforms state-of-the-art algorithms on the DIRHA dataset. Abstract: In the field of human speech capturing systems, a fundamental role is played by the source localization algorithms. In this paper a Speaker Localization algorithm (SLOC) based on Deep Neural Networks (DNN) is evaluated and compared with state-of-the art approaches. The speaker position in the room under analysis is directly determined by the DNN, leading the proposed algorithm to be fully data-driven. Two different neural network architectures are investigated: the Multi Layer Perceptron (MLP) and Convolutional Neural Networks (CNN). GCC-PHAT (Generalized Cross Correlation-PHAse Transform) Patterns, computed from the audio signals captured by the microphone are used as input features for the DNN. In particular, a multi-room case study is dealt with, where the acoustic scene of each room is influenced by sounds emitted in the other rooms. The algorithm is tested by means of the home recorded DIRHA dataset, characterized by multiple wall and ceiling microphone signals for each room. In detail, the focus goes to speaker localization task in two distinctHighlights: MLP and CNN architectures for multi-room speaker localization are investigated. Localization is performed by using the microphone signals coming from all rooms. An in-depth study on the effect of the temporal context is conducted. A reduced dependence on the microphones location inside the room is observed. The CNN approach with temporal context outperforms state-of-the-art algorithms on the DIRHA dataset. Abstract: In the field of human speech capturing systems, a fundamental role is played by the source localization algorithms. In this paper a Speaker Localization algorithm (SLOC) based on Deep Neural Networks (DNN) is evaluated and compared with state-of-the art approaches. The speaker position in the room under analysis is directly determined by the DNN, leading the proposed algorithm to be fully data-driven. Two different neural network architectures are investigated: the Multi Layer Perceptron (MLP) and Convolutional Neural Networks (CNN). GCC-PHAT (Generalized Cross Correlation-PHAse Transform) Patterns, computed from the audio signals captured by the microphone are used as input features for the DNN. In particular, a multi-room case study is dealt with, where the acoustic scene of each room is influenced by sounds emitted in the other rooms. The algorithm is tested by means of the home recorded DIRHA dataset, characterized by multiple wall and ceiling microphone signals for each room. In detail, the focus goes to speaker localization task in two distinct neighboring rooms. As term of comparison, two algorithms proposed in literature for the addressed applicative context are evaluated, the Crosspower Spectrum Phase Speaker Localization (CSP-SLOC) and the Steered Response Power using the Phase Transform speaker localization (SRP-SLOC). Besides providing an extensive analysis of the proposed method, the article shows how DNN-based algorithm significantly outperforms the state-of-the-art approaches evaluated on the DIRHA dataset, providing an average localization error, expressed in terms of Root Mean Square Error (RMSE), equal to 324 mm and 367 mm, respectively, for the Simulated and the Real subsets. … (more)
- Is Part Of:
- Computer speech & language. Volume 49(2018)
- Journal:
- Computer speech & language
- Issue:
- Volume 49(2018)
- Issue Display:
- Volume 49, Issue 2018 (2018)
- Year:
- 2018
- Volume:
- 49
- Issue:
- 2018
- Issue Sort Value:
- 2018-0049-2018-0000
- Page Start:
- 83
- Page End:
- 106
- Publication Date:
- 2018-05
- Subjects:
- Acoustic source localization -- Speaker localization -- GCC-PHAT -- Deep Neural Networks -- Convolutional Neural Networks -- Computational Audio Processing
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2017.12.002 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 5619.xml