Residual convolutional neural network with attentive feature pooling for end-to-end language identification from short-duration speech. (November 2019)
- Record Type:
- Journal Article
- Title:
- Residual convolutional neural network with attentive feature pooling for end-to-end language identification from short-duration speech. (November 2019)
- Main Title:
- Residual convolutional neural network with attentive feature pooling for end-to-end language identification from short-duration speech
- Authors:
- Monteiro, João
Alam, Jahangir
Falk, Tiago H. - Abstract:
- Abstract : A ResNet-50 with self-attention effectively learns language-dependent representations. Convolution over the time dimension along with data-driven learned features pooling yield robust representations of languages. The output posterior of a trained convolutional neural network can be directly used as score for verification tasks outperforming trained backends. The combination of cross-entropy and triplet loss minimization yields discriminative representations. Abstract: The problem of language identification from speech is tackled in this work. Residual convolutional neural networks are employed to this end, aiming at exploiting the ability of such architectures to take into account large contextual segments of input data. Moreover, learnable attention mechanisms are introduced on top of the convolutional stack for data-driven feature pooling across time, enabling the computation of fixed-dimension representations given varying-length speech segments as input. Training is performed using a combination of language identification and metric learning via triplet loss minimization, aimed at enforcing class separability within the embeddings space. Evaluation is performed across different conditions, such as multi-class classification, short-duration test utterances, and confusing languages, for the closed-set case, while open-set performance is evaluated with the introduction of unseen languages. At test time, end-to-end scoring along with cosine similarity and PLDAAbstract : A ResNet-50 with self-attention effectively learns language-dependent representations. Convolution over the time dimension along with data-driven learned features pooling yield robust representations of languages. The output posterior of a trained convolutional neural network can be directly used as score for verification tasks outperforming trained backends. The combination of cross-entropy and triplet loss minimization yields discriminative representations. Abstract: The problem of language identification from speech is tackled in this work. Residual convolutional neural networks are employed to this end, aiming at exploiting the ability of such architectures to take into account large contextual segments of input data. Moreover, learnable attention mechanisms are introduced on top of the convolutional stack for data-driven feature pooling across time, enabling the computation of fixed-dimension representations given varying-length speech segments as input. Training is performed using a combination of language identification and metric learning via triplet loss minimization, aimed at enforcing class separability within the embeddings space. Evaluation is performed across different conditions, such as multi-class classification, short-duration test utterances, and confusing languages, for the closed-set case, while open-set performance is evaluated with the introduction of unseen languages. At test time, end-to-end scoring along with cosine similarity and PLDA are employed, outperforming state-of-the-art benchmark methods, such as i-vectors by improving the average cost by 30% to 40% depending on the evaluation condition. … (more)
- Is Part Of:
- Computer speech & language. Volume 58(2019)
- Journal:
- Computer speech & language
- Issue:
- Volume 58(2019)
- Issue Display:
- Volume 58, Issue 2019 (2019)
- Year:
- 2019
- Volume:
- 58
- Issue:
- 2019
- Issue Sort Value:
- 2019-0058-2019-0000
- Page Start:
- 364
- Page End:
- 376
- Publication Date:
- 2019-11
- Subjects:
- Language recognition -- Language modeling -- Residual convolutional neural networks -- Attentive feature pooling
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2019.05.006 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 11148.xml