Optimization of the area under the ROC curve using neural network supervectors for text-dependent speaker verification. (September 2020)
- Record Type:
- Journal Article
- Title:
- Optimization of the area under the ROC curve using neural network supervectors for text-dependent speaker verification. (September 2020)
- Main Title:
- Optimization of the area under the ROC curve using neural network supervectors for text-dependent speaker verification
- Authors:
- Mingote, Victoria
Miguel, Antonio
Ortega, Alfonso
Lleida, Eduardo - Abstract:
- Highlights: The use of different alignment techniques inside of deep neural networks to keep the temporal structure and to encode the relevant information in a differentiable supervector. The use of Convolutional Neuronal Networks to increase the temporal context of the representation for the training process of the system. The use of a cost function close to the objective of the verification task. A set of experiments on a small database with several phrases for text-dependent phrased based speaker verification. Study of the system performance using the different alignment techniques proposed in combination with two neural network alternatives, and the novel cost function proposed contrast with the traditional triplet loss. Abstract: This paper explores two techniques to improve the performance of text-dependent speaker verification systems based on deep neural networks. Firstly, we propose a general alignment mechanism to keep the temporal structure of each phrase and obtain a supervector with the speaker and phrase information, since both are relevant for a text-dependent verification. As we show, it is possible to use different alignment techniques to replace the global average pooling providing significant gains in performance. Moreover, we also present a novel Back-end approach to train a neural network for detection tasks by optimizing the Area Under the Curve (AUC) as an alternative to the usual triplet loss function, so the system is end-to-end, with a cost functionHighlights: The use of different alignment techniques inside of deep neural networks to keep the temporal structure and to encode the relevant information in a differentiable supervector. The use of Convolutional Neuronal Networks to increase the temporal context of the representation for the training process of the system. The use of a cost function close to the objective of the verification task. A set of experiments on a small database with several phrases for text-dependent phrased based speaker verification. Study of the system performance using the different alignment techniques proposed in combination with two neural network alternatives, and the novel cost function proposed contrast with the traditional triplet loss. Abstract: This paper explores two techniques to improve the performance of text-dependent speaker verification systems based on deep neural networks. Firstly, we propose a general alignment mechanism to keep the temporal structure of each phrase and obtain a supervector with the speaker and phrase information, since both are relevant for a text-dependent verification. As we show, it is possible to use different alignment techniques to replace the global average pooling providing significant gains in performance. Moreover, we also present a novel Back-end approach to train a neural network for detection tasks by optimizing the Area Under the Curve (AUC) as an alternative to the usual triplet loss function, so the system is end-to-end, with a cost function close to our desired measure of performance. As we can see in the experimental section, this approach improves the system performance, since our triplet neural network based on an approximation of the AUC ( aAUC ) learns how to discriminate between pairs of examples from the same identity and pairs of different identities. The different alignment techniques to produce supervectors in addition to the new Back-end approach were tested on the RSR2015-Part I and RSR2015-Part II database for text-dependent speaker verification, providing competitive results compared to similar size networks using the global average pooling to extract supervectors and using a simple Back-end or triplet loss training. … (more)
- Is Part Of:
- Computer speech & language. Volume 63(2020)
- Journal:
- Computer speech & language
- Issue:
- Volume 63(2020)
- Issue Display:
- Volume 63, Issue 2020 (2020)
- Year:
- 2020
- Volume:
- 63
- Issue:
- 2020
- Issue Sort Value:
- 2020-0063-2020-0000
- Page Start:
- Page End:
- Publication Date:
- 2020-09
- Subjects:
- Text dependent speaker verification -- Supervectors -- Alignment -- Triplet neural network -- AUC
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2020.101078 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 13576.xml