Deep generative variational autoencoding for replay spoof detection in automatic speaker verification. (September 2020)
- Record Type:
- Journal Article
- Title:
- Deep generative variational autoencoding for replay spoof detection in automatic speaker verification. (September 2020)
- Main Title:
- Deep generative variational autoencoding for replay spoof detection in automatic speaker verification
- Authors:
- Chettri, Bhusan
Kinnunen, Tomi
Benetos, Emmanouil - Abstract:
- Highlights: Variational autoencoder (VAE) for replay spoofing attack detection as an alternative backend to GMMs. A systematic comparison of three alternative class-conditioned VAE variants. Experimental evaluation on two standard replay spoofing benchmarks, ASVspoof 2017 v2.0 and ASVspoof 2019 PA. VAE residual (absolute difference of input and its reconstruction) as a data-driven feature representation for replay attack detection. Abstract: Automatic speaker verification (ASV) systems are highly vulnerable to presentation attacks, also called spoofing attacks. Replay is among the simplest attacks to mount — yet difficult to detect reliably. The generalization failure of spoofing countermeasures (CMs) has driven the community to study various alternative deep learning CMs. The majority of them are supervised approaches that learn a human-spoof discriminator. In this paper, we advocate a different, deep generative approach that leverages from powerful unsupervised manifold learning in classification. The potential benefits include the possibility to sample new data, and to obtain insights to the latent features of genuine and spoofed speech. To this end, we propose to use variational autoencoders (VAEs) as an alternative backend for replay attack detection, via three alternative models that differ in their class-conditioning. The first one, similar to the use of Gaussian mixture models (GMMs) in spoof detection, is to train independently two VAEs — one for each class. TheHighlights: Variational autoencoder (VAE) for replay spoofing attack detection as an alternative backend to GMMs. A systematic comparison of three alternative class-conditioned VAE variants. Experimental evaluation on two standard replay spoofing benchmarks, ASVspoof 2017 v2.0 and ASVspoof 2019 PA. VAE residual (absolute difference of input and its reconstruction) as a data-driven feature representation for replay attack detection. Abstract: Automatic speaker verification (ASV) systems are highly vulnerable to presentation attacks, also called spoofing attacks. Replay is among the simplest attacks to mount — yet difficult to detect reliably. The generalization failure of spoofing countermeasures (CMs) has driven the community to study various alternative deep learning CMs. The majority of them are supervised approaches that learn a human-spoof discriminator. In this paper, we advocate a different, deep generative approach that leverages from powerful unsupervised manifold learning in classification. The potential benefits include the possibility to sample new data, and to obtain insights to the latent features of genuine and spoofed speech. To this end, we propose to use variational autoencoders (VAEs) as an alternative backend for replay attack detection, via three alternative models that differ in their class-conditioning. The first one, similar to the use of Gaussian mixture models (GMMs) in spoof detection, is to train independently two VAEs — one for each class. The second one is to train a single conditional model (C-VAE) by injecting a one-hot class label vector to the encoder and decoder networks. Our final proposal integrates an auxiliary classifier to guide the learning of the latent space. Our experimental results using constant-Q cepstral coefficient (CQCC) features on the ASVspoof 2017 and 2019 physical access subtask datasets indicate that the C-VAE offers substantial improvement in comparison to training two separate VAEs for each class. On the 2019 dataset, the C-VAE outperforms the VAE and the baseline GMM by an absolute 9 - 10% in both equal error rate (EER) and tandem detection cost function (t-DCF) metrics. Finally, we propose VAE residuals — the absolute difference of the original input and the reconstruction as features for spoofing detection. The proposed frontend approach augmented with a convolutional neural network classifier demonstrated substantial improvement over the VAE backend use case. … (more)
- Is Part Of:
- Computer speech & language. Volume 63(2020)
- Journal:
- Computer speech & language
- Issue:
- Volume 63(2020)
- Issue Display:
- Volume 63, Issue 2020 (2020)
- Year:
- 2020
- Volume:
- 63
- Issue:
- 2020
- Issue Sort Value:
- 2020-0063-2020-0000
- Page Start:
- Page End:
- Publication Date:
- 2020-09
- Subjects:
- Anti-spoofing -- Presentation attack detection -- Replay attack -- Countermeasures -- Deep generative models
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2020.101092 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 13581.xml