On significance of constant-Q transform for pop noise detection. (January 2023)
- Record Type:
- Journal Article
- Title:
- On significance of constant-Q transform for pop noise detection. (January 2023)
- Main Title:
- On significance of constant-Q transform for pop noise detection
- Authors:
- Khoria, Kuldeep
Patil, Ankur T.
Patil, Hemant A. - Abstract:
- Abstract: Liveness detection has emerged as an important research issue for many biometrics, such as face, iris, hand geometry, etc. and significant research efforts are reported in the literature. However, less emphasis is given to liveness detection for voice biometrics or Automatic Speaker Verification (ASV). Voice Liveness Detection (VLD) can be a potential technique to detect spoofing attacks in ASV system. Presence of pop noise in the speech signal of live speaker provides the discriminative acoustic cue to distinguish between genuine vs. spoofed speech in the framework of VLD. Pop noise comes out as a burst at the lips, which is captured by the ASV system (since the speaker and microphone are close enough), indicating the liveness of the speaker and provides the basis of VLD. In this paper, we present the Constant-Q Transform (CQT) -based approach over the traditional Short-Time Fourier Transform (STFT) -based algorithm (baseline). With respect to Heisenberg's uncertainty principle in signal processing framework, the CQT has variable spectro-temporal resolution, in particular, better frequency resolution for low frequency region and better temporal resolution for high frequency region, which can be effectively utilized to identify the low frequency characteristics of pop noise. We have also compared proposed algorithm with cepstral features, namely, Linear Frequency Cepstral Coefficients (LFCC) and Constant-Q Cepstral Coefficients. The experiments are performed onAbstract: Liveness detection has emerged as an important research issue for many biometrics, such as face, iris, hand geometry, etc. and significant research efforts are reported in the literature. However, less emphasis is given to liveness detection for voice biometrics or Automatic Speaker Verification (ASV). Voice Liveness Detection (VLD) can be a potential technique to detect spoofing attacks in ASV system. Presence of pop noise in the speech signal of live speaker provides the discriminative acoustic cue to distinguish between genuine vs. spoofed speech in the framework of VLD. Pop noise comes out as a burst at the lips, which is captured by the ASV system (since the speaker and microphone are close enough), indicating the liveness of the speaker and provides the basis of VLD. In this paper, we present the Constant-Q Transform (CQT) -based approach over the traditional Short-Time Fourier Transform (STFT) -based algorithm (baseline). With respect to Heisenberg's uncertainty principle in signal processing framework, the CQT has variable spectro-temporal resolution, in particular, better frequency resolution for low frequency region and better temporal resolution for high frequency region, which can be effectively utilized to identify the low frequency characteristics of pop noise. We have also compared proposed algorithm with cepstral features, namely, Linear Frequency Cepstral Coefficients (LFCC) and Constant-Q Cepstral Coefficients. The experiments are performed on recently released POp noise COrpus (POCO) dataset with various statistical, discriminative, and deep learning-based classifiers, namely, Gaussian Mixture Model (GMM), Support Vector Machine (SVM), Convolutional Neural Networks (CNN), Light-CNN (LCNN), and Residual Network (ResNet), respectively. The significant improvement in performance, in particular, an absolute improvement of 14.23% and 10.95% in terms of percentage classification accuracy on development and evaluation set, respectively, is obtained for the proposed CQT-based algorithm along with SVM classifier, over the STFT-SVM (baseline) system. Similar trend of the performance improvement is observed for the GMM, CNN, LCNN, and ResNet classifiers for the proposed CQT-based algorithm vs. traditional STFT-based algorithm. The analysis is further extended by simulating the replay mechanism (in the standard framework of ASVSpoof-2019 PA challenge dataset) on the subset of POCO dataset in order to observe the effect of room acoustics onto the performance of the VLD system. By embedding the moderate simulated replay mechanism in POCO dataset, we obtained the percentage classification accuracy of 97.82% on evaluation set. … (more)
- Is Part Of:
- Computer speech & language. Volume 77(2023)
- Journal:
- Computer speech & language
- Issue:
- Volume 77(2023)
- Issue Display:
- Volume 77, Issue 2023 (2023)
- Year:
- 2023
- Volume:
- 77
- Issue:
- 2023
- Issue Sort Value:
- 2023-0077-2023-0000
- Page Start:
- Page End:
- Publication Date:
- 2023-01
- Subjects:
- Voice liveness detection -- Pop noise -- Constant-Q transform -- STFT -- POCO
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2022.101421 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 23332.xml