Feature learning for efficient ASR-free keyword spotting in low-resource languages. (January 2022)
- Record Type:
- Journal Article
- Title:
- Feature learning for efficient ASR-free keyword spotting in low-resource languages. (January 2022)
- Main Title:
- Feature learning for efficient ASR-free keyword spotting in low-resource languages
- Authors:
- van der Westhuizen, Ewald
Kamper, Herman
Menon, Raghav
Quinn, John
Niesler, Thomas - Abstract:
- Abstract: We consider feature learning for a computationally efficient method of keyword spotting that can be applied in severely under-resourced settings. The objective is to support humanitarian relief programmes by the United Nations (UN) in parts of Africa in which almost no language resources are available. To allow a keyword spotting system to be rapidly developed in such a language, we rely on a small and easily-compiled set of isolated keywords. Using the isolated keywords as templates, we apply dynamic time warping (DTW) to a much larger corpus of in-domain but untranscribed speech. The resulting DTW alignment scores are used to train a convolutional neural network (CNN) which is orders of magnitude more computationally efficient than DTW and therefore suitable for real-time application. We optimise this ASR-free neural network keyword spotting procedure by identifying acoustic features that provide robust performance in this almost zero-resource setting. First, we consider the benefits of incorporating information from well-resourced but unrelated languages by incorporating a multilingual bottleneck feature (BNF) extractor. Next, we consider using features extracted from an autoencoder (AE) trained on in-domain but untranscribed data. Finally, we consider features obtained from a correspondence autoencoder (CAE) which is initialised with the AE and subsequently fine-tuned on the small set of in-domain labelled data. Experiments in South African English and Luganda,Abstract: We consider feature learning for a computationally efficient method of keyword spotting that can be applied in severely under-resourced settings. The objective is to support humanitarian relief programmes by the United Nations (UN) in parts of Africa in which almost no language resources are available. To allow a keyword spotting system to be rapidly developed in such a language, we rely on a small and easily-compiled set of isolated keywords. Using the isolated keywords as templates, we apply dynamic time warping (DTW) to a much larger corpus of in-domain but untranscribed speech. The resulting DTW alignment scores are used to train a convolutional neural network (CNN) which is orders of magnitude more computationally efficient than DTW and therefore suitable for real-time application. We optimise this ASR-free neural network keyword spotting procedure by identifying acoustic features that provide robust performance in this almost zero-resource setting. First, we consider the benefits of incorporating information from well-resourced but unrelated languages by incorporating a multilingual bottleneck feature (BNF) extractor. Next, we consider using features extracted from an autoencoder (AE) trained on in-domain but untranscribed data. Finally, we consider features obtained from a correspondence autoencoder (CAE) which is initialised with the AE and subsequently fine-tuned on the small set of in-domain labelled data. Experiments in South African English and Luganda, a low-resource language, demonstrate that, on their own, both the BNF and CAE features can achieve a 5% relative performance improvement over baseline MFCCs. However, by using BNFs as input to the CAE, even better performance is achieved, resulting in a more than 27% relative improvement over MFCCs in ROC area-under-the-curve (AUC) and more than twice as many top-10 retrievals. We also show that, using these features, the CNN-DTW keyword spotter performs almost as well as the DTW keyword spotter while comfortably outperforming a baseline CNN trained only on the keyword templates. We conclude that a CNN-DTW keyword spotter using BNF-derived CAE features represents a computationally efficient approach with very competitive performance that is suited to rapid deployment in a severely under-resourced scenario. Highlights: Keyword spotting is used for word recognition in severely under-resourced languages. Keyword spotting systems support United Nations humanitarian relief programmes. Isolated keyword templates are used with dynamic time warping for keyword spotting. Convolutional networks are trained to be more efficient keyword spotters than DTW. A correspondence autoencoder with bottleneck features yields superior features. … (more)
- Is Part Of:
- Computer speech & language. Volume 71(2022)
- Journal:
- Computer speech & language
- Issue:
- Volume 71(2022)
- Issue Display:
- Volume 71, Issue 2022 (2022)
- Year:
- 2022
- Volume:
- 71
- Issue:
- 2022
- Issue Sort Value:
- 2022-0071-2022-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-01
- Subjects:
- Keyword spotting -- Representation learning -- Low-resource languages -- Dynamic time warping -- Convolutional neural networks
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2021.101275 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 19368.xml