Unsupervised language identification based on Latent Dirichlet Allocation. (September 2016)
- Record Type:
- Journal Article
- Title:
- Unsupervised language identification based on Latent Dirichlet Allocation. (September 2016)
- Main Title:
- Unsupervised language identification based on Latent Dirichlet Allocation
- Authors:
- Zhang, Wei
Clark, Robert A.J.
Wang, Yongyuan
Li, Wen - Abstract:
- Abstract : Graphical abstract: On the left generative model, we propose an unsupervised language identification approach based on Latent Dirichlet Allocation (LDA-LI) where we take the raw n -gram count as features without any smoothing, pruning or interpolation. As the right experiments on ECI/MCI benchmark, the LDA-LI has comparable precisions, recalls and F scores to state of the art supervised language identification techniques (langID.py and Guess_language, etc.). Abstract : Highlights: An unsupervised language identification approach based on Latent Dirichlet Allocation with high precisions, recalls and F scores. The raw n -gram count as features without any smoothing, pruning or interpolation. Purifies main language with unknown number of other languages in high precision. Find out the nearest measure related to the minimum of topic number. Abstract: To automatically build, from scratch, the language processing component for a speech synthesis system in a new language, a purified text corpora is needed where any words and phrases from other languages are clearly identified or excluded. When using found data and where there is no inherent linguistic knowledge of the language/languages contained in the data, identifying the pure data is a difficult problem. We propose an unsupervised language identification approach based on Latent Dirichlet Allocation where we take the raw n -gram count as features without any smoothing, pruning or interpolation. The Latent DirichletAbstract : Graphical abstract: On the left generative model, we propose an unsupervised language identification approach based on Latent Dirichlet Allocation (LDA-LI) where we take the raw n -gram count as features without any smoothing, pruning or interpolation. As the right experiments on ECI/MCI benchmark, the LDA-LI has comparable precisions, recalls and F scores to state of the art supervised language identification techniques (langID.py and Guess_language, etc.). Abstract : Highlights: An unsupervised language identification approach based on Latent Dirichlet Allocation with high precisions, recalls and F scores. The raw n -gram count as features without any smoothing, pruning or interpolation. Purifies main language with unknown number of other languages in high precision. Find out the nearest measure related to the minimum of topic number. Abstract: To automatically build, from scratch, the language processing component for a speech synthesis system in a new language, a purified text corpora is needed where any words and phrases from other languages are clearly identified or excluded. When using found data and where there is no inherent linguistic knowledge of the language/languages contained in the data, identifying the pure data is a difficult problem. We propose an unsupervised language identification approach based on Latent Dirichlet Allocation where we take the raw n -gram count as features without any smoothing, pruning or interpolation. The Latent Dirichlet Allocation topic model is reformulated for the language identification task and Collapsed Gibbs Sampling is used to train an unsupervised language identification model. In order to find the number of languages present, we compared four kinds of measure and also the Hierarchical Dirichlet process on several configurations of the ECI/UCI benchmark. Experiments on the ECI/MCI data and a Wikipedia based Swahili corpus shows this LDA method, without any annotation, has comparable precisions, recalls and F -scores to state of the art supervised language identification techniques. … (more)
- Is Part Of:
- Computer speech & language. Volume 39(2016)
- Journal:
- Computer speech & language
- Issue:
- Volume 39(2016)
- Issue Display:
- Volume 39, Issue 2016 (2016)
- Year:
- 2016
- Volume:
- 39
- Issue:
- 2016
- Issue Sort Value:
- 2016-0039-2016-0000
- Page Start:
- 47
- Page End:
- 66
- Publication Date:
- 2016-09
- Subjects:
- Language filtering -- Language purifying -- Language identification
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2016.02.001 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 2467.xml