Unsupervised language identification based on Latent Dirichlet Allocation. (September 2016)

Record Type:: Journal Article
Title:: Unsupervised language identification based on Latent Dirichlet Allocation. (September 2016)
Main Title:: Unsupervised language identification based on Latent Dirichlet Allocation
Authors:: Zhang, Wei
Clark, Robert A.J.
Wang, Yongyuan
Li, Wen
Abstract:: Abstract : Graphical abstract: On the left generative model, we propose an unsupervised language identification approach based on Latent Dirichlet Allocation (LDA-LI) where we take the raw n -gram count as features without any smoothing, pruning or interpolation. As the right experiments on ECI/MCI benchmark, the LDA-LI has comparable precisions, recalls and F scores to state of the art supervised language identification techniques (langID.py and Guess_language, etc.). Abstract : Highlights: An unsupervised language identification approach based on Latent Dirichlet Allocation with high precisions, recalls and F scores. The raw n -gram count as features without any smoothing, pruning or interpolation. Purifies main language with unknown number of other languages in high precision. Find out the nearest measure related to the minimum of topic number. Abstract: To automatically build, from scratch, the language processing component for a speech synthesis system in a new language, a purified text corpora is needed where any words and phrases from other languages are clearly identified or excluded. When using found data and where there is no inherent linguistic knowledge of the language/languages contained in the data, identifying the pure data is a difficult problem. We propose an unsupervised language identification approach based on Latent Dirichlet Allocation where we take the raw n -gram count as features without any smoothing, pruning or interpolation. The Latent Dirichlet … (more)
Is Part Of:: Computer speech & language. Volume 39(2016)
Journal:: Computer speech & language
Issue:: Volume 39(2016)
Issue Display:: Volume 39, Issue 2016 (2016)
Year:: 2016
Volume:: 39
Issue:: 2016
Issue Sort Value:: 2016-0039-2016-0000
Page Start:: 47
Page End:: 66
Publication Date:: 2016-09
Subjects:: Language filtering -- Language purifying -- Language identification
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454
Journal URLs:: http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗
DOI:: 10.1016/j.csl.2016.02.001 ↗
Languages:: English
ISSNs:: 0885-2308
Deposit Type:: Legaldeposit
View Content:: Available online (eLD content is only available in our Reading Rooms) ↗
Physical Locations:: British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store
Ingest File:: 2467.xml