Topic modeling of Chinese language beyond a bag-of-words. (November 2016)
- Record Type:
- Journal Article
- Title:
- Topic modeling of Chinese language beyond a bag-of-words. (November 2016)
- Main Title:
- Topic modeling of Chinese language beyond a bag-of-words
- Authors:
- Qin, Zengchang
Cong, Yonghui
Wan, Tao - Abstract:
- Abstract : Highlights: Topic modeling on Chinese words and characters are investigated experimentally. A model CWTM is proposed by embedding character–word relation into topic modeling. Asymmetric Dirichlet prior over topic–word distribution leads to better results. CWTM model obtain better performance on classification and is more robust. Abstract: The topic model is one of best known hierarchical Bayesian models for language modeling and document analysis. It has achieved a great success in text classification, in which a text is represented as a big of its words, disregarding grammar and even word order, that is referred to as the bag-of-words assumption. In this paper, we investigate topic modeling of the Chinese language, which has different morphology from alphabetical western languages like English. The Chinese characters, but not the Chinese words, are the basic structural units in Chinese. In previous empirical studies, it shows that the character-based topic model performs better than the word-based topic model. In this research, we propose the character–word topic model (CWTM) to consider the character–word relation in topic modeling. Two types of experiments are designed to test the performance of the new proposed model: topic extraction and text classification. By empirical studies, we demonstrate the superiority of the new proposed model comparing to both word and character based topic models.
- Is Part Of:
- Computer speech & language. Volume 40(2016)
- Journal:
- Computer speech & language
- Issue:
- Volume 40(2016)
- Issue Display:
- Volume 40, Issue 2016 (2016)
- Year:
- 2016
- Volume:
- 40
- Issue:
- 2016
- Issue Sort Value:
- 2016-0040-2016-0000
- Page Start:
- 60
- Page End:
- 78
- Publication Date:
- 2016-11
- Subjects:
- Topic models -- Chinese language modeling -- Text classification -- Language model -- Character–word topic model -- Latent Dirichlet allocation
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2016.03.004 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 7920.xml