Tackling topic general words in topic modeling. (June 2017)
- Record Type:
- Journal Article
- Title:
- Tackling topic general words in topic modeling. (June 2017)
- Main Title:
- Tackling topic general words in topic modeling
- Authors:
- Xu, Yueshen
Yin, Yuyu
Yin, Jianwei - Abstract:
- Abstract: Topic models are a prevailing tool for exploring latent topics in documents, and for helping to complete many NLP tasks. To obtain good topics for a corpus, a preprocessing step is often needed to remove common stop words and identify topic general words (TGW) from the corpus. Such words can seriously harm the topic formation because they create spurious co-occurrence of unrelated words. Also, they are likely to occupy top positions of multiple topics, lead to many unrelated words being grouped under a topic, and consequently result in inscrutable and similar topics. In an application, one typically manually identifies and removes a list of TGWs in the corpus. This is a time consuming process and very hard to do by a layman user. In this paper, we aim to solve this problem automatically. The proposed approaches can be based on the current corpus alone or multiple corpora. In the latter case, a novel continuous learning method is proposed that learns from past results of multiple domain corpora to help identify TGWs in the current domain. We conduct experiments in two real-world datasets, and the experimental results show that the proposed approaches achieve superior results. Abstract : Highlights: Study the problem of topic general words in topic modeling. Propose a metric generality score to measure the generality of a word. Propose a new topic model generality-sensitive LDA to exploit generality scores in modeling. Propose a continuous learning approach that canAbstract: Topic models are a prevailing tool for exploring latent topics in documents, and for helping to complete many NLP tasks. To obtain good topics for a corpus, a preprocessing step is often needed to remove common stop words and identify topic general words (TGW) from the corpus. Such words can seriously harm the topic formation because they create spurious co-occurrence of unrelated words. Also, they are likely to occupy top positions of multiple topics, lead to many unrelated words being grouped under a topic, and consequently result in inscrutable and similar topics. In an application, one typically manually identifies and removes a list of TGWs in the corpus. This is a time consuming process and very hard to do by a layman user. In this paper, we aim to solve this problem automatically. The proposed approaches can be based on the current corpus alone or multiple corpora. In the latter case, a novel continuous learning method is proposed that learns from past results of multiple domain corpora to help identify TGWs in the current domain. We conduct experiments in two real-world datasets, and the experimental results show that the proposed approaches achieve superior results. Abstract : Highlights: Study the problem of topic general words in topic modeling. Propose a metric generality score to measure the generality of a word. Propose a new topic model generality-sensitive LDA to exploit generality scores in modeling. Propose a continuous learning approach that can use multiple domains to find topic general words. … (more)
- Is Part Of:
- Engineering applications of artificial intelligence. Volume 62(2017:Feb.)
- Journal:
- Engineering applications of artificial intelligence
- Issue:
- Volume 62(2017:Feb.)
- Issue Display:
- Volume 62 (2017)
- Year:
- 2017
- Volume:
- 62
- Issue Sort Value:
- 2017-0062-0000-0000
- Page Start:
- 124
- Page End:
- 133
- Publication Date:
- 2017-06
- Subjects:
- Topic Modeling -- Topic General Word -- Pólya Urn -- Lifelong Learning -- Topic Coherence
Engineering -- Data processing -- Periodicals
Artificial intelligence -- Periodicals
Expert systems (Computer science) -- Periodicals
Ingénierie -- Informatique -- Périodiques
Intelligence artificielle -- Périodiques
Systèmes experts (Informatique) -- Périodiques
Artificial intelligence
Engineering -- Data processing
Expert systems (Computer science)
Periodicals
620.00285 - Journal URLs:
- http://www.sciencedirect.com/science/journal/09521976 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.engappai.2017.04.009 ↗
- Languages:
- English
- ISSNs:
- 0952-1976
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3755.704500
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 2498.xml