A neural topic model with word vectors and entity vectors for short texts. Issue 2 (March 2021)
- Record Type:
- Journal Article
- Title:
- A neural topic model with word vectors and entity vectors for short texts. Issue 2 (March 2021)
- Main Title:
- A neural topic model with word vectors and entity vectors for short texts
- Authors:
- Zhao, Xiaowei
Wang, Deqing
Zhao, Zhengyang
Liu, Wei
Lu, Chenwei
Zhuang, Fuzhen - Abstract:
- Abstract: Traditional topic models are widely used for semantic discovery from long texts. However, they usually fail to mine high-quality topics from short texts (e.g. tweets) due to the sparsity of features and the lack of word co-occurrence patterns. In this paper, we propose a V ariational A uto-E ncoder T opic M odel (VAETM for short) by combining word vector representation and entity vector representation to address the above limitations. Specifically, we first learn embedding representations of each word and each entity by employing a large-scale external corpora and a large and manually edited knowledge graph, respectively. Then we integrated the embedding representations into the variational auto-encoder framework and propose an unsupervised model named VAETM to infer the latent representation of topic distributions. To further boost VAETM, we propose an improved supervised VAETM (SVAETM for short) by considering label information in training set to supervise the inference of latent representation of topic distributions and the generation of topics. Last, we propose KL-divergence-based inference algorithms to infer approximate posterior distribution for our two models. Extensive experiments on three common short text datasets demonstrate our proposed VAETM and SVAETM outperform various kinds of state-of-the-art models in terms of perplexity, NPMI, and accuracy. Highlights: We first propose an unsupervised topic model named VAETM for short texts. We then considerAbstract: Traditional topic models are widely used for semantic discovery from long texts. However, they usually fail to mine high-quality topics from short texts (e.g. tweets) due to the sparsity of features and the lack of word co-occurrence patterns. In this paper, we propose a V ariational A uto-E ncoder T opic M odel (VAETM for short) by combining word vector representation and entity vector representation to address the above limitations. Specifically, we first learn embedding representations of each word and each entity by employing a large-scale external corpora and a large and manually edited knowledge graph, respectively. Then we integrated the embedding representations into the variational auto-encoder framework and propose an unsupervised model named VAETM to infer the latent representation of topic distributions. To further boost VAETM, we propose an improved supervised VAETM (SVAETM for short) by considering label information in training set to supervise the inference of latent representation of topic distributions and the generation of topics. Last, we propose KL-divergence-based inference algorithms to infer approximate posterior distribution for our two models. Extensive experiments on three common short text datasets demonstrate our proposed VAETM and SVAETM outperform various kinds of state-of-the-art models in terms of perplexity, NPMI, and accuracy. Highlights: We first propose an unsupervised topic model named VAETM for short texts. We then consider label information in dataset to boost VAETM. A KL-divergence-based algorithm is used to infer approximate posterior distribution. Extensive experiments demonstrate our models outperform state-of-the-art baselines. … (more)
- Is Part Of:
- Information processing & management. Volume 58:Issue 2(2021)
- Journal:
- Information processing & management
- Issue:
- Volume 58:Issue 2(2021)
- Issue Display:
- Volume 58, Issue 2 (2021)
- Year:
- 2021
- Volume:
- 58
- Issue:
- 2
- Issue Sort Value:
- 2021-0058-0002-0000
- Page Start:
- Page End:
- Publication Date:
- 2021-03
- Subjects:
- Topic model -- Short text -- Variational auto-encoder -- Word embedding -- Entity embedding
Information storage and retrieval systems -- Periodicals
Information science -- Periodicals
Systèmes d'information -- Périodiques
Sciences de l'information -- Périodiques
Information science
Information storage and retrieval systems
Periodicals
658.4038 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064573 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.ipm.2020.102455 ↗
- Languages:
- English
- ISSNs:
- 0306-4573
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4493.893000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 15543.xml