Robustness, replicability and scalability in topic modelling. Issue 1 (February 2022)
- Record Type:
- Journal Article
- Title:
- Robustness, replicability and scalability in topic modelling. Issue 1 (February 2022)
- Main Title:
- Robustness, replicability and scalability in topic modelling
- Authors:
- Ballester, Omar
Penner, Orion - Abstract:
- Highlights: We identify three key properties a topic model must exhibit for effective application in the social sciences: statistical robustness, descriptive power across all dimensions and reflection of reality. We propose a simple approach for estimating the statistical robustness of topic models that is based on pairwise similarity scores between documents. Applying that approach we find that the neural network-based Doc2Vec is more stable than the other topic models tested: Latent Dirichlet Allocation and Non-negative Matrix Factorisation. We further propose a principal component analysis based approach for assessing the descriptive power of topic models. In applying that approach we find that Doc2Vec performs the best, but LDA does also perform well. We provide grounds for the application of neural embeddings approaches in the social sciences and also how traditional visualisation techniques can be applied directly to dense vector representations. Abstract: Approaches for estimating the similarity between individual publications are an area of long-standing interest in the scientometrics and informetrics communities. Traditional techniques have generally relied on references and other metadata, while text mining approaches based on title and abstract text have appeared more frequently in recent years. In principle, topic models have great potential in this domain. But, in practice, they are often difficult to employ successfully, and are notoriously inconsistent asHighlights: We identify three key properties a topic model must exhibit for effective application in the social sciences: statistical robustness, descriptive power across all dimensions and reflection of reality. We propose a simple approach for estimating the statistical robustness of topic models that is based on pairwise similarity scores between documents. Applying that approach we find that the neural network-based Doc2Vec is more stable than the other topic models tested: Latent Dirichlet Allocation and Non-negative Matrix Factorisation. We further propose a principal component analysis based approach for assessing the descriptive power of topic models. In applying that approach we find that Doc2Vec performs the best, but LDA does also perform well. We provide grounds for the application of neural embeddings approaches in the social sciences and also how traditional visualisation techniques can be applied directly to dense vector representations. Abstract: Approaches for estimating the similarity between individual publications are an area of long-standing interest in the scientometrics and informetrics communities. Traditional techniques have generally relied on references and other metadata, while text mining approaches based on title and abstract text have appeared more frequently in recent years. In principle, topic models have great potential in this domain. But, in practice, they are often difficult to employ successfully, and are notoriously inconsistent as latent space dimension grows. In this manuscript we identify the three properties all usable topic models should have: robustness, descriptive power and reflection of reality. We develop a novel method for evaluating the robustness of topic models and suggest a metric to assess and benchmark descriptive power as number of topics scale. Employing that procedure, we find that the neural-network-based paragraph embedding approach seems capable of providing statistically robust estimates of the document–document similarities, even for topic spaces far larger than what is usually considered prudent for the most common topic model approaches. … (more)
- Is Part Of:
- Journal of informetrics. Volume 16:Issue 1(2022)
- Journal:
- Journal of informetrics
- Issue:
- Volume 16:Issue 1(2022)
- Issue Display:
- Volume 16, Issue 1 (2022)
- Year:
- 2022
- Volume:
- 16
- Issue:
- 1
- Issue Sort Value:
- 2022-0016-0001-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-02
- Subjects:
- Scientometrics -- Topic modelling -- Stability -- Robustness -- Similarity -- Informetrics
Library statistics -- Periodicals
Information science -- Statistical methods -- Periodicals
Bibliometrics -- Periodicals
Bibliothèques -- Statistiques -- Périodiques
Sciences de l'information -- Méthodes statistiques -- Périodiques
Bibliométrie -- Périodiques
020.727 - Journal URLs:
- http://www.journals.elsevier.com/journal-of-informetrics/ ↗
http://rave.ohiolink.edu/ejournals/issn/17511577/ ↗
http://www.sciencedirect.com/science/journal/17511577 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.joi.2021.101224 ↗
- Languages:
- English
- ISSNs:
- 1751-1577
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 5006.830000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 22282.xml