Unsupervised modeling anomaly detection in discussion forums posts using global vectors for text representation. (4th September 2020)
- Record Type:
- Journal Article
- Title:
- Unsupervised modeling anomaly detection in discussion forums posts using global vectors for text representation. (4th September 2020)
- Main Title:
- Unsupervised modeling anomaly detection in discussion forums posts using global vectors for text representation
- Authors:
- Cichosz, Paweł
- Abstract:
- Abstract: Anomaly detection can be seen as an unsupervised learning task in which a predictive model created on historical data is used to detect outlying instances in new data. This work addresses possibly promising but relatively uncommon application of anomaly detection to text data. Two English-language and one Polish-language Internet discussion forums devoted to psychoactive substances received from home-grown plants, such as hashish or marijuana, serve as text sources that are both realistic and possibly interesting on their own, due to potential associations with drug-related crime. The utility of two different vector text representations is examined: the simple bag of words representation and a more refined Global Vectors (GloVe) representation, which is an example of the increasingly popular word embedding approach. They are both combined with two unsupervised anomaly detection methods, based on one-class support vector machines (SVM) and based on dissimilarity to k -medoids clusters. The GloVe representation is found definitely more useful for anomaly detection, permitting better detection quality and ameliorating the curse of dimensionality issues with text clustering. The cluster dissimilarity approach combined with this representation outperforms one-class SVM with respect to detection quality and appears a more promising approach to anomaly detection in text data.
- Is Part Of:
- Natural language engineering. Volume 26:Part 5(2020)
- Journal:
- Natural language engineering
- Issue:
- Volume 26:Part 5(2020)
- Issue Display:
- Volume 26, Issue 5, Part 5 (2020)
- Year:
- 2020
- Volume:
- 26
- Issue:
- 5
- Part:
- 5
- Issue Sort Value:
- 2020-0026-0005-0005
- Page Start:
- 551
- Page End:
- 578
- Publication Date:
- 2020-09-04
- Subjects:
- Text classification, -- Text clustering, -- Anomaly detection, -- Word embeddings
Natural language processing (Computer science) -- Periodicals
Software engineering -- Periodicals
006.35 - Journal URLs:
- http://journals.cambridge.org/action/displayJournal?jid=NLE ↗
- DOI:
- 10.1017/S1351324920000066 ↗
- Languages:
- English
- ISSNs:
- 1351-3249
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library HMNTS - ELD Digital store
- Ingest File:
- 14987.xml