Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. Issue 4 (July 2020)

Record Type:: Journal Article
Title:: Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. Issue 4 (July 2020)
Main Title:: Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling
Authors:: Cunha, Washington
Canuto, Sérgio
Viegas, Felipe
Salles, Thiago
Gomes, Christian
Mangaravite, Vitor
Resende, Elaine
Rosa, Thierson
Gonçalves, Marcos André
Rocha, Leonardo
Abstract:: Highlights: We propose and orchestrate new pre-processing steps for text classification pipelines. We explore meta-feature representations, sparsification and selective sampling. We provide thorough evaluations of the trade-offs between costs and effectiveness. Our final representations are more effective than word embeddings (up to 46%). Our processes induce large reductions in computational costs and memory consumption. Abstract: Text Classification pipelines are a sequence of tasks needed to be performed to classify documents into a set of predefined categories. The pre-processing phase (before training) of these pipelines involve different ways of transforming and manipulating the documents for the next (learning) phase. In this paper, we introduce three new steps into the pre-processing phase of text classification pipelines to improve effectiveness while reducing the associated costs. The distance-based Meta-Features (MFs) generation step aims at reducing the dimensionality of the original term-document matrix while producing a potentially more informative space that explicitly exploits discriminative labeled information. The second step is a sparsification one aimed at making the MF representation less dense to reduce training costs and noise. The third step is a selective sampling (SS) aimed at removing lines (documents) of the matrix obtained in the previous step, by carefully selecting the "best" documents for the learning phase. Our experiments show that the … (more)
Is Part Of:: Information processing & management. Volume 57:Issue 4(2020:Jul.)
Journal:: Information processing & management
Issue:: Volume 57:Issue 4(2020:Jul.)
Issue Display:: Volume 57, Issue 4 (2020)
Year:: 2020
Volume:: 57
Issue:: 4
Issue Sort Value:: 2020-0057-0004-0000
Page Start:
Page End:
Publication Date:: 2020-07
Subjects:: Text classification pipelines -- Pre-processing -- Meta-features -- Selective sampling -- Sparsification -- Experimental evaluation
Information storage and retrieval systems -- Periodicals
Information science -- Periodicals
Systèmes d'information -- Périodiques
Sciences de l'information -- Périodiques
Information science
Information storage and retrieval systems
Periodicals
658.4038
Journal URLs:: http://www.sciencedirect.com/science/journal/03064573 ↗
http://www.elsevier.com/journals ↗
DOI:: 10.1016/j.ipm.2020.102263 ↗
Languages:: English
ISSNs:: 0306-4573
Deposit Type:: Legaldeposit
View Content:: Available online (eLD content is only available in our Reading Rooms) ↗
Physical Locations:: British Library DSC - 4493.893000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store
Ingest File:: 20537.xml