Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. Issue 4 (July 2020)
- Record Type:
- Journal Article
- Title:
- Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. Issue 4 (July 2020)
- Main Title:
- Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling
- Authors:
- Cunha, Washington
Canuto, Sérgio
Viegas, Felipe
Salles, Thiago
Gomes, Christian
Mangaravite, Vitor
Resende, Elaine
Rosa, Thierson
Gonçalves, Marcos André
Rocha, Leonardo - Abstract:
- Highlights: We propose and orchestrate new pre-processing steps for text classification pipelines. We explore meta-feature representations, sparsification and selective sampling. We provide thorough evaluations of the trade-offs between costs and effectiveness. Our final representations are more effective than word embeddings (up to 46%). Our processes induce large reductions in computational costs and memory consumption. Abstract: Text Classification pipelines are a sequence of tasks needed to be performed to classify documents into a set of predefined categories. The pre-processing phase (before training) of these pipelines involve different ways of transforming and manipulating the documents for the next (learning) phase. In this paper, we introduce three new steps into the pre-processing phase of text classification pipelines to improve effectiveness while reducing the associated costs. The distance-based Meta-Features (MFs) generation step aims at reducing the dimensionality of the original term-document matrix while producing a potentially more informative space that explicitly exploits discriminative labeled information. The second step is a sparsification one aimed at making the MF representation less dense to reduce training costs and noise. The third step is a selective sampling (SS) aimed at removing lines (documents) of the matrix obtained in the previous step, by carefully selecting the "best" documents for the learning phase. Our experiments show that theHighlights: We propose and orchestrate new pre-processing steps for text classification pipelines. We explore meta-feature representations, sparsification and selective sampling. We provide thorough evaluations of the trade-offs between costs and effectiveness. Our final representations are more effective than word embeddings (up to 46%). Our processes induce large reductions in computational costs and memory consumption. Abstract: Text Classification pipelines are a sequence of tasks needed to be performed to classify documents into a set of predefined categories. The pre-processing phase (before training) of these pipelines involve different ways of transforming and manipulating the documents for the next (learning) phase. In this paper, we introduce three new steps into the pre-processing phase of text classification pipelines to improve effectiveness while reducing the associated costs. The distance-based Meta-Features (MFs) generation step aims at reducing the dimensionality of the original term-document matrix while producing a potentially more informative space that explicitly exploits discriminative labeled information. The second step is a sparsification one aimed at making the MF representation less dense to reduce training costs and noise. The third step is a selective sampling (SS) aimed at removing lines (documents) of the matrix obtained in the previous step, by carefully selecting the "best" documents for the learning phase. Our experiments show that the proposed extended pre-processing pipeline can achieve significant gains in effectiveness when compared to the original TF-IDF (up to 52%) and embedding-based representations (up to 46%), at a much lower cost (up to 9.7x faster in some datasets). Other main contributions of our work include a thorough and rigorous evaluation of the trade-offs between cost and effectiveness associated with the introduction of these new steps into the pipeline as well as a comprehensive comparative experimental evaluation of many alternatives in terms of representations, approaches, etc. … (more)
- Is Part Of:
- Information processing & management. Volume 57:Issue 4(2020:Jul.)
- Journal:
- Information processing & management
- Issue:
- Volume 57:Issue 4(2020:Jul.)
- Issue Display:
- Volume 57, Issue 4 (2020)
- Year:
- 2020
- Volume:
- 57
- Issue:
- 4
- Issue Sort Value:
- 2020-0057-0004-0000
- Page Start:
- Page End:
- Publication Date:
- 2020-07
- Subjects:
- Text classification pipelines -- Pre-processing -- Meta-features -- Selective sampling -- Sparsification -- Experimental evaluation
Information storage and retrieval systems -- Periodicals
Information science -- Periodicals
Systèmes d'information -- Périodiques
Sciences de l'information -- Périodiques
Information science
Information storage and retrieval systems
Periodicals
658.4038 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064573 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.ipm.2020.102263 ↗
- Languages:
- English
- ISSNs:
- 0306-4573
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4493.893000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 20537.xml