A Two-Stage Machine learning approach for temporally-robust text classification. (September 2017)
- Record Type:
- Journal Article
- Title:
- A Two-Stage Machine learning approach for temporally-robust text classification. (September 2017)
- Main Title:
- A Two-Stage Machine learning approach for temporally-robust text classification
- Authors:
- Salles, Thiago
Rocha, Leonardo
Mourão, Fernando
Gonçalves, Marcos
Viegas, Felipe
Meira, Wagner - Abstract:
- Highlights: Proposal of an automatic procedure to learn the Temporal Weighting Functions (TWFs) from the training data. based on a cheap strategy to learn the distribution that better suits each domain. Proposal of three lazy strategies to incorporate TWFs into traditional ADC algorithms. Further evaluation of the proposed strategies using three real-world textual datasets with distinct temporal characteristics. Abstract: One of the most relevant research topics in Information Retrieval is Automatic Document Classification (ADC). Several ADC algorithms have been proposed in the literature. However, the majority of these algorithms assume that the underlying data distribution does not change over time. Previous work has demonstrated evidence of the negative impact of three main temporal effects in representative datasets textual datasets, reflected by variations observed over time in the class distribution, in the pairwise class similarities and in the relationships between terms and classes [1]. In order to minimize the impact of temporal effects in ADC algorithms, we have previously introduced the notion of a temporal weighting function (TWF), which reflects the varying nature of textual datasets. We have also proposed a procedure to derive the TWF's expression and parameters. However, the derivation of the TWF requires the running of explicit and complex statistical tests, which are very cumbersome or can not even be run in several cases. In this article, we propose aHighlights: Proposal of an automatic procedure to learn the Temporal Weighting Functions (TWFs) from the training data. based on a cheap strategy to learn the distribution that better suits each domain. Proposal of three lazy strategies to incorporate TWFs into traditional ADC algorithms. Further evaluation of the proposed strategies using three real-world textual datasets with distinct temporal characteristics. Abstract: One of the most relevant research topics in Information Retrieval is Automatic Document Classification (ADC). Several ADC algorithms have been proposed in the literature. However, the majority of these algorithms assume that the underlying data distribution does not change over time. Previous work has demonstrated evidence of the negative impact of three main temporal effects in representative datasets textual datasets, reflected by variations observed over time in the class distribution, in the pairwise class similarities and in the relationships between terms and classes [1]. In order to minimize the impact of temporal effects in ADC algorithms, we have previously introduced the notion of a temporal weighting function (TWF), which reflects the varying nature of textual datasets. We have also proposed a procedure to derive the TWF's expression and parameters. However, the derivation of the TWF requires the running of explicit and complex statistical tests, which are very cumbersome or can not even be run in several cases. In this article, we propose a machine learning methodology to automatically learn the TWF without the need to perform any statistical tests. We also propose new strategies to incorporate the TWF into ADC algorithms, which we call temporally-aware classifiers . Experiments showed that the fully-automated temporally-aware classifiers achieved significant gains (up to 17%) when compared to their non-temporal counterparts, even outperforming some state-of-the-art algorithms (e.g., SVM) in most cases, with large reductions in execution time. … (more)
- Is Part Of:
- Information systems. Volume 69(2017)
- Journal:
- Information systems
- Issue:
- Volume 69(2017)
- Issue Display:
- Volume 69, Issue 2017 (2017)
- Year:
- 2017
- Volume:
- 69
- Issue:
- 2017
- Issue Sort Value:
- 2017-0069-2017-0000
- Page Start:
- 40
- Page End:
- 58
- Publication Date:
- 2017-09
- Subjects:
- Automatic document classification -- Temporal weighting function -- Fully-Automated machine learning process
Database management -- Periodicals
Electronic data processing -- Periodicals
Bases de données -- Gestion -- Périodiques
Informatique -- Périodiques
Database management
Electronic data processing
Periodicals
005.7 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064379 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.is.2017.04.004 ↗
- Languages:
- English
- ISSNs:
- 0306-4379
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4496.367300
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 4622.xml