Improving random forests by neighborhood projection for effective text classification. (September 2018)
- Record Type:
- Journal Article
- Title:
- Improving random forests by neighborhood projection for effective text classification. (September 2018)
- Main Title:
- Improving random forests by neighborhood projection for effective text classification
- Authors:
- Salles, Thiago
Gonçalves, Marcos
Rodrigues, Victor
Rocha, Leonardo - Abstract:
- Highlights: We propose a lazy version of the random forest classifier based on nearest neighbors. Our goal is to reduce overfitting due to very complex trees generated in noisy scenarios. We run a very extensive set of experiments covering hundreds of results in two domains. Our method was the best performer in almost all cases. Our method is also more scalable than other lazy solutions. Abstract: In this article, we propose a lazy version of the traditional random forest (RF) classifier (called LazyNN_RF), specially designed for highly dimensional noisy classification tasks. The LazyNN_RF "localized" training projection is composed by examples that better resemble the examples to be classified, obtained through nearest neighborhood training set projection. Such projection filters out irrelevant data, ultimately avoiding some of the drawbacks of traditional random forests, such as overfitting due to very complex trees, especially in high dimensional noisy datasets. In sum, our main contributions are: (i) the proposal and implementation of a novel lazy learner based on the random forest classifier and nearest neighborhood projection of the training set that excels in automatic text classification tasks, as well as (ii) a throughout and detailed experimental analysis that sheds light on the behavior, effectiveness and feasibility of our solution. By means of an extensive experimental evaluation, performed considering two text classification domains and a large set of baselineHighlights: We propose a lazy version of the random forest classifier based on nearest neighbors. Our goal is to reduce overfitting due to very complex trees generated in noisy scenarios. We run a very extensive set of experiments covering hundreds of results in two domains. Our method was the best performer in almost all cases. Our method is also more scalable than other lazy solutions. Abstract: In this article, we propose a lazy version of the traditional random forest (RF) classifier (called LazyNN_RF), specially designed for highly dimensional noisy classification tasks. The LazyNN_RF "localized" training projection is composed by examples that better resemble the examples to be classified, obtained through nearest neighborhood training set projection. Such projection filters out irrelevant data, ultimately avoiding some of the drawbacks of traditional random forests, such as overfitting due to very complex trees, especially in high dimensional noisy datasets. In sum, our main contributions are: (i) the proposal and implementation of a novel lazy learner based on the random forest classifier and nearest neighborhood projection of the training set that excels in automatic text classification tasks, as well as (ii) a throughout and detailed experimental analysis that sheds light on the behavior, effectiveness and feasibility of our solution. By means of an extensive experimental evaluation, performed considering two text classification domains and a large set of baseline algorithms, we show that our approach is highly effective and feasible, being a strong candidate for consideration for solving automatic text classification tasks when compared to state-of-the-art classifiers. … (more)
- Is Part Of:
- Information systems. Volume 77(2018)
- Journal:
- Information systems
- Issue:
- Volume 77(2018)
- Issue Display:
- Volume 77, Issue 2018 (2018)
- Year:
- 2018
- Volume:
- 77
- Issue:
- 2018
- Issue Sort Value:
- 2018-0077-2018-0000
- Page Start:
- 1
- Page End:
- 21
- Publication Date:
- 2018-09
- Subjects:
- Classification -- Random forests -- Lazy learning -- Nearest neighbors
00-01 -- 99-00
Database management -- Periodicals
Electronic data processing -- Periodicals
Bases de données -- Gestion -- Périodiques
Informatique -- Périodiques
Database management
Electronic data processing
Periodicals
005.7 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064379 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.is.2018.05.006 ↗
- Languages:
- English
- ISSNs:
- 0306-4379
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4496.367300
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 7005.xml