Comparative study of term-weighting schemes for environmental big data using machine learning. (November 2022)
- Record Type:
- Journal Article
- Title:
- Comparative study of term-weighting schemes for environmental big data using machine learning. (November 2022)
- Main Title:
- Comparative study of term-weighting schemes for environmental big data using machine learning
- Authors:
- Kim, JungJin
Kim, Han-Ul
Adamowski, Jan
Hatami, Shadi
Jeong, Hanseok - Abstract:
- Abstract: Widely-used term-weighting schemes and machine learning (ML) classifiers with default parameter settings were assessed for their performance when applied to environmental big data analysis. Five term-weighting schemes [term frequency (TF), TF–inverse document frequency (TF-IDF), Best Match 25 (BM25), TF–inverse gravity moment (TF-IGM), and TF–IDF–inverse class frequency (TF-IDF-ICF)] and five different ML classifiers [support vector machine (SVM), Naive Bayes (NB ), logistic regression (LR ), random forest (RF), and extreme gradient boosting (XGBoost)] were tested. The optimal text-classification scheme and classifier were TF-IDF-ICF and LR, respectively. Based on evaluation criteria, their combination resulted in the best performance of all scheme and classifier combinations for the full environmental data analysis. Category classification performance differed according to the environmental section (climate, air, water, or waste/garbage), with the best performance being achieved for climate, and the poorest for water. This demonstrated the importance of selecting term-weighting schemes and ML classifiers in human-generated environmental big data analysis. Highlights: Term-weighting schemes and ML classifiers were tested for big data analysis. Korean digital news of air, climate, water, and waste sections were used. TF-IDF-ICF with LR was an optimal classification approach for the entire section. Performance rating differed for the sections: climate the highest,Abstract: Widely-used term-weighting schemes and machine learning (ML) classifiers with default parameter settings were assessed for their performance when applied to environmental big data analysis. Five term-weighting schemes [term frequency (TF), TF–inverse document frequency (TF-IDF), Best Match 25 (BM25), TF–inverse gravity moment (TF-IGM), and TF–IDF–inverse class frequency (TF-IDF-ICF)] and five different ML classifiers [support vector machine (SVM), Naive Bayes (NB ), logistic regression (LR ), random forest (RF), and extreme gradient boosting (XGBoost)] were tested. The optimal text-classification scheme and classifier were TF-IDF-ICF and LR, respectively. Based on evaluation criteria, their combination resulted in the best performance of all scheme and classifier combinations for the full environmental data analysis. Category classification performance differed according to the environmental section (climate, air, water, or waste/garbage), with the best performance being achieved for climate, and the poorest for water. This demonstrated the importance of selecting term-weighting schemes and ML classifiers in human-generated environmental big data analysis. Highlights: Term-weighting schemes and ML classifiers were tested for big data analysis. Korean digital news of air, climate, water, and waste sections were used. TF-IDF-ICF with LR was an optimal classification approach for the entire section. Performance rating differed for the sections: climate the highest, water the lowest. The importance of the text-classification method in big data analysis was exhibited. … (more)
- Is Part Of:
- Environmental modelling & software. Volume 157(2022)
- Journal:
- Environmental modelling & software
- Issue:
- Volume 157(2022)
- Issue Display:
- Volume 157, Issue 2022 (2022)
- Year:
- 2022
- Volume:
- 157
- Issue:
- 2022
- Issue Sort Value:
- 2022-0157-2022-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-11
- Subjects:
- Text classification -- Environmental digital news -- Term-weighting schemes -- Feature selection
Environmental monitoring -- Computer programs -- Periodicals
Ecology -- Computer simulation -- Periodicals
Digital computer simulation -- Periodicals
Computer software -- Periodicals
Environmental Monitoring -- Periodicals
Computer Simulation -- Periodicals
Environnement -- Surveillance -- Logiciels -- Périodiques
Écologie -- Simulation, Méthodes de -- Périodiques
Simulation par ordinateur -- Périodiques
Logiciels -- Périodiques
Computer software
Digital computer simulation
Ecology -- Computer simulation
Environmental monitoring -- Computer programs
Periodicals
Electronic journals
363.70015118 - Journal URLs:
- http://www.sciencedirect.com/science/journal/13648152 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.envsoft.2022.105536 ↗
- Languages:
- English
- ISSNs:
- 1364-8152
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3791.522800
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 23968.xml