Improving classifier training efficiency for automatic cyberbullying detection with Feature Density. Issue 5 (September 2021)
- Record Type:
- Journal Article
- Title:
- Improving classifier training efficiency for automatic cyberbullying detection with Feature Density. Issue 5 (September 2021)
- Main Title:
- Improving classifier training efficiency for automatic cyberbullying detection with Feature Density
- Authors:
- Eronen, Juuso
Ptaszynski, Michal
Masui, Fumito
Smywiński-Pohl, Aleksander
Leliwa, Gniewosz
Wroczynski, Michal - Abstract:
- Abstract: We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods in order to estimate dataset complexity, which in turn is used to comparatively estimate the potential performance of machine learning (ML) classifiers prior to any training. We hypothesize that estimating dataset complexity allows for the reduction of the number of required experiments iterations. This way we can optimize the resource-intensive training of ML models which is becoming a serious issue due to the increases in available dataset sizes and the ever rising popularity of models based on Deep Neural Networks (DNN). The problem of constantly increasing needs for more powerful computational resources is also affecting the environment due to alarmingly-growing amount of CO 2 emissions caused by training of large-scale ML models. The research was conducted on multiple datasets, including popular datasets, such as Yelp business review dataset used for training typical sentiment analysis models, as well as more recent datasets trying to tackle the problem of cyberbullying, which, being a serious social problem, is also a much more sophisticated problem form the point of view of linguistic representation. We use cyberbullying datasets collected for multiple languages, namely English, Japanese and Polish. The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.Abstract: We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods in order to estimate dataset complexity, which in turn is used to comparatively estimate the potential performance of machine learning (ML) classifiers prior to any training. We hypothesize that estimating dataset complexity allows for the reduction of the number of required experiments iterations. This way we can optimize the resource-intensive training of ML models which is becoming a serious issue due to the increases in available dataset sizes and the ever rising popularity of models based on Deep Neural Networks (DNN). The problem of constantly increasing needs for more powerful computational resources is also affecting the environment due to alarmingly-growing amount of CO 2 emissions caused by training of large-scale ML models. The research was conducted on multiple datasets, including popular datasets, such as Yelp business review dataset used for training typical sentiment analysis models, as well as more recent datasets trying to tackle the problem of cyberbullying, which, being a serious social problem, is also a much more sophisticated problem form the point of view of linguistic representation. We use cyberbullying datasets collected for multiple languages, namely English, Japanese and Polish. The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing. Highlights: Feature Density can be utilized to reduce of the number of required experiments iterations. In general, Feature Density seems to have a negative correlation with classifier performance. Dependency structures could have potential as features in Neural Networks. Dataset complexity cannot be measured with Feature Density alone. Linguistic preprocessing can improve classifier performance. … (more)
- Is Part Of:
- Information processing & management. Volume 58:Issue 5(2021)
- Journal:
- Information processing & management
- Issue:
- Volume 58:Issue 5(2021)
- Issue Display:
- Volume 58, Issue 5 (2021)
- Year:
- 2021
- Volume:
- 58
- Issue:
- 5
- Issue Sort Value:
- 2021-0058-0005-0000
- Page Start:
- Page End:
- Publication Date:
- 2021-09
- Subjects:
- Feature density -- Dataset complexity -- Linguistics -- Cyberbullying -- Document classification -- Preprocessing
Information storage and retrieval systems -- Periodicals
Information science -- Periodicals
Systèmes d'information -- Périodiques
Sciences de l'information -- Périodiques
Information science
Information storage and retrieval systems
Periodicals
658.4038 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064573 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.ipm.2021.102616 ↗
- Languages:
- English
- ISSNs:
- 0306-4573
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4493.893000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 17578.xml