Machine learning methods for imbalanced data set for prediction of faecal contamination in beach waters. (1st September 2021)
- Record Type:
- Journal Article
- Title:
- Machine learning methods for imbalanced data set for prediction of faecal contamination in beach waters. (1st September 2021)
- Main Title:
- Machine learning methods for imbalanced data set for prediction of faecal contamination in beach waters
- Authors:
- Bourel, Mathias
Segura, Angel M.
Crisci, Carolina
López, Guzmán
Sampognaro, Lia
Vidal, Victoria
Kruk, Carla
Piccini, Claudia
Perera, Gonzalo - Abstract:
- Highlights: Predicting water contamination by statistical models. Evaluation of several machine learning techniques and metrics to model imbalanced data. Imbalanced data-sets requires modified machine learning algorithms and evaluation metrics. Combining modeling strategies is necessary to anticipate water contamination. Abstract: Predicting water contamination by statistical models is a useful tool to manage health risk in recreational beaches. Extreme contamination events, i.e. those exceeding normative are generally rare with respect to bathing conditions and thus the data is said to be imbalanced. Modeling and predicting those rare events present unique challenges. Here we introduce and evaluate several machine learning techniques and metrics to model imbalanced data and evaluate model performance. We do so by using a) simulated data-sets and b) a real data base with records of faecal coliform abundance monitored for 10 years in 21 recreational beaches in Uruguay (N ≈ 19000) using in situ and meteorological variables. We discuss advantages and disadvantages of the methods and provide a simple guide to perform models for a general audience. We also provide R codes to reproduce model fitting and testing. We found that most Machine Learning techniques are sensitive to imbalance and require specific data pre-treatment (e.g. upsampling) to improve performance. Accuracy (i.e. correctly classified cases over total cases) is not adequate to evaluate model performance onHighlights: Predicting water contamination by statistical models. Evaluation of several machine learning techniques and metrics to model imbalanced data. Imbalanced data-sets requires modified machine learning algorithms and evaluation metrics. Combining modeling strategies is necessary to anticipate water contamination. Abstract: Predicting water contamination by statistical models is a useful tool to manage health risk in recreational beaches. Extreme contamination events, i.e. those exceeding normative are generally rare with respect to bathing conditions and thus the data is said to be imbalanced. Modeling and predicting those rare events present unique challenges. Here we introduce and evaluate several machine learning techniques and metrics to model imbalanced data and evaluate model performance. We do so by using a) simulated data-sets and b) a real data base with records of faecal coliform abundance monitored for 10 years in 21 recreational beaches in Uruguay (N ≈ 19000) using in situ and meteorological variables. We discuss advantages and disadvantages of the methods and provide a simple guide to perform models for a general audience. We also provide R codes to reproduce model fitting and testing. We found that most Machine Learning techniques are sensitive to imbalance and require specific data pre-treatment (e.g. upsampling) to improve performance. Accuracy (i.e. correctly classified cases over total cases) is not adequate to evaluate model performance on imbalanced data set. Instead, true positive rates (TPR) and false positive rates (FPR) are recommended. Among the 52 possible candidate algorithms tested, the stratified Random forest presented the better performance improving TPR in 50% with respect to baseline (0.4) and outperformed baseline in the evaluated metrics. Support vector machines combined with upsampling method or synthetic minority oversampling technique (SMOTE) performed well, similar to Adaboost with SMOTE. These results suggests that combining modeling strategies is necessary to improve our capacity to anticipate water contamination and avoid health risk. … (more)
- Is Part Of:
- Water research. Volume 202(2021)
- Journal:
- Water research
- Issue:
- Volume 202(2021)
- Issue Display:
- Volume 202, Issue 2021 (2021)
- Year:
- 2021
- Volume:
- 202
- Issue:
- 2021
- Issue Sort Value:
- 2021-0202-2021-0000
- Page Start:
- Page End:
- Publication Date:
- 2021-09-01
- Subjects:
- Machine learning -- Faecal coliform -- Recreational waters -- Prediction
Water -- Pollution -- Research -- Periodicals
363.7394 - Journal URLs:
- http://catalog.hathitrust.org/api/volumes/oclc/1769499.html ↗
http://www.sciencedirect.com/science/journal/00431354 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.watres.2021.117450 ↗
- Languages:
- English
- ISSNs:
- 0043-1354
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 9273.400000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 18487.xml