Advancing methodologies for applying machine learning and evaluating spatiotemporal models of fine particulate matter (PM2.5) using satellite data over large regions. (15th October 2020)
- Record Type:
- Journal Article
- Title:
- Advancing methodologies for applying machine learning and evaluating spatiotemporal models of fine particulate matter (PM2.5) using satellite data over large regions. (15th October 2020)
- Main Title:
- Advancing methodologies for applying machine learning and evaluating spatiotemporal models of fine particulate matter (PM2.5) using satellite data over large regions
- Authors:
- Just, Allan C.
Arfer, Kodi B.
Rush, Johnathan
Dorman, Michael
Shtein, Alexandra
Lyapustin, Alexei
Kloog, Itai - Abstract:
- Abstract: Reconstructing the distribution of fine particulate matter (PM2.5 ) in space and time, even far from ground monitoring sites, is an important exposure science contribution to epidemiologic analyses of PM2.5 health impacts. Flexible statistical methods for prediction have demonstrated the integration of satellite observations with other predictors, yet these algorithms are susceptible to overfitting the spatiotemporal structure of the training datasets. We present a new approach for predicting PM2.5 using machine-learning methods and evaluating prediction models for the goal of making predictions where they were not previously available. We apply extreme gradient boosting (XGBoost) modeling to predict daily PM2.5 on a 1 × 1 km 2 resolution for a 13 state region in the Northeastern USA for the years 2000–2015 using satellite-derived aerosol optical depth and implement a recursive feature selection to develop a parsimonious model. We demonstrate excellent predictions of withheld observations but also contrast an RMSE of 3.11 μg/m 3 in our spatial cross-validation withholding nearby sites versus an overfit RMSE of 2.10 μg/m 3 using a more conventional random ten-fold splitting of the dataset. As the field of exposure science moves forward with the use of advanced machine-learning approaches for spatiotemporal modeling of air pollutants, our results show the importance of addressing data leakage in training, overfitting to spatiotemporal structure, and the impact of theAbstract: Reconstructing the distribution of fine particulate matter (PM2.5 ) in space and time, even far from ground monitoring sites, is an important exposure science contribution to epidemiologic analyses of PM2.5 health impacts. Flexible statistical methods for prediction have demonstrated the integration of satellite observations with other predictors, yet these algorithms are susceptible to overfitting the spatiotemporal structure of the training datasets. We present a new approach for predicting PM2.5 using machine-learning methods and evaluating prediction models for the goal of making predictions where they were not previously available. We apply extreme gradient boosting (XGBoost) modeling to predict daily PM2.5 on a 1 × 1 km 2 resolution for a 13 state region in the Northeastern USA for the years 2000–2015 using satellite-derived aerosol optical depth and implement a recursive feature selection to develop a parsimonious model. We demonstrate excellent predictions of withheld observations but also contrast an RMSE of 3.11 μg/m 3 in our spatial cross-validation withholding nearby sites versus an overfit RMSE of 2.10 μg/m 3 using a more conventional random ten-fold splitting of the dataset. As the field of exposure science moves forward with the use of advanced machine-learning approaches for spatiotemporal modeling of air pollutants, our results show the importance of addressing data leakage in training, overfitting to spatiotemporal structure, and the impact of the predominance of ground monitoring sites in dense urban sub-networks on model evaluation. The strengths of our resultant modeling approach for exposure in epidemiologic studies of PM2.5 include improved efficiency, parsimony, and interpretability with robust validation while still accommodating complex spatiotemporal relationships. Highlights: Flexible machine-learning models can estimate fine particulate PM2.5 concentrations. Models require spatial cross-validation or else are assessed overly optimistically. Gradient boosting with a small number of predictors creates excellent predictions. New daily 1 km model for health studies in Northeastern USA 2000–2015. … (more)
- Is Part Of:
- Atmospheric environment. Volume 239(2020)
- Journal:
- Atmospheric environment
- Issue:
- Volume 239(2020)
- Issue Display:
- Volume 239, Issue 2020 (2020)
- Year:
- 2020
- Volume:
- 239
- Issue:
- 2020
- Issue Sort Value:
- 2020-0239-2020-0000
- Page Start:
- Page End:
- Publication Date:
- 2020-10-15
- Subjects:
- Air pollution -- PM2.5 -- Spatial cross-validation -- Aerosol optical depth -- MAIAC
Air -- Pollution -- Periodicals
Air -- Pollution -- Meteorological aspects -- Periodicals
551.51 - Journal URLs:
- http://www.sciencedirect.com/web-editions/journal/13522310 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.atmosenv.2020.117649 ↗
- Languages:
- English
- ISSNs:
- 1352-2310
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 1767.120000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 26844.xml