Use of data imputation tools to reconstruct incomplete air quality datasets: A case-study in Temuco, Chile. (1st March 2019)
- Record Type:
- Journal Article
- Title:
- Use of data imputation tools to reconstruct incomplete air quality datasets: A case-study in Temuco, Chile. (1st March 2019)
- Main Title:
- Use of data imputation tools to reconstruct incomplete air quality datasets: A case-study in Temuco, Chile
- Authors:
- Quinteros, María Elisa
Lu, Siyao
Blazquez, Carola
Cárdenas-R, Juan Pablo
Ossa, Ximena
Delgado-Saborit, Juana-María
Harrison, Roy M.
Ruiz-Rudolph, Pablo - Abstract:
- Abstract: Missing data from air quality datasets is a common problem, but is much more severe in small cities or localities. This poses a great challenge for environmental epidemiology as high exposures to pollutants worldwide occur in these settings and gaps in datasets hinder health studies that could later inform local and international policies. Here, we propose the use of imputation methods as a tool to reconstruct air quality datasets and have applied this approach to an air quality dataset in Temuco, a mid-size city in Chile as a case-study. We attempted to reconstruct the database comparing five approaches: mean imputation, conditional mean imputation, K-Nearest Neighbor imputation, multiple imputation and Bayesian Principal Component Analysis imputation. As a base for the imputation methods, linear regression models were fitted for PM2.5 against other air quality and meteorological variables. Methods were challenged against validation sets where data was removed artificially. Imputation methods were able to reconstruct the dataset with good performance in terms of completeness, errors, and bias, even when challenged against the validations sets. The performance improved when including covariates from a second monitoring station in Temuco. K-Nearest Neighbor imputation showed slightly better performance than multiple imputation for error (25% vs. 27%) and bias (2.1% vs. 3.9%), but presented lower completeness (70% vs. 100%). In summary, our results show that theAbstract: Missing data from air quality datasets is a common problem, but is much more severe in small cities or localities. This poses a great challenge for environmental epidemiology as high exposures to pollutants worldwide occur in these settings and gaps in datasets hinder health studies that could later inform local and international policies. Here, we propose the use of imputation methods as a tool to reconstruct air quality datasets and have applied this approach to an air quality dataset in Temuco, a mid-size city in Chile as a case-study. We attempted to reconstruct the database comparing five approaches: mean imputation, conditional mean imputation, K-Nearest Neighbor imputation, multiple imputation and Bayesian Principal Component Analysis imputation. As a base for the imputation methods, linear regression models were fitted for PM2.5 against other air quality and meteorological variables. Methods were challenged against validation sets where data was removed artificially. Imputation methods were able to reconstruct the dataset with good performance in terms of completeness, errors, and bias, even when challenged against the validations sets. The performance improved when including covariates from a second monitoring station in Temuco. K-Nearest Neighbor imputation showed slightly better performance than multiple imputation for error (25% vs. 27%) and bias (2.1% vs. 3.9%), but presented lower completeness (70% vs. 100%). In summary, our results show that the imputation methods can be a useful tool in reconstructing air quality datasets in a real-life situation. Graphical abstract: Highlights: An air quality dataset with high rate of data losses (>20%) was attempted to be reconstructed using imputations methods. Regression models were successful in predicting PM2.5, with predictors in agreement with residential wood-burning source. Imputation methods, particularly multiple imputation, the data-set was successfully reconstructed to a certain extent. These methods seem promising to study health problems in cities whose datasets may be fragmented. … (more)
- Is Part Of:
- Atmospheric environment. Volume 200(2019)
- Journal:
- Atmospheric environment
- Issue:
- Volume 200(2019)
- Issue Display:
- Volume 200, Issue 2019 (2019)
- Year:
- 2019
- Volume:
- 200
- Issue:
- 2019
- Issue Sort Value:
- 2019-0200-2019-0000
- Page Start:
- 40
- Page End:
- 49
- Publication Date:
- 2019-03-01
- Subjects:
- Wood-burning -- Air pollution -- Missing data -- Multiple imputation -- Environmental epidemiology -- Single imputation
Air -- Pollution -- Periodicals
Air -- Pollution -- Meteorological aspects -- Periodicals
551.51 - Journal URLs:
- http://www.sciencedirect.com/web-editions/journal/13522310 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.atmosenv.2018.11.053 ↗
- Languages:
- English
- ISSNs:
- 1352-2310
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 1767.120000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 9441.xml