521Performance of doubly-robust, machine learning effect estimators in realistic epidemiologic data settings and practical recommendations. (2nd September 2021)
- Record Type:
- Journal Article
- Title:
- 521Performance of doubly-robust, machine learning effect estimators in realistic epidemiologic data settings and practical recommendations. (2nd September 2021)
- Main Title:
- 521Performance of doubly-robust, machine learning effect estimators in realistic epidemiologic data settings and practical recommendations
- Authors:
- Huang, Jonathan
Meng, Xiang - Abstract:
- Abstract: Background: Flexible, data-adaptive algorithms (machine learning; ML) for nuisance parameter estimation in epidemiologic causal inference have promising asymptotic properties for complex, high-dimensional data. However, recently proposed applications ( e.g . targeted maximum likelihood estimation; TMLE) may produce biases parameter and standard error estimates in common real-world cohort settings. The relative performance of these novel estimators over simpler approaches in such settings is unclear. Methods: We apply double-crossfit TMLE, augmented inverse probability weighting (AIPW), and standard IPW to simple simulations (5 covariates) and "real-world" data using covariate-structure-preserving ("plasmode") simulations of 1, 178 subjects and 331 covariates from a longitudinal birth cohort. We evaluate various data generating and estimation scenarios including: under- and over- (e.g. excess orthogonal covariates) identification, poor data support, near-instruments, and mis-specified biological interactions. We also track representative computation times. Results: We replicate optimal performance of cross-fit, doubly robust estimators in simple data generating processes. However, in nearly every real world-based scenario, estimators fit with parametric learners outperform those that include non-parametric learners in terms of mean bias and confidence interval coverage. Even when correctly specified, estimators fit with non-parametric algorithms (xgboost, randomAbstract: Background: Flexible, data-adaptive algorithms (machine learning; ML) for nuisance parameter estimation in epidemiologic causal inference have promising asymptotic properties for complex, high-dimensional data. However, recently proposed applications ( e.g . targeted maximum likelihood estimation; TMLE) may produce biases parameter and standard error estimates in common real-world cohort settings. The relative performance of these novel estimators over simpler approaches in such settings is unclear. Methods: We apply double-crossfit TMLE, augmented inverse probability weighting (AIPW), and standard IPW to simple simulations (5 covariates) and "real-world" data using covariate-structure-preserving ("plasmode") simulations of 1, 178 subjects and 331 covariates from a longitudinal birth cohort. We evaluate various data generating and estimation scenarios including: under- and over- (e.g. excess orthogonal covariates) identification, poor data support, near-instruments, and mis-specified biological interactions. We also track representative computation times. Results: We replicate optimal performance of cross-fit, doubly robust estimators in simple data generating processes. However, in nearly every real world-based scenario, estimators fit with parametric learners outperform those that include non-parametric learners in terms of mean bias and confidence interval coverage. Even when correctly specified, estimators fit with non-parametric algorithms (xgboost, random forest) performed poorly (e.g. 24% bias, 57% coverage vs. 10% bias, 79% coverage for parametric fit), at times underperforming simple IPW. Conclusions: In typical epidemiologic data sets, double-crossfit estimators fit with simple smooth, parametric learners may be the optimal solution, taking 2-5 times less computation time than flexible non-parametric models, while having equal or better performance. No approaches are optimal, and estimators should be compared on simulations close to the source data. Key messages: In epidemiologic studies, use of flexible non-parametric algorithms for effect estimation should be strongly justified ( i.e . high-dimensional covariates) and performed with care. Parametric learners may be a safer option with few drawbacks. … (more)
- Is Part Of:
- International journal of epidemiology. Volume 50(2021)Supplement 1
- Journal:
- International journal of epidemiology
- Issue:
- Volume 50(2021)Supplement 1
- Issue Display:
- Volume 50, Issue 1 (2021)
- Year:
- 2021
- Volume:
- 50
- Issue:
- 1
- Issue Sort Value:
- 2021-0050-0001-0000
- Page Start:
- Page End:
- Publication Date:
- 2021-09-02
- Subjects:
- Epidemiology -- Periodicals
614.4 - Journal URLs:
- http://ije.oxfordjournals.org/ ↗
http://ukcatalogue.oup.com/ ↗ - DOI:
- 10.1093/ije/dyab168.293 ↗
- Languages:
- English
- ISSNs:
- 0300-5771
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4542.244000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 18612.xml