Comparison of variable selection methods for clinical predictive modeling. (August 2018)
- Record Type:
- Journal Article
- Title:
- Comparison of variable selection methods for clinical predictive modeling. (August 2018)
- Main Title:
- Comparison of variable selection methods for clinical predictive modeling
- Authors:
- Sanchez-Pinto, L. Nelson
Venable, Laura Ruth
Fahrenbach, John
Churpek, Matthew M. - Abstract:
- Highlights: Modern, machine learning-based modeling techniques are increasingly applied to clinical problems, including variable selection methods for predictive modeling using Electronic Health Record data. Prior studies have shown that modern modeling techniques are "data hungry". The performance of classic and modern variable selection methods appears to be associated with the size of the clinical dataset and the event-per-variable rate. In our study, we showed that classic regression-based variable selection methods perform better in smaller datasets, while modern tree-based methods do better in larger datasets. Abstract: Objective: Modern machine learning-based modeling methods are increasingly applied to clinical problems. One such application is in variable selection methods for predictive modeling. However, there is limited research comparing the performance of classic and modern for variable selection in clinical datasets. Materials and Methods: We analyzed the performance of eight different variable selection methods: four regression-based methods (stepwise backward selection using p-value and AIC, Least Absolute Shrinkage and Selection Operator, and Elastic Net) and four tree-based methods (Variable Selection Using Random Forest, Regularized Random Forests, Boruta, and Gradient Boosted Feature Selection). We used two clinical datasets of different sizes, a multicenter adult clinical deterioration cohort and a single center pediatric acute kidney injury cohort.Highlights: Modern, machine learning-based modeling techniques are increasingly applied to clinical problems, including variable selection methods for predictive modeling using Electronic Health Record data. Prior studies have shown that modern modeling techniques are "data hungry". The performance of classic and modern variable selection methods appears to be associated with the size of the clinical dataset and the event-per-variable rate. In our study, we showed that classic regression-based variable selection methods perform better in smaller datasets, while modern tree-based methods do better in larger datasets. Abstract: Objective: Modern machine learning-based modeling methods are increasingly applied to clinical problems. One such application is in variable selection methods for predictive modeling. However, there is limited research comparing the performance of classic and modern for variable selection in clinical datasets. Materials and Methods: We analyzed the performance of eight different variable selection methods: four regression-based methods (stepwise backward selection using p-value and AIC, Least Absolute Shrinkage and Selection Operator, and Elastic Net) and four tree-based methods (Variable Selection Using Random Forest, Regularized Random Forests, Boruta, and Gradient Boosted Feature Selection). We used two clinical datasets of different sizes, a multicenter adult clinical deterioration cohort and a single center pediatric acute kidney injury cohort. Method evaluation included measures of parsimony, variable importance, and discrimination. Results: In the large, multicenter dataset, the modern tree-based Variable Selection Using Random Forest and the Gradient Boosted Feature Selection methods achieved the best parsimony. In the smaller, single-center dataset, the classic regression-based stepwise backward selection using p-value and AIC methods achieved the best parsimony. In both datasets, variable selection tended to decrease the accuracy of the random forest models and increase the accuracy of logistic regression models. Conclusions: The performance of classic regression-based and modern tree-based variable selection methods is associated with the size of the clinical dataset used. Classic regression-based variable selection methods seem to achieve better parsimony in clinical prediction problems in smaller datasets while modern tree-based methods perform better in larger datasets. … (more)
- Is Part Of:
- International journal of medical informatics. Volume 116(2018)
- Journal:
- International journal of medical informatics
- Issue:
- Volume 116(2018)
- Issue Display:
- Volume 116, Issue 2018 (2018)
- Year:
- 2018
- Volume:
- 116
- Issue:
- 2018
- Issue Sort Value:
- 2018-0116-2018-0000
- Page Start:
- 10
- Page End:
- 17
- Publication Date:
- 2018-08
- Subjects:
- Models -- Statistical -- Regression analysis -- Machine learning -- Data interpretation -- Statistical -- Electronic health records -- Variable selection
Medical informatics -- Periodicals
Information science -- Periodicals
Computers -- Periodicals
Medical technology -- Periodicals
Medical Informatics -- Periodicals
Technology, Medical -- Periodicals
Computers
Information science
Medical informatics
Medical technology
Electronic journals
Periodicals
Electronic journals
610.285 - Journal URLs:
- http://www.sciencedirect.com/science/journal/13865056 ↗
http://www.clinicalkey.com/dura/browse/journalIssue/13865056 ↗
http://www.clinicalkey.com.au/dura/browse/journalIssue/13865056 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.ijmedinf.2018.05.006 ↗
- Languages:
- English
- ISSNs:
- 1386-5056
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4542.345250
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 9461.xml