COMPARISON OF STATISTICAL AND MACHINE LEARNING APPROACHES IN A SURVEY FROM THE EPIDEMIOLOGY OF NON-COMMUNICABLE DISEASES. (April 2021)
- Record Type:
- Journal Article
- Title:
- COMPARISON OF STATISTICAL AND MACHINE LEARNING APPROACHES IN A SURVEY FROM THE EPIDEMIOLOGY OF NON-COMMUNICABLE DISEASES. (April 2021)
- Main Title:
- COMPARISON OF STATISTICAL AND MACHINE LEARNING APPROACHES IN A SURVEY FROM THE EPIDEMIOLOGY OF NON-COMMUNICABLE DISEASES
- Authors:
- Vladimir, Kutsenko
Pereverdieva, Ksenia
Balanova, Yulia
Imaeva, Asiia
Shalnova, Svetlana
Yarovaya, Elena - Abstract:
- Abstract : Objective: To determine if the disuse of basic machine learning approaches may cause a significant loss of predictive power in the epidemiology study. Figure. No caption available. Design and method: A cross-sectional study «Epidemiology of Cardiovascular Diseases in the Russian federation (ESSE-RF)» was performed in 13 regions in 2012–2013. ESSE-RF included randomly selected 21768 participants aged 25–64 with a response rate > 80 %. Standard epidemiology methods and criteria were used. The current sub-study included 13912 participants. We chose lasso regression (L1Regression) and random forest (RF) as basic machine learning methods and logistic regression (LogRegression) as a basic statistical method. We compared these algorithms on predicting arterial hypertension (AH) from 18 clinical, demographic, and social risk factors from international AH guidelines. We fitted the models on the training sample and assessed their quality on the holdout sample with an AUC metric. We used 1000 random train-holdout splits to exclude randomness. We studied the RF feature impacts with model-agnostic techniques: partial dependence plots (PDPs), feature importances, and feature interaction measures. Statistical analysis was performed using R 3.6.1. Results: LogRegression had better performance (82.11 ± 0.42) than L1Regression (81.90 ± 0.40) and RF (81.60 ± 0.43) respectively. Nine out of the top ten important variables in LogRegression were in the top ten important variables ofAbstract : Objective: To determine if the disuse of basic machine learning approaches may cause a significant loss of predictive power in the epidemiology study. Figure. No caption available. Design and method: A cross-sectional study «Epidemiology of Cardiovascular Diseases in the Russian federation (ESSE-RF)» was performed in 13 regions in 2012–2013. ESSE-RF included randomly selected 21768 participants aged 25–64 with a response rate > 80 %. Standard epidemiology methods and criteria were used. The current sub-study included 13912 participants. We chose lasso regression (L1Regression) and random forest (RF) as basic machine learning methods and logistic regression (LogRegression) as a basic statistical method. We compared these algorithms on predicting arterial hypertension (AH) from 18 clinical, demographic, and social risk factors from international AH guidelines. We fitted the models on the training sample and assessed their quality on the holdout sample with an AUC metric. We used 1000 random train-holdout splits to exclude randomness. We studied the RF feature impacts with model-agnostic techniques: partial dependence plots (PDPs), feature importances, and feature interaction measures. Statistical analysis was performed using R 3.6.1. Results: LogRegression had better performance (82.11 ± 0.42) than L1Regression (81.90 ± 0.40) and RF (81.60 ± 0.43) respectively. Nine out of the top ten important variables in LogRegression were in the top ten important variables of RF, and vice versa. However, RF feature importance was more compatible with the risk guidelines than the LogRegression one (fig. 1A). In particular, LDL and HDL levels were insignificant in the LogRegression but significant in the RF. PDPs of RF were strictly monotonous and close to linear (fig. 1B). There were no features with an interaction measure greater than 35 % except for age. Using the previous analysis information, we constructed new interpretable features, which improved the AUC of basic LogRegression by 0.21. Conclusions: Data from the ESSE-RF study is homogeneous. Associations between features and AH are approximately linear and non-interactive. Therefore, it is correct to adhere to the basic interpretable statistical algorithms. However, machine learning methods can provide additional information that can improve understanding of risk factors influence. … (more)
- Is Part Of:
- Journal of hypertension. Volume 39(2021)e-Supplement 1
- Journal:
- Journal of hypertension
- Issue:
- Volume 39(2021)e-Supplement 1
- Issue Display:
- Volume 39, Issue 1 (2021)
- Year:
- 2021
- Volume:
- 39
- Issue:
- 1
- Issue Sort Value:
- 2021-0039-0001-0000
- Page Start:
- Page End:
- Publication Date:
- 2021-04
- Subjects:
- Hypertension -- Periodicals
Hypertension -- Periodicals
616.132005 - Journal URLs:
- http://firstsearch.oclc.org ↗
http://journals.lww.com/jhypertension/pages/default.aspx ↗
http://ovidsp.ovid.com/ovidweb.cgi?T=JS&NEWS=n&CSC=Y&PAGE=toc&D=yrovft&AN=00004872-000000000-00000 ↗
http://www.jhypertension.com/ ↗
http://journals.lww.com/pages/default.aspx ↗ - DOI:
- 10.1097/01.hjh.0000745448.89364.7a ↗
- Languages:
- English
- ISSNs:
- 1473-5598
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 5004.510000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 19887.xml