A machine learning-based framework to identify type 2 diabetes through electronic health records. (January 2017)
- Record Type:
- Journal Article
- Title:
- A machine learning-based framework to identify type 2 diabetes through electronic health records. (January 2017)
- Main Title:
- A machine learning-based framework to identify type 2 diabetes through electronic health records
- Authors:
- Zheng, Tao
Xie, Wei
Xu, Liling
He, Xiaoying
Zhang, Ya
You, Mingrong
Yang, Gong
Chen, You - Abstract:
- Highlights: A machine learning-based framework to identify type 2 diabetes subjects. The framework achieved high identification performances (∼0.98 in average AUC). The framework focused on reducing missing rate to identify more type 2 diabetes subjects. Abstract: Objective: To discover diverse genotype-phenotype associations affiliated with Type 2 Diabetes Mellitus (T2DM) via genome-wide association study (GWAS) and phenome-wide association study (PheWAS), more cases (T2DM subjects) and controls (subjects without T2DM) are required to be identified (e.g., via Electronic Health Records (EHR)). However, existing expert based identification algorithms often suffer in a low recall rate and could miss a large number of valuable samples under conservative filtering standards. The goal of this work is to develop a semi-automated framework based on machine learning as a pilot study to liberalize filtering criteria to improve recall rate with a keeping of low false positive rate. Materials and methods: We propose a data informed framework for identifying subjects with and without T2DM from EHR via feature engineering and machine learning. We evaluate and contrast the identification performance of widely-used machine learning models within our framework, including k-Nearest-Neighbors, Naïve Bayes, Decision Tree, Random Forest, Support Vector Machine and Logistic Regression. Our framework was conducted on 300 patient samples (161 cases, 60 controls and 79 unconfirmed subjects),Highlights: A machine learning-based framework to identify type 2 diabetes subjects. The framework achieved high identification performances (∼0.98 in average AUC). The framework focused on reducing missing rate to identify more type 2 diabetes subjects. Abstract: Objective: To discover diverse genotype-phenotype associations affiliated with Type 2 Diabetes Mellitus (T2DM) via genome-wide association study (GWAS) and phenome-wide association study (PheWAS), more cases (T2DM subjects) and controls (subjects without T2DM) are required to be identified (e.g., via Electronic Health Records (EHR)). However, existing expert based identification algorithms often suffer in a low recall rate and could miss a large number of valuable samples under conservative filtering standards. The goal of this work is to develop a semi-automated framework based on machine learning as a pilot study to liberalize filtering criteria to improve recall rate with a keeping of low false positive rate. Materials and methods: We propose a data informed framework for identifying subjects with and without T2DM from EHR via feature engineering and machine learning. We evaluate and contrast the identification performance of widely-used machine learning models within our framework, including k-Nearest-Neighbors, Naïve Bayes, Decision Tree, Random Forest, Support Vector Machine and Logistic Regression. Our framework was conducted on 300 patient samples (161 cases, 60 controls and 79 unconfirmed subjects), randomly selected from 23, 281 diabetes related cohort retrieved from a regional distributed EHR repository ranging from 2012 to 2014. Results: We apply top-performing machine learning algorithms on the engineered features. We benchmark and contrast the accuracy, precision, AUC, sensitivity and specificity of classification models against the state-of-the-art expert algorithm for identification of T2DM subjects. Our results indicate that the framework achieved high identification performances (∼0.98 in average AUC), which are much higher than the state-of-the-art algorithm (0.71 in AUC). Discussion: Expert algorithm-based identification of T2DM subjects from EHR is often hampered by the high missing rates due to their conservative selection criteria. Our framework leverages machine learning and feature engineering to loosen such selection criteria to achieve a high identification rate of cases and controls. Conclusions: Our proposed framework demonstrates a more accurate and efficient approach for identifying subjects with and without T2DM from EHR. … (more)
- Is Part Of:
- International journal of medical informatics. Volume 97(2017)
- Journal:
- International journal of medical informatics
- Issue:
- Volume 97(2017)
- Issue Display:
- Volume 97, Issue 2017 (2017)
- Year:
- 2017
- Volume:
- 97
- Issue:
- 2017
- Issue Sort Value:
- 2017-0097-2017-0000
- Page Start:
- 120
- Page End:
- 127
- Publication Date:
- 2017-01
- Subjects:
- Electronic health records -- Type 2 diabetes -- Data mining -- Feature engineering -- Machine learning
Medical informatics -- Periodicals
Information science -- Periodicals
Computers -- Periodicals
Medical technology -- Periodicals
Medical Informatics -- Periodicals
Technology, Medical -- Periodicals
Computers
Information science
Medical informatics
Medical technology
Electronic journals
Periodicals
Electronic journals
610.285 - Journal URLs:
- http://www.sciencedirect.com/science/journal/13865056 ↗
http://www.clinicalkey.com/dura/browse/journalIssue/13865056 ↗
http://www.clinicalkey.com.au/dura/browse/journalIssue/13865056 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.ijmedinf.2016.09.014 ↗
- Languages:
- English
- ISSNs:
- 1386-5056
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4542.345250
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 1499.xml