A machine learning method for selection of genetic variants to increase prediction accuracy of type 2 diabetes mellitus using sequencing data. (4th April 2020)
- Record Type:
- Journal Article
- Title:
- A machine learning method for selection of genetic variants to increase prediction accuracy of type 2 diabetes mellitus using sequencing data. (4th April 2020)
- Main Title:
- A machine learning method for selection of genetic variants to increase prediction accuracy of type 2 diabetes mellitus using sequencing data
- Authors:
- Jung, Luann C.
Wang, Haiyan
Li, Xukun
Wu, Cen - Abstract:
- Abstract: Type 2 diabetes mellitus (T2DM) affects millions of people through its life‐altering complications. Worldwide, 3.4 million people die of diabetes annually. Studying the effect of genetic polymorphism on T2DM has been plagued by the available sample size. A 2016 Nature Reviews article summarized that the accuracy of predicting future type 2 diabetes from genetic polymorphism is very low at the population level. Innumerable associations between genes, environmental factors, and type 2 diabetes remain to be discovered. This research presents a method to identify subtle effects of genetic variants using whole genome sequencing data and improve prediction accuracy of T2DM at the population level. To achieve this, a new feature selection procedure and a classifier are proposed. The method involves (a) first applying sparse principal component analysis to genotype data to obtain orthogonal features; (b) building a new classifier using single nucleotide polymorphism (SNP)‐specific regularization parameters to reduce the false positive rate of feature selection; (c) verifying feature relevance through penalized logistic regression. After application to a dataset containing 625 597 SNPs and 23 environmental variables from each of 3326 humans, the method identified 271 genetic variants with subtle effects on T2DM prediction. These variants led to greatly improved prediction accuracy for new patients at the population level. The proposed method also has the advantage ofAbstract: Type 2 diabetes mellitus (T2DM) affects millions of people through its life‐altering complications. Worldwide, 3.4 million people die of diabetes annually. Studying the effect of genetic polymorphism on T2DM has been plagued by the available sample size. A 2016 Nature Reviews article summarized that the accuracy of predicting future type 2 diabetes from genetic polymorphism is very low at the population level. Innumerable associations between genes, environmental factors, and type 2 diabetes remain to be discovered. This research presents a method to identify subtle effects of genetic variants using whole genome sequencing data and improve prediction accuracy of T2DM at the population level. To achieve this, a new feature selection procedure and a classifier are proposed. The method involves (a) first applying sparse principal component analysis to genotype data to obtain orthogonal features; (b) building a new classifier using single nucleotide polymorphism (SNP)‐specific regularization parameters to reduce the false positive rate of feature selection; (c) verifying feature relevance through penalized logistic regression. After application to a dataset containing 625 597 SNPs and 23 environmental variables from each of 3326 humans, the method identified 271 genetic variants with subtle effects on T2DM prediction. These variants led to greatly improved prediction accuracy for new patients at the population level. The proposed method also has the advantage of computational efficiency, over 15 times faster than random forest and extreme gradient boosting (XGBoost) classifiers, and thus provides a promising tool for large‐scale genome‐wide association studies. … (more)
- Is Part Of:
- Statistical analysis and data mining. Volume 13:Number 3(2020)
- Journal:
- Statistical analysis and data mining
- Issue:
- Volume 13:Number 3(2020)
- Issue Display:
- Volume 13, Issue 3 (2020)
- Year:
- 2020
- Volume:
- 13
- Issue:
- 3
- Issue Sort Value:
- 2020-0013-0003-0000
- Page Start:
- 261
- Page End:
- 281
- Publication Date:
- 2020-04-04
- Subjects:
- Cornish‐Fisher expansion -- feature selection -- nearest shrunken centroid -- sparse PCA
Data mining -- Statistical methods -- Periodicals
006.312 - Journal URLs:
- http://www3.interscience.wiley.com/journal/112701062/home ↗
http://onlinelibrary.wiley.com/ ↗ - DOI:
- 10.1002/sam.11456 ↗
- Languages:
- English
- ISSNs:
- 1932-1864
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 8447.424100
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 13171.xml