A forest-based feature screening approach for large-scale genome data with complex structures. Issue 1 (December 2015)
- Record Type:
- Journal Article
- Title:
- A forest-based feature screening approach for large-scale genome data with complex structures. Issue 1 (December 2015)
- Main Title:
- A forest-based feature screening approach for large-scale genome data with complex structures
- Authors:
- Wang, Gang
Fu, Guifang
Corcoran, Christopher - Abstract:
- Abstract Background Genome-wide association studies (GWAS) interrogate large-scale whole genome to characterize the complex genetic architecture for biomedical traits. When the number of SNPs dramatically increases to half million but the sample size is still limited to thousands, the traditionalp -value based statistical approaches suffer from unprecedented limitations. Feature screening has proved to be an effective and powerful approach to handle ultrahigh dimensional data statistically, yet it has not received much attention in GWAS. Feature screening reduces the feature space from millions to hundreds by removing non-informative noise. However, the univariate measures used to rank features are mainly based on individual effect without considering the mutual interactions with other features. In this article, we explore the performance of a random forest (RF) based feature screening procedure to emphasize the SNPs that have complex effects for a continuous phenotype. Results Both simulation and real data analysis are conducted to examine the power of the forest-based feature screening. We compare it with five other popular feature screening approaches via simulation and conclude that RF can serve as a decent feature screening tool to accommodate complex genetic effects such as nonlinear, interactive, correlative, and joint effects. Unlike the traditionalp -value based Manhattan plot, we use the Permutation Variable Importance Measure (PVIM) to display the relativeAbstract Background Genome-wide association studies (GWAS) interrogate large-scale whole genome to characterize the complex genetic architecture for biomedical traits. When the number of SNPs dramatically increases to half million but the sample size is still limited to thousands, the traditionalp -value based statistical approaches suffer from unprecedented limitations. Feature screening has proved to be an effective and powerful approach to handle ultrahigh dimensional data statistically, yet it has not received much attention in GWAS. Feature screening reduces the feature space from millions to hundreds by removing non-informative noise. However, the univariate measures used to rank features are mainly based on individual effect without considering the mutual interactions with other features. In this article, we explore the performance of a random forest (RF) based feature screening procedure to emphasize the SNPs that have complex effects for a continuous phenotype. Results Both simulation and real data analysis are conducted to examine the power of the forest-based feature screening. We compare it with five other popular feature screening approaches via simulation and conclude that RF can serve as a decent feature screening tool to accommodate complex genetic effects such as nonlinear, interactive, correlative, and joint effects. Unlike the traditionalp -value based Manhattan plot, we use the Permutation Variable Importance Measure (PVIM) to display the relative significance and believe that it will provide as much useful information as the traditional plot. Conclusion Most complex traits are found to be regulated by epistatic and polygenic variants. The forest-based feature screening is proven to be an efficient, easily implemented, and accurate approach to cope whole genome data with complex structures. Our explorations should add to a growing body of enlargement of feature screening better serving the demands of contemporary genome data. … (more)
- Is Part Of:
- BMC genetics. Volume 16:Issue 1(2015)
- Journal:
- BMC genetics
- Issue:
- Volume 16:Issue 1(2015)
- Issue Display:
- Volume 16, Issue 1 (2015)
- Year:
- 2015
- Volume:
- 16
- Issue:
- 1
- Issue Sort Value:
- 2015-0016-0001-0000
- Page Start:
- 1
- Page End:
- 11
- Publication Date:
- 2015-12
- Subjects:
- Feature screening -- GWAS -- Epistasis -- Random forest -- Large-scale modeling
Genetics -- Periodicals
576.505 - Journal URLs:
- http://www.biomedcentral.com/bmcgenet/ ↗
http://www.pubmedcentral.nih.gov/tocrender.fcgi?journal=31 ↗
http://link.springer.com/ ↗ - DOI:
- 10.1186/s12863-015-0294-9 ↗
- Languages:
- English
- ISSNs:
- 1471-2156
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 9988.xml