Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data. (10th April 2015)
- Record Type:
- Journal Article
- Title:
- Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data. (10th April 2015)
- Main Title:
- Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data
- Authors:
- Sill, Martin
Saadati, Maral
Benner, Axel - Abstract:
- Abstract : Motivation: Principal component analysis (PCA) is a basic tool often used in bioinformatics for visualization and dimension reduction. However, it is known that PCA may not consistently estimate the true direction of maximal variability in high-dimensional, low sample size settings, which are typical for molecular data. Assuming that the underlying signal is sparse, i.e. that only a fraction of features contribute to a principal component (PC), this estimation consistency can be retained. Most existing sparse PCA methods use L1-penalization, i.e. the lasso, to perform feature selection. But, the lasso is known to lack variable selection consistency in high dimensions and therefore a subsequent interpretation of selected features can give misleading results. Results: We present S4VDPCA, a sparse PCA method that incorporates a subsampling approach, namely stability selection. S4VDPCA can consistently select the truly relevant variables contributing to a sparse PC while also consistently estimate the direction of maximal variability. The performance of the S4VDPCA is assessed in a simulation study and compared to other PCA approaches, as well as to a hypothetical oracle PCA that 'knows' the truly relevant features in advance and thus finds optimal, unbiased sparse PCs. S4VDPCA is computationally efficient and performs best in simulations regarding parameter estimation consistency and feature selection consistency. Furthermore, S4VDPCA is applied to a publiclyAbstract : Motivation: Principal component analysis (PCA) is a basic tool often used in bioinformatics for visualization and dimension reduction. However, it is known that PCA may not consistently estimate the true direction of maximal variability in high-dimensional, low sample size settings, which are typical for molecular data. Assuming that the underlying signal is sparse, i.e. that only a fraction of features contribute to a principal component (PC), this estimation consistency can be retained. Most existing sparse PCA methods use L1-penalization, i.e. the lasso, to perform feature selection. But, the lasso is known to lack variable selection consistency in high dimensions and therefore a subsequent interpretation of selected features can give misleading results. Results: We present S4VDPCA, a sparse PCA method that incorporates a subsampling approach, namely stability selection. S4VDPCA can consistently select the truly relevant variables contributing to a sparse PC while also consistently estimate the direction of maximal variability. The performance of the S4VDPCA is assessed in a simulation study and compared to other PCA approaches, as well as to a hypothetical oracle PCA that 'knows' the truly relevant features in advance and thus finds optimal, unbiased sparse PCs. S4VDPCA is computationally efficient and performs best in simulations regarding parameter estimation consistency and feature selection consistency. Furthermore, S4VDPCA is applied to a publicly available gene expression data set of medulloblastoma brain tumors. Features contributing to the first two estimated sparse PCs represent genes significantly over-represented in pathways typically deregulated between molecular subgroups of medulloblastoma. Availability and implementation: Software is available at https://github.com/mwsill/s4vdpca . Contact: m.sill@dkfz.de Supplementary information: Supplementary data are available at Bioinformatics online. … (more)
- Is Part Of:
- Bioinformatics. Volume 31:Number 16(2015)
- Journal:
- Bioinformatics
- Issue:
- Volume 31:Number 16(2015)
- Issue Display:
- Volume 31, Issue 16 (2015)
- Year:
- 2015
- Volume:
- 31
- Issue:
- 16
- Issue Sort Value:
- 2015-0031-0016-0000
- Page Start:
- 2683
- Page End:
- 2690
- Publication Date:
- 2015-04-10
- Subjects:
- Bioinformatics -- Periodicals
Genomics -- Data processing -- Periodicals
Computational biology -- Periodicals
572.80285 - Journal URLs:
- http://bioinformatics.oxfordjournals.org ↗
http://firstsearch.oclc.org ↗
http://ukcatalogue.oup.com/ ↗ - DOI:
- 10.1093/bioinformatics/btv197 ↗
- Languages:
- English
- ISSNs:
- 1367-4803
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 2072.348000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 12388.xml