Robust biomarker screening from gene expression data by stable machine learning-recursive feature elimination methods. (October 2022)
- Record Type:
- Journal Article
- Title:
- Robust biomarker screening from gene expression data by stable machine learning-recursive feature elimination methods. (October 2022)
- Main Title:
- Robust biomarker screening from gene expression data by stable machine learning-recursive feature elimination methods
- Authors:
- Li, Lingyu
Ching, Wai-Ki
Liu, Zhi-Ping - Abstract:
- Abstract: Recently, identifying robust biomarkers or signatures from gene expression profiling data has attracted much attention in computational biomedicine. The successful discovery of biomarkers for complex diseases such as spontaneous preterm birth (SPTB) and high-grade serous ovarian cancer (HGSOC) will be beneficial to reduce the risk of preterm birth and ovarian cancer among women for early detection and intervention. In this paper, we propose a stable machine learning-recursive feature elimination (StabML-RFE for short) strategy for screening robust biomarkers from high-throughput gene expression data. We employ eight popular machine learning methods, namely AdaBoost (AB), Decision Tree (DT), Gradient Boosted Decision Trees (GBDT), Naive Bayes (NB), Neural Network (NNET), Random Forest (RF), Support Vector Machine (SVM) and XGBoost (XGB), to train on all feature genes of training data, apply recursive feature elimination (RFE) to remove the least important features sequentially, and obtain eight gene subsets with feature importance ranking. Then we select the top-ranking features in each ranked subset as the optimal feature subset. We establish a stability metric aggregated with classification performance on test data to assess the robustness of the eight different feature selection techniques. Finally, StabML-RFE chooses the high-frequent features in the subsets of the combination with maximum stability value as robust biomarkers. Particularly, we verify theAbstract: Recently, identifying robust biomarkers or signatures from gene expression profiling data has attracted much attention in computational biomedicine. The successful discovery of biomarkers for complex diseases such as spontaneous preterm birth (SPTB) and high-grade serous ovarian cancer (HGSOC) will be beneficial to reduce the risk of preterm birth and ovarian cancer among women for early detection and intervention. In this paper, we propose a stable machine learning-recursive feature elimination (StabML-RFE for short) strategy for screening robust biomarkers from high-throughput gene expression data. We employ eight popular machine learning methods, namely AdaBoost (AB), Decision Tree (DT), Gradient Boosted Decision Trees (GBDT), Naive Bayes (NB), Neural Network (NNET), Random Forest (RF), Support Vector Machine (SVM) and XGBoost (XGB), to train on all feature genes of training data, apply recursive feature elimination (RFE) to remove the least important features sequentially, and obtain eight gene subsets with feature importance ranking. Then we select the top-ranking features in each ranked subset as the optimal feature subset. We establish a stability metric aggregated with classification performance on test data to assess the robustness of the eight different feature selection techniques. Finally, StabML-RFE chooses the high-frequent features in the subsets of the combination with maximum stability value as robust biomarkers. Particularly, we verify the screened biomarkers not only via internal validation, functional enrichment analysis and literature check, but also via external validation on two real-world SPTB and HGSOC datasets respectively. Obviously, the proposed StabML-RFE biomarker discovery pipeline easily serves as a model for identifying diagnostic biomarkers for other complex diseases from omics data. The source code and data can be found at https://github.com/zpliulab/StabML-RFE . Graphical Abstract: ga1 Highlights: A stable feature selection method (StabML-RFE) is proposed to screen robust biomarkers. StabML-RFE employs some popular ML-RFE methods and integrates them into an aggregation-like framework. StabML-RFE ensembles multiple optimal feature subsets by aggregating AUC values and stability indices. The robustness of screened biomarker genes is measured by the Stability metric based on Hamming distance. … (more)
- Is Part Of:
- Computational biology and chemistry. Volume 100(2022)
- Journal:
- Computational biology and chemistry
- Issue:
- Volume 100(2022)
- Issue Display:
- Volume 100, Issue 2022 (2022)
- Year:
- 2022
- Volume:
- 100
- Issue:
- 2022
- Issue Sort Value:
- 2022-0100-2022-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-10
- Subjects:
- Robust biomarker discovery -- Machine learning -- Recursive feature elimination -- Stable feature selection -- Spontaneous preterm birth -- High-grade serous ovarian cancer
Chemistry -- Data processing -- Periodicals
Biology -- Data processing -- Periodicals
Biochemistry -- Data processing
Biology -- Data processing
Molecular biology -- Data processing
Periodicals
Electronic journals
542.85 - Journal URLs:
- http://www.sciencedirect.com/science/journal/14769271 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.compbiolchem.2022.107747 ↗
- Languages:
- English
- ISSNs:
- 1476-9271
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3390.576700
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 23288.xml