A comparative study on feature selection for a risk prediction model for colorectal cancer. (August 2019)
- Record Type:
- Journal Article
- Title:
- A comparative study on feature selection for a risk prediction model for colorectal cancer. (August 2019)
- Main Title:
- A comparative study on feature selection for a risk prediction model for colorectal cancer
- Authors:
- Cueto-López, Nahúm
García-Ordás, Maria Teresa
Dávila-Batista, Verónica
Moreno, Víctor
Aragonés, Nuria
Alaiz-Rodríguez, Rocío - Abstract:
- Highlights: Feature selection techniques may be unstable dealing with colorectal cancer data. Robustness and Performance of Feature selection for CRC should be studied together. Feature selection algorithms improve the colorectal risk prediction model performance measured in terms of AUC. Potential Risk (or protective) factors can be identified with machine learning techniques. Graphical approaches allow to analyze at a glance both stability and similarities. Abstract: Background and objective: Risk prediction models aim at identifying people at higher risk of developing a target disease. Feature selection is particularly important to improve the prediction model performance avoiding overfitting and to identify the leading cancer risk (and protective) factors. Assessing the stability of feature selection/ranking algorithms becomes an important issue when the aim is to analyze the features with more prediction power. Methods: This work is focused on colorectal cancer, assessing several feature ranking algorithms in terms of performance for a set of risk prediction models (Neural Networks, Support Vector Machines (SVM), Logistic Regression, k-Nearest Neighbors and Boosted Trees). Additionally, their robustness is evaluated following a conventional approach with scalar stability metrics and a visual approach proposed in this work to study both similarity among feature ranking techniques as well as their individual stability. A comparative analysis is carried out between theHighlights: Feature selection techniques may be unstable dealing with colorectal cancer data. Robustness and Performance of Feature selection for CRC should be studied together. Feature selection algorithms improve the colorectal risk prediction model performance measured in terms of AUC. Potential Risk (or protective) factors can be identified with machine learning techniques. Graphical approaches allow to analyze at a glance both stability and similarities. Abstract: Background and objective: Risk prediction models aim at identifying people at higher risk of developing a target disease. Feature selection is particularly important to improve the prediction model performance avoiding overfitting and to identify the leading cancer risk (and protective) factors. Assessing the stability of feature selection/ranking algorithms becomes an important issue when the aim is to analyze the features with more prediction power. Methods: This work is focused on colorectal cancer, assessing several feature ranking algorithms in terms of performance for a set of risk prediction models (Neural Networks, Support Vector Machines (SVM), Logistic Regression, k-Nearest Neighbors and Boosted Trees). Additionally, their robustness is evaluated following a conventional approach with scalar stability metrics and a visual approach proposed in this work to study both similarity among feature ranking techniques as well as their individual stability. A comparative analysis is carried out between the most relevant features found out in this study and features provided by the experts according to the state-of-the-art knowledge. Results: The two best performance results in terms of Area Under the ROC Curve (AUC) are achieved with a SVM classifier using the top-41 features selected by the SVM wrapper approach (AUC=0.693) and Logistic Regression with the top-40 features selected by the Pearson (AUC=0.689). Experiments showed that performing feature selection contributes to classification performance with a 3.9% and 1.9% improvement in AUC for the SVM and Logistic Regression classifier, respectively, with respect to the results using the full feature set. The visual approach proposed in this work allows to see that the Neural Network-based wrapper ranking is the most unstable while the Random Forest is the most stable. Conclusions: This study demonstrates that stability and model performance should be studied jointly as Random Forest turned out to be the most stable algorithm but outperformed by others in terms of model performance while SVM wrapper and the Pearson correlation coefficient are moderately stable while achieving good model performance. … (more)
- Is Part Of:
- Computer methods and programs in biomedicine. Volume 177(2019)
- Journal:
- Computer methods and programs in biomedicine
- Issue:
- Volume 177(2019)
- Issue Display:
- Volume 177, Issue 2019 (2019)
- Year:
- 2019
- Volume:
- 177
- Issue:
- 2019
- Issue Sort Value:
- 2019-0177-2019-0000
- Page Start:
- 219
- Page End:
- 229
- Publication Date:
- 2019-08
- Subjects:
- Colorectal cancer -- Risk prediction model -- Feature selection -- Stability
Medicine -- Computer programs -- Periodicals
Biology -- Computer programs -- Periodicals
Computers -- Periodicals
Medicine -- Periodicals
Médecine -- Logiciels -- Périodiques
Biologie -- Logiciels -- Périodiques
Biology -- Computer programs
Medicine -- Computer programs
Periodicals
Electronic journals
610.28 - Journal URLs:
- http://www.sciencedirect.com/science/journal/01692607 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.cmpb.2019.06.001 ↗
- Languages:
- English
- ISSNs:
- 0169-2607
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.095000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 11049.xml