A generic model-free feature screening procedure for ultra-high dimensional data with categorical response. (February 2023)
- Record Type:
- Journal Article
- Title:
- A generic model-free feature screening procedure for ultra-high dimensional data with categorical response. (February 2023)
- Main Title:
- A generic model-free feature screening procedure for ultra-high dimensional data with categorical response
- Authors:
- Cheng, Xuewei
Wang, Hong - Abstract:
- Highlights: New R package, called MFSIS, for feature screening based on model-free procedures in ultra-high dimensional data. The novel methodology that can efficiently identifying active features is introduced to wrestle with ultra-high dimensional categorical data. The proposed procedure is a versatile and robust feature screening method with a data-driven threshold selection via knock-off feature. Graphical abstract: Abstract: Background and objective : Identifying active features from ultra-high dimensional data is one of the primary and vital tasks in statistical learning and biological discovery. Methods : In this paper, we develop a generic concordance index screening (CI-SIS) procedure to wrestle with ultra-high dimensional data with categorical response. The proposed procedure is model-free and nonparametric based on the concordance index measure. It enjoys both sure screening and ranking consistency properties under some relatively weak assumptions. We investigate the flexibility of this procedure by considering some commonly-encountered challenging settings in biomedical studies, such as category-adaptive data and extremely unbalanced response distributions. A data-driven threshold selection procedure via knockoff features is also presented. Results : On the real lung dataset, our method achieves a lower prediction error with a mean error of 0.107 with linear discriminant analysis (LDA) and 0.117 with random forest (RF), respectively. In addition, we obtain anHighlights: New R package, called MFSIS, for feature screening based on model-free procedures in ultra-high dimensional data. The novel methodology that can efficiently identifying active features is introduced to wrestle with ultra-high dimensional categorical data. The proposed procedure is a versatile and robust feature screening method with a data-driven threshold selection via knock-off feature. Graphical abstract: Abstract: Background and objective : Identifying active features from ultra-high dimensional data is one of the primary and vital tasks in statistical learning and biological discovery. Methods : In this paper, we develop a generic concordance index screening (CI-SIS) procedure to wrestle with ultra-high dimensional data with categorical response. The proposed procedure is model-free and nonparametric based on the concordance index measure. It enjoys both sure screening and ranking consistency properties under some relatively weak assumptions. We investigate the flexibility of this procedure by considering some commonly-encountered challenging settings in biomedical studies, such as category-adaptive data and extremely unbalanced response distributions. A data-driven threshold selection procedure via knockoff features is also presented. Results : On the real lung dataset, our method achieves a lower prediction error with a mean error of 0.107 with linear discriminant analysis (LDA) and 0.117 with random forest (RF), respectively. In addition, we obtain an accuracy improvement of 3% with LDA and 5% with RF compared to the runner-up method. In a more challenging real data of SRBCT (Small round blue cell tumours), CI-SIS brings about a amazing performance improvement, which is at least 8% higher than all other competing methods. Conclusion : Experimental results show that the proposed method can efficiently identify genes that are associated with certain types of diseases. Therefore, survived features (filtering out irrelevant features) selected by our procedure can help doctors make precision diagnoses and refined treatments of patients. … (more)
- Is Part Of:
- Computer methods and programs in biomedicine. Volume 229(2023)
- Journal:
- Computer methods and programs in biomedicine
- Issue:
- Volume 229(2023)
- Issue Display:
- Volume 229, Issue 2023 (2023)
- Year:
- 2023
- Volume:
- 229
- Issue:
- 2023
- Issue Sort Value:
- 2023-0229-2023-0000
- Page Start:
- Page End:
- Publication Date:
- 2023-02
- Subjects:
- Concordance index -- Data-driven threshold -- High-dimensional categorical data -- Extremely unbalanced responses -- GAN-knockoff
Medicine -- Computer programs -- Periodicals
Biology -- Computer programs -- Periodicals
Computers -- Periodicals
Medicine -- Periodicals
Médecine -- Logiciels -- Périodiques
Biologie -- Logiciels -- Périodiques
Biology -- Computer programs
Medicine -- Computer programs
Periodicals
Electronic journals
610.28 - Journal URLs:
- http://www.sciencedirect.com/science/journal/01692607 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.cmpb.2022.107269 ↗
- Languages:
- English
- ISSNs:
- 0169-2607
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.095000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 25645.xml