Assessing reproducibility and veracity across machine learning techniques in biomedicine: A case study using TCGA data. (September 2020)
- Record Type:
- Journal Article
- Title:
- Assessing reproducibility and veracity across machine learning techniques in biomedicine: A case study using TCGA data. (September 2020)
- Main Title:
- Assessing reproducibility and veracity across machine learning techniques in biomedicine: A case study using TCGA data
- Authors:
- Kim, Ahyoung Amy
Rachid Zaim, Samir
Subbian, Vignesh - Abstract:
- Abstract: Background: Many studies that aim to identify gene biomarkers using statistical methods and translate them into FDA-approved drugs have faced challenges due to lack of clinical validity and methodological reproducibility. Since genomic data analysis relies heavily on these statistical learning tools more than before, it is vital to address the limitations of these computational techniques. Methods: Our study demonstrates these methodological gaps among most common statistical learning techniques used in gene expression analysis. To assess the classification ability and reproducibility of statistical learning tools for gene biomarker detection, six state-of-the-art machine learning models were trained on four different cancer data retrieved from The Cancer Genome Atlas (TCGA). Standard performance metrics including specificity, sensitivity, precision, and F1 score were evaluated to investigate the classification ability. For analysis of reproducibility, the identifiability of gene classifiers was examined by quantifying the consistency of the chosen classifier genes. Results: Among the six state-of-the-art machine learning methods, the random forest had the best classification ability overall. Very few genes were selected by multiple methods, which suggests poor identifiability and reproducibility of statistical learning methods for gene expression data. Our results demonstrated the challenges of reproducing discoveries from gene expression analysis due to theAbstract: Background: Many studies that aim to identify gene biomarkers using statistical methods and translate them into FDA-approved drugs have faced challenges due to lack of clinical validity and methodological reproducibility. Since genomic data analysis relies heavily on these statistical learning tools more than before, it is vital to address the limitations of these computational techniques. Methods: Our study demonstrates these methodological gaps among most common statistical learning techniques used in gene expression analysis. To assess the classification ability and reproducibility of statistical learning tools for gene biomarker detection, six state-of-the-art machine learning models were trained on four different cancer data retrieved from The Cancer Genome Atlas (TCGA). Standard performance metrics including specificity, sensitivity, precision, and F1 score were evaluated to investigate the classification ability. For analysis of reproducibility, the identifiability of gene classifiers was examined by quantifying the consistency of the chosen classifier genes. Results: Among the six state-of-the-art machine learning methods, the random forest had the best classification ability overall. Very few genes were selected by multiple methods, which suggests poor identifiability and reproducibility of statistical learning methods for gene expression data. Our results demonstrated the challenges of reproducing discoveries from gene expression analysis due to the inherent differences that exist in statistical machine learning methods. Conclusion: Since statistical machine learning models can have large variations in high-dimensional settings such as analysis of gene expression data, transparent analysis procedures including data preprocessing, model parameterization, and evaluation and choice of interpretable models are required for clinical validity and utility. … (more)
- Is Part Of:
- International journal of medical informatics. Volume 141(2020)
- Journal:
- International journal of medical informatics
- Issue:
- Volume 141(2020)
- Issue Display:
- Volume 141, Issue 2020 (2020)
- Year:
- 2020
- Volume:
- 141
- Issue:
- 2020
- Issue Sort Value:
- 2020-0141-2020-0000
- Page Start:
- Page End:
- Publication Date:
- 2020-09
- Subjects:
- Reproducibility -- Classification -- Neoplasm -- Machine learning -- TCGA
Medical informatics -- Periodicals
Information science -- Periodicals
Computers -- Periodicals
Medical technology -- Periodicals
Medical Informatics -- Periodicals
Technology, Medical -- Periodicals
Computers
Information science
Medical informatics
Medical technology
Electronic journals
Periodicals
Electronic journals
610.285 - Journal URLs:
- http://www.sciencedirect.com/science/journal/13865056 ↗
http://www.clinicalkey.com/dura/browse/journalIssue/13865056 ↗
http://www.clinicalkey.com.au/dura/browse/journalIssue/13865056 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.ijmedinf.2020.104148 ↗
- Languages:
- English
- ISSNs:
- 1386-5056
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4542.345250
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 14789.xml