Optimisation of cancer classification by machine learning generates an enriched list of candidate drug targets and biomarkers. (25th February 2020)
- Record Type:
- Journal Article
- Title:
- Optimisation of cancer classification by machine learning generates an enriched list of candidate drug targets and biomarkers. (25th February 2020)
- Main Title:
- Optimisation of cancer classification by machine learning generates an enriched list of candidate drug targets and biomarkers
- Authors:
- Ramroach, Sterling
Joshi, Ajay
John, Melford - Abstract:
- Abstract : A novel list of potential biomarkers was generated from RNA-seq expression data and used to optimise cancer classification. Abstract : The Cancer Genome Atlas has provided expression values of 18 015 genes for different cancer types. Studies on the classification of cancers by machine learning algorithms have used different data and methods, which makes it difficult to compare their performance. It is unclear, which algorithm performs best and if maximum levels of accuracy have been obtained. In this study, we aimed to optimise the diagnosis of cancer by comparing the performance of five algorithms using the same data, and by identifying the smallest possible number of differentiator genes. Classification accuracies of five algorithms of cancer type and primary site were determined using a gene expression dataset of 5629 samples and a dataset of 9144 samples, respectively. When trained with sample sets ranging from 16 718 to 40 genes, Random Forest (RF), Gradient Boosting Machine (GBM), and Neural Network (NN) consistently achieved 100% or near 100% accuracy in the classification of both cancer type and primary site. Reduction of training sets to the 40 highest-ranked genes resulted in 78-fold and 45-fold faster processing times for RF and GBM, respectively. The olfactory receptor family, keratin associated proteins, and defensin beta family were among the highest ranked genes. The ensemble and NN algorithms were the most accurate at distinguishing between cancerAbstract : A novel list of potential biomarkers was generated from RNA-seq expression data and used to optimise cancer classification. Abstract : The Cancer Genome Atlas has provided expression values of 18 015 genes for different cancer types. Studies on the classification of cancers by machine learning algorithms have used different data and methods, which makes it difficult to compare their performance. It is unclear, which algorithm performs best and if maximum levels of accuracy have been obtained. In this study, we aimed to optimise the diagnosis of cancer by comparing the performance of five algorithms using the same data, and by identifying the smallest possible number of differentiator genes. Classification accuracies of five algorithms of cancer type and primary site were determined using a gene expression dataset of 5629 samples and a dataset of 9144 samples, respectively. When trained with sample sets ranging from 16 718 to 40 genes, Random Forest (RF), Gradient Boosting Machine (GBM), and Neural Network (NN) consistently achieved 100% or near 100% accuracy in the classification of both cancer type and primary site. Reduction of training sets to the 40 highest-ranked genes resulted in 78-fold and 45-fold faster processing times for RF and GBM, respectively. The olfactory receptor family, keratin associated proteins, and defensin beta family were among the highest ranked genes. The ensemble and NN algorithms were the most accurate at distinguishing between cancer types and primary sites, whereas KNN was the fastest. Training sets can be reduced to the 40 highest-ranked differentiator genes without any significant loss of accuracy, amongst which there are potential drug targets and biomarkers. … (more)
- Is Part Of:
- Molecular omics. Volume 16:Number 2(2020)
- Journal:
- Molecular omics
- Issue:
- Volume 16:Number 2(2020)
- Issue Display:
- Volume 16, Issue 2 (2020)
- Year:
- 2020
- Volume:
- 16
- Issue:
- 2
- Issue Sort Value:
- 2020-0016-0002-0000
- Page Start:
- 113
- Page End:
- 125
- Publication Date:
- 2020-02-25
- Subjects:
- Molecular biology -- Periodicals
Biochemistry -- Periodicals
Biological systems -- Periodicals
Molecular Biology
Computational Biology
Biochemistry
Biological systems
Molecular biology
Periodicals
Electronic journals
Periodicals
Fulltext
Internet Resources
Periodicals - Journal URLs:
- http://www.rsc.org/journals-books-databases/about-journals/molecular-omics/ ↗
http://pubs.rsc.org/en/journals/journalissues/mo#!recentarticles&adv ↗
http://www.rsc.org/ ↗ - DOI:
- 10.1039/c9mo00198k ↗
- Languages:
- English
- ISSNs:
- 2515-4184
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 9838.212612
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 13856.xml