Comparison of merging strategies for building machine learning models on multiple independent gene expression data sets. (13th September 2021)
- Record Type:
- Journal Article
- Title:
- Comparison of merging strategies for building machine learning models on multiple independent gene expression data sets. (13th September 2021)
- Main Title:
- Comparison of merging strategies for building machine learning models on multiple independent gene expression data sets
- Authors:
- Krepel, Jessica
Kircher, Magdalena
Kohls, Moritz
Jung, Klaus - Abstract:
- Abstract: High‐dimensional gene expression data are regularly studied for their ability to separate different groups of samples by means of machine learning (ML) models. Meanwhile, a large number of such data are publicly available. Several approaches for meta‐analysis on independent sets of gene expression data have been proposed, mainly focusing on the step of feature selection, a typical step in fitting a ML model. Here, we compare different strategies of merging the information of such independent data sets to train a classifier model. Specifically, we compare the strategy of merging data sets directly (strategy A), and the strategy of merging the classification results (strategy B). We use simulations with pure artificial data as well as evaluations based on independent gene expression data from lung fibrosis studies to compare the two merging approaches. In the simulations, the number of studies, the strength of batch effects, and the separability are varied. The comparison incorporates five standard ML techniques typically used for high‐dimensional data, namely discriminant analysis, support vector machines, least absolute shrinkage and selection operator, random forest, and artificial neural networks. Using cross‐study validations, we found that direct data merging yields higher accuracies when having training data of three or four studies, and merging of classification results performed better when having only two training studies. In the evaluation with the lungAbstract: High‐dimensional gene expression data are regularly studied for their ability to separate different groups of samples by means of machine learning (ML) models. Meanwhile, a large number of such data are publicly available. Several approaches for meta‐analysis on independent sets of gene expression data have been proposed, mainly focusing on the step of feature selection, a typical step in fitting a ML model. Here, we compare different strategies of merging the information of such independent data sets to train a classifier model. Specifically, we compare the strategy of merging data sets directly (strategy A), and the strategy of merging the classification results (strategy B). We use simulations with pure artificial data as well as evaluations based on independent gene expression data from lung fibrosis studies to compare the two merging approaches. In the simulations, the number of studies, the strength of batch effects, and the separability are varied. The comparison incorporates five standard ML techniques typically used for high‐dimensional data, namely discriminant analysis, support vector machines, least absolute shrinkage and selection operator, random forest, and artificial neural networks. Using cross‐study validations, we found that direct data merging yields higher accuracies when having training data of three or four studies, and merging of classification results performed better when having only two training studies. In the evaluation with the lung fibrosis data, both strategies showed a similar performance. … (more)
- Is Part Of:
- Statistical analysis and data mining. Volume 15:Number 1(2022)
- Journal:
- Statistical analysis and data mining
- Issue:
- Volume 15:Number 1(2022)
- Issue Display:
- Volume 15, Issue 1 (2022)
- Year:
- 2022
- Volume:
- 15
- Issue:
- 1
- Issue Sort Value:
- 2022-0015-0001-0000
- Page Start:
- 112
- Page End:
- 124
- Publication Date:
- 2021-09-13
- Subjects:
- artificial neural networks -- data fusion -- discriminant analysis -- gene expression data -- high‐dimensional data -- LASSO -- random forest -- support vector machines
Data mining -- Statistical methods -- Periodicals
006.312 - Journal URLs:
- http://www3.interscience.wiley.com/journal/112701062/home ↗
http://onlinelibrary.wiley.com/ ↗ - DOI:
- 10.1002/sam.11549 ↗
- Languages:
- English
- ISSNs:
- 1932-1864
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 8447.424100
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 20328.xml