Speeding up the discovery of combinations of differentially expressed genes for disease prediction and classification. (March 2019)
- Record Type:
- Journal Article
- Title:
- Speeding up the discovery of combinations of differentially expressed genes for disease prediction and classification. (March 2019)
- Main Title:
- Speeding up the discovery of combinations of differentially expressed genes for disease prediction and classification
- Authors:
- Khamesipour, Alireza
Kagaris, Dimitri - Abstract:
- Abstract: Background and objective: Finding combinations (i.e., pairs, or more generally, q -tuples with q ≥ 2) of genes whose behavior as a group differs significantly between two classes has received a lot of attention in the quest for the discovery of simple, accurate, and easily interpretable decision rules for disease classification and prediction. For example, the Top Scoring Pair (TSP) method seeks to find pairs of genes so that the probability of the reversal of the relative ranking of the expression levels of the genes in the two classes is maximized. The computational cost of finding a q -tuple of genes that scores highest under a given metric is O ( G q ), where G is the total number of genes. This cost is often problematic or prohibitive in practice (even for q = 2 ), as the number of genes G is often in the order of tens of thousands. Methods: In this paper, we show that this computational cost can be significantly reduced by excluding from consideration genes whose behavior is almost identical in the two classes and therefore their inclusion in any q -tuple is rather non-informative. Our criterion for the exclusion of genes is supported by a statistically robust metric, the Area Under the Curve (AUC) of the corresponding Receiver Operating Characteristic (ROC) curve. By filtering out genes whose AUC value is below a user-chosen threshold, as determined by a procedure that we describe in the paper, dramatic reductions in the run times are obtained whileAbstract: Background and objective: Finding combinations (i.e., pairs, or more generally, q -tuples with q ≥ 2) of genes whose behavior as a group differs significantly between two classes has received a lot of attention in the quest for the discovery of simple, accurate, and easily interpretable decision rules for disease classification and prediction. For example, the Top Scoring Pair (TSP) method seeks to find pairs of genes so that the probability of the reversal of the relative ranking of the expression levels of the genes in the two classes is maximized. The computational cost of finding a q -tuple of genes that scores highest under a given metric is O ( G q ), where G is the total number of genes. This cost is often problematic or prohibitive in practice (even for q = 2 ), as the number of genes G is often in the order of tens of thousands. Methods: In this paper, we show that this computational cost can be significantly reduced by excluding from consideration genes whose behavior is almost identical in the two classes and therefore their inclusion in any q -tuple is rather non-informative. Our criterion for the exclusion of genes is supported by a statistically robust metric, the Area Under the Curve (AUC) of the corresponding Receiver Operating Characteristic (ROC) curve. By filtering out genes whose AUC value is below a user-chosen threshold, as determined by a procedure that we describe in the paper, dramatic reductions in the run times are obtained while maintaining the same classification accuracy. Results: We have experimentally verified the gains of this approach on several case studies involving ovarian, colon, leukemia, breast and prostate cancers, and diffuse large b-cell lymphoma. Conclusions: The proposed method is not only faster (for example, we observed an average 78.65% reduction over the run time of TSP) while maintaining the same classification accuracy, but it can even result in better classification accuracy due to its inherent ability to avoid the so-called "pivot" (non-informative) genes that may intrude in q -tuples chosen otherwise. … (more)
- Is Part Of:
- Computer methods and programs in biomedicine. Volume 170(2019)
- Journal:
- Computer methods and programs in biomedicine
- Issue:
- Volume 170(2019)
- Issue Display:
- Volume 170, Issue 2019 (2019)
- Year:
- 2019
- Volume:
- 170
- Issue:
- 2019
- Issue Sort Value:
- 2019-0170-2019-0000
- Page Start:
- 69
- Page End:
- 80
- Publication Date:
- 2019-03
- Subjects:
- Microarray data analysis -- Gene expression -- Top Scoring Pair (TSP) -- Computational cost -- Cancer prediction
Medicine -- Computer programs -- Periodicals
Biology -- Computer programs -- Periodicals
Computers -- Periodicals
Medicine -- Periodicals
Médecine -- Logiciels -- Périodiques
Biologie -- Logiciels -- Périodiques
Biology -- Computer programs
Medicine -- Computer programs
Periodicals
Electronic journals
610.28 - Journal URLs:
- http://www.sciencedirect.com/science/journal/01692607 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.cmpb.2019.01.004 ↗
- Languages:
- English
- ISSNs:
- 0169-2607
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.095000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 9484.xml