MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification. Issue 1 (December 2016)
- Record Type:
- Journal Article
- Title:
- MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification. Issue 1 (December 2016)
- Main Title:
- MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification
- Authors:
- Fiscon, Giulia
Weitschek, Emanuel
Cella, Eleonora
Lo Presti, Alessandra
Giovanetti, Marta
Babakir-Mina, Muhammed
Ciotti, Marco
Ciccozzi, Massimo
Pierangeli, Alessandra
Bertolazzi, Paola
Felici, Giovanni - Abstract:
- Abstract Background Continuous improvements in next generation sequencing technologies led to ever-increasing collections of genomic sequences, which have not been easily characterized by biologists, and whose analysis requires huge computational effort. The classification of species emerged as one of the main applications of DNA analysis and has been addressed with several approaches, e.g., multiple alignments-, phylogenetic trees-, statistical- and character-based methods. Results We propose a supervised method based on a genetic algorithm to identify small genomic subsequences that discriminate among different species. The method identifies multiple subsequences of bounded length with the same information power in a given genomic region. The algorithm has been successfully evaluated through its integration into a rule-based classification framework and applied to three different biological data sets: Influenza, Polyoma, and Rhino virus sequences. Conclusions We discover a large number of small subsequences that can be used to identify each virus type with high accuracy and low computational time, and moreover help to characterize different genomic regions. Bounding their length to 20, our method found 1164 characterizing subsequences for all the Influenza virus subtypes, 194 for all the Polyoma viruses, and 11 for Rhino viruses. The abundance of small separating subsequences extracted for each genomic region may be an important support for quick and robust virusAbstract Background Continuous improvements in next generation sequencing technologies led to ever-increasing collections of genomic sequences, which have not been easily characterized by biologists, and whose analysis requires huge computational effort. The classification of species emerged as one of the main applications of DNA analysis and has been addressed with several approaches, e.g., multiple alignments-, phylogenetic trees-, statistical- and character-based methods. Results We propose a supervised method based on a genetic algorithm to identify small genomic subsequences that discriminate among different species. The method identifies multiple subsequences of bounded length with the same information power in a given genomic region. The algorithm has been successfully evaluated through its integration into a rule-based classification framework and applied to three different biological data sets: Influenza, Polyoma, and Rhino virus sequences. Conclusions We discover a large number of small subsequences that can be used to identify each virus type with high accuracy and low computational time, and moreover help to characterize different genomic regions. Bounding their length to 20, our method found 1164 characterizing subsequences for all the Influenza virus subtypes, 194 for all the Polyoma viruses, and 11 for Rhino viruses. The abundance of small separating subsequences extracted for each genomic region may be an important support for quick and robust virus identification. Finally, useful biological information can be derived by the relative location and abundance of such subsequences along the different regions. … (more)
- Is Part Of:
- Biodata mining. Volume 9:Issue 1(2016)
- Journal:
- Biodata mining
- Issue:
- Volume 9:Issue 1(2016)
- Issue Display:
- Volume 9, Issue 1 (2016)
- Year:
- 2016
- Volume:
- 9
- Issue:
- 1
- Issue Sort Value:
- 2016-0009-0001-0000
- Page Start:
- 1
- Page End:
- 24
- Publication Date:
- 2016-12
- Subjects:
- Classification of genomic sequences -- Genetic algorithms -- Supervised learning -- Extraction of multiple classification models
Bioinformatics -- Periodicals
Computational biology -- Periodicals
Data mining -- Periodicals
570.285 - Journal URLs:
- http://www.biodatamining.org/ ↗
http://link.springer.com/ ↗ - DOI:
- 10.1186/s13040-016-0116-2 ↗
- Languages:
- English
- ISSNs:
- 1756-0381
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 9976.xml