VariantSpark: population scale clustering of genotype information. (December 2015)
- Record Type:
- Journal Article
- Title:
- VariantSpark: population scale clustering of genotype information. (December 2015)
- Main Title:
- VariantSpark: population scale clustering of genotype information
- Authors:
- O'Brien, Aidan
Saunders, Neil
Guo, Yi
Buske, Fabian
Scott, Rodney
Bauer, Denis - Abstract:
- Abstract Background Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. The widely used Hadoop MapReduce architecture and associated machine learning library, Mahout, provide the means for tackling computationally challenging tasks. However, many genomic analyses do not fit the Map-Reduce paradigm. We therefore utilise the recently developedSpark engine, along with its associated machine learning library, MLlib, which offers more flexibility in the parallelisation of population-scale bioinformatics tasks. The resulting tool, VariantSpark provides an interface from MLlib to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results. Results To demonstrate the capabilities ofVariantSpark, we clustered more than 3, 000 individuals with 80 Million variants each to determine the population structure in the dataset.VariantSpark is 80 % faster than theSpark -based genome clustering approach, adam, the comparable implementation using Hadoop/Mahout, as well asAdmixture, a commonly used tool for determining individual ancestries. It is over 90 % faster than traditional implementations using R and Python. Conclusion The benefits of speed, resource consumption and scalability enablesVariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomicAbstract Background Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. The widely used Hadoop MapReduce architecture and associated machine learning library, Mahout, provide the means for tackling computationally challenging tasks. However, many genomic analyses do not fit the Map-Reduce paradigm. We therefore utilise the recently developedSpark engine, along with its associated machine learning library, MLlib, which offers more flexibility in the parallelisation of population-scale bioinformatics tasks. The resulting tool, VariantSpark provides an interface from MLlib to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results. Results To demonstrate the capabilities ofVariantSpark, we clustered more than 3, 000 individuals with 80 Million variants each to determine the population structure in the dataset.VariantSpark is 80 % faster than theSpark -based genome clustering approach, adam, the comparable implementation using Hadoop/Mahout, as well asAdmixture, a commonly used tool for determining individual ancestries. It is over 90 % faster than traditional implementations using R and Python. Conclusion The benefits of speed, resource consumption and scalability enablesVariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data. … (more)
- Is Part Of:
- BMC genomics. Volume 16:Number 1(2015)
- Journal:
- BMC genomics
- Issue:
- Volume 16:Number 1(2015)
- Issue Display:
- Volume 16, Issue 1 (2015)
- Year:
- 2015
- Volume:
- 16
- Issue:
- 1
- Issue Sort Value:
- 2015-0016-0001-0000
- Page Start:
- 1
- Page End:
- 9
- Publication Date:
- 2015-12
- Subjects:
- Genotype clustering -- SPARK -- BigData -- 1000 Genomes Project -- Personal Genome Project -- Population structure
Genomes -- Periodicals
Gene mapping -- Periodicals
Genomics -- Periodicals
Base Sequence -- Periodicals
Chromosome Mapping -- Periodicals
Genetic Techniques -- Periodicals
Sequence Analysis, DNA -- Periodicals
572.8605 - Journal URLs:
- http://www.biomedcentral.com/bmcgenomics/ ↗
http://www.pubmedcentral.nih.gov/tocrender.fcgi?journal=32 ↗
http://link.springer.com/ ↗ - DOI:
- 10.1186/s12864-015-2269-7 ↗
- Languages:
- English
- ISSNs:
- 1471-2164
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 9851.xml