Catwalk: identifying closely related sequences in large microbial sequence databases. Issue 6 (30th June 2022)
- Record Type:
- Journal Article
- Title:
- Catwalk: identifying closely related sequences in large microbial sequence databases. Issue 6 (30th June 2022)
- Main Title:
- Catwalk: identifying closely related sequences in large microbial sequence databases
- Authors:
- Volk, Denis
Yang-Turner, Fan
Didelot, Xavier
Crook, Derrick W.
Wyllie, David - Abstract:
- Abstract : There is a need to identify microbial sequences that may form part of transmission chains, or that may represent importations across national boundaries, amidst large numbers of SARS-CoV-2 and other bacterial or viral sequences. Reference-based compression is a sequence analysis technique that allows both a compact storage of sequence data and comparisons between sequences. Published implementations of the approach are being challenged by the large sample collections now being generated. Our aim was to develop a fast software detecting highly similar sequences in large collections of microbial genomes, including millions of SARS-CoV-2 genomes. To do so, we developed Catwalk, a tool that bypasses bottlenecks in the generation, comparison and in-memory storage of microbial genomes generated by reference mapping. It is a compiled solution, coded in Nim to increase performance. It can be accessed via command line, rest api or web server interfaces. We tested Catwalk using both SARS-CoV-2 and Mycobacterium tuberculosis genomes generated by prospective public-health sequencing programmes. Pairwise sequence comparisons, using clinically relevant similarity cut-offs, took about 0.39 and 0.66 μs, respectively; in 1 s, between 1 and 2 million sequences can be searched. Catwalk operates about 1700 times faster than, and uses about 8 % of the RAM of, a Python reference-based compression and comparison tool in current use for outbreak detection. Catwalk can rapidly identifyAbstract : There is a need to identify microbial sequences that may form part of transmission chains, or that may represent importations across national boundaries, amidst large numbers of SARS-CoV-2 and other bacterial or viral sequences. Reference-based compression is a sequence analysis technique that allows both a compact storage of sequence data and comparisons between sequences. Published implementations of the approach are being challenged by the large sample collections now being generated. Our aim was to develop a fast software detecting highly similar sequences in large collections of microbial genomes, including millions of SARS-CoV-2 genomes. To do so, we developed Catwalk, a tool that bypasses bottlenecks in the generation, comparison and in-memory storage of microbial genomes generated by reference mapping. It is a compiled solution, coded in Nim to increase performance. It can be accessed via command line, rest api or web server interfaces. We tested Catwalk using both SARS-CoV-2 and Mycobacterium tuberculosis genomes generated by prospective public-health sequencing programmes. Pairwise sequence comparisons, using clinically relevant similarity cut-offs, took about 0.39 and 0.66 μs, respectively; in 1 s, between 1 and 2 million sequences can be searched. Catwalk operates about 1700 times faster than, and uses about 8 % of the RAM of, a Python reference-based compression and comparison tool in current use for outbreak detection. Catwalk can rapidly identify close relatives of a SARS-CoV-2 or M. tuberculosis genome amidst millions of samples. … (more)
- Is Part Of:
- Microbial genomics. Volume 8:Issue 6(2022)
- Journal:
- Microbial genomics
- Issue:
- Volume 8:Issue 6(2022)
- Issue Display:
- Volume 8, Issue 6 (2022)
- Year:
- 2022
- Volume:
- 8
- Issue:
- 6
- Issue Sort Value:
- 2022-0008-0006-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-06-30
- Subjects:
- bacterial genomics -- microbial relatedness -- outbreak detection
Microbial genomics -- Periodicals
572.8629 - Journal URLs:
- https://www.microbiologyresearch.org/content/journal/mgen ↗
- DOI:
- 10.1099/mgen.0.000850 ↗
- Languages:
- English
- ISSNs:
- 2057-5858
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library HMNTS - ELD Digital store
- Ingest File:
- 22252.xml