NSAMD: A new approach to discover structured contiguous substrings in sequence datasets using Next-Symbol-Array. (October 2016)
- Record Type:
- Journal Article
- Title:
- NSAMD: A new approach to discover structured contiguous substrings in sequence datasets using Next-Symbol-Array. (October 2016)
- Main Title:
- NSAMD: A new approach to discover structured contiguous substrings in sequence datasets using Next-Symbol-Array
- Authors:
- Pari, Abdolvahed
Baraani, Ahmad
Parseh, Saeed - Abstract:
- Graphical abstract: Highlights: We presented a solution to extract unknown structured motifs named NSAMD. A new data structure to index the dataset has been presented. NSAMD uses much less memory than Flame (the competitive solution), about 99%. NSAMD is faster than Flame in extracting structured motifs, about 51%. But NSAMD is slower than Flame in finding simple motifs. Abstract: In many sequence data mining applications, the goal is to find frequent substrings. Some of these applications like extracting motifs in protein and DNA sequences are looking for frequently occurring approximate contiguous substrings called simple motifs. By approximate we mean that some mismatches are allowed during similarity test between substrings, and it helps to discover unknown patterns. Structured motifs in DNA sequences are frequent structured contiguous substrings which contains two or more simple motifs. There are some works that have been done to find simple motifs but these works have problems such as low scalability, high execution time, no guarantee to find all patterns, and low flexibility in adaptation to other application. The Flame is the only algorithm that can find all unknown structured patterns in a dataset and has solved most of these problems but its scalability for very large sequences is still weak. In this research a new approach named Next-Symbol-Array based Motif Discovery (NSAMD) is represented to improve scalability in extracting all unknown simple and structuredGraphical abstract: Highlights: We presented a solution to extract unknown structured motifs named NSAMD. A new data structure to index the dataset has been presented. NSAMD uses much less memory than Flame (the competitive solution), about 99%. NSAMD is faster than Flame in extracting structured motifs, about 51%. But NSAMD is slower than Flame in finding simple motifs. Abstract: In many sequence data mining applications, the goal is to find frequent substrings. Some of these applications like extracting motifs in protein and DNA sequences are looking for frequently occurring approximate contiguous substrings called simple motifs. By approximate we mean that some mismatches are allowed during similarity test between substrings, and it helps to discover unknown patterns. Structured motifs in DNA sequences are frequent structured contiguous substrings which contains two or more simple motifs. There are some works that have been done to find simple motifs but these works have problems such as low scalability, high execution time, no guarantee to find all patterns, and low flexibility in adaptation to other application. The Flame is the only algorithm that can find all unknown structured patterns in a dataset and has solved most of these problems but its scalability for very large sequences is still weak. In this research a new approach named Next-Symbol-Array based Motif Discovery (NSAMD) is represented to improve scalability in extracting all unknown simple and structured patterns. To reach this goal a new data structure has been presented called Next-Symbol-Array. This data structure makes change in how to find patterns by NSAMD in comparison with Flame and helps to find structured motif faster. Proposed algorithm is as accurate as Flame and extracts all existing patterns in dataset. Performance comparisons show that NSAMD outperforms Flame in extracting structured motifs in both execution time (51% faster) and memory usage (more than 99%). Proposed algorithm is slower in extracting simple motifs but considerable improvement in memory usage (more than 99%) makes NSAMD more scalable than Flame. This advantage of NSAMD is very important in biological applications in which very large sequences are applied. … (more)
- Is Part Of:
- Computational biology and chemistry. Volume 64(2016)
- Journal:
- Computational biology and chemistry
- Issue:
- Volume 64(2016)
- Issue Display:
- Volume 64, Issue 2016 (2016)
- Year:
- 2016
- Volume:
- 64
- Issue:
- 2016
- Issue Sort Value:
- 2016-0064-2016-0000
- Page Start:
- 384
- Page End:
- 395
- Publication Date:
- 2016-10
- Subjects:
- Data mining -- Motif -- String -- Substring -- Support
Chemistry -- Data processing -- Periodicals
Biology -- Data processing -- Periodicals
Biochemistry -- Data processing
Biology -- Data processing
Molecular biology -- Data processing
Periodicals
Electronic journals
542.85 - Journal URLs:
- http://www.sciencedirect.com/science/journal/14769271 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.compbiolchem.2016.09.001 ↗
- Languages:
- English
- ISSNs:
- 1476-9271
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3390.576700
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 840.xml