SpliceVec: Distributed feature representations for splice junction prediction. (June 2018)
- Record Type:
- Journal Article
- Title:
- SpliceVec: Distributed feature representations for splice junction prediction. (June 2018)
- Main Title:
- SpliceVec: Distributed feature representations for splice junction prediction
- Authors:
- Dutta, Aparajita
Dubey, Tushar
Singh, Kusum Kumari
Anand, Ashish - Abstract:
- Graphical abstract: Highlights: A self-learned feature representation (SpliceVec) for splice sites is proposed. SpliceVec yields better accuracy than state-of-the-art model. The proposed model (SpliceVec-MLP) is simple and computationally efficient. SpliceVec is robust with reduced dataset and imbalanced classes. SpliceVec is invariant to both canonical and non-canonical splice junctions. Abstract: Identification of intron boundaries, called splice junctions, is an important part of delineating gene structure and functions. This also provides valuable insights into the role of alternative splicing in increasing functional diversity of genes. Identification of splice junctions through RNA-seq is by mapping short reads to the reference genome which is prone to errors due to random sequence matches. This encourages identification of splicing junctions through computational methods based on machine learning. Existing models are dependent on feature extraction and selection for capturing splicing signals lying in the vicinity of splice junctions. But such manually extracted features are not exhaustive. We introduce distributed feature representation, SpliceVec, to avoid explicit and biased feature extraction generally adopted for such tasks. SpliceVec is based on two widely used distributed representation models in natural language processing. Learned feature representation in form of SpliceVec is fed to multilayer perceptron for splice junction classification task. An intrinsicGraphical abstract: Highlights: A self-learned feature representation (SpliceVec) for splice sites is proposed. SpliceVec yields better accuracy than state-of-the-art model. The proposed model (SpliceVec-MLP) is simple and computationally efficient. SpliceVec is robust with reduced dataset and imbalanced classes. SpliceVec is invariant to both canonical and non-canonical splice junctions. Abstract: Identification of intron boundaries, called splice junctions, is an important part of delineating gene structure and functions. This also provides valuable insights into the role of alternative splicing in increasing functional diversity of genes. Identification of splice junctions through RNA-seq is by mapping short reads to the reference genome which is prone to errors due to random sequence matches. This encourages identification of splicing junctions through computational methods based on machine learning. Existing models are dependent on feature extraction and selection for capturing splicing signals lying in the vicinity of splice junctions. But such manually extracted features are not exhaustive. We introduce distributed feature representation, SpliceVec, to avoid explicit and biased feature extraction generally adopted for such tasks. SpliceVec is based on two widely used distributed representation models in natural language processing. Learned feature representation in form of SpliceVec is fed to multilayer perceptron for splice junction classification task. An intrinsic evaluation of SpliceVec indicates that it is able to group true and false sites distinctly. Our study on optimal context to be considered for feature extraction indicates inclusion of entire intronic sequence to be better than flanking upstream and downstream region around splice junctions. Further, SpliceVec is invariant to canonical and non-canonical splice junction detection. The proposed model is consistent in its performance even with reduced dataset and class-imbalanced dataset. SpliceVec is computationally efficient and can be trained with user-defined data as well. … (more)
- Is Part Of:
- Computational biology and chemistry. Volume 74(2018)
- Journal:
- Computational biology and chemistry
- Issue:
- Volume 74(2018)
- Issue Display:
- Volume 74, Issue 2018 (2018)
- Year:
- 2018
- Volume:
- 74
- Issue:
- 2018
- Issue Sort Value:
- 2018-0074-2018-0000
- Page Start:
- 434
- Page End:
- 441
- Publication Date:
- 2018-06
- Subjects:
- AS alternative splicing -- CNN convolutional neural network -- t-SNE stochastic neighbor embedding -- MLP multilayer perceptron -- nt nucleotides -- bp base pairs -- NLP natural language processing -- CBOW continuous bag of words -- DBOW distributed bag of words -- DM distributed memory -- ReLU rectified linear units -- LSVM linear support vector machines
Splice junction -- Feature representation -- Alternative splicing -- Distributed representation
Chemistry -- Data processing -- Periodicals
Biology -- Data processing -- Periodicals
Biochemistry -- Data processing
Biology -- Data processing
Molecular biology -- Data processing
Periodicals
Electronic journals
542.85 - Journal URLs:
- http://www.sciencedirect.com/science/journal/14769271 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.compbiolchem.2018.03.009 ↗
- Languages:
- English
- ISSNs:
- 1476-9271
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3390.576700
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 13023.xml