SPRoBERTa: protein embedding learning with local fragment modeling. Issue 6 (22nd September 2022)
- Record Type:
- Journal Article
- Title:
- SPRoBERTa: protein embedding learning with local fragment modeling. Issue 6 (22nd September 2022)
- Main Title:
- SPRoBERTa: protein embedding learning with local fragment modeling
- Authors:
- Wu, Lijun
Yin, Chengcan
Zhu, Jinhua
Wu, Zhen
He, Liang
Xia, Yingce
Xie, Shufang
Qin, Tao
Liu, Tie-Yan - Abstract:
- Abstract: Well understanding protein function and structure in computational biology helps in the understanding of human beings. To face the limited proteins that are annotated structurally and functionally, the scientific community embraces the self-supervised pre-training methods from large amounts of unlabeled protein sequences for protein embedding learning. However, the protein is usually represented by individual amino acids with limited vocabulary size (e.g. 20 type proteins), without considering the strong local semantics existing in protein sequences. In this work, we propose a novel pre-training modeling approach SPRoBERTa . We first present an unsupervised protein tokenizer to learn protein representations with local fragment pattern. Then, a novel framework for deep pre-training model is introduced to learn protein embeddings. After pre-training, our method can be easily fine-tuned for different protein tasks, including amino acid-level prediction task (e.g. secondary structure prediction), amino acid pair-level prediction task (e.g. contact prediction) and also protein-level prediction task (remote homology prediction, protein function prediction). Experiments show that our approach achieves significant improvements in all tasks and outperforms the previous methods. We also provide detailed ablation studies and analysis for our protein tokenizer and training framework.
- Is Part Of:
- Briefings in bioinformatics. Volume 23:Issue 6(2022)
- Journal:
- Briefings in bioinformatics
- Issue:
- Volume 23:Issue 6(2022)
- Issue Display:
- Volume 23, Issue 6 (2022)
- Year:
- 2022
- Volume:
- 23
- Issue:
- 6
- Issue Sort Value:
- 2022-0023-0006-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-09-22
- Subjects:
- local fragment representation -- protein tokenizer -- protein pre-training
Genetics -- Data processing -- Periodicals
Molecular biology -- Data processing -- Periodicals
Genomes -- Data processing -- Periodicals
572.80285 - Journal URLs:
- http://bib.oxfordjournals.org ↗
http://www.oxfordjournals.org/content?genre=journal&issn=1477-4054 ↗
http://ukcatalogue.oup.com/ ↗
http://firstsearch.oclc.org ↗ - DOI:
- 10.1093/bib/bbac401 ↗
- Languages:
- English
- ISSNs:
- 1467-5463
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 2283.958363
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 24767.xml