Convolutional ProteinUnetLM competitive with long short‐term memory‐based protein secondary structure predictors. Issue 5 (5th December 2022)
- Record Type:
- Journal Article
- Title:
- Convolutional ProteinUnetLM competitive with long short‐term memory‐based protein secondary structure predictors. Issue 5 (5th December 2022)
- Main Title:
- Convolutional ProteinUnetLM competitive with long short‐term memory‐based protein secondary structure predictors
- Authors:
- Kotowski, Krzysztof
Fabian, Piotr
Roterman, Irena
Stapor, Katarzyna - Abstract:
- Abstract: The protein secondary structure (SS) prediction plays an important role in the characterization of general protein structure and function. In recent years, a new generation of algorithms for SS prediction based on embeddings from protein language models (pLMs) is emerging. These algorithms reach state‐of‐the‐art accuracy without the need for time‐consuming multiple sequence alignment (MSA) calculations. Long short‐term memory (LSTM)‐based SPOT‐1D‐LM and NetSurfP‐3.0 are the latest examples of such predictors. We present the ProteinUnetLM model using a convolutional Attention U‐Net architecture that provides prediction quality and inference times at least as good as the best LSTM‐based models for 8‐class SS prediction (SS8). Additionally, we address the issue of the heavily imbalanced nature of the SS8 problem by extending the loss function with the Matthews correlation coefficient, and by proper assessment using previously introduced adjusted geometric mean (AGM) metric. ProteinUnetLM achieved better AGM and sequence overlap score than LSTM‐based predictors, especially for the rare structures 310‐helix (G), beta‐bridge (B), and high curvature loop (S). It is also competitive on challenging datasets without homologs, free‐modeling targets, and chameleon sequences. Moreover, ProteinUnetLM outperformed its previous MSA‐based version ProteinUnet2, and provided better AGM than AlphaFold2 for 1/3 of proteins from the CASP14 dataset, proving its potential for making aAbstract: The protein secondary structure (SS) prediction plays an important role in the characterization of general protein structure and function. In recent years, a new generation of algorithms for SS prediction based on embeddings from protein language models (pLMs) is emerging. These algorithms reach state‐of‐the‐art accuracy without the need for time‐consuming multiple sequence alignment (MSA) calculations. Long short‐term memory (LSTM)‐based SPOT‐1D‐LM and NetSurfP‐3.0 are the latest examples of such predictors. We present the ProteinUnetLM model using a convolutional Attention U‐Net architecture that provides prediction quality and inference times at least as good as the best LSTM‐based models for 8‐class SS prediction (SS8). Additionally, we address the issue of the heavily imbalanced nature of the SS8 problem by extending the loss function with the Matthews correlation coefficient, and by proper assessment using previously introduced adjusted geometric mean (AGM) metric. ProteinUnetLM achieved better AGM and sequence overlap score than LSTM‐based predictors, especially for the rare structures 310‐helix (G), beta‐bridge (B), and high curvature loop (S). It is also competitive on challenging datasets without homologs, free‐modeling targets, and chameleon sequences. Moreover, ProteinUnetLM outperformed its previous MSA‐based version ProteinUnet2, and provided better AGM than AlphaFold2 for 1/3 of proteins from the CASP14 dataset, proving its potential for making a significant step forward in the domain. To facilitate the usage of our solution by protein scientists, we provide an easy‐to‐use web interface under https://biolib.com/SUT/ProteinUnetLM/ . … (more)
- Is Part Of:
- Proteins. Volume 91:Issue 5(2023)
- Journal:
- Proteins
- Issue:
- Volume 91:Issue 5(2023)
- Issue Display:
- Volume 91, Issue 5 (2023)
- Year:
- 2023
- Volume:
- 91
- Issue:
- 5
- Issue Sort Value:
- 2023-0091-0005-0000
- Page Start:
- 608
- Page End:
- 618
- Publication Date:
- 2022-12-05
- Subjects:
- attention -- deep learning -- language models -- long short‐term memory -- secondary structure -- U‐Net
Proteins -- Periodicals
Proteins -- Periodicals
572.6 - Journal URLs:
- http://onlinelibrary.wiley.com/ ↗
- DOI:
- 10.1002/prot.26452 ↗
- Languages:
- English
- ISSNs:
- 0887-3585
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 6936.164000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 26773.xml