Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets. Issue 5 (7th May 2021)
- Record Type:
- Journal Article
- Title:
- Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets. Issue 5 (7th May 2021)
- Main Title:
- Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets
- Authors:
- Dallago, Christian
Schütze, Konstantin
Heinzinger, Michael
Olenyi, Tobias
Littmann, Maria
Lu, Amy X.
Yang, Kevin K.
Min, Seonwoo
Yoon, Sungroh
Morton, James T.
Rost, Burkhard - Abstract:
- Abstract: Models from machine learning (ML) or artificial intelligence (AI) increasingly assist in guiding experimental design and decision making in molecular biology and medicine. Recently, Language Models (LMs) have been adapted from Natural Language Processing (NLP) to encode the implicit language written in protein sequences. Protein LMs show enormous potential in generating descriptive representations (embeddings) for proteins from just their sequences, in a fraction of the time with respect to previous approaches, yet with comparable or improved predictive ability. Researchers have trained a variety of protein LMs that are likely to illuminate different angles of the protein language. By leveraging the bio_embeddings pipeline and modules, simple and reproducible workflows can be laid out to generate protein embeddings and rich visualizations. Embeddings can then be leveraged as input features through machine learning libraries to develop methods predicting particular aspects of protein function and structure. Beyond the workflows included here, embeddings have been leveraged as proxies to traditional homology‐based inference and even to align similar protein sequences. A wealth of possibilities remain for researchers to harness through the tools provided in the following protocols. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC. The following protocols are included in this manuscript: Basic Protocol 1 : Generic use of the bio_embeddingsAbstract: Models from machine learning (ML) or artificial intelligence (AI) increasingly assist in guiding experimental design and decision making in molecular biology and medicine. Recently, Language Models (LMs) have been adapted from Natural Language Processing (NLP) to encode the implicit language written in protein sequences. Protein LMs show enormous potential in generating descriptive representations (embeddings) for proteins from just their sequences, in a fraction of the time with respect to previous approaches, yet with comparable or improved predictive ability. Researchers have trained a variety of protein LMs that are likely to illuminate different angles of the protein language. By leveraging the bio_embeddings pipeline and modules, simple and reproducible workflows can be laid out to generate protein embeddings and rich visualizations. Embeddings can then be leveraged as input features through machine learning libraries to develop methods predicting particular aspects of protein function and structure. Beyond the workflows included here, embeddings have been leveraged as proxies to traditional homology‐based inference and even to align similar protein sequences. A wealth of possibilities remain for researchers to harness through the tools provided in the following protocols. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC. The following protocols are included in this manuscript: Basic Protocol 1 : Generic use of the bio_embeddings pipeline to plot protein sequences and annotations Basic Protocol 2 : Generate embeddings from protein sequences using the bio_embeddings pipeline Basic Protocol 3 : Overlay sequence annotations onto a protein space visualization Basic Protocol 4 : Train a machine learning classifier on protein embeddings Alternate Protocol 1 : Generate 3D instead of 2D visualizations Alternate Protocol 2 : Visualize protein solubility instead of protein subcellular localization Support Protocol : Join embedding generation and sequence space visualization in a pipeline … (more)
- Is Part Of:
- Current protocols. Volume 1:Issue 5(2021)
- Journal:
- Current protocols
- Issue:
- Volume 1:Issue 5(2021)
- Issue Display:
- Volume 1, Issue 5 (2021)
- Year:
- 2021
- Volume:
- 1
- Issue:
- 5
- Issue Sort Value:
- 2021-0001-0005-0000
- Page Start:
- n/a
- Page End:
- n/a
- Publication Date:
- 2021-05-07
- Subjects:
- deep learning embeddings -- machine learning -- protein annotation pipeline -- protein representations -- protein visualization
Life sciences -- Laboratory manuals -- Periodicals
Biology -- Laboratory manuals -- Periodicals
Life sciences -- Technique -- Periodicals
Biology -- Technique -- Periodicals
570.028 - Journal URLs:
- https://currentprotocols.onlinelibrary.wiley.com/journal/26911299 ↗
http://onlinelibrary.wiley.com/ ↗ - DOI:
- 10.1002/cpz1.113 ↗
- Languages:
- English
- ISSNs:
- 2691-1299
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 18233.xml