Deep protein representations enable recombinant protein expression prediction. (December 2021)
- Record Type:
- Journal Article
- Title:
- Deep protein representations enable recombinant protein expression prediction. (December 2021)
- Main Title:
- Deep protein representations enable recombinant protein expression prediction
- Authors:
- Martiny, Hannah-Marie
Armenteros, Jose Juan Almagro
Johansen, Alexander Rosenberg
Salomon, Jesper
Nielsen, Henrik - Abstract:
- Abstract: A crucial process in the production of industrial enzymes is recombinant gene expression, which aims to induce enzyme overexpression of the genes in a host microbe. Current approaches for securing overexpression rely on molecular tools such as adjusting the recombinant expression vector, adjusting cultivation conditions, or performing codon optimizations. However, such strategies are time-consuming, and an alternative strategy would be to select genes for better compatibility with the recombinant host. Several methods for predicting soluble expression are available; however, they are all optimized for the expression host Escherichia coli and do not consider the possibility of an expressed protein not being soluble. We show that these tools are not suited for predicting expression potential in the industrially important host Bacillus subtilis . Instead, we build a B. subtilis -specific machine learning model for expressibility prediction. Given millions of unlabelled proteins and a small labeled dataset, we can successfully train such a predictive model. The unlabeled proteins provide a performance boost relative to using amino acid frequencies of the labeled proteins as input. On average, we obtain a modest performance of 0.64 area-under-the-curve (AUC) and 0.2 Matthews correlation coefficient (MCC). However, we find that this is sufficient for the prioritization of expression candidates for high-throughput studies. Moreover, the predicted class probabilities areAbstract: A crucial process in the production of industrial enzymes is recombinant gene expression, which aims to induce enzyme overexpression of the genes in a host microbe. Current approaches for securing overexpression rely on molecular tools such as adjusting the recombinant expression vector, adjusting cultivation conditions, or performing codon optimizations. However, such strategies are time-consuming, and an alternative strategy would be to select genes for better compatibility with the recombinant host. Several methods for predicting soluble expression are available; however, they are all optimized for the expression host Escherichia coli and do not consider the possibility of an expressed protein not being soluble. We show that these tools are not suited for predicting expression potential in the industrially important host Bacillus subtilis . Instead, we build a B. subtilis -specific machine learning model for expressibility prediction. Given millions of unlabelled proteins and a small labeled dataset, we can successfully train such a predictive model. The unlabeled proteins provide a performance boost relative to using amino acid frequencies of the labeled proteins as input. On average, we obtain a modest performance of 0.64 area-under-the-curve (AUC) and 0.2 Matthews correlation coefficient (MCC). However, we find that this is sufficient for the prioritization of expression candidates for high-throughput studies. Moreover, the predicted class probabilities are correlated with expression levels. A number of features related to protein expression, including base frequencies and solubility, are captured by the model. Graphical Abstract: ga1 Highlights: A language model can create protein sequence representations useful for predicting recombinant expression in B. subtilis . Existing E. coli solubility predictors used as indicators of successful expression are not suited for other organisms. Universal protein embeddings do not have a species bias and can be used for other protein classification tasks. … (more)
- Is Part Of:
- Computational biology and chemistry. Volume 95(2021)
- Journal:
- Computational biology and chemistry
- Issue:
- Volume 95(2021)
- Issue Display:
- Volume 95, Issue 2021 (2021)
- Year:
- 2021
- Volume:
- 95
- Issue:
- 2021
- Issue Sort Value:
- 2021-0095-2021-0000
- Page Start:
- Page End:
- Publication Date:
- 2021-12
- Subjects:
- Chemistry -- Data processing -- Periodicals
Biology -- Data processing -- Periodicals
Biochemistry -- Data processing
Biology -- Data processing
Molecular biology -- Data processing
Periodicals
Electronic journals
542.85 - Journal URLs:
- http://www.sciencedirect.com/science/journal/14769271 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.compbiolchem.2021.107596 ↗
- Languages:
- English
- ISSNs:
- 1476-9271
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3390.576700
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 25255.xml