Protein stability prediction by fine-tuning a protein language model on a mega-scale dataset

Simon K. S. Chu; Kush Narang; Justin B. Siegel

doi:10.1371/journal.pcbi.1012248

Abstract

Protein stability plays a crucial role in a variety of applications, such as food processing, therapeutics, and the identification of pathogenic mutations. Engineering campaigns commonly seek to improve protein stability, and there is a strong interest in streamlining these processes to enable rapid optimization of highly stabilized proteins with fewer iterations. In this work, we explore utilizing a mega-scale dataset to develop a protein language model optimized for stability prediction. ESM_therm is trained on the folding stability of 528k natural and de novo sequences derived from 461 protein domains and can accommodate deletions, insertions, and multiple-point mutations. We show that a protein language model can be fine-tuned to predict folding stability. ESM_therm performs reasonably on small protein domains and generalizes to sequences distal from the training set. Lastly, we discuss our model’s limitations compared to other state-of-the-art methods in generalizing to larger protein scaffolds. Our results highlight the need for large-scale stability measurements on a diverse dataset that mirrors the distribution of sequence lengths commonly observed in nature.

Author summary

Research in Professor Justin Siegel’s lab focuses on discovering and engineering enzyme catalysis. His work follows a design-build-test cycle, integrating computational protein modeling with wet-lab experiments. Key areas of his research include de novo enzyme design, enzyme therapeutics for celiac disease, and applications in food and renewable energy. Additionally, his lab has developed the Design2Data program, a multi-year, multi-campus effort to curate a high-quality dataset of enzymatic activity and stability for beta-glucosidase.

Under the supervision of Professor Justin Siegel, I am engaged in molecular modeling and machine learning in protein engineering. I have a background in molecular dynamics simulations for protein and cell membrane permeability estimation and in using the Rosetta molecular modeling suite for protein structure modeling and enzyme-substrate interaction. Current research topics include the prediction of mutational effects on protein functions and protein language models for functional prediction and protein design.

Citation: Chu SKS, Narang K, Siegel JB (2024) Protein stability prediction by fine-tuning a protein language model on a mega-scale dataset. PLoS Comput Biol 20(7): e1012248. https://doi.org/10.1371/journal.pcbi.1012248

Editor: Piero Fariselli, Universita degli Studi di Torino, ITALY

Received: December 11, 2023; Accepted: June 13, 2024; Published: July 22, 2024

Copyright: © 2024 Chu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The code and materials are maintained on https://github.com/SimonKitSangChu/EsmTherm. Mutant-level predictions from our model and benchmark evaluation of state-of-the-art are available under Supplementary Materials.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Protein stability is one of the foundations of protein engineering to design resilient proteins for industrial processes and therapeutic manufacturing [1–3]. Beyond protein engineering, destabilizing mutations are associated with pathogenicity, and stability predictors can help identify pathogenic mutations across human proteome [4–7]. Molecular modeling methods, including Rosetta [8, 9], FoldX [10], and molecular dynamics simulations [11], have been shown to predict the impact of mutation on protein stability. More recently, the use of machine learning models grounded in biophysical features and evolutionary statistics [12–20] has offered an alternative approach to stability and function prediction without the need for computationally intensive molecular modeling simulations. Fueled by the latest advances in deep learning, convolutional neural networks (CNNs) [21] and graph neural networks (GNNs) [22] are now being adopted to predict mutational impacts on stability by operating directly on the input protein structure [23, 24]. For example, RaSP is a CNN-based model trained on top of Rosetta [25], while ELASPIC-2, another stability predictor, operates on both sequence embedding from ESM and structural embedding from GNN [26–28].

Despite these advancements, the lack of a consistent and universal dataset remains an obstacle. While merging smaller datasets into a more comprehensive collection, such as ProTherm [29], ProtaBank [30] and ThermoMutDB [31], is a feasible approach, combined datasets often consist of closely related but distinct quantities accompanied by additional discrepancies in experimental conditions. While deep mutagenesis scanning (DMS) offers profound insights, these studies typically focus on a single protein target, limiting the broader applicability of the derived data and models subsequently trained on these datasets. In light of these challenges, Tsuboyama et al. introduced a mega-scale thermostability dataset, encompassing 776k short protein sequences derived from 479 small protein domains, all consistently evaluated using the same assay [32].

Utilizing this dataset, we fine-tuned a protein language model (pLM), named ESM_therm, from ESM-2 [33] to act as an end-to-end stability predictor. We observe that ESM_therm performs comparably with state-of-the-art models and generalizes to small protein sequences distal to those of the training set. We also demonstrate that training on an ensemble of protein domains, instead of mutagenesis studies of a single domain, improves the performance of the fine-tuned protein language model for folding stability prediction. Lastly, we discuss the limitations of ESM_therm and compare it to other state-of-the-art methods in the ability to generalize to longer protein sequences.

Results

Evaluating model generalizability on test-set-only domains

Protein stability prediction can be assessed on different scales of generalizability. Although machine learning algorithms are often trained and tested on different sets of non-overlapping samples, the definition of overlap is ambiguous in protein sequences. For example, assigning two point mutants from the same WW domain, one to the training set and another to the test set, can assess the generalizability of the model to sequences sharing the same protein domain. However, it fails to evaluate the generalizability of the model to a domain different from those in the training set, such as an SH2 domain. To benchmark our model on both scales, our test set sequences consist of two parts. The first part is formed by protein domains also found in training set, whereas the second part consists of protein domains exclusively found in test set only, denoted as test-set-only domains. We assess the model performance by Spearman’s R, and its capability to generalize to these test-set-only sequences by the highest sequence identity to any domains in the training set. Given that domains are classified according to the wildtype definitions by Tsuboyama et al. [32], it is possible for domains exclusive to the test set to still share considerable sequence identity with those in the training set. This setup allows for an assessment of generalizability across varying degrees of sequence identity. The dataset-splitting scheme is illustrated in Fig 1 and further detailed in Methods and Materials.

Download:

Fig 1. Dataset splitting scheme.

Protein domains are first identified by their wildtype sequences and split into train-validation-test (green) and test-set-only partitions (cyan). Mutants are then randomly assigned to either training, validation and test sets or test set only according to their respective wildtype.

https://doi.org/10.1371/journal.pcbi.1012248.g001

ESM_therm generalizes reasonably well to 47 test-set-only protein domains, illustrated in Fig 2. The Spearman’s R evaluated on individual domains ranges from 0.2 to 0.9, except for the uncharacterized bacterial protein yahO (PDB code: 2MA4) [34]. Among all test-set-only domains, SH3-subunit of chicken alpha spectrin (PDB code: 6SCW) [35] has the highest sequence identity of 95.8% and scores a corresponding Spearman’s R of 0.88. Going down the ladder to test-set-only domains in lower sequence identity, our model scores worse in Homo sapein J-domain protein HSJ1a (PDB code: 2LGW) [36] at 59% identity but still retains a Spearman’s R of 0.52.

Download:

Fig 2. Spearman’s R on test-set-only protein domains.

Natural protein domains are labeled in blue and de novo domains are in orange. The x-axis is the highest sequence identity from the evaluated protein domain to those in the training set. In the case where no sequence alignment was found, 0% is assigned. The y-axis is the Spearman’s R evaluated on all sequences from the corresponding domain. We highlighted some of the test-set-only AlphaFold2 models in cyan, and when possible, overlay them with the training-set protein domains of the highest sequence identity in green.

https://doi.org/10.1371/journal.pcbi.1012248.g002

In the 13 cases where no alignment with the training set sequences passes e-value < 10⁻³, ESM_therm is capable of generalizing to both natural and de novo proteins. No training sequence can be aligned to Escherichia coli DNA-binding arginine repressor (PDB code: 1AOY) [37], and yet its Spearman’s R evaluated is 0.69. For de novo designs, we highlight two protein domains from Baker Lab. αββα domain (HEEH_KT_rd6_0790) is a mini-protein from high-throughput computational design with Rosetta [38], whereas the trRosetta-hallucinated structure (r11_233_TrROS_Hall) was sampled with iterative sequence refinement to improve the confidence in the prediction of residue-residue distance map [39]. Spearman’s R on these domains is 0.44 and 0.72, respectively.

Improving stability prediction by learning all domains collectively

Prior to the work by Tsuboyama et al. [32], DMS was often restricted to a single protein of interest. In the case where the target of interest is not thoroughly mapped, site-saturated mutagenesis studies from a homologous sequence(s) might provide insights into selecting the best mutation for the specific function of interest. However, direct cross-comparison between proteins is often complicated by the difference in measured quantities and experimental conditions between functional assays. This inconsistency makes it difficult to highlight the benefits of learning from multiple target proteins collectively in a systematic manner.

The mega-scale dataset addresses this difficulty by measuring folding stability across multiple protein domains in a uniform experimental condition, and it helps us compare two paradigms, i.e. transfer learning from homologous sequences and learning from all domains collectively. To contrast these approaches, we assess the generalizability of the model fine-tuned on these paradigms on test-set-only protein domains.

Extrapolating to test-set-only domains clearly benefits from learning all domains collectively. Collective training improves Spearman’s R by 0.16 on average (p-value = 6x10^-3), as illustrated in Fig (3). CdnL protein (PDB code: 2LQK) [40] cannot be aligned with any training sequence and instead was matched with its closest structural alignment (PDB code: 2BTT) [41] with Foldseek [42]. Collective training increased CdnL’s Spearman R from -0.25 to 0.65. Similarly, amino-terminal domain of phase 434 repressor (PDB code: 1R69) [43] was matched by structural alignment to a redesigned protein G (PDB code: 1EM7) [44] with a TM-score of 0.23, and gained 0.74 in Spearman’s R from -0.22 to 0.52. Looking into the domains with sequence alignment to the training set, WW domain from APBB3 (PDB code: 2YSC) shares 47% identity with its training-set partner (PDB code: 1WR7) and yet still benefits from multi-domain training with an improvement of 0.32. In contrast, uncharacterized yahO protein remains a difficult target. Compared to training on its closest training-set domain (PDB code: 1IGV), learning on multiple domains only improves the correlation from -0.35 to -0.14. Overall, these results highlight the benefits of a protein stability dataset on a diverse collection of protein domains for generalization to previously understudied targets.

Download:

Fig 3. Comparison between transfer learning from the closest protein domain in training set and training on all domains collectively.

(A) Schematic of the comparison. In the case of transfer learning, we match the test-set-only protein domain in cyan with the closest domain found in the training set in green. (B) Spearman’s R in the test set-only protein domains (x-axis) by learning from all domains collectively and (y-axis) by learning from the closest training-set domain alone. Samples(s) located under the diagonal line indicate better performance by learning collectively. The closest training-set domains were identified primarily by sequence alignment using MMseqs2, then by structure alignment using Foldseek, or discarded when no match was found in either case. The color bar indicates the highest sequence identity to any training-set domains and 0% was assigned when no sequence alignment was found. Statistical significance is performed with Wilcoxon’s rank sum test (p-value = 6 x 10^-3).

https://doi.org/10.1371/journal.pcbi.1012248.g003

Although the improvement brought by collective training highlights the benefits of a consistent large-scale dataset on folding stability, it is still unclear whether the improvement originates from the shared knowledge on folding stability across multiple domains or the sheer number of samples. The discrepancy in dataset size is significant as an individual domain only constitutes up to 7k sequences, less than 2% of the training set on the collection of protein domains.

In addition to extrapolating to test-set-only protein domains, we conducted a similar comparison on the impact of training on a collection of protein domains on interpolation on previously observed protein domains. Overall, performance on sequences from training-set domains is marginally uplifted by learning from a multi-domain dataset. Illustrated in S2 Fig, learning from an ensemble of protein domains weakly outperforms models trained on the same domain by an average of 0.03 (p-value = 2x10^-2). However, the margin is slim. 72% of the domains have Spearman’s R only change by 0.1.

Comparison with existing models on larger proteins

Although natural proteins often span between 200 and 400 residues [45], ESM_therm is fine-tuned on sequences no longer than 72 residues in length. To explore its performance under this limitation, we benchmarked our model on seven stability-related datasets on larger proteins and compared our results with state-of-the-art covering different methodologies in Tables 1 and 2. These include Rosetta Cartesian ΔΔG for molecular modeling, MUPro for support vector machine (SVM) on traditional sequence features, RaSP for structure-based CNN, ELASPIC-2 which employs a machine-learning model based on both structure and sequence embedding, and unsupervised prediction from ESM-2 [46].

Download:

Table 1. Comparison of Spearman’s R across methods on individual DMS datasets.

All evaluation is restricted to point mutations, except our pLM on the mega-scale dataset. We also report unsupervised prediction from pretrained ESM-2 to contrast with supervised approaches. While the mega-scale dataset from Tsuboyama et al. covers multiple protein domains [32], all other datasets studied only one target protein.

https://doi.org/10.1371/journal.pcbi.1012248.t001

Download:

Table 2. Overview of benchmarked DMS datasets on protein stability.

https://doi.org/10.1371/journal.pcbi.1012248.t002

We observe comparable performance in predicting the thermostability of test-set-only protein domains across all models except MUPro. Our pLM achieves a Spearman’s R of 0.65, compared to 0.64 from RaSP and ELASPIC-2, and 0.61 from Rosetta molecular modeling. MUPro finishes last by scoring 0.31. Drawing an interesting parallel between datasets, Huang et al. reported direct melting temperature measurements of beta-glucosidase active-site mutants (PDB code: 2JIE) manually selected based on biophysical knowledge [47, 48], while Romero et al. leveraged a log-enrichment value to gauge the stability for a similar beta-glucosidase (PDB code: 1GNX) in a site-saturated fashion [49]. The former closely resembles a smaller-scale study guided by domain knowledge in contrast to the latter dataset that leverages a parallelized assay. Despite an identical alpha-beta barrel scaffold and catalytic mechanism, and a shared sequence identity 48%, most models achieve Spearman’s R above 0.4 on the Bgl3 dataset, and no method correlates with BglB dataset. This highlights the potential impact of sampling and assay through a comparative setting.

Trained specifically on small protein domains, ESM_therm does not generalize to other datasets on larger protein sequences. In a collection of direct [50] and indirect [51–53] stability measurements, state-of-the-art methods outperform our pLM convincingly. Cartesian ddG in Rosetta achieves generalizability through molecular modeling with a correlation between 0.33 and 0.48. Simultaneously, RaSP is built on top of Cartessian ddG and dramatically speeds up the protocol with marginal correlation setbacks. Overall, ELASPIC-2 ranks highest with a Spearman’s R of 0.42–0.58 while our pLM correlates to none of these datasets.

Another intriguing observation is the performance of unsupervised predictions from pLM. While ESM-2 is less capable of predicting stability changes within the mega-scale dataset, it excels in datasets where indirect stability measurements correlate with function. These include log2-enrichment value which characterizes how catalytic activity reacts to heat shock in Bgl3 dataset and the intracellular abundance of the protein in the acetyltransferase dataset. Conversely, ESM-2 has a comparably weaker performance for proteolysis folding stability in the mega-scale dataset and chemical stability in the Lipase EstA dataset, where the assays measure stability directly. We also highlight the impact of fine-tuning by benchmarking the unsupervised prediction from 35M-parameter ESM-2 against our fine-tuned ESM_therm of the same model size. Supervised prediction improves the correlation from 0.36 to 0.65.

Discussion

Although our model generalizes reasonably well to new small protein domains in the mega-scale thermostability dataset, it is substantially weaker on larger proteins. Studies have established a strong correlation between the parallelized assay and direct measurement of thermostability [54]. However, we cannot rule out that our language model is biased towards dataset-specific details, including experiment conditions and sampling distribution of protein sequences. One hypothesis is that our pLM is biased toward shorter sequences, while geometric learning do not suffer from the same pitfall and already performs better in unsupervised prediction [55]. The protein domains on which we trained are limited to 40 to 72 amino acids in length, a stark contrast to the 177- to 501-residue-long sequences in our additional DMS benchmark. This might suggest that fine-tuned pLM stability predictors would benefit from a large-scale folding stability dataset on longer sequences.

While most methods can rank ΔΔG between mutants successfully, predicting ΔG is still challenging. Our predictions often suffer from an offset and/or scale differently when compared to the experimental ΔG of the test-set-only domains (S3 Fig) and other methods might share the same problem. For example, Rosetta Cartesian ΔΔG follows a different energy unit (Rosetta Energy Unit), and it might not be suitable to be compared directly to kcal mol^-1. However, the misalignment can be easily resolved by a simple linear regression between model prediction and experiment. Upon recalibration per protein domain, the root mean square error from our model improved from 1.34 to 0.83 and R² from -0.85 to 0.45, averaged across all test-set-only domains. For instance, our model scores a negative R² on DNA-binding arginine repressor before recalibration and improves to 0.47 after rescaling, while Spearman’s R remains the same at 0.69 regardless of any monotonic transformation (Fig 4).

Download:

Fig 4. Impact of recalibration.

(Left) Miscalibration between prediction and true value on the stability of DNA-binding arginine repressor. (Right) Recovered agreement between prediction and stability measurement through linear rescaling on the same set of data.

https://doi.org/10.1371/journal.pcbi.1012248.g004

Conclusion

In this work, we demonstrate that folding stability prediction is possible using a protein language model. Enabled by large-scale protein stability measurements, we fine-tuned ESM-2 on the absolute folding energy of small protein domains. This approach generalizes successfully to protein domains distal from the training set, showing the potential of transfer learning to reduce experimental burden. Furthermore, our result highlights the benefits of training collectively on all protein sequences instead of mutagenesis study on a single wildtype. Although its performance on larger protein scaffolds is lagging behind state-of-the-art, a folding stability dataset of larger proteins might be vital to improving the generalizability of ESM_therm.

Methods and materials

Protein language model and fine-tuning protocol

ESM-2 is a transformer pLM pre-trained on masked-language-model (MLM) objective on UniRef50. We fine-tuned the model on whole-sequence regression task with a classification head on the starting token. All parameters were trainable in fine-tuning, and used a local batch size of 128 and a global batch size of 2048. We trained the model on A100 GPU at half precision with a patience of 500 steps. We report all test-set-only evaluations on the checkpoint with the best performance on the validation set.

We performed hyperparameter selection on model size (8, 35, 150, and 650 million parameters) (S1 Table), and selected the 35-million-parameter model to balance prediction performance and compute speed. In addition, we performed an ablation study on pretraining. Model with pretraining has a superior advantage over that with random initialization (S1 Fig).

Dataset construction

Tsuboyama et al. measured the folding stability of 1.8M measurements derived from 542 protein domains by cDNA display proteolysis [32]. We aggregated measurement(s) with the identical protein sequence, regardless of their DNA sequence(s), into a single entry. In cases where the DNA sequence was unique while sharing the same protein sequence, we evaluated the standard deviation of ΔG and log K₅₀. We removed measurements when the standard deviation of ΔG was greater than 2 kcal mol^-1 or that of log K₅₀ was greater than 0.5 and we kept only domains with at least 100 measurements by protein sequence. This reduced the number of entries from 851,552 protein sequences from their original criteria (K50_dG_Dataset1_Dataset2.csv) to 527,785 protein sequences and 258 natural and 203 de novo protein domains.

Under the hierarchical nature of this dataset, by which multiple domains are constituted and each domain holds a collection of multiple mutants, the definition of model generalizability has two layers. The first is the ability of the model to generalize to mutants on training protein domains, and the second is that on test-set-only domains. To evaluate the model on both training and test-set-only domains, we split our dataset into train, validation, and test sets by domains as illustrated in Fig 1. 10% of all domains, defined by wildtype by the authors, are randomly drawn and all of their mutants are assigned to the test set. Mutants are randomly assigned to train-validation-test sets in an 80–10-10 ratio for the remaining domains.

Sequence and structural alignment

We implemented sequence clustering and alignment through MMseqs2 [56]. For clustering, we clustered the domain wild-type sequences using a similar strategy in constructing the Uniclust database. We dropped prefiltering for all-to-all pairwise alignment. For Foldseek, we searched for the structural identity based on AlphaFold structures from Tsuboyama et al [32]. Unless otherwise specified, we used the default parameters in both MMseqs and Foldseek. The implementation details of alignment can be found src/esmtherm/alignment in the GitHub repository.

Matching test-set-only domains

We fine-tuned ESM-2 (esm2_t12_35M_UR50D) on each of the 416 protein domains in the training set as our independently learned models. We first matched each test-set-only domain to its closest partner in the training set by the highest sequence identity using MMseqs2. In the case where no sequence alignment is identified, we matched test-set-only domain by the highest structural identity by Foldseek. In the case that neither is identified, the test-set-only domain was not compared. Pairwise comparisons of interpolation and extrapolation are performed in Wilcoxon’s rank sum test.

Benchmark protein dataset selection

Given the intensive computing resource required to benchmark Rosetta, we limited ourselves to six DMS datasets on direct and indirect stability measurements from ProteinGym [57, 58], and another independent mutational dataset (BglB) from Huang et al. [47] to cover a range of assays. Nutschel et al. reported the thermostability (ΔT₅₀) of Bacillus subtilis Lipase A [50], whereas Dandage et al. reports chemical stability on Gentamicin 3-N-acetyltransferase [51]. Contrary to direct stability measurements, PTEN and Methyltransferase datasets correlate with stability through enhancement or depreciation of intracellular abundance as an indirect indicator [52, 53]. The pair of Bglb and Bgl3 datasets was chosen for a comparative study on the impact of sampling and measurement assays. Bgl3 from Romero et al. and BglB datasets [47, 49] share homologous beta-glucosidase sequences but differ in log enrichment value and melting temperature (T_m) as indirect and direct thermostability measurements.

Supporting information

S1 Fig. Ablation of pretraining measured in Spearman’s R.

Each sample is a collection of mutants from a test-set-only domain. The x-axis is Spearman’s R of a test-set-only domain with pretraining. The y-axis is that from randomly initialized model. The color bar on the right represents the closest sequence identity in the train and validation set domains. The statistical assessment was performed using Wilcoxon’s rank sum test.

https://doi.org/10.1371/journal.pcbi.1012248.s001

(TIF)

S2 Fig. Comparison between learning from the same protein domain only and training on all domains collectively.

(A) Schematic of the comparison. (B) Spearman’s R on test mutants whose protein domains are also present in the training set. The x-axis represents learning from all domains collectively and the y-axis is learning from the same protein domain alone. Domain(s) located under the diagonal line indicate better performance when learning collectively. Statistical significance is performed with Wilcoxon’s rank sum test.

https://doi.org/10.1371/journal.pcbi.1012248.s002

(TIF)

S3 Fig. Offset in ΔG prediction on wildtype sequences.

The x-axis is the ΔG prediction from ESM_therm and the y-axis is the experimental ΔG label. The Spearman’s R across all wildtypes in test-set-only protein domains is 0.39.

https://doi.org/10.1371/journal.pcbi.1012248.s003

(TIF)

S1 Table. Performance evaluation on different model sizes on test set.

Metrics are evaluated on each individual domain, and then aggregated into mean and standard deviation over all domains. All models have similar performance metrics with esm2_t12_35M_UR50D except esm2_t6_8M_UR50D on Spearman’s R (p-value < 5x10^-2).

https://doi.org/10.1371/journal.pcbi.1012248.s004

(PDF)

S1 Spreadsheet. Performance evaluation per protein domain.

https://doi.org/10.1371/journal.pcbi.1012248.s005

(CSV)

S2 Spreadsheet. Model prediction per protein sequence.

https://doi.org/10.1371/journal.pcbi.1012248.s006

(CSV)

References

1. Lv Y., Zheng S., Goldenzweig A., Liu F., Gao Y., Yang X., Kandale A., McGeary R.P., Williams S.J., Kobe B., Schembri M.A., Landsberg M.J., Wu B., Brück T.B., Sieber V., Bodén M., Rao Z., Fleishman S.J., Schenk G., Guddat L.W. Enhancing the Thermal and Kinetic Stability of Ketol-Acid Reductoisomerase, a Central Catalyst of a Cell-Free Enzyme Cascade for the Manufacture of Platform Chemicals. Applied Biosciences.
- View Article
- Google Scholar
2. Rennison A., Winther J.R., Varrone C. Rational Protein Engineering to Increase the Activity and Stability of IsPETase Using the PROSS Algorithm. Polymers, 13. pmid:34833182
- View Article
- PubMed/NCBI
- Google Scholar
3. Hutchinson M., Ruffolo J.A., Haskins N., Iannotti M., Vozza G., Pham T., Mehzabeen N., Shandilya H., Rickert K., Croasdale-Wood R., Damschroder M., Fu Y., Dippel A., Gray J.J., Kaplan G. (2023). Enhancement of antibody thermostability and affinity by computational design in the absence of antigen. bioRxiv.
- View Article
- Google Scholar
4. Gerasimavicius L., Liu X., Marsh J.A. Identification of pathogenic missense mutations using protein stability predictors. Scientific Reports. 2020; 10. pmid:32958805
- View Article
- PubMed/NCBI
- Google Scholar
5. Cheng J., Novati G., Pan J., Bycroft C., Žemgulytė A., Applebaum T., et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science. 2023; 381. pmid:37733863
- View Article
- PubMed/NCBI
- Google Scholar
6. Stein A., Fowler D.M., Hartmann-Petersen R., Lindorff-Larsen K. Biophysical and Mechanistic Models for Disease-Causing Protein Variants. Trends in biochemical sciences, 2019; 44 7, 575–588. pmid:30712981
- View Article
- PubMed/NCBI
- Google Scholar
7. Yue P., Li Z., Moult J. Loss of protein structure stability as a major causative factor in monogenic disease. Journal of molecular biology, 2005; 353 2, 459–73. pmid:16169011
- View Article
- PubMed/NCBI
- Google Scholar
8. Kellogg EH, Leaver-Fay A, Baker D. Role of conformational sampling in computing mutation-induced changes in protein structure and stability. Proteins: Structure, Function and Bioinformatics. 2011;79(3):830–838. pmid:21287615
- View Article
- PubMed/NCBI
- Google Scholar
9. Park H, Bradley P, Greisen P, Liu Y, Mulligan VK, Kim DE, et al. Simultaneous Optimization of Biomolecular Energy Functions on Features from Small Molecules and Macromolecules. Journal of Chemical Theory and Computation. 2016;12(12):6201–6212. pmid:27766851
- View Article
- PubMed/NCBI
- Google Scholar
10. Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. The FoldX web server: An online force field. Nucleic Acids Research. 2005;33(SUPPL. 2):382–388. pmid:15980494
- View Article
- PubMed/NCBI
- Google Scholar
11. Wilson CJ, Chang M, Karttunen M, Choy WY. Keap1 cancer mutants: A large-scale molecular dynamics study of protein stability. International Journal of Molecular Sciences. 2021;22(10). pmid:34065616
- View Article
- PubMed/NCBI
- Google Scholar
12. Dehouck Y, Kwasigroch JM, Gilis D, Rooman M. PoPMuSiC 2.1: A web server for the estimation of protein stability changes upon mutation and sequence optimality. BMC Bioinformatics. 2011;12. pmid:21569468
- View Article
- PubMed/NCBI
- Google Scholar
13. Cao H, Wang J, He L, Qi Y, Zhang JZ. DeepDDG: Predicting the Stability Change of Protein Point Mutations Using Neural Networks. Journal of Chemical Information and Modeling. 2019;59(4):1508–1514. pmid:30759982
- View Article
- PubMed/NCBI
- Google Scholar
14. Witvliet DK, Strokach A, Giraldo-Forero AF, Teyra J, Colak R, Kim PM. ELASPIC web-server: Proteome-wide structure-based prediction of mutation effects on protein stability and binding affinity. Bioinformatics. 2016;32(10):1589–1591. pmid:26801957
- View Article
- PubMed/NCBI
- Google Scholar
15. Worth CL, Preissner R, Blundell TL. SDM—A server for predicting effects of mutations on protein stability and malfunction. Nucleic Acids Research. 2011;39(SUPPL. 2). pmid:21593128
- View Article
- PubMed/NCBI
- Google Scholar
16. Masso M, Vaisman II. AUTO-MUTE 2.0: A portable framework with enhanced capabilities for predicting protein functional consequences upon mutation. Advances in Bioinformatics. 2014;2014. pmid:25197272
- View Article
- PubMed/NCBI
- Google Scholar
17. Strokach A., Corbi-Verge C., Teyra J., Kim P.M. Predicting the Effect of Mutations on Protein Folding and Protein-Protein Interactions. Methods in molecular biology, 2018; 1851, 1–17.
- View Article
- Google Scholar
18. Strokach A., Corbi-Verge C., Kim P.M. Predicting changes in protein stability caused by mutation using sequence-and structure-based methods in a CAGI5 blind challenge. Human Mutation, 40, 1414–1423. pmid:31243847
- View Article
- PubMed/NCBI
- Google Scholar
19. Cheng J., Randall A., Baldi P. Prediction of protein stability changes for single-site mutations using support vector machines. Proteins: Structure, Function, and Bioinformatics, 2006; 62(4), 1125–1132. pmid:16372356
- View Article
- PubMed/NCBI
- Google Scholar
20. Huang L., Gromiha M.M., Ho S. iPTREE-STAB: interpretable decision tree based method for predicting protein stability changes upon mutations. Bioinformatics, 23 10, 1292–3. pmid:17379687
- View Article
- PubMed/NCBI
- Google Scholar
21. Lecun Y, Bottou E, Bengio Y, Haffner P. Gradient-Based Learning Applied to Document Recognition; 1998.
22. Kipf TN, Welling M. Semi-Supervised Classification with Graph Convolutional Networks. arxiv. 2016;.
23. Wang S, Tang H, Shan P, Wu Z, Zuo L. ProS-GNN: Predicting effects of mutations on protein stability using graph neural networks. Computational Biology and Chemistry. 2023;107. pmid:37643501
- View Article
- PubMed/NCBI
- Google Scholar
24. Chu SKS, Siegel J. Predicting single-point mutational effect on protein stability; 2021.
25. Blaabjerg LM, Kassem MM, Good LL, Jonsson N, Cagiada M, Johansson KE, et al. Rapid protein stability prediction using deep learning representations. eLife. 2023;12. pmid:37184062
- View Article
- PubMed/NCBI
- Google Scholar
26. Strokach A, Lu TY, Kim PM. ELASPIC2 (EL2): Combining Contextualized Language Models and Graph Neural Networks to Predict Effects of Mutations. Journal of Molecular Biology. 2021;433(11). pmid:33450251
- View Article
- PubMed/NCBI
- Google Scholar
27. Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. bioRxiv. 2019;118(15):e2016239118.
- View Article
- Google Scholar
28. Strokach A, Becerra D, Corbi-Verge C, Perez-Riba A, Kim PM. Fast and Flexible Protein Design Using Deep Graph Neural Networks. Cell Systems. 2020;11(4):402–411. pmid:32971019
- View Article
- PubMed/NCBI
- Google Scholar
29. Gromiha MM, An J, Kono H, Oobatake M, Uedaira H, Prabakaran P, et al. ProTherm, version 2.0: thermodynamic database for proteins and mutants; 2000. 1. Available from: http://www.rtc.riken.go.jp/protherm.html.
30. Wang CY, Chang PM, Ary ML, Allen BD, Chica RA, Mayo SL, et al. ProtaBank: A repository for protein design and engineering data. Protein Science. 2018;27(6):1113–1124. pmid:29575358
- View Article
- PubMed/NCBI
- Google Scholar
31. Xavier J.S., Nguyen T., Karmarkar M., Portelli S., Rezende P.M., Velloso J.P., et al. ThermoMutDB: a thermodynamic database for missense mutations. Nucleic Acids Research. 2020; 49, D475–D479.
- View Article
- Google Scholar
32. Tsuboyama K, Dauparas J, Chen J, Laine E, Mohseni Behbahani Y, Weinstein JJ, et al. Mega-scale experimental analysis of protein folding stability in biology and design. Nature. 2023;620(7973):434–444. pmid:37468638
- View Article
- PubMed/NCBI
- Google Scholar
33. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123–1130. pmid:36927031
- View Article
- PubMed/NCBI
- Google Scholar
34. Eletsky A, Michalska K, Houliston S, Zhang Q, Daily MD, Xu X, et al. Structural and Functional Characterization of DUF1471 Domains of Salmonella Proteins SrfN, YdgH/SssB, and YahO. PLoS ONE. 2014;9:e101787. pmid:25010333
- View Article
- PubMed/NCBI
- Google Scholar
35. Grohe K, Patel S, Hebrank C, Medina S, Klein A, Rovó P, et al. Protein Motional Details Revealed by Complementary Structural Biology Techniques. Structure. 2020;28(9):1024–1034. pmid:32579946
- View Article
- PubMed/NCBI
- Google Scholar
36. Gao XC, Zhou CJ, Zhou ZR, Wu M, Cao CY, Hu HY. The C-terminal helices of heat shock protein 70 are essential for J-domain binding and ATPase activation. Journal of Biological Chemistry. 2012;287(8):6044–6052. pmid:22219199
- View Article
- PubMed/NCBI
- Google Scholar
37. Sunnerhagen M, Nilges M, Otting G, Carey J. Solution structure of the DNA-binding domain and model for the complex of multifunctional hexameric arginine repressor with DNA; 1997. Available from: http://www.nature.com/nsmb.
38. Chevalier A., Silva DA., Rocklin G. et al. Massively parallel de novo protein design for targeted therapeutics. Nature, 2017; 550, 74–79. pmid:28953867
- View Article
- PubMed/NCBI
- Google Scholar
39. Anishchenko I, Chidyausiku TM, Ovchinnikov S, Pellock SJ, Baker D. De novo protein design by deep network hallucination. Nature. 2020; p. 547–552.
- View Article
- Google Scholar
40. Gallego-García A, Mirassou Y, Elías-Arnanz M, Padmanabhan S, Jiménez MA. NMR structure note: N-terminal domain of Thermus thermophilus CdnL. Journal of Biomolecular NMR. 2012;53(4):355–363. pmid:22782235
- View Article
- PubMed/NCBI
- Google Scholar
41. Musi V, Birdsall B, Fernandez-Ballester G, Guerrini R, Salvatori S, Serrano L, et al. New approaches to high-throughput structure characterization of SH3 complexes: The example of Myosin-3 and Myosin-5 SH3 domains from S. cerevisiae. Protein Science. 2006;15:795–807. pmid:16600966
- View Article
- PubMed/NCBI
- Google Scholar
42. van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, et al. Fast and accurate protein structure search with Foldseek. Nature Biotechnology. 2023. pmid:37156916
- View Article
- PubMed/NCBI
- Google Scholar
43. Mondragbn A, Subbiah’ S, Almolt SC, Drottar’ M, Harrison SC. Structure of the Amino-terminal Domain of Phage 434 Repressor at 2.0 A Resolution. J Mol Hiol (1989). 1989;205:189–200.
- View Article
- Google Scholar
44. Strop P., Marinescu A., Mayo S.L. Structure of a protein G helix variant suggests the importance of helix propensity and helix dipole interactions in protein design. Protein Science, 2000; 9. pmid:10933505
- View Article
- PubMed/NCBI
- Google Scholar
45. Nevers Y, Glover NM, Dessimoz C, Lecompte O. Protein length distribution is remarkably uniform across the tree of life. Genome Biology. 2023;24(1). pmid:37291671
- View Article
- PubMed/NCBI
- Google Scholar
46. Meier, J., Rao, R., Verkuil, R., Liu, J., Sercu, T. and Rives, A., 2021. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 34 (2021): 29287-29303.
47. Huang P, Chu SKS, Frizzo HN, Connolly MP, Caster RW, Siegel JB. Evaluating Protein Engineering Thermostability Prediction Tools Using an Independently Generated Dataset. ACS Omega. 2020;5(12):6487–6493. pmid:32258884
- View Article
- PubMed/NCBI
- Google Scholar
48. Isorna P, Polaina J, Latorre-García L, Cañada FJ, González B, Sanz-Aparicio J. Crystal Structures of Paenibacillus polymyxa β-Glucosidase B Complexes Reveal the Molecular Basis of Substrate Specificity and Give New Insights into the Catalytic Machinery of Family I Glycosidases. Journal of Molecular Biology. 2007;371(5):1204–1218. pmid:17585934
- View Article
- PubMed/NCBI
- Google Scholar
49. Romero PA, Tran TM, Abate AR. Dissecting enzyme function with microfluidic-based deep mutational scanning. Proceedings of the National Academy of Sciences. 2015;112(23):7159–7164. pmid:26040002
- View Article
- PubMed/NCBI
- Google Scholar
50. Nutschel C, Fulton A, Zimmermann O, Schwaneberg U, Jaeger KE, Gohlke H. Systematically Scrutinizing the Impact of Substitution Sites on Thermostability and Detergent Tolerance for Bacillus subtilis Lipase A. Journal of Chemical Information and Modeling. 2020;60:1568–1584. pmid:31905288
- View Article
- PubMed/NCBI
- Google Scholar
51. Dandage R, Pandey R, Jayaraj G, Rai M, Berger D, Chakraborty K. Differential strengths of molecular determinants guide environment specific mutational fates. PLOS Genetics. 2018;14:e1007419. pmid:29813059
- View Article
- PubMed/NCBI
- Google Scholar
52. Matreyek KA, Stephany JJ, Ahler E, Fowler DM. Integrating thousands of PTEN variant activity and abundance measurements reveals variant subgroups and new dominant negatives in cancers. Genome Medicine. 2021;13:165. pmid:34649609
- View Article
- PubMed/NCBI
- Google Scholar
53. Matreyek KA, Starita LM, Stephany JJ, Martin B, Chiasson MA, Gray VE, et al. Multiplex assessment of protein variant abundance by massively parallel sequencing. Nature Genetics. 2018;50(6):874–882. pmid:29785012
- View Article
- PubMed/NCBI
- Google Scholar
54. Rocklin GJ, Chidyausiku TM, Goreshnik I, Ford A, Houliston S, Lemak A, et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science. 2017;357(6347):168–175. pmid:28706065
- View Article
- PubMed/NCBI
- Google Scholar
55. Paul, S., Kollasch, A., Notin, P., Marks, D. Combining Structure and Sequence for Superior Fitness Prediction. NeurIPS 2023 Generative AI and Biology (GenBio) Workshop.
56. Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology. 2017;35(11):1026–1028. pmid:29035372
- View Article
- PubMed/NCBI
- Google Scholar
57. Notin P, Dias M, Frazer J, Hurtado JM, Gomez AN, Marks D, et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In: International Conference on Machine Learning. PMLR; 2022. p. 16990–17017.
58. Notin P., Kollasch A.W., Ritter D., van Niekerk L., Paul S., Spinner H., et al. ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction. bioRxiv, 2023. pmid:38106144
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Lv Y., Zheng S., Goldenzweig A., Liu F., Gao Y., Yang X., Kandale A., McGeary R.P., Williams S.J., Kobe B., Schembri M.A., Landsberg M.J., Wu B., Brück T.B., Sieber V., Bodén M., Rao Z., Fleishman S.J., Schenk G., Guddat L.W. Enhancing the Thermal and Kinetic Stability of Ketol-Acid Reductoisomerase, a Central Catalyst of a Cell-Free Enzyme Cascade for the Manufacture of Platform Chemicals. Applied Biosciences.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Rennison A., Winther J.R., Varrone C. Rational Protein Engineering to Increase the Activity and Stability of IsPETase Using the PROSS Algorithm. Polymers, 13. pmid:34833182
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Hutchinson M., Ruffolo J.A., Haskins N., Iannotti M., Vozza G., Pham T., Mehzabeen N., Shandilya H., Rickert K., Croasdale-Wood R., Damschroder M., Fu Y., Dippel A., Gray J.J., Kaplan G. (2023). Enhancement of antibody thermostability and affinity by computational design in the absence of antigen. bioRxiv.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref4] 4. Gerasimavicius L., Liu X., Marsh J.A. Identification of pathogenic missense mutations using protein stability predictors. Scientific Reports. 2020; 10. pmid:32958805
View Article
PubMed/NCBI
Google Scholar

[12] View Article

[13] PubMed/NCBI

[14] Google Scholar

[ref5] 5. Cheng J., Novati G., Pan J., Bycroft C., Žemgulytė A., Applebaum T., et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science. 2023; 381. pmid:37733863
View Article
PubMed/NCBI
Google Scholar

[16] View Article

[17] PubMed/NCBI

[18] Google Scholar

[ref6] 6. Stein A., Fowler D.M., Hartmann-Petersen R., Lindorff-Larsen K. Biophysical and Mechanistic Models for Disease-Causing Protein Variants. Trends in biochemical sciences, 2019; 44 7, 575–588. pmid:30712981
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref7] 7. Yue P., Li Z., Moult J. Loss of protein structure stability as a major causative factor in monogenic disease. Journal of molecular biology, 2005; 353 2, 459–73. pmid:16169011
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref8] 8. Kellogg EH, Leaver-Fay A, Baker D. Role of conformational sampling in computing mutation-induced changes in protein structure and stability. Proteins: Structure, Function and Bioinformatics. 2011;79(3):830–838. pmid:21287615
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref9] 9. Park H, Bradley P, Greisen P, Liu Y, Mulligan VK, Kim DE, et al. Simultaneous Optimization of Biomolecular Energy Functions on Features from Small Molecules and Macromolecules. Journal of Chemical Theory and Computation. 2016;12(12):6201–6212. pmid:27766851
View Article
PubMed/NCBI
Google Scholar

[32] View Article

[33] PubMed/NCBI

[34] Google Scholar

[ref10] 10. Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. The FoldX web server: An online force field. Nucleic Acids Research. 2005;33(SUPPL. 2):382–388. pmid:15980494
View Article
PubMed/NCBI
Google Scholar

[36] View Article

[37] PubMed/NCBI

[38] Google Scholar

[ref11] 11. Wilson CJ, Chang M, Karttunen M, Choy WY. Keap1 cancer mutants: A large-scale molecular dynamics study of protein stability. International Journal of Molecular Sciences. 2021;22(10). pmid:34065616
View Article
PubMed/NCBI
Google Scholar

[40] View Article

[41] PubMed/NCBI

[42] Google Scholar

[ref12] 12. Dehouck Y, Kwasigroch JM, Gilis D, Rooman M. PoPMuSiC 2.1: A web server for the estimation of protein stability changes upon mutation and sequence optimality. BMC Bioinformatics. 2011;12. pmid:21569468
View Article
PubMed/NCBI
Google Scholar

[44] View Article

[45] PubMed/NCBI

[46] Google Scholar

[ref13] 13. Cao H, Wang J, He L, Qi Y, Zhang JZ. DeepDDG: Predicting the Stability Change of Protein Point Mutations Using Neural Networks. Journal of Chemical Information and Modeling. 2019;59(4):1508–1514. pmid:30759982
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref14] 14. Witvliet DK, Strokach A, Giraldo-Forero AF, Teyra J, Colak R, Kim PM. ELASPIC web-server: Proteome-wide structure-based prediction of mutation effects on protein stability and binding affinity. Bioinformatics. 2016;32(10):1589–1591. pmid:26801957
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref15] 15. Worth CL, Preissner R, Blundell TL. SDM—A server for predicting effects of mutations on protein stability and malfunction. Nucleic Acids Research. 2011;39(SUPPL. 2). pmid:21593128
View Article
PubMed/NCBI
Google Scholar

[56] View Article

[57] PubMed/NCBI

[58] Google Scholar

[ref16] 16. Masso M, Vaisman II. AUTO-MUTE 2.0: A portable framework with enhanced capabilities for predicting protein functional consequences upon mutation. Advances in Bioinformatics. 2014;2014. pmid:25197272
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref17] 17. Strokach A., Corbi-Verge C., Teyra J., Kim P.M. Predicting the Effect of Mutations on Protein Folding and Protein-Protein Interactions. Methods in molecular biology, 2018; 1851, 1–17.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref18] 18. Strokach A., Corbi-Verge C., Kim P.M. Predicting changes in protein stability caused by mutation using sequence-and structure-based methods in a CAGI5 blind challenge. Human Mutation, 40, 1414–1423. pmid:31243847
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref19] 19. Cheng J., Randall A., Baldi P. Prediction of protein stability changes for single-site mutations using support vector machines. Proteins: Structure, Function, and Bioinformatics, 2006; 62(4), 1125–1132. pmid:16372356
View Article
PubMed/NCBI
Google Scholar

[71] View Article

[72] PubMed/NCBI

[73] Google Scholar

[ref20] 20. Huang L., Gromiha M.M., Ho S. iPTREE-STAB: interpretable decision tree based method for predicting protein stability changes upon mutations. Bioinformatics, 23 10, 1292–3. pmid:17379687
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

[ref21] 21. Lecun Y, Bottou E, Bengio Y, Haffner P. Gradient-Based Learning Applied to Document Recognition; 1998.

[ref22] 22. Kipf TN, Welling M. Semi-Supervised Classification with Graph Convolutional Networks. arxiv. 2016;.

[ref23] 23. Wang S, Tang H, Shan P, Wu Z, Zuo L. ProS-GNN: Predicting effects of mutations on protein stability using graph neural networks. Computational Biology and Chemistry. 2023;107. pmid:37643501
View Article
PubMed/NCBI
Google Scholar

[81] View Article

[82] PubMed/NCBI

[83] Google Scholar

[ref24] 24. Chu SKS, Siegel J. Predicting single-point mutational effect on protein stability; 2021.

[ref25] 25. Blaabjerg LM, Kassem MM, Good LL, Jonsson N, Cagiada M, Johansson KE, et al. Rapid protein stability prediction using deep learning representations. eLife. 2023;12. pmid:37184062
View Article
PubMed/NCBI
Google Scholar

[86] View Article

[87] PubMed/NCBI

[88] Google Scholar

[ref26] 26. Strokach A, Lu TY, Kim PM. ELASPIC2 (EL2): Combining Contextualized Language Models and Graph Neural Networks to Predict Effects of Mutations. Journal of Molecular Biology. 2021;433(11). pmid:33450251
View Article
PubMed/NCBI
Google Scholar

[90] View Article

[91] PubMed/NCBI

[92] Google Scholar

[ref27] 27. Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. bioRxiv. 2019;118(15):e2016239118.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref28] 28. Strokach A, Becerra D, Corbi-Verge C, Perez-Riba A, Kim PM. Fast and Flexible Protein Design Using Deep Graph Neural Networks. Cell Systems. 2020;11(4):402–411. pmid:32971019
View Article
PubMed/NCBI
Google Scholar

[97] View Article

[98] PubMed/NCBI

[99] Google Scholar

[ref29] 29. Gromiha MM, An J, Kono H, Oobatake M, Uedaira H, Prabakaran P, et al. ProTherm, version 2.0: thermodynamic database for proteins and mutants; 2000. 1. Available from: http://www.rtc.riken.go.jp/protherm.html.

[ref30] 30. Wang CY, Chang PM, Ary ML, Allen BD, Chica RA, Mayo SL, et al. ProtaBank: A repository for protein design and engineering data. Protein Science. 2018;27(6):1113–1124. pmid:29575358
View Article
PubMed/NCBI
Google Scholar

[102] View Article

[103] PubMed/NCBI

[104] Google Scholar

[ref31] 31. Xavier J.S., Nguyen T., Karmarkar M., Portelli S., Rezende P.M., Velloso J.P., et al. ThermoMutDB: a thermodynamic database for missense mutations. Nucleic Acids Research. 2020; 49, D475–D479.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref32] 32. Tsuboyama K, Dauparas J, Chen J, Laine E, Mohseni Behbahani Y, Weinstein JJ, et al. Mega-scale experimental analysis of protein folding stability in biology and design. Nature. 2023;620(7973):434–444. pmid:37468638
View Article
PubMed/NCBI
Google Scholar

[109] View Article

[110] PubMed/NCBI

[111] Google Scholar

[ref33] 33. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123–1130. pmid:36927031
View Article
PubMed/NCBI
Google Scholar

[113] View Article

[114] PubMed/NCBI

[115] Google Scholar

[ref34] 34. Eletsky A, Michalska K, Houliston S, Zhang Q, Daily MD, Xu X, et al. Structural and Functional Characterization of DUF1471 Domains of Salmonella Proteins SrfN, YdgH/SssB, and YahO. PLoS ONE. 2014;9:e101787. pmid:25010333
View Article
PubMed/NCBI
Google Scholar

[117] View Article

[118] PubMed/NCBI

[119] Google Scholar

[ref35] 35. Grohe K, Patel S, Hebrank C, Medina S, Klein A, Rovó P, et al. Protein Motional Details Revealed by Complementary Structural Biology Techniques. Structure. 2020;28(9):1024–1034. pmid:32579946
View Article
PubMed/NCBI
Google Scholar

[121] View Article

[122] PubMed/NCBI

[123] Google Scholar

[ref36] 36. Gao XC, Zhou CJ, Zhou ZR, Wu M, Cao CY, Hu HY. The C-terminal helices of heat shock protein 70 are essential for J-domain binding and ATPase activation. Journal of Biological Chemistry. 2012;287(8):6044–6052. pmid:22219199
View Article
PubMed/NCBI
Google Scholar

[125] View Article

[126] PubMed/NCBI

[127] Google Scholar

[ref37] 37. Sunnerhagen M, Nilges M, Otting G, Carey J. Solution structure of the DNA-binding domain and model for the complex of multifunctional hexameric arginine repressor with DNA; 1997. Available from: http://www.nature.com/nsmb.

[ref38] 38. Chevalier A., Silva DA., Rocklin G. et al. Massively parallel de novo protein design for targeted therapeutics. Nature, 2017; 550, 74–79. pmid:28953867
View Article
PubMed/NCBI
Google Scholar

[130] View Article

[131] PubMed/NCBI

[132] Google Scholar

[ref39] 39. Anishchenko I, Chidyausiku TM, Ovchinnikov S, Pellock SJ, Baker D. De novo protein design by deep network hallucination. Nature. 2020; p. 547–552.
View Article
Google Scholar

[134] View Article

[135] Google Scholar

[ref40] 40. Gallego-García A, Mirassou Y, Elías-Arnanz M, Padmanabhan S, Jiménez MA. NMR structure note: N-terminal domain of Thermus thermophilus CdnL. Journal of Biomolecular NMR. 2012;53(4):355–363. pmid:22782235
View Article
PubMed/NCBI
Google Scholar

[137] View Article

[138] PubMed/NCBI

[139] Google Scholar

[ref41] 41. Musi V, Birdsall B, Fernandez-Ballester G, Guerrini R, Salvatori S, Serrano L, et al. New approaches to high-throughput structure characterization of SH3 complexes: The example of Myosin-3 and Myosin-5 SH3 domains from S. cerevisiae. Protein Science. 2006;15:795–807. pmid:16600966
View Article
PubMed/NCBI
Google Scholar

[141] View Article

[142] PubMed/NCBI

[143] Google Scholar

[ref42] 42. van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, et al. Fast and accurate protein structure search with Foldseek. Nature Biotechnology. 2023. pmid:37156916
View Article
PubMed/NCBI
Google Scholar

[145] View Article

[146] PubMed/NCBI

[147] Google Scholar

[ref43] 43. Mondragbn A, Subbiah’ S, Almolt SC, Drottar’ M, Harrison SC. Structure of the Amino-terminal Domain of Phage 434 Repressor at 2.0 A Resolution. J Mol Hiol (1989). 1989;205:189–200.
View Article
Google Scholar

[149] View Article

[150] Google Scholar

[ref44] 44. Strop P., Marinescu A., Mayo S.L. Structure of a protein G helix variant suggests the importance of helix propensity and helix dipole interactions in protein design. Protein Science, 2000; 9. pmid:10933505
View Article
PubMed/NCBI
Google Scholar

[152] View Article

[153] PubMed/NCBI

[154] Google Scholar

[ref45] 45. Nevers Y, Glover NM, Dessimoz C, Lecompte O. Protein length distribution is remarkably uniform across the tree of life. Genome Biology. 2023;24(1). pmid:37291671
View Article
PubMed/NCBI
Google Scholar

[156] View Article

[157] PubMed/NCBI

[158] Google Scholar

[ref46] 46. Meier, J., Rao, R., Verkuil, R., Liu, J., Sercu, T. and Rives, A., 2021. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 34 (2021): 29287-29303.

[ref47] 47. Huang P, Chu SKS, Frizzo HN, Connolly MP, Caster RW, Siegel JB. Evaluating Protein Engineering Thermostability Prediction Tools Using an Independently Generated Dataset. ACS Omega. 2020;5(12):6487–6493. pmid:32258884
View Article
PubMed/NCBI
Google Scholar

[161] View Article

[162] PubMed/NCBI

[163] Google Scholar

[ref48] 48. Isorna P, Polaina J, Latorre-García L, Cañada FJ, González B, Sanz-Aparicio J. Crystal Structures of Paenibacillus polymyxa β-Glucosidase B Complexes Reveal the Molecular Basis of Substrate Specificity and Give New Insights into the Catalytic Machinery of Family I Glycosidases. Journal of Molecular Biology. 2007;371(5):1204–1218. pmid:17585934
View Article
PubMed/NCBI
Google Scholar

[165] View Article

[166] PubMed/NCBI

[167] Google Scholar

[ref49] 49. Romero PA, Tran TM, Abate AR. Dissecting enzyme function with microfluidic-based deep mutational scanning. Proceedings of the National Academy of Sciences. 2015;112(23):7159–7164. pmid:26040002
View Article
PubMed/NCBI
Google Scholar

[169] View Article

[170] PubMed/NCBI

[171] Google Scholar

[ref50] 50. Nutschel C, Fulton A, Zimmermann O, Schwaneberg U, Jaeger KE, Gohlke H. Systematically Scrutinizing the Impact of Substitution Sites on Thermostability and Detergent Tolerance for Bacillus subtilis Lipase A. Journal of Chemical Information and Modeling. 2020;60:1568–1584. pmid:31905288
View Article
PubMed/NCBI
Google Scholar

[173] View Article

[174] PubMed/NCBI

[175] Google Scholar

[ref51] 51. Dandage R, Pandey R, Jayaraj G, Rai M, Berger D, Chakraborty K. Differential strengths of molecular determinants guide environment specific mutational fates. PLOS Genetics. 2018;14:e1007419. pmid:29813059
View Article
PubMed/NCBI
Google Scholar

[177] View Article

[178] PubMed/NCBI

[179] Google Scholar

[ref52] 52. Matreyek KA, Stephany JJ, Ahler E, Fowler DM. Integrating thousands of PTEN variant activity and abundance measurements reveals variant subgroups and new dominant negatives in cancers. Genome Medicine. 2021;13:165. pmid:34649609
View Article
PubMed/NCBI
Google Scholar

[181] View Article

[182] PubMed/NCBI

[183] Google Scholar

[ref53] 53. Matreyek KA, Starita LM, Stephany JJ, Martin B, Chiasson MA, Gray VE, et al. Multiplex assessment of protein variant abundance by massively parallel sequencing. Nature Genetics. 2018;50(6):874–882. pmid:29785012
View Article
PubMed/NCBI
Google Scholar

[185] View Article

[186] PubMed/NCBI

[187] Google Scholar

[ref54] 54. Rocklin GJ, Chidyausiku TM, Goreshnik I, Ford A, Houliston S, Lemak A, et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science. 2017;357(6347):168–175. pmid:28706065
View Article
PubMed/NCBI
Google Scholar

[189] View Article

[190] PubMed/NCBI

[191] Google Scholar

[ref55] 55. Paul, S., Kollasch, A., Notin, P., Marks, D. Combining Structure and Sequence for Superior Fitness Prediction. NeurIPS 2023 Generative AI and Biology (GenBio) Workshop.

[ref56] 56. Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology. 2017;35(11):1026–1028. pmid:29035372
View Article
PubMed/NCBI
Google Scholar

[194] View Article

[195] PubMed/NCBI

[196] Google Scholar

[ref57] 57. Notin P, Dias M, Frazer J, Hurtado JM, Gomez AN, Marks D, et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In: International Conference on Machine Learning. PMLR; 2022. p. 16990–17017.

[ref58] 58. Notin P., Kollasch A.W., Ritter D., van Niekerk L., Paul S., Spinner H., et al. ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction. bioRxiv, 2023. pmid:38106144
View Article
PubMed/NCBI
Google Scholar

[199] View Article

[200] PubMed/NCBI

[201] Google Scholar

Figures

Abstract

Author summary

Introduction

Results

Evaluating model generalizability on test-set-only domains

Improving stability prediction by learning all domains collectively

Comparison with existing models on larger proteins

Discussion

Conclusion

Methods and materials

Protein language model and fine-tuning protocol

Dataset construction

Sequence and structural alignment

Matching test-set-only domains

Benchmark protein dataset selection

Supporting information

S1 Fig. Ablation of pretraining measured in Spearman’s R.

S2 Fig. Comparison between learning from the same protein domain only and training on all domains collectively.

S3 Fig. Offset in ΔG prediction on wildtype sequences.

S1 Table. Performance evaluation on different model sizes on test set.

S1 Spreadsheet. Performance evaluation per protein domain.

S2 Spreadsheet. Model prediction per protein sequence.

References