Large-scale design and refinement of stable proteins using sequence-only models

doi:10.1371/journal.pone.0265020

Fig 1.

Overview of design and refinement.

Proteins are designed, either by an expert using Rosetta or dTERMen software or by a neural network model that transforms secondary sequences into primary sequences. These designs are refined to maximize stability via an iterative procedure. At each step, the stability of all possible single-site substitutions is predicted by another neural network model. The mutants with the highest predicted stability are saved and used as seeds for the next round of optimization.

More »

Expand

Fig 2.

Evaluator model and performance.

(A) Architecture of Evaluator Model. (1) Input: one-hot encoding of protein’s primary sequence. (2) Three convolutional layers; the first flattens the one-hot encoding to a single dimension, successive filters span longer windows of sequence. Three dense layers (3) yield trypsin and chymotrypsin stability scores (4). The final stability score (5) is the minimum of the two. (6) A separate dense layer from the final convolution layer yields one-hot encoding of the protein’s secondary structure. (B) Success of EM predictions on a library of new designs. We used the EM to predict the stability of 45,840 new protein sequences that the model had not seen before (described later as “Corpus B”); the distribution of predictions is shown in pink. The blue curve shows the fraction of these designs that were empirically stable (stability score >1.0) as a function of the model’s a priori stability predictions (dotted black line: stability threshold for predicted stability). 281 outliers (predicted stability score <-1.0 or >3.0) excluded for clarity. (C) Predicted versus observed stability scores for the same data, with outliers included.

More »

Expand

Table 1.

Evaluator model performance on natural proteins.

More »

Expand

Fig 3.

Refinement and its effects.

(A) Beam search refinement. Refinement begins with a protein’s amino acid sequence (left, green). All possible single-site substitutions are generated (bold red characters in middle sequences), and they are sorted according to the EM’s prediction of their stability (middle). The design with the highest predicted stability (middle, green) is reserved as the product of refinement at this stage. The k single-site substitutions with the highest predicted stability (middle, green and yellow; k = 2 in this illustration, though we used k = 50 to stabilize proteins) are then used as new bases. For each of the k new bases, the process was repeated, combining all single-site substitutions of all k new bases in the new sorted list (right). In this fashion, we predicted the best mutations of 1–5 amino acid substitutions for each of the base designs. (B) Effect of guided and random substitutions on expert-designed proteins. Guided substitutions (orange) raised the mean stability score from 0.23 in the base population (green) to 1.27 after five amino acid changes, as compared to random substitutions (blue) which dropped it to -0.06. Because stability score is logarithmic, the increase in stability is more than ten-fold after five guided substitutions. Annotated black bars indicate means, notches indicate bootstrapped 95% confidence intervals around the medians, boxes indicate upper and lower quartiles, and whiskers indicate 1.5 times the inter-quartile range.

More »

Expand

Fig 4.

Generator model and its performance.

(A) Architecture of the GM. Adapted for use with protein secondary and primary sequences from [45]. (B) Density plot of experimental stability scores for training designs, designs from the GM, and scrambles of the GM designs. (C) Density plot of trypsin EC₅₀ values. (D) Density plot of chymotrypsin EC₅₀ values.

More »

Expand

Fig 5.

Refinement of GM designs, overall and as a function of novelty.

(A) Effect of guided and random substitutions on designs created by the GM. The base stability score was much higher for this population of designs than for the expert-designed proteins tested, with a mean of 0.67; EM-guided refinement further increased it to 1.67. As with the expert-designed proteins, this demonstrates a ten-fold increase in stability. Random substitutions again had a deleterious effect, dropping mean stability to 0.29. (B) Stability of GM designs, and guided and random substitutions within those designs, as novelty increases. We consider designs to be more novel when BLAST percent identity with the most-similar design in the training corpus is lower.

More »

Expand

Fig 6.

Differential effects on stability between guided and random single-site substitutions.

For each original amino acid (indexed on the y-axis) and each replacement amino acid (indexed on the x-axis), the mean effect on stability when that substitution was guided by the EM is computed, as is the mean effect on stability when that substitution was applied randomly. The difference between these two effects is plotted for each from-to pair that was represented in the data; redder circles indicate that guided substitutions were more beneficial for stability, bluer circles indicate that random substitutions were more beneficial. Circles with heavy black outlines showed a significant difference (two-sample unpaired two-tailed t-test, p < 0.05 uncorrected) between guided and random effects. Bar graphs indicate mean differences in stability score (guided substitutions minus random substitutions) averaged across all replacement amino acids for each original amino acid (left) and and averaged across all original amino acids for each replacement amino acid (bottom).

More »

Expand

Fig 7.

Laboratory analyses of GM proteins.

(A) Results of targeted analyses of twelve GM proteins. All twelve proteins had less than 60% identity with respect to the entire set of training proteins, as calculated by BLAST. Reported topology was predicted by PSIPRED [47] and Rosetta (in that order, when predictions differ). (B) “Life cycle” of one refined protein, nmt_0994_guided_02. The design began with a requested secondary structure fed into the GM. The GM produced a primary sequence (nmt_0994) stochastically translated from that secondary structure; however, the Reverse GM correctly predicted that two of the requested helices were actually merged into one in the generated protein’s structure. EM-guided refinement then changed two residues to tryptophan, which raised the empirical stability score from -0.18 to 1.88. Green characters highlight differences from original sequences. (C) Crystal structure for nmt_0994_guided_02 (dark grey), showing that it also has the three helices predicted by the Reverse GM for its pre-refinement progenitor. It is shown aligned to the structure predicted by AlphaFold2 (cyan). The prediction and the crystal structure have a C_αRMSD of 3.4 Å.

More »

Expand

Table 2.

Crystallographic data collection and refinement statistics.

More »

Expand