Designing diverse and high-performance proteins with a large language model in the loop

doi:10.1371/journal.pcbi.1013119

Fig 1.

Overview of our approach.

(A) Pipeline for directed evolution with machine learning using our proposed fitness prediction model and optimizer. New designs can start by running BADASS to select sequences with multiple mutations and high zero-shot scores from evolutionary models like ESM2 for an initial screening. (B) Architecture of the semi-supervised Seq2Fitness model. (C) The optimizer approach demonstrating the initial transient phase and two cooling and heating cycles. Sequences are iteratively sampled from an updated probability distribution, with the sampling temperature reduced (cooled) as the average fitness score rises. After the set point is breached, cooling continues for patience number of iterations, after which the temperature is increased (heated). The fitness score decreases until it reaches the set point , at which point cooling resumes. The optimization process consists of multiple cooling and heating phases.

More »

Expand

Table 1.

Performance comparison of Seq2Fitness and alternative models.

Models were evaluated across different train/test splits, with performance metrics evaluated as Spearman correlation for regression tasks and adjusted AUC for classification tasks (NucB). Scores represent averages across the AAV, GB1, NucB, and AMY_BACSU datasets.

More »

Expand

Fig 2.

Fitness score statistics for BADASS optimization of alpha-amylase (AMY_BACSU).

BADASS was run for 140 iterations with a batch size of 1,000. The plot shows fitness score averaged over sampled sequences per iteration, with the shaded area representing scores within of the average, with the standard deviation of batch scores. Horizontal lines denote the set points and that govern the transitions between cooling and heating phases (Fig 1C). Vertical lines mark iterations where phase transitions occur as a moving average of fitness scores crosses the thresholds. The optimization was performed using the unsupervised ESM2 model and the semi-supervised Seq2Fitness model. Fitness scores were standardized as described in the methods. Runs included either exactly 6 mutations per variant (A, C) or an even mix of 2 to 6 mutations (B, D).

More »

Expand

Table 2.

Alpha amylase sampling:

Performance comparison between BADASS, EvoProtGrad and GGS (using EvoProtGrad on the Smoothed Seq2Fitness model) using ESM2 and Seq2Fitness models. All approaches are given comparable GPU compute time for the sampling. GGS requires an additional round to evaluate sequences with the original Seq2Fitness model. Metrics include the percentage of sequences better than wild type in the top 10,000 sequences found (or less when a method cannot find enough), the best, best 100th, and best 1,000th sequence scores, and the number of unique mutations and unique mutated sites present in the top 10,000 sequences. The number of mutations per sequence is k. As benchmarks, the reference alpha amylase sequence has an ESM2 score of 0.0, and a Seq2Fitness score of 0.8. BADASS was run for 200 iterations with a batch size of 520 sequences. Missing entries for EvoProtGrad (using T=0.1 for ESM2) are due to the generation of a limited number of unique sequences (on the order of hundreds), as the sampler becomes overly concentrated on a small number of mutations.

More »

Expand

Table 3.

NucB sampling:

Performance comparison between BADASS, EvoProtGrad and GGS using ESM2 and Seq2Fitness models. As benchmark, the reference NucB sequence has an ESM2 fitness of 0.0, and a Seq2Fitness score of -0.677. BADASS was run for 200 iterations with a batch size of 520 sequences.

More »

Expand

Fig 3.

Order parameters versus temperature for amylase tasks:

mean score and variance of scores versus temperature after initial transient, i.e., at steady-state oscillatory BADASS behavior. Markers come from BADASS runs, and lines are fits using Eqs 5–7 in S1 Text. These were obtained from cooling then heating runs of our algorithm for the amylase task: on the left using the ESM2 mutant marginal score, and on the right using the machine learning model that predicts fitness for stain removal and dp3 function. We ran the algorithm for 250 iterations, scoring 500 sequences in each iteration, and show all data for iterations larger than 100 to avoid the initial transient. The peak of the variance at intermediate temperatures is striking. Running the algorithm with an even blend of numbers of mutations changes the variance behavior, and was not fit to our equations. The mean and variance traces here are reminiscent of the magnetization and susceptibility in Ising models.

More »

Expand

Fig 4.

Order parameters versus temperature for the NucB tasks:

analogous to Fig 3. Left corresponds to ESM2 mutant marginal scores, and right the Seq2Fitness model score: the logit for the probability that the nuclease activity is higher than that of the reference sequence.

More »

Expand