An integrative approach to protein sequence design through multiobjective optimization

doi:10.1371/journal.pcbi.1011953

Fig 1.

Evolutionary multiobjective optimization provides a suitable framework for multistate design.

A. In this work, we examine how machine learning models such as pMPNN, AlphaFold2/AF2Rank, and ESM-1v may be integrated directly into protein sequence design through a multiobjective optimization method known as Non-dominated Sorting Genetic Algorithm II (NSGA-II). Left: first, new design candidates are proposed through a mutation operator; here, this operator is composed of ESM-1v, which is used to rank residue positions, and ProteinMPNN (pMPNN), which is used to redesign the least nativelike positions. Middle: the design candidates are then scored using objective functions derived from AlphaFold2 and pMPNN confidence metrics. Right: lastly, the scored candidates are sorted into successive pareto fronts (here numbered F1 to F5), and the candidates from the best fronts are selected by NSGA-II for the next round of design. See Methods and S1 Fig for additional details. B. To demonstrate the effectiveness of this framework, we perform an in-depth analysis of the two-state sequence design problem for RfaH, a small foldswitching protein whose C-terminal domain can interconvert between an all-α RfaHα state and an all-β RfaHβ state. We then examine the ability of the proposed framework to tackle higher-dimensional design problems, by redesigning the multi-specific binding interface of PapD (three states) and the various binding modes of calmodulin (14 states). For RfaH, the designable positions are highlighted in green; note that the N-terminal domain is not shown in the cartoon representation of the RfaHβ state. For PapD, the binding partner is shown in orange, and PapD is shown in green. For CaM, the N-terminal half of the protein (residues 1–74) is shown in yellow, the C-terminal half of the protein (residues 75–148) is shown in blue, and the binding partners, when present, are shown in orange. For each structure, its PDB ID is listed in parenthesis. The CaM binding partner name abbreviations are explained on Table 1.

More »

Expand

Fig 2.

Integration of multiple models through evolutionary multiobjective optimization improves RfaH sequence design outcomes.

Each column represents one genetic algorithm setup, indicated by the legend above each column that describes how, at each iteration, the subset of designable positions is selected, how the mutations are proposed, and which objective functions are used to score the designs; here, RND stands for random, pMPNN stands for ProteinMPNN, and ESM refers to ESM-1v. Each row represents a different quality metric: “nat. seq. recov.” refers to the population average fractional identity to the WT sequence, “ESM” refers to the ESM-1v log likelihood score, averaged over both the population and sequence positions, “HV” refers to hypervolume in the pMPNN-SD objective space (HV[pMPNN]), and hypervolume in the AF2Rank composite score objective space (HV[AF2Rank]); see Methods for more details. Each panel represents the progression of a genetic algorithm simulation over 50 iterations and at four different mutation rates (μ). Within each panel, the horizontal lines represent the quality metric values calculated for the WT RfaH sequence (solid gray), the population average of the pMPNN-AD design sequences (dotted gray), the population average of the RfaHα pMPNN-SD design sequences (dashed blue), and the population average of the RfaHβ pMPNN-SD design sequences (dashed yellow). Simulations results at additional mutation rates are shown in S2 Fig.

More »

Expand

Fig 3.

Integration of multiple models into RfaH sequence design leads to reduction of sequence bias and variance.

A. From left to right: distribution of pMPNN-AD (gray), pMPNN-SD (blue for the RfaHα state and yellow for the RfaHβ state), GA[ESM,pMPNN,pMPNN;μ = 0.3] (green; later abbreviated as GA[pMPNN]), GA[ESM,pMPNN,AF2Rank;μ = 0.3] (purple; later abbreviated as GA[AF2Rank]), and the WT (black) sequences in a two-dimensional embedding generated with the Laplacian eigenmaps algorithm [65], the pMPNN-SD log likelihood objective space, and the AF2Rank composite score objective space. The fourth panel on the right shows the empirical cumulative distribution functions (eCDF) for the per-position sequence entropy (base e) of these populations of designed sequences. All GA sequences in this figure refer to the final iteration sequence populations. B. Logo plots for sequences generated using pMPNN-AD (top), pMPNN-SD for the RfaHα state (middle), and pMPNN-SD for the RfaHβ state (bottom). The residue positions are organized into three blocks, depending on whether the sequence profiles from the RfaHα (left), RfaHβ (middle), or both states (right) dominate the pMPNN-AD sequence profiles. For the RfaHα and RfaHβ pMPNN-SD design sequence logo plots, the residue positions are shaded according to their relative solvent accessibility in the corresponding state; blue shading indicates a relative solvent accessibility > 50%, while yellow shading indicates < 20%. The WT residue type at each position is indicated on the secondary x-axis. C. Logo plots for sequences generated using pMPNN-AD (top), GA[pMPNN] (middle), and GA[AF2Rank] (bottom). The residue positions are organized into four blocks: “bias”, “native recovery”, “sequence entropy”, and “no major difference”; see the main text for more details on this classification.

More »

Expand

Fig 4.

Genetic algorithm RfaH designs exhibit greater sequence similarity to NusG-like foldswitching proteins than non-foldswitching proteins.

The four panels show the distribution of sequence similarity measures for the pMPNN-AD (gray), GA[pMPNN] (green), and GA[AF2Rank] (purple) sequences to NusG-like foldswitching and non-foldswitching sequences. Each panel corresponds to a different mutation rate for the GA simulations, indicated in the panel title. The dashed black lines represent the y = x lines. See Methods for more details on the similarity measure and the NusG-like sequence database.

More »

Expand

Fig 5.

Genetic algorithms can be applied to higher-dimensional design problems.

The GA[ESM,pMPNN,pMPNN] and GA[ESM,pMPNN,AF2Rank] setups are applied to two additional model systems: PapD (3 states) and CaM (14 states). A and B show the benchmark results for PapD, and C and D show the benchmark results for CaM. Similar to Fig 2, A and C show simulation progression as measured by four quality metrics (from left to right): native sequence recovery, ESM-1v log likelihood scores, pMPNN-SD log likelihood score hypervolume, and the AF2Rank composite score hypervolume. For each quality metric, the panel with green curves represent the GA[ESM,pMPNN,pMPNN] setups, the panel with purple curves represent the GA[ESM,pMPNN,AF2Rank] setups, and the horizontal lines within each panel represent the WT values (solid gray) and the pMPNN-AD population averages (dotted gray). Note that the ESM-1v scores are computed over the PapD and CaM sequences only, without the binding partner sequences. For CaM, “approx. HV” indicates that the hypervolume metrics are computed using an approximate algorithm (see Methods). Similar to Fig 3A, B and D show distributions of the redesigned sequences in the sequence and objective spaces and their sequence entropies; as in Fig 3, the GA[ESM,pMPNN,pMPNN;μ = 0.3] and GA[ESM,pMPNN,AF2Rank;μ = 0.3] setups are abbreviated as GA[pMPNN] and GA[AF2Rank], respectively. Because of the higher dimensionality of the objective spaces, the middle two panels in B and D show the first two principal components (PC) from a principal component analysis of the pMPNN-SD log likelihood objective space and the AF2Rank composite score objective space.

More »

Expand

Table 1.

Structural models of CaM.

More »

Expand