Fig 1.
Shotgun Alanine scanning as an effective method to generate SAR information between sequences and foldability.
A. Sequence representation of alanine scanning libraries (lib1: grey; lib2: blue; lib3: pink; lib4: yellow; no change: dark grey) and alanine scanning results shown as heatmaps of ES for both YSD and BiFC. B. Structure of HCP (PDB code 5JI4) with regions that were selected for library constructions indicated in different colors. C. BiFC reagent. Half of SYFP2 (light yellow) and half of Turquoise2 (light blue) are fused to the N- and C-terminus of an unfolded peptide (rainbow colors). When peptide is folded, two non-fluorescent fragments will assemble into a green fluorescent complex (green).
Fig 2.
Sequences and enrichment scores of nine disulfide-rich peptides.
Generation of large datasets from YSD combined with alanine scanning libraries for eight HCP scaffolds and EETI-II scaffold. For each scaffold, both the measured and predicted ES scores were normalized such that the highest score within each scaffold is 1.
Fig 3.
Sequence similarity among nine scaffolds in the training data.
The figure shows the pairwise sequence similarity among nine HCP scaffolds, as calculated by the BLOSUM62 matrix. The similarity ranges from 16% to 60%, reflecting highly diverse sequence distribution within the training data and their divergence from the validation scaffold sequences. Such a cross-validation approach provides a robust assessment of the model’s accuracy and effectiveness in training a robust model that can generalize to sequences away from the training data distribution.
Table 1.
Adjusted F1 score of the cross-validation results.
Fig 4.
The trained model generalizes the prediction of ES scores to three distinct natural peptides.
A. Comparison of measured and predicted ES scores for three natural DCPs not included in the model training and independent sample preparation, demonstrating the model’s predictive capability. B. Sequence similarity between scaffolds used in training and the three new natural peptides (1BH4, 2KNP and 3Q8J), showing notable diversity from the training sequences (with less than 30% average sequence similarity), indicating strong generalizability of the trained model for predicting ES scores from unrelated sequences.
Table 2.
Accuracy of non-touchable residue prediction measured ES scores vs. predicted ES scores.
Table 3.
Accuracy of non-touchable residue on three new scaffolds from natural peptides.
Fig 5.
Validation of machine-learning-predicted non-touchable residues for folding using yeast surface display.
The libraries are constructed based on an HCP scaffold and a DCP scaffold, with PDB codes of 5JI4 (blue curve) and 3Q8J (green curve), respectively. Solid lines: positive libraries with the randomization on amenable residues; dashed lines: negative libraries with randomization on “non-touchable” residues.
Fig 6.
Re-evaluation of DCP libraries reveals design insights for improved hit rates.
A. ES score distribution across various DCP scaffolds with the originally randomized regions marked in green colored regions. B. Correlation between the “unmatched scores” and the hit rates for 7 previously reported DCP scaffolds; a higher degree of mismatch corresponds to lower hit rates, emphasizing the importance of model-guided design.
Fig 7.
Validation of newly constructed 5JI4 phage library based on the predicted ES scores.
A. Visualization of 5JI4’s ES scores, where the positions scoring below the median are set to zero. The randomization regions are colored with light green. B. 5JI4 structure with randomized regions highlighted in green. C. Summary of the hit numbers achieved by the 5JI4 phage library for ten panning campaigns.
Fig 8.
Confirm the outcome of the hits from 5JI4 libraries.
A. An example of peptide 5ji4-RSP-1 in vitro re-folding. TIC from LC/MS runs before and after re-folding are shown on the left with peaks’ m/z on the right. After two days of re-folding, both peaks 2 and 3 contain expected m/z (indicated by green triangles) for folded peptides with three disulfide bonds. B. An example SPR sensor gram for measuring peptide binding affinity. Kinetic fitting is on the left and steady-state affinity is on the right. Experimental data and fitting curves are shown in black and red, respectively.
Table 4.
Affinity summary for selected positive hits.
16 clones were selected for synthesis and in vitro folding, of which 7 clones are with spot ELISA signal range in 0.25–4.0 and signal/noise (s/n) ratio between 2–40 and nine clones were selected based on NGS ranking. The Kd of synthetic peptides were measured by SPR with steady state or kinetic fitting.
Fig 9.
Functional and selective inhibitors against HtrA1 generated from 5JI4 libraries.
A. Sequences for HtrA1 primary hit and affinity maturation hits with Spot ELISA, SPR measured Kd and IC50 measured by enzymatic assay. B. Inhibition of HtrA1 enzyme activity by HCP derived from 5JI4 libraries.
Fig 10.
Comparison of sequence and structure diversity between our method and ProteinMPNN.
A. Sequence logo plot from ten unique sequences generated by our model based on the EETI-II scaffold. B. Predicted 3D structures of the ten sequences from our model, displaying some conformation variability. C. Sequence logo plot from ten unique sequences generated by ProteinMPNN based on the EETI-II scaffold. D. Predicted 3D structures of the ten sequences from ProteinMPNN, showing limited conformation variability.