Computational design of novel Cas9 PAM-interacting domains using evolution-based modelling and structural quality assessment

doi:10.1371/journal.pcbi.1011621

Fig 1.

Experimental validation of chimeras between SpyCas9 and the PAM-interacting domain (PID) of other natural variants.

a: Genetic circuit used to test Cas9 PID variants. The dCas9 gene is guided to silence an operon that consists of a mCherry reporter and the SacB counter-selection marker. The gene construct was designed to enable the easy exchange of PID domains (see Methods). b: Serial dilution and spotting of E. coli MG1655 carrying the wild-type SpydCas9 or dCas9 without a functional PID in the absence (left) or presence (right) of sucrose. c: The activity of dCas9 chimeras is reported as the normalized repression of the mCherry fluorescence. Chimeras were tested against targets with a PAM recognized by the PID they carry [37]. This score is normalized such that 1 corresponds to the activity of the WT and 0 to the negative control. The percentage of identity to the SpyCas9 PID is reported. We flagged with a # the PIDs which did not recognize the TGG PAM and were therefore tested with another PAM.

More »

Expand

Fig 2.

Schematic representation of a standard Restricted Boltzmann Machine.

The RBM is a probabilistic graphical model with two layers: the visible layer carries protein sequences x and the hidden layer encodes latent vector h. The graph is not oriented, allowing one (i) to sample from the visible layer to the representation layer using the conditional distribution p(h|x) depending on x, on the weight matrix W and on the potentials U on the hidden units (ii) to sample from the hidden layer to the visible layer using the conditional distribution p(x|h) depending on h, on the weight matrix W and on the potentials g representing the prevalence of the amino acids at each position.

More »

Expand

Fig 3.

Semi-supervision helps the RBM learn useful representations of Cas9 PID.

a: Our Semi-Supervised Learning RBM with a one-layer classifier. This classifier takes as input the representation of the sequence in the hidden layer, and outputs the predicted PAM. b: Area under the ROC curve for the prediction of the PAM on the validation set. The curve is smoothed by averaging over 20 consecutive values. The shaded area shows the standard deviation over these 20 consecutive values of γ.

More »

Expand

Fig 4.

Constrained Langevin Dynamics as a sampling method.

a: The Constrained Langevin Dynamics in the representation space consists of two steps. The first step is a round of sampling to evaluate the gradient of the main criterion and the gradient of the control criterion through the expectation formula (see Methods), the second is a random step following the direction drawn by these two gradients (orthogonal projection of the main direction vector regarding the control direction). A Brownian noise is added to create randomness and diversity in the samples. b: Typical Random Walks obtained starting from the WT SpyCas9 PID and progressively targeting an RBM energy between -0.267 and -0.257 and a Hamming distance to the WT from 50 to 55 amino acids. The black lines are representing the time-dependent target intervals along the random walk.

More »

Expand

Fig 5.

Generative capacities of the SSL-RBM.

a: We tested the generative capacities of our models by generating 120 sequences with Constrained Langevin Dynamics using the trained SSL-RBM. We then use FoldX to compute the energy (displayed as ΔΔG, change in stability with respect to the wild-type) of the protein-DNA complex for the generated sequences and evaluate their quality. b: Distributions of FoldX energies of sequences generated with increasing values of γ. Distributions are drawn in gray using Gaussian Kernel Density Estimation, quantiles are also displayed for the different distributions. These quantiles and distributions show that overall, SSL-RBM trained with intermediate values of γ tend to generate sequences with better (lower) FoldX energies.

More »

Expand

Fig 6.

Experimental evaluation of generated sequences and relationship with Hamming distance, RBM and FoldX scores.

a and b: The activity of the dCas9 chimera carrying generated PIDs was measured using the mCherry fluorescence assay. Arbitrary thresholds were used to classify proteins as very functional (green, nr > 0.8), functional (yellow, 0.5 < nr < 0.8), marginally functional (orange, 0.2 < nr < 0.5), not functional (red, nr < 0.2). Bar plot a represents the sequence in the first batch of tested sequences, and bar plot b represents the sequences in the second batch of tested sequences. Error bars are plotted using 2 σ, where σ in the standard deviation between independent measurements (see Methods) (# dgfx_60 large standard deviation obtained from experimental measurement was removed from the figure for clarity) c: Scatter plot of generated protein sequences as a function of the RBM energy and Hamming distance of the sequence to that of the WT. Gray dots are untested sequences generated through Constrained Langevin Dynamics. We found very active sequences with up to 50 differences to the WT and any known natural sequence d: Scatter plot of generated protein sequences as a function of the RBM energy and FoldX ΔΔG.

More »

Expand

Fig 7.

Correlation between SSL-RBM energy and experimental activity.

The correlation between the experimental activity and the SSL-RBM score is plotted as a function of the classifier strength (γ) (red curve). The curve is smoothed by averaging over 20 consecutive values. The shaded area shows the standard deviation over these 20 consecutive values of γ.

More »

Expand

Fig 8.

RBM energy and structure scores predict sequence activity.

a: Sequences ordered by their score according to each of the metrics, the height of the bars and their color corresponds to the activity measured experimentally (color code identical to that of Fig 6). The RBM score displayed the best Spearman correlation, with AlphaFold2 showing also very good performance in detecting functional samples (AUROC curve for detecting sample with activity over 0.5). b: Correlation between the different scores. c: Three components emerge from a sparse PCA: a structure component (combination of AlphaFold2, FoldX and classifier scores), the RBM energy and the classifier score. d: Experimental results displayed along the two principal components inferred by the sparse PCA. The Cas9 PID variants are colored according to their activity, as in Fig 6.

More »

Expand

Fig 9.

Strengths and Weaknesses of Scoring Methods.

Each panel represents the precision in detecting inactive variants using different scores: RBM energy (a), AlphaFold2 RMSD (b), and FoldX’s ΔΔG (c). These metrics are plotted against the mutation’s location within the protein sequence and structure. DNA regions are highlighted in blue. Green signifies regions where the deleterious impact of mutations is accurately predicted. Conversely, red indicates regions where predictions miss the mark. Areas with insufficient data (less than five variant with a mutation at this position) are marked in gray. Notably, the RBM energy offers precise predictions at the DNA-binding interface, in contrast to AlphaFold2 RMSD and ΔΔG, which tend to be less accurate at these sites.

More »

Expand

Fig 10.

Architecture of the SSL-RBM classifier.

During training the classifier takes as input the hidden layer after sampling (equivalent to a noisy representation of the input sequence) while during prediction it uses the hidden representation. The hidden representation is then batch normalized and fed into a fully connected layer. The network attempts to predict if the PID recognizes each nucleotide at each position and in the PAM.

More »

Expand