Improved CRISPR/Cas9 off-target prediction with DNABERT and epigenetic features

doi:10.1371/journal.pone.0335863

Table 1.

Overview of datasets used for training and evaluation.

More »

Expand

Fig 1.

Overview of the DNABERT Fine-Tuning Process and the DNABERT-Epi Model Architecture.

(A) The two-stage fine-tuning process for DNABERT. A model pre-trained on a masked language model (MLM) task is first fine-tuned on a mismatch position prediction task, followed by a second fine-tuning stage on the off-target effect prediction task to produce a binary output (1 for active, 0 for inactive). (B) The input sequence processing pipeline. The sgRNA and target DNA sequences are first tokenized into 3-mers. These are then formatted with special tokens [CLS] and [SEP] before being converted into numerical input IDs for the model. (C) The architecture of the proposed DNABERT-Epi model. The model takes two inputs: the tokenized sequence, which is processed by DNABERT to produce a CLS embedding, and the epigenetic features, which are processed by a separate MLP. A gating mechanism, derived from the CLS embedding, modulates the epigenetic embedding. Finally, the CLS embedding and the gated epigenetic embedding are concatenated and passed to a final output layer to predict the off-target probability.

More »

Expand

Table 2.

Hyperparameters of the baseline models.

More »

Expand

Fig 2.

Performance comparison of all models on the Lazzarotto et al. (2020) GUIDE-seq dataset.

Boxplots show the distribution of (A) F1-score, (B) MCC, (C) ROC-AUC, and (D) PR-AUC scores from the 14-fold cross-validation experiments. The central line in each box indicates the median, the box represents the interquartile range (IQR), and the whiskers extend to 1.5 times the IQR. Dots beyond the whiskers are outliers. Statistical significance between model pairs was determined using the two-sided Wilcoxon signed-rank test with Benjamini-Hochberg correction. Significance levels are denoted as follows: ns: p > 0.05, *: p ≤ 0.05, **: p ≤ 0.01, ***: p ≤ 0.001, ****: p ≤ 0.0001.

More »

Expand

Fig 3.

Comparison of PR-AUC performance on training and independent test datasets.

This grouped bar chart provides a summary of model performance using the Precision-Recall Area Under the Curve (PR-AUC). The figure displays results for six datasets, facilitating a direct comparison of the models’ generalization capabilities. The Lazzarotto et al. (2020) GUIDE-seq dataset, which was analyzed in detail in Fig 2, is excluded from this summary view. Each bar represents the mean PR-AUC score for a given model on a specific dataset, calculated from the results of the cross-validation experiments. Error bars indicate the standard deviation of the PR-AUC scores across the folds. Higher bars indicate superior predictive performance. The labels on the x-axis are abbreviated for space and correspond to Lazzarotto et al. (2020) CHANGE-seq, Schmid-Burgk et al. (2020) TTISS, Listgarten et al. (2018) GUIDE-seq, Chen et al. (2017) GUIDE-seq, Tsai et al. (2015) GUIDE-seq (U2OS), and Tsai et al. (2015) GUIDE-seq (HEK293), respectively.

More »

Expand

Table 3.

Performance comparison of DNABERT with and without pre-training.

More »

Expand

Table 4.

Performance comparison of DNABERT with and without epigenetic features.

More »

Expand

Fig 4.

Contribution of individual models to the ensemble performance on the Lazzarotto et al. (2020) GUIDE-seq dataset.

The contribution of each model to the final ensemble was assessed using a leave-one-out approach. Each bar represents the decrease in the ensemble’s PR-AUC score when that specific model is excluded from the soft-voting process. A longer bar indicates a greater positive contribution to the ensemble’s predictive power. Models are ranked by their contribution.

More »

Expand

Fig 5.

SHAP analysis of epigenetic feature contributions in the DNABERT-Epi model.

The analysis was performed on the Lazzarotto et al. (2020) GUIDE-seq dataset. (A) The global importance of each epigenetic mark, measured by the mean absolute SHAP value across all features and samples. Error bars represent the standard deviation. (B) A SHAP summary plot from a representative cross-validation fold, illustrating the impact of the top 30 most important feature bins. Each point is a single sample, with its color indicating the feature’s value (red for high, blue for low) and its x-position showing the impact on the model’s output. (C) The positional importance of each epigenetic mark. The plot shows the mean absolute SHAP value for each 10 bp bin across a ± 500 bp window centered on the cleavage site. The shaded area represents the 95% confidence interval.

More »

Expand

Fig 6.

Integrated Gradients attribution analysis of sequence features.

(A) and (B) show representative attribution heatmaps for individual sgRNAs from the Lazzarotto et al. (2020) GUIDE-seq and Listgarten et al. (2018) GUIDE-seq datasets, respectively. The columns correspond to the 3-mer token positions in the target DNA. Color intensity reflects the attribution score, with darker red indicating a stronger positive contribution to the prediction of an off-target event. The sgRNA sequence is shown above each map. (C) UMAP visualization of the attribution vectors from all analyzed sgRNAs. Each point represents one sgRNA. Points are colored based on the location of their maximum attribution score: red for the PAM-distal hotspot (positions 4–6), blue for the PAM-proximal hotspot (positions 14–17), and grey for all others.

More »

Expand