Figures
Abstract
CRISPR/Cas9 is a powerful genome editing tool, but its clinical application is hindered by off-target effects. Accurate computational prediction of these unintended edits is crucial for ensuring the safety and efficacy of therapeutic applications. While various deep learning models have been developed, most are trained only on task-specific data, failing to leverage the vast knowledge embedded in entire genomes. To address this limitation, we introduce a novel approach that integrates DNABERT, a deep learning model pre-trained on the human genome, with epigenetic features (H3K4me3, H3K27ac, and ATAC-seq). We conducted a comprehensive benchmark of our model, DNABERT-Epi, against five state-of-the-art methods across seven distinct off-target datasets. Our results demonstrate that the pre-trained DNABERT-based models achieve competitive or even superior performance. Rigorous ablation studies quantitatively confirmed that both genomic pre-training and the integration of epigenetic features are critical factors that significantly enhance predictive accuracy. Furthermore, by applying advanced interpretability techniques (SHAP and Integrated Gradients), we identified the specific epigenetic marks and sequence-level patterns that influence the model’s predictions, offering insights into its decision-making process. This study is the first to establish the significant potential of a pre-trained DNA foundation model for CRISPR/Cas9 off-target prediction. Our findings underscore that leveraging both large-scale genomic knowledge and multi-modal data is a key strategy for advancing the development of safer genome editing tools.
Citation: Kimata K, Satou K (2025) Improved CRISPR/Cas9 off-target prediction with DNABERT and epigenetic features. PLoS One 20(11): e0335863. https://doi.org/10.1371/journal.pone.0335863
Editor: Hodaka Fujii, Hirosaki University Graduate School of Medicine, JAPAN
Received: June 30, 2025; Accepted: October 16, 2025; Published: November 12, 2025
Copyright: © 2025 Kimata, Satou. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript and its Supporting information files. All source code used in this study is available from GitHub: https://github.com/kimatakai/CRISPR_DNABERT.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
The CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats)/Cas9 system has revolutionized the field of biology, providing an unprecedentedly simple and efficient tool for genome editing [1]. Originally discovered as an adaptive immune system in bacteria, its adaptation for targeted DNA cleavage has opened up vast possibilities in genetic engineering, functional genomics, and the development of novel gene therapies for a wide range of human diseases [2–6]. The system’s specificity is primarily guided by a 20-nucleotide sequence in the single-guide RNA (sgRNA), which directs the Cas9 nuclease to a complementary target DNA sequence adjacent to a protospacer adjacent motif (PAM) [7–9].
Despite its power and precision, the therapeutic application of CRISPR/Cas9 is hampered by the risk of off-target effects, where the Cas9 nuclease cleaves unintended genomic sites that are similar to the intended target sequence [7–9]. Such unintended edits can lead to deleterious consequences, including the disruption of essential genes or the activation of oncogenes, posing a significant safety concern for clinical applications [10–12]. Therefore, the ability to accurately predict potential off-target sites in silico is of paramount importance for designing safe and effective sgRNAs.
In response to this challenge, a multitude of computational methods have been developed, evolving from early scoring algorithms to more sophisticated deep learning models that have demonstrated superior predictive performance [13–15]. In recent years, models based on the Transformer architecture, a cornerstone of natural language processing, have been successfully applied to off-target prediction, with models such as CRISPR-BERT and CrisprBERT showing promising results [16,17]. Concurrently, similar deep learning approaches have been effectively utilized in related bioinformatics classification tasks, such as predicting protein modification sites and identifying regulatory elements [18–22]. However, many existing deep learning models for off-target prediction are trained exclusively on task-specific datasets. This approach overlooks the rich, contextual information embedded within the entire genome. Moreover, while accumulating evidence suggests that epigenetic factors, such as chromatin accessibility, influence Cas9 activity [23,24], the integration of these features into predictive models remains an area of active development.
This study introduces a novel approach to address these limitations, centered on three key contributions. First, we present the first application of a pre-trained DNA foundation model, DNABERT, to the CRISPR/Cas9 off-target prediction task [25]. Unlike models trained from scratch on limited data, DNABERT has been pre-trained on the entire human genome, allowing it to learn the fundamental “language” of DNA. Our ablation studies quantitatively demonstrate that this genomic pre-training is indispensable for achieving high performance. Second, we propose DNABERT-Epi, a multi-modal model that integrates sequence data with epigenetic features, and we rigorously validate that this integration provides a statistically significant improvement in predictive accuracy. Third, recognizing the challenge of comparing models developed under different conditions, we provide a fair and comprehensive benchmark by re-implementing five state-of-the-art models and evaluating them alongside our own under a unified, stringent cross-validation framework [26,27]. Finally, we move beyond simple performance metrics by employing advanced interpretability techniques to provide novel insights into the biological mechanisms learned by the models.
The source codes used in this study are available at https://github.com/kimatakai/CRISPR_DNABERT.
2. Materials and methods
2.1. Datasets
2.1.1. Overview of utilized datasets.
In this study, we utilized one in vitro and six in cellula CRISPR/Cas9 off-target datasets to comprehensively evaluate our proposed models (Table 1). The in vitro dataset, derived from CHANGE-seq, was used for the initial training of all models from scratch [23]. For in cellula evaluation, we employed a multi-stage approach. First, we used two large-scale datasets, Lazzarotto et al. GUIDE-seq and Schmid-Burgk et al. TTISS, for training via transfer learning from the CHANGE-seq-trained models [28]. The Lazzarotto et al. GUIDE-seq dataset was particularly important as it was used to train and evaluate our multi-modal DNABERT-Epi model, which incorporates epigenetic features. To rigorously assess the generalization performance of our models, we used the remaining four in cellula datasets from Chen et al., Listgarten et al., and Tsai et al. exclusively as independent test sets [9,29,30].
2.1.2. Data acquisition and preprocessing.
To ensure a fair and reproducible comparison, we utilized datasets curated by Yaish et al. [27]. Specifically, the datasets from Lazzarotto et al., Chen et al., Listgarten et al., and Tsai et al. were obtained directly from their repository. For the Tsai et al. (2015) GUIDE-seq dataset, we separated the original combined dataset into U2OS and HEK293 cell-specific subsets for more precise evaluation. The Schmid-Burgk et al. (2020) TTISS dataset was generated by processing the raw sequence read data from PRJNA602092, following a pipeline identical to that used by Yaish et al. The Lazzarotto et al. (2020) GUIDE-seq dataset was expanded to include 20 additional sgRNAs newly curated by Yaish et al., resulting in a total of 78 sgRNAs for our 14-fold cross-validation.
All datasets exhibited a significant class imbalance between active (positive) and inactive (negative) off-target sites (Table 1). To mitigate potential model bias during training, we performed random downsampling on the negative class of the training data, reducing its size to 20% of the original. To ensure reproducibility across all models, this downsampling was performed once using a fixed random seed. This strategy to address severe class imbalance is a common approach in various bioinformatics classification tasks [18,19]. The test datasets remained unaltered throughout the process to allow for an unbiased evaluation of model performance.
2.2. Epigenetic feature processing
The selection of epigenetic features for our DNABERT-Epi model was guided by the findings of Lazzarotto et al. (2020), the source study for our primary in cellula dataset. Their research demonstrated that off-target sites identified by GUIDE-seq are significantly enriched in regions characterized by open chromatin (ATAC-seq), active promoters (H3K4me3), and enhancers (H3K27ac) [23]. In contrast, no significant enrichment was observed for repressive histone marks such as H3K27me3 and H3K9me3. Based on this direct evidence, we focused on integrating these three activating marks to enhance the predictive power of our model. The raw epigenetic data were obtained from the Gene Expression Omnibus (GSE149363).
The processing pipeline for each of the three epigenetic features was as follows. First, for each potential off-target site, we extracted the signal values within a 1000 bp window, centered on the cleavage site (±500 bp). To handle potential outliers within this window, signal values exceeding the range of Q1 - 1.5 * IQR or Q3 + 1.5 * IQR were capped at these respective boundary values. Subsequently, a Z-score transformation was applied to the signal values across the entire dataset for normalization. Finally, the normalized signal within the 1000 bp window was divided into 100 bins of 10 bp each, and the average signal was calculated for each bin, resulting in a 100-dimensional feature vector for each epigenetic mark. These three vectors were then concatenated to form a final 300-dimensional feature vector, which was used as the epigenetic input for the DNABERT-Epi model.
2.3. Model architecture
2.3.1. DNABERT fine-tuning.
DNABERT is a BERT-based model pre-trained on a large corpus of DNA sequences, enabling it to learn the fundamental patterns of the DNA language [25]. In this study, we utilized the 3-mer DNABERT model, which was pre-trained on a masked language model (MLM) task. To adapt this model for off-target prediction, we implemented a two-stage fine-tuning process (Fig 1A).
(A) The two-stage fine-tuning process for DNABERT. A model pre-trained on a masked language model (MLM) task is first fine-tuned on a mismatch position prediction task, followed by a second fine-tuning stage on the off-target effect prediction task to produce a binary output (1 for active, 0 for inactive). (B) The input sequence processing pipeline. The sgRNA and target DNA sequences are first tokenized into 3-mers. These are then formatted with special tokens [CLS] and [SEP] before being converted into numerical input IDs for the model. (C) The architecture of the proposed DNABERT-Epi model. The model takes two inputs: the tokenized sequence, which is processed by DNABERT to produce a CLS embedding, and the epigenetic features, which are processed by a separate MLP. A gating mechanism, derived from the CLS embedding, modulates the epigenetic embedding. Finally, the CLS embedding and the gated epigenetic embedding are concatenated and passed to a final output layer to predict the off-target probability.
The first stage involved fine-tuning the model on a mismatch position prediction task. This task was designed to explicitly teach the model the pairing relationship between sgRNA and target DNA sequences. For this stage, we used a batch size of 8, a learning rate of 2e-5, and trained for five epochs. In the second stage, the model was further fine-tuned for the primary binary classification task of predicting off-target effects. For this off-target prediction task, we used a batch size of 256, a learning rate of 2e-5, and trained for five epochs.
Before fine-tuning, DNABERT’s vocabulary was expanded to include 3-mer tokens containing the bulge character (‘-’) to handle insertions and deletions. As illustrated in Fig 1B, input sequences were formatted by concatenating the 3-mer tokens of the sgRNA and target DNA, separated by special tokens: [CLS] sgRNA 3-mer tokens [SEP] DNA 3-mer tokens [SEP].
2.3.2. DNABERT-Epi for multimodal integration.
To investigate the impact of epigenetic context on off-target activity, we propose DNABERT-Epi, a multi-modal model that integrates sequence information with epigenetic features (Fig 1C). This model uses the fine-tuned DNABERT as its sequence-processing backbone.
The architecture of DNABERT-Epi processes two distinct inputs simultaneously. The tokenized sequence is fed into the DNABERT component to generate a high-level sequence representation, from which we extract the final embedding of the [CLS] token. Concurrently, the 300-dimensional epigenetic feature vector is passed through a multi-layer perceptron (MLP) to produce an epigenetic embedding. To control the influence of the epigenetic information based on the sequence context, we employ a gating mechanism. A gate vector is generated from the CLS embedding, which then modulates the epigenetic embedding through element-wise multiplication. Finally, the original CLS embedding and the gated epigenetic embedding are concatenated and fed into a linear layer with a softmax activation function to compute the final probability of an off-target event.
2.3.3. Baseline models.
To evaluate the performance of our proposed models, we compared them against five state-of-the-art deep learning-based models for CRISPR off-target prediction: GRU-Embed, CRISPR-BERT, CRISPR-HW, CRISPR-DIPOFF, and CrisprBERT [16,17,27,31,32]. To ensure a fair and direct comparison under identical experimental conditions, we re-implemented all baseline models in PyTorch (version 2.5.1), based on the descriptions in their respective original papers and publicly available source code. As the original implementations of CRISPR-DIPOFF and CrisprBERT did not support inputs containing bulges, we modified their data processing modules accordingly. The hyperparameters used for each baseline model are detailed in Table 2.
2.3.4. Ensemble model.
To further enhance predictive performance and robustness, we also constructed an ensemble model. We employed a soft voting strategy, where the final prediction is determined by averaging the probability scores output by each individual model. For datasets without epigenetic information, the ensemble combined the five baseline models and DNABERT. For the Lazzarotto et al. (2020) GUIDE-seq dataset, the ensemble included all seven models (the five baselines, DNABERT, and DNABERT-Epi). This approach is designed to leverage the complementary strengths of diverse model architectures to achieve a more stable and accurate prediction [33].
2.4. Implementation details
All models were implemented using Python 3.10. The deep learning frameworks used were PyTorch (version 2.5.1) and Transformers (version 4.48.3). All training and evaluation experiments were conducted on a workstation equipped with an NVIDIA GeForce RTX 4090 GPU (24 GB of VRAM), running on a Linux operating system with CUDA (version 12.4). The training times for each model on the main datasets are provided in the Supplementary Materials (S4 Table).
2.5. Experimental setup and evaluation
2.5.1. Cross-validation strategy.
To robustly evaluate the predictive performance of the models, we employed a comprehensive, two-tiered validation strategy.
First, for the datasets used in training (Lazzarotto et al. (2020) CHANGE-seq, Lazzarotto et al. (2020) GUIDE-seq, and Schmid-Burgk et al. (2020) TTISS), we implemented a rigorous sgRNA-based cross-validation scheme. This approach ensures that all off-target sites associated with a particular sgRNA are entirely contained within a single fold, preventing information leakage between the training and test sets and thereby providing a more realistic estimate of a model’s ability to generalize to new sgRNAs [27]. The number of folds was set to 10 or 14, depending on the dataset, to align with established evaluation protocols.
Second, to assess the true generalization capability of the models on completely unseen data, we performed validation using four independent datasets that were entirely excluded from any training or hyperparameter tuning processes. This comprehensive approach, which combines cross-validation with independent testing, is essential for mitigating the risk of overfitting and ensuring that a model can generalize to new data from different cell types or experimental methods. The importance of such a dual-validation strategy has been emphasized as a critical practice for developing reliable and practical deep learning models in computational biology [18,20].
Furthermore, to account for the stochasticity inherent in deep learning models (e.g., random weight initialization), the entire cross-validation process was repeated five times with different random seeds. The final reported performance metrics represent the aggregated results from these five independent runs.
2.5.2. Evaluation metrics.
Given the extreme class imbalance inherent in the CRISPR off-target datasets used in this study, we selected four primary evaluation metrics that are well-suited for such scenarios: F1-score, Matthews Correlation Coefficient (MCC), the area under the receiver operating characteristic curve (ROC-AUC), and the area under the precision-recall curve (PR-AUC).
The F1-score, as the harmonic mean of precision and recall, provides a balanced measure of a model’s performance when the positive class is rare. The MCC considers all four entries of the confusion matrix (true positives, true negatives, false positives, and false negatives) and is widely regarded as one of the most robust metrics for imbalanced classification. While ROC-AUC assesses the overall discriminative ability of a model across all thresholds, PR-AUC is often more informative in settings with a large skew in the class distribution, as it evaluates the trade-off between precision and recall. By utilizing this suite of metrics, we ensure a comprehensive and reliable assessment of model performance.
2.5.3. Statistical analysis.
To rigorously compare the performance between different models, we employed statistical tests appropriate for the paired nature of our cross-validation results (i.e., all models were evaluated on the same data folds). A two-sided Wilcoxon signed-rank test was used to determine if there were statistically significant differences between the performance distributions of any two models.
Furthermore, to address the issue of multiple comparisons arising from performing tests across numerous model pairs and metrics, we controlled the false discovery rate (FDR) using the Benjamini-Hochberg (BH) procedure. All reported p-values are adjusted p-values from this procedure, ensuring the statistical robustness of our conclusions. A significance level of p < 0.05 was used throughout the study.
2.6. Model interpretability analysis
To gain insights into the decision-making processes of our models beyond predictive accuracy, we employed two distinct interpretability methods. We used SHAP to quantify the contribution of epigenetic features and Integrated Gradients to attribute the prediction to specific sequence tokens.
2.6.1. SHAP analysis for epigenetic features.
Shapley Additive Explanations (SHAP) is a game theory-based approach used to explain the output of any machine learning model by assigning an importance value to each feature for a particular prediction [34]. To analyze the influence of epigenetic features on the predictions of our DNABERT-Epi model, we utilized the DeepExplainer algorithm, which is an efficient variant of SHAP designed for deep learning models [35].
For the analysis, we first extracted a balanced subset of data, consisting of all active off-target sites and an equal number of randomly sampled inactive sites from the Lazzarotto et al. GUIDE-seq dataset. We then calculated the SHAP values for the 300-dimensional epigenetic input features for each sample in this subset. To derive global insights, these SHAP values were aggregated in two ways: (1) the mean absolute SHAP values were calculated for each of the three epigenetic marks (ATAC-seq, H3K4me3, and H3K27ac) to determine their overall importance, and (2) the mean SHAP values for each of the 100 genomic bins were computed to visualize the positional importance of epigenetic signals relative to the cleavage site.
2.6.2. Integrated gradients for sequence features.
Integrated Gradients (IG) is a feature attribution method that calculates the importance of each input feature by accumulating the gradients along the path from a baseline input to the actual input [36]. We applied IG to the DNABERT model to identify which nucleotide tokens were most influential in predicting active off-target sites.
The embedding of the [PAD] token was used as the baseline for this analysis. For each active off-target site, we calculated the attribution score for every 3-mer token in the input sequence. These scores were then visualized as heatmaps to reveal attribution patterns for individual sgRNAs. To statistically validate the significance of recurrently high-attribution regions (hotspots), we conducted a randomization test with 1,000 iterations. In each iteration, we compared the mean attribution in these regions to randomly selected regions of the same size to calculate an empirical p-value. Finally, to explore overarching patterns across all sgRNAs, the attribution vectors were embedded into a two-dimensional space using Uniform Manifold Approximation and Projection (UMAP) for visualization and clustering [37].
3. Results
3.1. Benchmarking performance on diverse datasets
To establish the effectiveness of our proposed methods, we conducted a comprehensive benchmark against five state-of-the-art models across seven distinct datasets. First, we focused on the Lazzarotto et al. GUIDE-seq dataset to evaluate our multi-modal DNABERT-Epi model. As shown in Fig 2, DNABERT-Epi demonstrated superior performance, particularly in metrics sensitive to class imbalance. It achieved a significantly higher F1-score and MCC compared to the sequence-only DNABERT model (p < 0.05), and its PR-AUC of 0.550 was the highest among all single models, outperforming DNABERT (0.539) and CrisprBERT (0.511). While several models, including DNABERT-Epi, achieved high ROC-AUC scores, PR-AUC provides a more critical evaluation in this context. The ensemble model consistently achieved the best performance across all metrics, highlighting the benefits of integrating diverse model predictions. Detailed performance metrics and the complete results of statistical tests for this dataset are provided in S1 Table and S2 Table, respectively.
Boxplots show the distribution of (A) F1-score, (B) MCC, (C) ROC-AUC, and (D) PR-AUC scores from the 14-fold cross-validation experiments. The central line in each box indicates the median, the box represents the interquartile range (IQR), and the whiskers extend to 1.5 times the IQR. Dots beyond the whiskers are outliers. Statistical significance between model pairs was determined using the two-sided Wilcoxon signed-rank test with Benjamini-Hochberg correction. Significance levels are denoted as follows: ns: p > 0.05, *: p ≤ 0.05, **: p ≤ 0.01, ***: p ≤ 0.001, ****: p ≤ 0.0001.
Next, to assess the generalization performance on the remaining datasets, we evaluated the sequence-based models on the in vitro CHANGE-seq data and the five other in cellula datasets. The overall trend, summarized by the PR-AUC in Fig 3, indicates that DNABERT is a consistently strong performer across these diverse conditions. For instance, on the Schmid-Burgk 2020 TTISS and Tsai 2015 GUIDE-seq (HEK293) datasets, DNABERT achieved the highest PR-AUC among all single models. However, the performance varied depending on the dataset; CrisprBERT, for example, showed competitive performance on the Chen 2017 and Tsai 2015 U2OS GUIDE-seq datasets. This highlights that no single model architecture is universally optimal for all conditions. Importantly, our proposed DNABERT model maintained robust and competitive performance across this wide range of datasets derived from different experimental methods and cell types, confirming its high generalization capability. The comprehensive performance results for all models across all metrics and datasets are available in the Supplementary Materials, which include detailed performance plots (S1 File), full numerical results (S1 Table), statistical test p-values (S2 Table), and confusion matrices organized by mismatch count (S3 Table).
This grouped bar chart provides a summary of model performance using the Precision-Recall Area Under the Curve (PR-AUC). The figure displays results for six datasets, facilitating a direct comparison of the models’ generalization capabilities. The Lazzarotto et al. (2020) GUIDE-seq dataset, which was analyzed in detail in Fig 2, is excluded from this summary view. Each bar represents the mean PR-AUC score for a given model on a specific dataset, calculated from the results of the cross-validation experiments. Error bars indicate the standard deviation of the PR-AUC scores across the folds. Higher bars indicate superior predictive performance. The labels on the x-axis are abbreviated for space and correspond to Lazzarotto et al. (2020) CHANGE-seq, Schmid-Burgk et al. (2020) TTISS, Listgarten et al. (2018) GUIDE-seq, Chen et al. (2017) GUIDE-seq, Tsai et al. (2015) GUIDE-seq (U2OS), and Tsai et al. (2015) GUIDE-seq (HEK293), respectively.
3.2. Ablation studies confirm key contributions
To dissect the factors contributing to our model’s performance, we conducted two key ablation studies. These studies were designed to quantitatively assess the impact of DNABERT’s pre-training and the integration of epigenetic features.
First, to evaluate the effectiveness of leveraging a pre-trained foundation model, we compared the performance of our fine-tuned DNABERT model against an identical model architecture trained from scratch (i.e., with randomly initialized weights). The results, summarized in Table 3, demonstrate the critical importance of pre-training. The model initialized with pre-trained weights substantially outperformed the from-scratch model across all evaluation metrics (e.g., + 0.1653 in PR-AUC; p < 0.001). While the from-scratch model showed evidence of some learning, its performance was markedly inferior to the pre-trained model. This suggests that the genomic knowledge encoded during pre-training is essential for the model to effectively learn the complex and imbalanced off-target prediction task, a conclusion supported by the less effective training loss reduction observed for the from-scratch model (S1 Fig).
Second, we conducted an ablation study to determine whether the inclusion of epigenetic information provides a tangible benefit for off-target prediction in a cellular context. We compared the performance of the sequence-only DNABERT model with our multi-modal DNABERT-Epi model on the Lazzarotto et al. (2020) GUIDE-seq dataset. As shown in Table 4, the integration of epigenetic features led to statistically significant improvements in three of the four key metrics: F1-score, MCC, and PR-AUC (p < 0.001). No statistically significant difference was observed in ROC-AUC (p = 0.127). These results confirm that incorporating epigenetic context enhances the model’s ability to distinguish between active and inactive off-target sites in cellula, particularly in ways captured by precision-recall-based metrics.
3.3. Ensemble model benefits from diverse architectures
Our benchmark results consistently showed that the ensemble model outperformed any single model across all datasets (Fig 2, Fig 3). To understand the dynamics within the ensemble, we analyzed the contribution of each constituent model using a leave-one-out strategy. This analysis reveals how the diversity of model architectures contributes to the overall superior performance of the ensemble, a phenomenon that has been noted in previous studies [33].
Fig 4 illustrates the contribution of each model to the ensemble’s performance on the Lazzarotto et al. (2020) GUIDE-seq dataset, as measured by the drop in PR-AUC when a model is excluded. As expected, high-performing individual models such as CrisprBERT and our proposed DNABERT-Epi were the largest contributors. Their exclusion led to the most significant drop in the ensemble’s performance, confirming their central role. Notably, the analysis also revealed that models with moderate individual performance, such as GRU-Embed and CRISPR-HW, still made substantial positive contributions. This suggests that these models capture unique predictive patterns that are complementary to those learned by the top-performing models. Conversely, the exclusion of CRISPR-BERT slightly improved the ensemble’s score, indicating that its predictions were, on average, detrimental in this specific combination. This analysis underscores the principle that the strength of an ensemble lies not just in combining the best-performing models, but in leveraging the diverse perspectives of multiple architectures. The detailed results of this analysis for all datasets are provided in the S2 File.
The contribution of each model to the final ensemble was assessed using a leave-one-out approach. Each bar represents the decrease in the ensemble’s PR-AUC score when that specific model is excluded from the soft-voting process. A longer bar indicates a greater positive contribution to the ensemble’s predictive power. Models are ranked by their contribution.
3.4. Model interpretation unveils predictive mechanisms
To move beyond predictive accuracy and understand the biological patterns learned by our models, we employed two interpretability techniques. We used SHAP to analyze the role of epigenetic features in DNABERT-Epi and Integrated Gradients to identify critical nucleotide positions in the sequence-only DNABERT model.
3.4.1. SHAP reveals the critical role of H3K27ac.
To elucidate how DNABERT-Epi utilizes epigenetic information, we calculated SHAP values for each input feature. This approach is similar to previous studies in RNA modification prediction, where SHAP has been used to link computational predictions to biologically meaningful motifs [20]. The analysis of global feature importance revealed a clear hierarchy among the three epigenetic marks. As shown in Fig 5A, H3K27ac, a mark associated with active enhancers, had a substantially higher mean absolute SHAP value than H3K4me3 (active promoters) and ATAC-seq (open chromatin), identifying that H3K27ac is the most influential feature in the model. This is further supported by the SHAP summary plot (Fig 5B and S3 File), where the top 30 most impactful features consist exclusively of H3K27ac-related bins. The plot also shows a consistent trend where higher signal values for these features (red points) positively impact the model’s prediction of off-target activity.
The analysis was performed on the Lazzarotto et al. (2020) GUIDE-seq dataset. (A) The global importance of each epigenetic mark, measured by the mean absolute SHAP value across all features and samples. Error bars represent the standard deviation. (B) A SHAP summary plot from a representative cross-validation fold, illustrating the impact of the top 30 most important feature bins. Each point is a single sample, with its color indicating the feature’s value (red for high, blue for low) and its x-position showing the impact on the model’s output. (C) The positional importance of each epigenetic mark. The plot shows the mean absolute SHAP value for each 10 bp bin across a ± 500 bp window centered on the cleavage site. The shaded area represents the 95% confidence interval.
Examining the positional importance of these features (Fig 5C), H3K27ac consistently showed the highest mean SHAP values across the entire ±500 bp window around the cleavage site. Importantly, the contribution of H3K27ac was not uniform, with prominent peaks observed around ±200 bp and near the ± 500 bp boundaries of the window. These results suggest that the model has learned to associate off-target events not only with the general presence of an active enhancer landscape but also with specific spatial patterns of H3K27ac enrichment relative to the cleavage site.
3.4.2. Integrated gradients identify key nucleotide positions and sgRNA clusters.
To understand which parts of the input sequence were most critical for prediction, we used Integrated Gradients to calculate attribution scores for each 3-mer token. The resulting heatmaps for individual sgRNAs consistently revealed two distinct high-attribution regions, or “hotspots,” within the 20-nt guide-target duplex (Fig 6A, 6B and S4 File). The first hotspot was located at the PAM-distal end (positions 4–6), while the second was at the PAM-proximal end (positions 14–17), partially overlapping with the canonical seed region. A randomization test confirmed that the attribution scores within these two hotspots were, in most cases (15 of 18 data subsets), statistically significantly higher than in the rest of the sequence (all p-values in S5 Table).
(A) and (B) show representative attribution heatmaps for individual sgRNAs from the Lazzarotto et al. (2020) GUIDE-seq and Listgarten et al. (2018) GUIDE-seq datasets, respectively. The columns correspond to the 3-mer token positions in the target DNA. Color intensity reflects the attribution score, with darker red indicating a stronger positive contribution to the prediction of an off-target event. The sgRNA sequence is shown above each map. (C) UMAP visualization of the attribution vectors from all analyzed sgRNAs. Each point represents one sgRNA. Points are colored based on the location of their maximum attribution score: red for the PAM-distal hotspot (positions 4–6), blue for the PAM-proximal hotspot (positions 14–17), and grey for all others.
To investigate whether these attribution patterns were consistent across all sgRNAs, we visualized the high-dimensional attribution vectors using UMAP (Fig 6C). The plot reveals that the sgRNAs form three distinct clusters: one group where the PAM-distal hotspot has the maximum attribution (red), another where the PAM-proximal hotspot is dominant (blue), and a third with no clear hotspot pattern (grey). This clustering suggests that the model does not apply a single, uniform pattern of sequence recognition across all sgRNAs. Instead, the model appears to have learned to weigh nucleotide importance in a context-dependent manner. This finding may indicate that the sequence determinants of off-target activity are more nuanced than a simple binary seed/non-seed distinction, and that the model is capturing elements of this complexity.
4. Discussion
This study presents a comprehensive evaluation of a pre-trained DNA foundation model, DNABERT, for CRISPR/Cas9 off-target prediction. We have demonstrated that by leveraging large-scale genomic pre-training and integrating epigenetic features, our proposed models, DNABERT and DNABERT-Epi, achieve state-of-the-art performance across a wide range of datasets. Our contributions can be summarized in fourfold: (1) we are the first to apply a pre-trained DNA foundation model to this task; (2) we have quantitatively demonstrated the significant contribution of both pre-training and epigenetic features through rigorous ablation studies; (3) we have provided a fair and extensive benchmark by re-implementing and evaluating multiple models under identical conditions; and (4) we have offered novel insights into the models’ decision-making processes through advanced interpretability analyses.
Our interpretability analyses provided valuable insights into the predictive mechanisms of the models. The SHAP analysis of DNABERT-Epi revealed that H3K27ac, an active enhancer mark, emerged as the most influential feature among those examined for predicting off-target activity in a cellular context. This aligns with existing biological knowledge that CRISPR/Cas9 activity is often higher in accessible, transcriptionally active chromatin regions [23,38,39]. Furthermore, our analysis of positional importance suggested that the model learned specific spatial patterns of H3K27ac enrichment, rather than just its overall presence.
Regarding sequence specificity, the canonical “seed” region adjacent to the PAM is widely considered the most critical determinant for target recognition [7,40–47]. We therefore initially hypothesized that our interpretability analysis would primarily and uniformly highlight this region. However, the Integrated Gradients analysis of DNABERT revealed a more nuanced picture. Rather than a singular focus on the entire seed region, the model identified two distinct attribution hotspots: a PAM-proximal (positions 14–17) and a PAM-distal (positions 4–6) region. These computationally identified hotspots may reflect critical stages in R-loop formation and conformational activation, although this remains a hypothesis requiring further experimental validation [42,48–50]. For instance, the high importance placed on the PAM-proximal hotspot is consistent with recent structural studies that describe a conformational checkpoint mechanism, where mismatches in this region can prevent the activation of the HNH nuclease domain [48–50]. This consistency suggests, but does not prove, that the model may be capturing sequence features relevant to this checkpoint. In parallel, the PAM-distal hotspot could represent the model’s focus on an earlier stage, such as the initiation and stable propagation of the R-loop. Interestingly, the UMAP visualization of attribution vectors further suggested that sgRNAs may cluster into three groups: those emphasizing the PAM-proximal hotspot, those emphasizing the PAM-distal hotspot, and those without a clear hotspot preference. This preliminary observation may point to a potential classification of sgRNAs into context-dependent categories, although further experimental evidence will be required to establish such a framework. This convergence of our model’s learned patterns with biophysical mechanisms may provide a data-driven hypothesis for the nuanced rules governing Cas9 target engagement, but further experimental validation will be necessary to confirm this link.
While DNABERT and DNABERT-Epi demonstrated robust performance, they did not universally outperform all other models across metrics and datasets. For instance, CrisprBERT, a model that does not use large-scale pre-training but combines a BERT-based encoder with a BiLSTM, showed competitive or superior performance on certain datasets. This suggests that while pre-training provides a powerful, generalizable foundation, task-specific architectural innovations are also crucial. A promising future direction would be to combine the strengths of both approaches, for example, by integrating a BiLSTM layer into the fine-tuned DNABERT architecture to better capture sequential dependencies specific to off-target recognition. Our study has several limitations that represent important areas for future research. First, our analysis was focused exclusively on the Streptococcus pyogenes Cas9 (SpCas9) nuclease. The applicability of our models to other CRISPR systems, such as Cas12a or high-fidelity Cas9 variants, remains unverified and will require retraining and evaluation on variant-specific datasets [51–55]. Second, the validation of our multi-modal DNABERT-Epi model was limited to a single GUIDE-seq dataset due to the lack of publicly available, matched epigenetic data for other off-target datasets. The full potential and generalizability of this approach can only be realized as more comprehensive, multi-modal datasets become available. A promising future direction to overcome this data scarcity would be to leverage large-scale, sequence-based predictive models, such as DeepSEA, to generate in silico chromatin profiles for these datasets. Integrating these high-quality predicted features could significantly expand the applicability of our multi-modal approach [56]. Third, the high performance of DNABERT-based models comes at a significant computational cost, requiring substantially more training time than other baseline models (S4 Table). Applying model compression techniques such as distillation or pruning could be a valuable next step to create more efficient yet powerful models for broader use [57,58]. Finally, like all models benchmarked in this study, our approach struggled to accurately predict active off-target sites with a high number of mismatches (5–6), frequently misclassifying them as inactive (S3 Table). This “high-mismatch problem” is a key challenge for the field and may require novel architectural solutions or specialized sampling strategies to address effectively.
In conclusion, this work establishes the significant potential of pre-trained foundation models for advancing CRISPR/Cas9 off-target prediction. We have demonstrated that combining the vast genomic knowledge learned during pre-training with context-specific epigenetic information leads to a more accurate and robust prediction of off-target events. The interpretability analyses not only increase transparency but also generate new and testable hypotheses about the underlying biology. Future efforts should focus on expanding these models to other CRISPR variants, improving computational efficiency, and developing new strategies to tackle the persistent challenge of high-mismatch prediction, ultimately contributing to the development of safer and more effective genome editing therapies. A promising direction is to extend our model to predict not just the presence of off-target effects, but also their strength. This approach, which involves a multi-level classification task, has been successfully applied in other domains, such as the identification of enhancers, where a two-stage framework was used to classify both the presence and strength of enhancers [21].
Supporting information
S1 File. Performance comparison across all datasets.
https://doi.org/10.1371/journal.pone.0335863.s001
(PDF)
S2 File. Ensemble contribution (PR-AUC) across all datasets.
https://doi.org/10.1371/journal.pone.0335863.s002
(PDF)
S4 File. Integrated Gradients heatmaps for all sgRNAs.
https://doi.org/10.1371/journal.pone.0335863.s004
(PDF)
S1 Table. Comprehensive performance comparison table across all datasets.
https://doi.org/10.1371/journal.pone.0335863.s005
(XLSX)
S2 Table. P-values from statistical comparisons across all datasets.
https://doi.org/10.1371/journal.pone.0335863.s006
(XLSX)
S3 Table. Confusion matrices by mismatch count for all datasets.
https://doi.org/10.1371/journal.pone.0335863.s007
(XLSX)
S4 Table. Training time comparison for each model.
https://doi.org/10.1371/journal.pone.0335863.s008
(XLSX)
S5 Table. P-values from the randomization test for Integrated Gradients analysis.
https://doi.org/10.1371/journal.pone.0335863.s009
(XLSX)
S1 Fig. Training loss curves comparing the pre-trained and from-scratch DNABERT models.
https://doi.org/10.1371/journal.pone.0335863.s010
(TIFF)
Acknowledgments
In this research, the super-computing resource was provided by Human Genome Center, the Institute of Medical Science, the University of Tokyo. In addition, computations were partially performed on the NIG supercomputer at ROIS National Institute of Genetics.
References
- 1. Jinek M, Chylinski K, Fonfara I, Hauer M, Doudna JA, Charpentier E. A programmable dual-RNA–guided DNA endonuclease in adaptive bacterial immunity. Science. 2012;337(6096):816–21.
- 2. Doudna JA, Charpentier E. Genome editing. The new frontier of genome engineering with CRISPR-Cas9. Science. 2014;346(6213):1258096. pmid:25430774
- 3. Sharma G, Sharma AR, Bhattacharya M, Lee S-S, Chakraborty C. CRISPR-Cas9: a preclinical and clinical perspective for the treatment of human diseases. Mol Ther. 2021;29(2):571–86. pmid:33238136
- 4. Hsu PD, Lander ES, Zhang F. Development and applications of CRISPR-Cas9 for genome engineering. Cell. 2014;157(6):1262–78. pmid:24906146
- 5. Deveau H, Garneau JE, Moineau S. CRISPR/Cas system and its role in phage-bacteria interactions. Annu Rev Microbiol. 2010;64:475–93. pmid:20528693
- 6. Horvath P, Barrangou R. CRISPR/Cas, the immune system of bacteria and archaea. Science. 2010;327(5962):167–70. pmid:20056882
- 7. Zhang X-H, Tee LY, Wang X-G, Huang Q-S, Yang S-H. Off-target effects in CRISPR/Cas9-mediated genome engineering. Mol Ther Nucleic Acids. 2015;4(11):e264. pmid:26575098
- 8. Mali P, Aach J, Stranges PB, Esvelt KM, Moosburner M, Kosuri S, et al. CAS9 transcriptional activators for target specificity screening and paired nickases for cooperative genome engineering. Nat Biotechnol. 2013;31(9):833–8. pmid:23907171
- 9. Chen JS, Dagdas YS, Kleinstiver BP, Welch MM, Sousa AA, Harrington LB, et al. Enhanced proofreading governs CRISPR-Cas9 targeting accuracy. Nature. 2017;550(7676):407–10. pmid:28931002
- 10. Lin J, Wong K-C. Off-target predictions in CRISPR-Cas9 gene editing using deep learning. Bioinformatics. 2018;34(17):i656–63. pmid:30423072
- 11. Manghwar H, Li B, Ding X, Hussain A, Lindsey K, Zhang X, et al. CRISPR/Cas systems in genome editing: methodologies and tools for sgRNA design, off-target evaluation, and strategies to mitigate off-target effects. Adv Sci (Weinh). 2020;7(6):1902312. pmid:32195078
- 12. Peng R, Lin G, Li J. Potential pitfalls of CRISPR/Cas9-mediated genome editing. FEBS J. 2016;283(7):1218–31. pmid:26535798
- 13. Konstantakos V, Nentidis A, Krithara A, Paliouras G. CRISPR-Cas9 gRNA efficiency prediction: an overview of predictive tools and the role of deep learning. Nucleic Acids Res. 2022;50(7):3616–37. pmid:35349718
- 14. Sherkatghanad Z, Abdar M, Charlier J, Makarenkov V. Using traditional machine learning and deep learning methods for on- and off-target prediction in CRISPR/Cas9: a review. Brief Bioinform. 2023;24(3):bbad131. pmid:37080758
- 15. Zhang G, Luo Y, Dai X, Dai Z. Benchmarking deep learning methods for predicting CRISPR/Cas9 sgRNA on- and off-target activities. Brief Bioinform. 2023;24(6):bbad333. pmid:37775147
- 16. Luo Y, Chen Y, Xie H, Zhu W, Zhang G. Interpretable CRISPR/Cas9 off-target activities with mismatches and indels prediction using BERT. Comput Biol Med. 2024;169:107932. pmid:38199209
- 17. Sari O, Liu Z, Pan Y, Shao X. Predicting CRISPR-Cas9 off-target effects in human primary cells using bidirectional LSTM with BERT embedding. Bioinform Adv. 2024;5(1):vbae184. pmid:39758829
- 18. Khan S, AlQahtani SA, Noor S, Ahmad N. PSSM-Sumo: deep learning based intelligent model for prediction of sumoylation sites using discriminative features. BMC Bioinform. 2024;25(1):284. pmid:39215231
- 19. Khan S, Noor S, Javed T, Naseem A, Aslam F, AlQahtani SA, et al. XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites. BioData Min. 2025;18(1):12. pmid:39901279
- 20. Noor S, Naseem A, Awan HH, Aslam W, Khan S, AlQahtani SA, et al. Deep-m5U: a deep learning-based approach for RNA 5-methyluridine modification prediction using optimized feature integration. BMC Bioinform. 2024;25(1):360. pmid:39563239
- 21. Yao L, Xie P, Guan J, Chung C-R, Huang Y, Pang Y, et al. CapsEnhancer: an effective computational framework for identifying enhancers based on chaos game representation and capsule network. J Chem Inf Model. 2024;64(14):5725–36. pmid:38946113
- 22. Xie P, Guan J, He X, Zhao Z, Guo Y, Sun Z, et al. CAP-m7G: a capsule network-based framework for specific RNA N7-methylguanosine site identification using image encoding and reconstruction layers. Comput Struct Biotechnol J. 2025;27:804–12. pmid:40109445
- 23. Lazzarotto CR, Malinin NL, Li Y, Zhang R, Yang Y, Lee G, et al. CHANGE-seq reveals genetic and epigenetic effects on CRISPR-Cas9 genome-wide activity. Nat Biotechnol. 2020;38(11):1317–27. pmid:32541958
- 24. Bergman S, Tuller T. Strong association between genomic 3D structure and CRISPR cleavage efficiency. PLoS Comput Biol. 2024;20(6):e1012214. pmid:38848440
- 25. Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics. 2021;37(15):2112–20. pmid:33538820
- 26. Yaish O, Asif M, Orenstein Y. A systematic evaluation of data processing and problem formulation of CRISPR off-target site prediction. Brief Bioinform. 2022;23(5):bbac157. pmid:35595297
- 27. Yaish O, Orenstein Y. Generating, modeling and evaluating a large-scale set of CRISPR/Cas9 off-target sites with bulges. Nucleic Acids Res. 2024;52(12):6777–90. pmid:38813823
- 28. Schmid-Burgk JL, Gao L, Li D, Gardner Z, Strecker J, Lash B, et al. Highly parallel profiling of Cas9 variant specificity. Mol Cell. 2020;78(4):794–800.e8. pmid:32187529
- 29. Listgarten J, Weinstein M, Kleinstiver BP, Sousa AA, Joung JK, Crawford J, et al. Prediction of off-target activities for the end-to-end design of CRISPR guide RNAs. Nat Biomed Eng. 2018;2(1):38–47. pmid:29998038
- 30. Tsai SQ, Zheng Z, Nguyen NT, Liebers M, Topkar VV, Thapar V, et al. GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases. Nat Biotechnol. 2015;33(2):187–97. pmid:25513782
- 31. Yang Y, Li J, Zou Q, Ruan Y, Feng H. Prediction of CRISPR-Cas9 off-target activities with mismatches and indels based on hybrid neural network. Comput Struct Biotechnol J. 2023;21:5039–48. pmid:37867973
- 32. Toufikuzzaman M, Hassan Samee MA, Sohel Rahman M. CRISPR-DIPOFF: an interpretable deep learning approach for CRISPR Cas-9 off-target prediction. Brief Bioinform. 2024;25(2):bbad530. pmid:38388680
- 33. Zhang S, Li X, Lin Q, Wong K-C. Synergizing CRISPR/Cas9 off-target predictions for ensemble insights and practical applications. Bioinformatics. 2019;35(7):1108–15. pmid:30169558
- 34. Lundberg SM, Lee SI. A unified approach to interpreting model predictions. In: Luxburg UV, Guyon I, Bengio S, Wallach H, Fergus R. Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17); 2017 Dec 4-9; Long Beach, California, USA. Red Hook (NY): Curran Associates Inc.; 2017. pp. 4768–77.
- 35. Shrikumar A, Greenside P, Kundaje A. Learning important features through propagating activation differences. In: Precup D, Teh YW. In: Proceedings of the 34th International Conference on Machine Learning (ICML’17); 2017 Aug 6-11; Sydney, Australia. NSW, Australia: JMLR.org.; 2017. pp. 3145–53.
- 36. Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks. In: Precup D, Teh YW. Proceedings of the 34th International Conference on Machine Learning (ICML’17); 2017 Aug 6-11; Sydney, Australia. Sydney, NSW, Australia: JMLR.org.; 2017. pp. 3319–28.
- 37. McInnes L, Healy J, Melville J. UMAP: uniform manifold approximation and projection for dimension reduction. arXiv. 2018.
- 38. Daer RM, Cutts JP, Brafman DA, Haynes KA. The impact of chromatin dynamics on cas9-mediated genome editing in human cells. ACS Synth Biol. 2017;6(3):428–38. pmid:27783893
- 39. Chung C-H, Allen AG, Sullivan NT, Atkins A, Nonnemacher MR, Wigdahl B, et al. Computational analysis concerning the impact of DNA accessibility on CRISPR-Cas9 cleavage efficiency. Mol Ther. 2020;28(1):19–28. pmid:31672284
- 40. Bravo JPK, Liu M-S, Hibshman GN, Dangerfield TL, Jung K, McCool RS, et al. Structural basis for mismatch surveillance by CRISPR-Cas9. Nature. 2022;603(7900):343–7. pmid:35236982
- 41. Fu R, He W, Dou J, Villarreal OD, Bedford E, Wang H, et al. Systematic decomposition of sequence determinants governing CRISPR/Cas9 specificity. Nat Commun. 2022;13(1):474. pmid:35078987
- 42. Pacesa M, Lin C-H, Cléry A, Saha A, Arantes PR, Bargsten K, et al. Structural basis for Cas9 off-target activity. Cell. 2022;185(22):4067–4081.e21. pmid:36306733
- 43. Semenova E, Jore MM, Datsenko KA, Semenova A, Westra ER, Wanner B, et al. Interference by clustered regularly interspaced short palindromic repeat (CRISPR) RNA is governed by a seed sequence. Proc Natl Acad Sci U S A. 2011;108(25):10098–103. pmid:21646539
- 44. Jiang F, Doudna JA. CRISPR-Cas9 structures and mechanisms. Annu Rev Biophys. 2017;46:505–29. pmid:28375731
- 45. Yu T, Liu T, Wang Y, Zhao X, Zhang W. Effect of Cas9 protein on the seed-target base pair of the sgRNA/DNA hybrid duplex. J Phys Chem B. 2023;127(22):4989–97. pmid:37243666
- 46. Feng Y, Liu S, Chen R, Xie A. Target binding and residence: a new determinant of DNA double-strand break repair pathway choice in CRISPR/Cas9 genome editing. J Zhejiang Univ Sci B. 2021;22(1):73–86. pmid:33448189
- 47. Hsu PD, Scott DA, Weinstein JA, Ran FA, Konermann S, Agarwala V, et al. DNA targeting specificity of RNA-guided Cas9 nucleases. Nat Biotechnol. 2013;31(9):827–32. pmid:23873081
- 48. Dagdas YS, Chen JS, Sternberg SH, Doudna JA, Yildiz A. A conformational checkpoint between DNA binding and cleavage by CRISPR-Cas9. Sci Adv. 2017;3(8):eaao0027. pmid:28808686
- 49. Nishimasu H, Ran FA, Hsu PD, Konermann S, Shehata SI, Dohmae N, et al. Crystal structure of Cas9 in complex with guide RNA and target DNA. Cell. 2014;156(5):935–49. pmid:24529477
- 50. Palermo G, Miao Y, Walker RC, Jinek M, McCammon JA. CRISPR-Cas9 conformational activation as elucidated from enhanced molecular simulations. Proc Natl Acad Sci U S A. 2017;114(28):7260–5. pmid:28652374
- 51. Zetsche B, Gootenberg JS, Abudayyeh OO, Slaymaker IM, Makarova KS, Essletzbichler P, et al. Cpf1 is a single RNA-guided endonuclease of a class 2 CRISPR-Cas system. Cell. 2015;163(3):759–71. pmid:26422227
- 52. Kleinstiver BP, Pattanayak V, Prew MS, Tsai SQ, Nguyen NT, Zheng Z, et al. High-fidelity CRISPR-Cas9 nucleases with no detectable genome-wide off-target effects. Nature. 2016;529(7587):490–5. pmid:26735016
- 53. Slaymaker IM, Gao L, Zetsche B, Scott DA, Yan WX, Zhang F. Rationally engineered Cas9 nucleases with improved specificity. Science. 2016;351(6268):84–8. pmid:26628643
- 54. Ikeda A, Fujii W, Sugiura K, Naito K. High-fidelity endonuclease variant HypaCas9 facilitates accurate allele-specific gene modification in mouse zygotes. Commun Biol. 2019;2:371. pmid:31633062
- 55. Wang G, Liu X, Wang A, Wen J, Kim P, Song Q, et al. CRISPRoffT: comprehensive database of CRISPR/Cas off-targets. Nucleic Acids Res. 2025;53(D1):D914–24. pmid:39526384
- 56. Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods. 2015;12(10):931–4. pmid:26301843
- 57. O’Neill J, Dutta S, Assem H. Self-distilled Pruning of Deep Neural Networks. In: Amini M-R, Canu S, Fischer A, Guns T, Novak PK, Tsoumakas G. Machine Learning and Knowledge Discovery in Databases: European Conference (ECML PKDD’22); 2022 Sep 19; Grenoble, France. Berlin, Heidelberg, German: Springer; 2022. pp. 655–70.
- 58. Liu W, Zhou P, Zhao Z, Wang Z, Deng H, Ju Q. FastBERT: a self-distilling BERT with adaptive inference time. arXiv. 2020.