Similarity-based transfer learning with deep learning networks for accurate CRISPR-Cas9 off-target prediction

doi:10.1371/journal.pcbi.1013606

Fig 1.

An overview of the proposed framework leveraging data similarity analysis with genome editing transfer learning.

(A) Three distance measures: cosine, Euclidean, and Manhattan distances are used to identify the most suitable source dataset, among three benchmark candidate datasets CD33, CIRCLE, and SITE (complete large dataset), for a given target dataset (smaller bootstrapped dataset); (B) The framework subsequently transfers the learned model knowledge from the selected optimal source dataset to the target dataset, enhancing the predictive accuracy.

More »

Expand

Table 1.

Seven CRISPR-Cas9 benchmark off-target datasets used in our study.

Six of them include gRNA-target pairs with mismatches only, and one of them (CIRCLE, denoted with an asterisk) includes gRNA-target pairs with both mismatches and indels. Minority class samples correspond to active off-target sites (or active off-targets) and Majority class samples correspond to inactive off-target sites.

More »

Expand

Fig 2.

A schematic view of the encoding of an sgRNA-DNA sequence pair, as employed in the study of Lin et al. [37].

A seven-bit encoding example is illustrated, where the _ symbol indicates the position of DNA or RNA bulges. Each sgRNA-DNA sequence pair is encoded as a fixed-length matrix with seven rows, comprising a five-bit character channel (A, G, C, T, _) and a two-bit direction channel. The five-bit channel encodes the nucleotides at the on- and off-target sites, while the direction channel identifies the locations of mismatches and indels. L denotes the sequence length (L=23 in our study).

More »

Expand

Fig 3.

Representation of transfer learning for FNNs, CNNs, and RNNs.

Minor variations exist between different CNN and FNN architectures used in our experiments, based on the number of layers included. However, the architecture presented is consistent across all RNNs evaluated in our study.

More »

Expand

Table 2.

Minority and majority class distribution, and class imbalance ratio for bootstrapped target datasets, with sample size of 250, used in our experiments.

More »

Expand

Table 3.

Average Estimated Similarities (1 - Normalized Average Distances) between the three source datasets (CD33, CIRCLE, and SITE) and the seven bootstrapped target datasets (CD33_BS, CIRCLE_BS, SITE_BS, Tasi_GUIDE_BS, Listgarten_GUIDE_BS, Kleinstiver_GUIDE_BS, and Hmg_BS) calculated using cosine, Euclidean, and Manhattan distances.

Similarity values corresponding to the most suitable source-target dataset pairs are highlighted in bold.

More »

Expand

Fig 4.

Bar plot representation of Average Estimated Similarities (1 - Normalized Average Distances).

Similarities between the three source datasets (CD33, CIRCLE, and SITE) and the seven bootstrapped target datasets (CD33_BS, CIRCLE_BS, SITE_BS, Tasi_GUIDE_BS, Listgarten_GUIDE_BS, Kleinstiver_GUIDE_BS, and Hmg_BS) were assessed using the cosine, Euclidean, and Manhattan distances.

More »

Expand

Fig 5.

ROC curves for model evaluation.

ROC curves for models trained on: (A) CD33 dataset, (B) CIRCLE dataset, and (C) SITE dataset, used as sources, and evaluated on their respective bootstrapped targets. The AUC ROC values for each model are displayed in descending order within each figure.

More »

Expand

Fig 6.

ROC curves for model evaluation.

ROC curves for models trained on the CD33 dataset, used as source, and six bootstrapped datasets: CIRCLE_BS, SITE_BS, Tasi_GUIDE_BS, Listgarten_GUIDE_BS, Kleinstiver_GUIDE_BS, Hmg_BS, used as targets. The AUC ROC values for each model are displayed in descending order within the figure.

More »

Expand

Table 4.

Performance metrics for each considered classification model obtained using the CD33 dataset for training (i.e. as source).

Target datasets exhibiting the highest similarity to the CD33 dataset are marked with an asterisk. The results of the top-performing models are highlighted in bold.

More »

Expand

Fig 7.

ROC curves for model evaluation.

ROC curves for models trained on the CIRCLE dataset, used as source, and six bootstrapped datasets: CD33_BS, SITE_BS, Tasi_GUIDE_BS, Listgarten_GUIDE_BS, Kleinstiver_GUIDE_BS, Hmg_BS, used as targets. The AUC ROC values for each model are displayed in descending order within the figure.

More »

Expand

Table 5.

Performance metrics for each considered classification model obtained using the CIRCLE dataset for training (i.e. as source).

Target datasets exhibiting the highest similarity to the CIRCLE dataset are marked with an asterisk. The results of the top-performing models are highlighted in bold.

More »

Expand

Fig 8.

ROC curves for model evaluation.

ROC curves for models trained on the SITE dataset, used as source, and six bootstrapped datasets: CD33_BS, CIRCLE_BS, Tasi_GUIDE_BS, Listgarten_GUIDE_BS, Kleinstiver_GUIDE_BS, Hmg_BS, used as targets. The AUC ROC values for each model are displayed in descending order within the figure.

More »

Expand

Table 6.

Performance metrics for each considered classification model obtained using the SITE dataset for training (i.e. as source).

Target datasets exhibiting the highest similarity to the SITE dataset are marked with an asterisk. The results of the top-performing models are highlighted in bold.

More »

Expand