Table 1.
Performance of different models applying different coding schemes on two datasets.
Table 2.
(A) Models M1 to M3 are derivative models of the original model, each omitting three critical components. (B) Models M4 to M8 are designed with increasing levels of complexity, each composed of different complex modules. These models were all evaluated using the K562 and D9 datasets.
Fig 1.
Comparison of the CRISPR-MCA model with six existing off-target prediction models on a mismatches-only datasets.
A,B Performance of the seven models on the fusion datasets D1 (Hek293t) and D2 (K562), where A is PR_AUC and B is ROC_AUC. C,D Performance of seven models on dataset D4. E,F Performance of seven models on dataset D6.
Fig 2.
Comparison of the CRISPR-MCA model with six existing off-target prediction models on the included indels and mismatches datasets.
A,B Performance of four models supporting indels and mismatches dataset on D8. C,D Performance of the four models on the D9 dataset.
Table 3.
The level of class imbalance within the dataset subsequent to the implementation of ESB rebalancing strategy.
Fig 3.
A The new samples feature heatmaps of the severely imbalanced D5 dataset, delineating the positions and types of nucleotides substituted in the generated positive samples of gRNA-target DNA. B Validation of positive off-target sites in dataset D5 predicted by rebalanced CRISPR-MCA. The outermost layer is the real sample, the middle layer is the prediction result after using ESB rebalancing strategy, and the innermost layer is the original prediction result. Incorrect and correct predictions are indicated in red and green, respectively.
Table 4.
CRISPR-MCA assesses six class rebalancing strategies on two datasets that exhibit severe imbalance.
The table presents average outcomes obtained through 5-fold cross-validation.
Table 5.
Comparison of predictions from six existing models after implementing the ESB strategy.
Table 6.
Details of the two types of datasets utilized in the experiments and analyses.
Table 7.
Analysis of the degree of imbalance between positive and negative samples in the datasets, where the Total is the number of all samples in the dataset, the IR stands for Imbalance Ratio, the CVIR denotes the Coefficient of Variation of the Imbalance Ratio, and IE signifies Information Entropy.
Fig 4.
The heatmaps illustrate the types of base mismatches occurring at various positions across the four datasets.
Fig 5.
The frequency of mismatches at varying positions within gRNA-target DNA sequences across multiple datasets.
Fig 6.
A Example of a gRNA-target DNA sequence code. The first five channels are base channels, which are responsible for converting base pairs in the sequence into unique One-hot vectors, adenine is coded as [1, 0, 0, 0, 0], guanine is coded as [0, 1, 0, 0, 0], cytosine is coded as [0, 0, 1, 0, 0], thymine is encoded as [0, 0, 0, 1, 0], and indels indicated by underscores (_) are encoded as [0, 0, 0, 0, 1]. The last two channels are direction channels, which are used to mark the direction of the bases at the position where the mismatch occurred. B The architecture of CRISPR-MCA is depicted, starting with a 24*7 matrix derived from the encoded gRNA-target DNA sequence as its input. This matrix is processed by a multiscale convolutional layer designed to extract sequence features. The output from the Multi-CNN Layers undergoes positional encoding before being input into the Multi-Head Self-Attention layer for further sequence analysis. Following processing, the data is merged with earlier inputs and then channeled through three dense layers, comprising 256, 128, and 2 neurons respectively. The final layer utilizes a softmax activation function to yield binary classification outcomes.
Fig 7.
The Efficiency and Specificity-Based class rebalancing strategy encompasses two distinct phases: Initial mutation screening of gRNAs and subsequent specificity assessment of high-efficiency mutants.
Initially, each nucleotide within a gRNA is substituted with the three alternative nucleotides, following which the targeting efficiencies of these mutated gRNAs are evaluated. Mutants demonstrating superior efficiency compared to the original gRNA are selected for further analysis. The second phase involves identifying potential off-target sites across the genome using Cas-Offinder, succeeded by calculating the specificity of mutant gRNAs’ interaction with the target DNA sequence. Mutants that exhibit specificity surpassing that of the original sequence are then adopted as enhanced positive samples for subsequent training purposes.
Table 8.
Detailed information about the selected comparison models, specifically regarding their support for ‘Mismatches’ and ‘Indels’ within datasets.
The types of datasets that the model can predict are labeled as ‘Supported’ and those that it cannot predict are labeled as ‘Unsupported’.