Fig 1.
Schematic representation of the study.
Fig 2.
Machine Learning Model Architectures and Workflow for Genome Classification.
(A) k-means clustering algorithm workflow (B) Schematic architecture of the random forest classification model (C) Architecture of the 1-D convolutional neural network for whole-genome sequence classification. (D) Layer-wise structural summary of the 1-D CNN model configuration.
Fig 3.
K-mer–Based Feature Analysis and Clustering of Delta Variant Genomes Pre- and Post-Omicron Emergence.
(A) Comprehensive set of all possible 3-mer sequence combinations used for feature extraction. (B) Global frequency distribution of k-mers across the entire sequence dataset. (C) Correlation heatmap depicting interrelationships among k-mer frequencies. (D) GC content distribution in Delta variant genomes prior to Omicron emergence. (E) GC content distribution in Delta variant genomes following Omicron emergence. (F) Two-dimensional t-SNE projection illustrating clustering patterns of DNA sequences. (G) Three-dimensional t-SNE visualization demonstrating class separation in sequence data. (H) K-means clustering of Delta variant sequences before Omicron emergence. (I) Cluster structure revealed by K-means partitioning of pre-Omicron Delta genomes. (J) Random forest–derived feature importance scores highlighting key predictive k-mers.
Fig 4.
Model Interpretation, Motif Discovery, and Classification Performance Analysis.
(A) Comparative k-mer importance profiles derived from SHAP attribution and random forest metrics. (B) Genome-wide positional SHAP attribution landscape showing nucleotide-level contributions to model predictions. (C) Most prevalent sequence motif in pre-Omicron Delta genomes with frequency counts. (D) Most prevalent sequence motif in post-Omicron Delta genomes with frequency counts. (E) Dataset-wide discriminative motif patterns learned by the 1-D CNN model. (F) Receiver operating characteristic curve of the random forest model illustrating threshold-dependent classification performance. (G) Receiver operating characteristic curve of the 1-D CNN model depicting sensitivity–specificity trade-off.
Table 1.
Comparison of the random forest model and the 1-D CNN model in classifying the genome sequences.