Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Genomic evolution of SARS-CoV-2 delta variants pre- and post-omicron emergence using alignment-free machine learning models

  • Sathish Sankar,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Supervision, Validation, Writing – original draft, Writing – review & editing

    Affiliation Department of Microbiology, Center for Infectious Diseases, Saveetha Dental College and Hospitals, Saveetha Institute of Medical and Technical Sciences, Saveetha University, Chennai, Tamil Nadu, India

  • Kaushika Anandharaman ,

    Contributed equally to this work with: Kaushika Anandharaman, Pradeesh Selvam

    Roles Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft

    Affiliation Department of Artificial Intelligence and Machine Learning, Saveetha Engineering College (Affiliated to Anna University), Chennai, India

  • Pradeesh Selvam ,

    Contributed equally to this work with: Kaushika Anandharaman, Pradeesh Selvam

    Roles Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Visualization, Writing – original draft

    Affiliation Department of Artificial Intelligence and Machine Learning, Saveetha Engineering College (Affiliated to Anna University), Chennai, India

  • Aswini Jayaraman,

    Roles Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Supervision, Validation, Visualization, Writing – original draft

    Affiliation Department of Artificial Intelligence and Machine Learning, Saveetha Engineering College (Affiliated to Anna University), Chennai, India

  • Deepak Jayakumar,

    Roles Conceptualization, Formal analysis, Project administration, Resources, Supervision, Validation, Writing – original draft

    Affiliation State Public Health Laboratory, Directorate of Public Health and Preventive Medicine, DMS Campus, Teynampet, Chennai, Tamil Nadu, India

  • Pachamuthu Balakrishnan,

    Roles Conceptualization, Funding acquisition, Project administration, Resources, Supervision, Validation, Writing – original draft

    Affiliation Centre for Global Health Research, Meenakshi Academy of Higher Education and Research (MAHER), Chennai, India

  • Marie Larsson ,

    Roles Conceptualization, Formal analysis, Funding acquisition, Project administration, Resources, Software, Supervision, Visualization, Writing – original draft, Writing – review & editing

    marie.larsson@liu.se (ML); shankarem@cutn.ac.in (EMS)

    Affiliation Division of Molecular Medicine and Virology, Department of Biomedical and Clinical Sciences, Linköping University, Linköping, Sweden

  • Vijayakumar Velu,

    Roles Conceptualization, Funding acquisition, Project administration, Resources, Supervision, Visualization, Writing – review & editing

    Affiliation Department of Pathology and Laboratory Medicine, Emory University School of Medicine, Division of Microbiology and Immunology, Emory National Primate Research Center, Emory Vaccine Center, Atlanta, Georgia, United States of America,

  • Sivadoss Raju,

    Roles Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Visualization, Writing – review & editing

    Affiliation Centre for Global Health Research, Meenakshi Academy of Higher Education and Research (MAHER), Chennai, India

  • Esaki M. Shankar

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Validation, Writing – original draft, Writing – review & editing

    marie.larsson@liu.se (ML); shankarem@cutn.ac.in (EMS)

    Affiliation Infection and Inflammation, Department of Biotechnology, Central University of Tamil Nadu, Thiruvarur, Tamil Nadu, India

Abstract

The SARS-CoV-2 Delta variant (B.1.617.2), initially classified as a variant of concern due to its enhanced transmissibility and vaccine-escape mutations, underwent further genomic changes following the emergence of the Omicron variant (B.1.1.529). This study investigates the genomic differences in Delta variant spike gene sequences collected before and after the emergence of Omicron. A total of 190 sequences were analyzed using an alignment-free approach incorporating k-mer-based feature extraction and machine learning models, including convolutional neural networks (CNN), K-means clustering, and random forest classification. The random forest model achieved 93% accuracy, with significant F1 scores, effectively distinguishing the two Delta variant groups. Comparative analysis revealed 157 persistent mutations and four vanished mutations in the post-Omicron group. Cluster analysis showed notable shifts, indicating stable yet evolving genomic patterns over time. The study demonstrates the advantage of alignment-free methods in detecting subtle sequence variations that alignment-based approaches may overlook. These findings enhance our understanding of SARS-CoV-2 evolution and provide a framework for identifying key genomic signatures relevant to public health. The methodology and insights gained offer potential applications in variant surveillance, vaccine design, and viral evolutionary studies, supporting preparedness for future SARS-CoV-2 variant emergence.

1 Introduction

The WHO estimates suggest that the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic recorded over 777 million cases with >7 million deaths globally [1]. Active extensive surveillance of COVID-19 has been discontinued due to the prevailing epidemiological situations, although sentinel surveillance has now been integrated with surveillance of respiratory viruses [1]. The community-level surveillance is now focused primarily on wastewater systems as part of the public health surveillance strategy [2]. Monitoring of variants, together with environmental surveillance, is urgently warranted for future pandemic preparedness and to understand viral behavior. The Delta variant (B.1.617.2), first detected in India, evolved into a pandemic from mid-March 2021 until it was subsequently replaced by Omicron (B.1.1.529). First detected in Botswana, Hong Kong, and South Africa, the Omicron variant peaked exponentially in frequency from December 2021 to March 2022, and continued to sustain thereafter [3].

The genomic comparison of Delta and Omicron variants indicated significant mutational changes in the spike (S) protein with subsequent changes in the stability and binding to the ACE2 host receptor protein [4]. The Delta variant mainly influenced immune responses to the antigenic regions of the receptor-binding domain (RBD) of the spike protein, wherein the Omicron variant, with 30 different mutations in the spike protein, was associated with high transmissibility and vaccine-induced immune evasion [5,6]. However, the dynamics of transmission and infectivity of the Delta variant before and after the emergence of Omicron were distinct; the former was highly infectious with high rates of transmissibility, and the latter was relatively poor in the aforesaid attributes. The underlying genomic signatures and the virological and clinical significance of this difference largely remain nebulous [4,7,8].

Genome sequences are analyzed traditionally using alignment-based tools for the detection of mutations, level of conservation, evolutionary distances, and to assess the function of genes and proteins [9]. Viral genomes possess comparatively smaller-scale genomes, yet minimal changes result in massive functional changes in the protein [10], which are often overlooked by alignment-based tools. The lack of domain rearrangements of proteins, low sequence-alignment accuracy, especially within closely-related genomes, and incongruity for large-scale whole genome sequence, as they are memory and time-consuming [11]. In contrast, alignment-free sequence comparison methods that are based on the length and informational features of the sequences, such as base composition and quantitative distribution, have several advantages [12]. These are suitable for large-scale whole genome sequence metadata, cheaper, faster, and remain unaffected by genome recombination events. Use of machine learning methods such as k-mers in comparing SARS-CoV-2 sequences has been reported previously to analyze the variations for epidemic surveillance [1315]. The Delta variant’s whole genome sequence data obtained as part of the national sentinel surveillance program were utilized. Here, we used k-mers and t-SNE clustering and applied a random forest classifier and a convolutional neural network for comparing sequences of Delta variants before and after the emergence of Omicron.

2 Materials and methods

2.1 Study setting and design

This observational study was conducted at the State Public Health Laboratory (Directorate of Public Health, Government of Tamil Nadu, India), which is a nodal centre for the state SARS-CoV-2 variant analysis. Since 01/03/2021, whole-genome sequencing (WGS) of SARS-CoV-2 has been conducted prospectively as part of the national and state-level genomic surveillance programs. Samples were randomly selected from the initial real-time RT-PCR–positive cases with cycle threshold (Ct) values below 25. Sequencing was performed directly from the original clinical samples using the Ion Torrent platform (Thermo Fisher Scientific, Waltham, MA, USA), following the manufacturer’s protocol. Base calling, adapter trimming, and quality control were carried out using the integrated pipeline in Ion Reporter™ Software. The processed reads were aligned to the reference genome, Wuhan-Hu-1 (GenBank accession number MN908947.3), using the IRMA assembler plugin within the Ion Reporter™ Software (Thermo Fisher Scientific, Carlsbad, CA, USA). The sequences were submitted to GenBank, and accession numbers were obtained.

Here, for this study, the sequences of Delta variants identified between 01/03/2021 and 31/12/2021, and those identified from 01/01/2022–31/12/2023, were retrospectively collected from the GenBank and considered as Delta variants before and after Omicron emergence, respectively. The sequences used for our study are listed in the S1 File. All patient information has been completely anonymized before it is accessed. The study was approved by the Institutional Ethics Committee of the Madras Medical College (EC No. 03092021).

2.2 Dataset description

A total of 190 genome sequences were selected, and complete S protein gene sequences were analyzed. The dataset was split into two classes, the first being the genome sequence samples collected before the omicron variant emergence, as “before Omicron”, and sequences collected after the omicron variant as “after Omicron”. All DNA sequence data were prepared in FASTA file format for further analysis. The average length of the sequences was 3813 nucleotides. From the datasets, information regarding the strain variant, the length of the sequence, and the class label was retrieved. A total of 164 sequences, as “before Omicron”, and 26 sequences as “after Omicron” were selected. As there was an imbalance of data samples between the two classes, a weighted cross-entropy loss function and random sampling were used.

This comprehensive flowchart, depicted in Fig 1, illustrates a sophisticated bioinformatics pipeline designed for genomic sequence analysis, encompassing three key processing stages indicated by color-coded sections. The “data input and pre-processing” section represents initial data preprocessing, where raw sequence data undergoes cleaning, standardization, and encoding into numerical representations suitable for computational analysis. The “feature extraction and exploratory data analysis” section constitutes the feature extraction and analysis phase, where k-mer frequencies are calculated alongside various statistical analyses, with pathways for both dimensionality reduction and correlation heatmaps to identify significant patterns. The machine learning model features parallel convolutional neural networks (CNN) implementations with one branch dedicated to sequence classification and another to a random forest architecture. It concludes with performance metrics, model interpretability assessments, and a comprehensive analysis [16]. This well-structured pipeline elegantly integrates sequence preprocessing, feature engineering, and advanced machine learning techniques to transform raw genomic data into actionable insights, with each component carefully designed to handle the unique challenges of high-dimensional sequence data while maximizing biological interpretation potential.

2.3 Dataset preprocessing

The sequences were parsed using the SeqIO.parse module from Biopython (v1.84) in a Python 3 environment to read the raw FASTA files, converting the genomic data into iterable sequence records for subsequent encoding. Following sequence parsing, the data were structured into a Pandas (v2.2.2) DataFrame, where sequence data and their corresponding temporal labels (Pre-Omicron vs. Post-Omicron) were organized as discrete features, enabling efficient vectorized operations for k-mer counting and one-hot encoding. Initially, the sequences were prepared to be the same length using sequence trimming and sequence padding. With the former, the end of sequences that are larger than the smallest sequence was removed. With the latter, sequences were padded with ‘N’ characters to a fixed length of 3813 nucleotides, concerning the largest sequence length observed. Several pre-processing techniques were applied to transform DNA sequences into numerical formats suitable for DNA classification. This step was crucial, as machine learning and deep learning models required input data in numerical form.

Since DNA sequences are composed of categorical nucleotide data, they were first encoded numerically for effective classification. In this study, two encoding approaches—one-hot encoding and K-mer encoding—were utilized to achieve this transformation. Label encoding assigns a unique numerical value to each nucleotide within the DNA sequence, facilitating its use in computational models. The label encoding method assigned a numerical value to each nucleotide in the DNA sequence (A – 0, C – 0.25, G – 0.5, T – 0.75), or in one-hot encoding, the nucleotides are encoded using binary digits (A – 0001, C −0010, G – 0100, T – 1000). The DNA sequence was converted into a sequence of numbers. A “k-mer” refers to a sequence of “k” nucleotides (e.g., for k = 3, the k-mers of “AGCT” would be “AGC” and “GCT”). Thus, the entire sequence was fragmented into k-mers, and every k-mer was encoded into numerical values using the encoders mentioned above. The number of unique k-mers depended on the value of k and the unique letters available in the sequence. To determine the optimal k-mer length, we conducted a sensitivity analysis across a range of k values (k = 2–30). The results demonstrated that k = 3 achieved the peak classification accuracy of 92.98% while requiring the minimum computational time (1.57 seconds). Although k = 4 and k = 5 yielded identical accuracy, k = 3 was selected as the most efficient parameter, providing a lower-dimensional feature space that avoids the sparsity issues often associated with larger k-mer sizes. Furthermore, increasing k beyond 5 led to a noticeable decline in model performance and increased training latency, suggesting that 3-mer frequencies (representative of codon-level information) provide the most robust signal for differentiating datasets. In this study, the k value was set to 3, and the DNA sequences consisted of 4 characters. Therefore, the number of unique k-mers was equal to 4^3 or 64 k-mers. In this approach, each classifier casts a vote, and the label receiving the highest number of votes is assigned to the sample, as illustrated in Figs 2A and 2B.

thumbnail
Fig 2. Machine Learning Model Architectures and Workflow for Genome Classification.

(A) k-means clustering algorithm workflow (B) Schematic architecture of the random forest classification model (C) Architecture of the 1-D convolutional neural network for whole-genome sequence classification. (D) Layer-wise structural summary of the 1-D CNN model configuration.

https://doi.org/10.1371/journal.pone.0345259.g002

2.4 Exploratory data analysis

The preprocessed data was further analyzed for underlying patterns using various tools. The k-mer frequency in each DNA sequence was calculated using the Count Vectorizer function from the Scikit-learn library. The frequency of k-mers is used as a feature in machine learning models. The correlation between each k-mer was plotted in the graph to analyze the relation between the k-mers.

2.5 T-distributed stochastic neighbor embedding

t-SNE was applied for dimensionality reduction of high-dimensional genomic feature vectors to visualize sample clustering in two- and three-dimensional space. The dimensionality reduction to two components compressed the data’s inherent structure, bringing the clusters closer together while maintaining their distinctiveness. Both clusters exhibit remarkably similar internal cohesion, with Class 0 showing a dispersion of 11.41 and Class 1 showing 11.95, indicating consistent within-class variability. The variance is almost equally distributed between the two components, with Component 1 accounting for 51.79% and Component 2 for 48.21% of the total point distribution. This balanced contribution from both dimensions indicated that the 2D representation effectively captured the underlying structure of the data, with neither component dominating the visualization, making this a reliable low-dimensional representation for further analytical interpretation.

2.6 Machine learning models

Random forest and 1D CNN models were used to analyze genomic sequences. The random forest classifier generated predictions using multiple decision trees and majority voting. The 1D CNN model processed full-length sequences through convolutional layers to learn discriminative patterns. Both models were trained and evaluated on designated datasets.

2.7 K-means clustering

K-means Clustering, as an unsupervised machine-learning method, was implemented to identify the shift of clusters over a period of time to identify the genomic sequences for specific mutation patterns. K-means clustering was applied to group sequences into K clusters based on similarity. The algorithm iteratively assigned samples to the nearest centroid and updated cluster centers until stable groupings were obtained [17,18].

2.8 Random forest classifier

For classification, a random forest model was implemented as an ensemble learning approach. The algorithm constructs multiple decision trees using bootstrap-sampled subsets of the training data and random subsets of features at each split. Each tree independently predicts a class label for a given input sample. Final classification was determined using majority voting across all trees. This procedure was applied to assign labels to previously unlabeled samples.

In this study, the random forest model is trained with 80% of the dataset, and 20% of the dataset is used to test the model. It operates by selecting random subsets of data and features for each tree, ensuring diversity among the trees. The Random Forest model functions as an ensemble of N independent decision trees. By iteratively testing values for N (number of estimators) ranging from 10 to 5,000, we determined that N = 1000 provided optimal performance, ensuring a sufficient diversity of trees for robust majority voting without incurring unnecessary computational or memory overhead. Each branch within these trees represents a binary split based on the normalized frequency of a specific 3-mer pattern. By traversing these branches, the model partitions the genomic data into increasingly homogeneous subsets, eventually reaching a leaf node that assigns the sequence to either the before or after Omicron emergence group. Each decision tree in the forest splits nodes based on a chosen feature and threshold that best separates the data, typically using criteria like Gini impurity or entropy in classification tasks and mean squared error (MSE) in regression tasks. Gini Impurity and Entropy are two key criteria used in decision trees to measure the quality of splits. Both metrics evaluate how “pure” a node is, meaning how mixed the classes are in a classification problem. A lower value for either metric indicates a more homogeneous (pure) node.

2.9 Gini impurity calculation

The probability of the sample being incorrectly classified is labeled based on the distribution of classes within a node was calculated by Gini impurity using the formula:

Where C is the number of classes, and p is the proportion of samples that belong to class i at the node.

2.10 Entropy calculation

To quantify the amount of uncertainty in a node, entropy was calculated using the formula:

where H is the entropy, C is the number of classes, and p is the probability of class i in the node. Entropy is 0 for a pure node and reaches its maximum when all classes are equally probable, meaning the node is most uncertain.

2.11 1-D convolutional neural network

In addition to using a random forest to identify critical k-mers that distinguish genome sequence classes, we implemented a 1-D CNN to operate directly on individual nucleotides rather than k-mers. Fig 2C represents the architecture of a 1-D CNN model. Neural networks are made of layers and nodes, where each layer is designed to perform a specific task(s). The CNN consisted of convolutional layers where each layer performed the convolution operation on its input data. A convolution layer computes its output by convolving the input with its filter weights, adding a bias β, and passing each result through an activation function. In 1D convolution, for N number of kernels or filters, a stride of one, and a dilation rate of one, the ith hidden unit hi can be expressed as:

Here, f [.] corresponds to the activation function, βi is the bias, and w and x are the weights and data points, respectively. In other words, a 1D CNN slides a filter along one dimension, computing the dot product of the weight and input data, and creates the output feature map of the convolution layer. The model included the input layer, convolutional layers, a flatten layer, a dense layer, and an output layer. The network had a total of 687,731 trainable parameters. The model summary is shown in Fig 2D.

In this study, TensorFlow (v2.x) was utilized to construct the Deep Learning pipeline. The model architecture begins with a Reshape layer to organize the input genome sequences into a three-dimensional format suitable for convolution. Subsequently, three 1D convolutional layers, each followed by max-pooling, extract hierarchical features from the sequence while reducing dimensionality. A Flatten layer then converts the resulting feature maps into a one-dimensional vector. This vector is fed into two Dense layers for high-level feature combination, interspersed with a Dropout layer to mitigate overfitting. Finally, a single-node output layer provides the classification outcome. The combination of convolution, pooling, and dense layers yields an end-to-end architecture optimized for genomic data. The model was compiled using the Adam optimizer, which adaptively updates network weights by computing individual learning rates for each parameter based on the first and second moments of the gradients. This approach helps the network converge more quickly and reliably compared to standard stochastic gradient descent. A binary cross-entropy loss function was employed to measure the discrepancy between predicted probabilities and actual labels, making it well-suited for binary classification tasks.

The training dynamics of the 1D-CNN revealed a non-monotonic learning trajectory characteristic of deep representation learning in high-dimensional genomic data. The model was trained for a maximum of 100 epochs, utilizing an initial learning rate of 1e-3 with a scheduler configured to decay the rate to a minimum floor of 1e-6. The model underwent an initial phase of high variance (Epochs 1–15), where validation loss exhibited significant fluctuations (peaking at 0.71). This volatility indicates the model was actively exploring the optimization landscape to escape sharp local minima rather than prematurely converging on suboptimal solutions. Following this exploration, the model entered a stabilization phase where the validation loss decreased and plateaued, signaling a shift from memorizing noise to learning robust, generalizable genomic motifs. ReduceLROnPlateau monitored validation performance and systematically reduced the learning rate when progress plateaued, allowing the model to fine-tune its parameters more effectively and avoid suboptimal convergence. Although local minima were observed in early epochs lacked stability, Epoch 33 was identified via Early Stopping as the optimal convergence point. At this epoch, the model achieved a critical equilibrium, minimizing training loss (0.598) while maintaining a stable validation loss, thereby ensuring that the classification relied on learned structural features rather than stochastic fluctuations.

2.12 Metrics

2.12.1 Accuracy.

Accuracy measures the proportion of correctly classified instances among all predictions, providing an overall gauge of a model’s performance. Accuracy is calculated using the following formula:

True positives (TP) refer to instances in which the model correctly identifies the positive class, while true negatives (TN) indicate cases where the model accurately predicts the negative class. False positives (FP) occur when the model incorrectly classifies a negative instance as positive, and false negatives (FN) arise when a positive instance is mistakenly categorized as negative. Together, these values provide a comprehensive picture of classification performance.

2.12.2 F1 score.

F1 score is the harmonic mean of precision and recall, offering a balance between these two complementary metrics. The F1 score provided an informative single value that captured both how precise the model’s positive predictions are and how well the model detected all actual positive [19,20]. As such, it was particularly useful in applications where one aims to balance false positives and false negatives, especially in the presence of class imbalance. The F1 score is defined as:

Precision and recall are two key metrics used to evaluate a classifier’s performance on a positive class. Precision measures the proportion of correctly identified positives among all instances labeled as positive, reflecting how reliable the model’s positive predictions are. Formally, it is given by

Recall, on the other hand, captures the proportion of actual positives that the model successfully detects. It is calculated as

Together, precision and recall provide complementary insights into a model’s accuracy in identifying positive instances.

2.12.3 SHapley Additive exPlanations.

SHAP (SHapley Additive exPlanations) is a method grounded in cooperative game theory that provides individualized explanations for model predictions. It assigns each feature a “Shapley value” indicating the contribution of that feature to the prediction relative to a baseline [21]. For a model f and a set of features F, the SHAP value ɸ for feature i is computed by averaging the difference in model predictions with and without the feature, over all possible feature subsets S ⊆ (F\ {i}). Formally:

Where ∣𝐹∣ is the total number of features, and 𝑓(𝑆) denotes the model’s prediction when only the features in 𝑆 are present.

SHAP values are based on a rigorous game-theoretic foundation, and uphold desirable properties such as local accuracy, consistency, and missingness. SHAP was used to generate both global and local explanations: globally, by aggregating contributions across many samples to understand how features influence overall model behavior, and locally, by explaining a single instance’s prediction in terms of its feature contributions.

2.13 Receiver operating characteristic

The ROC curve was used to evaluate the performance of binary classification models by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at various decision thresholds. The TPR represents the proportion of actual positives correctly identified, while the FPR measures how many negative instances are incorrectly classified as positive. By adjusting the classification threshold, from classifying almost everything as positive to classifying almost everything as negative, the ROC curve was plotted to reveal the trade-off between detecting more true positives and avoiding false positives. A model whose curve hugs the top-left corner of the plot (high TPR, low FPR) is considered superior, as it consistently distinguishes between classes across a range of thresholds. The Area Under the Curve (AUC) condenses this information into a single metric, with higher values (closer to 1.0) signifying better overall discrimination.

3 Results

SARS-CoV-2 Delta variant spike gene sequences identified during 2021 and 2022 outbreaks were investigated for genomic differences between the strains identified before Omicron and after Omicron emergence. In this study, the sequences were analyzed using statistical methods and machine-learning models. The models were trained and fitted with the dataset such that the accuracy is maximized. The primary goal of the study was to understand and find the difference in DNA sequences. The machine learning models were built to classify the classes of sequences and then retrieve the features that provide a significant contribution to the classification. The temporal trend suggested an overall number of point mutations (n = 157) from both groups of data, suggesting stable changes in genomic sequence that had occurred over time. The comparison of point mutations displayed four significant mutations that disappeared in the “after Omicron” sequences compared to the “before Omicron” sequences.

Fig 3A shows the list of k-mers used in this study. The frequency of k-mers was used as a feature in machine learning models. The frequency distribution of every k-mer across the dataset is shown in Fig 3B. The correlation heat of the k-mers is shown in Fig 3C. Blue denotes a negative connection, whereas red denotes a positive correlation. GC content denotes the proportion of guanine (G) and cytosine (C) nucleotides within a DNA or RNA sequence and serves as a critical parameter in genomic analysis. The GC content of the sequences was analyzed, and the GC percentage was plotted for “before Omicron” (Fig 3D) and “after Omicron” (Fig 3E). Both plots show that the mean GC content remains consistent between before Omicron and after Omicron sequences, suggesting no significant shift in overall GC composition. This implies that any observed differences between these two classes are likely influenced by other genomic features or sequence variations, rather than changes in GC content alone.

thumbnail
Fig 3. K-mer–Based Feature Analysis and Clustering of Delta Variant Genomes Pre- and Post-Omicron Emergence.

(A) Comprehensive set of all possible 3-mer sequence combinations used for feature extraction. (B) Global frequency distribution of k-mers across the entire sequence dataset. (C) Correlation heatmap depicting interrelationships among k-mer frequencies. (D) GC content distribution in Delta variant genomes prior to Omicron emergence. (E) GC content distribution in Delta variant genomes following Omicron emergence. (F) Two-dimensional t-SNE projection illustrating clustering patterns of DNA sequences. (G) Three-dimensional t-SNE visualization demonstrating class separation in sequence data. (H) K-means clustering of Delta variant sequences before Omicron emergence. (I) Cluster structure revealed by K-means partitioning of pre-Omicron Delta genomes. (J) Random forest–derived feature importance scores highlighting key predictive k-mers.

https://doi.org/10.1371/journal.pone.0345259.g003

The 2D t-SNE visualization shown in Fig 3F revealed a moderate separation between the two classes, with a centroid distance of 7.49 units—significantly smaller than observed in the 3D representation. Based on the t-SNE visualization results in 3D space, as shown in Fig 3G, the data revealed distinct clustering patterns with meaningful separations between classes. The distance matrix shows a substantial separation of 256.72 units between cluster centroids, indicating well-differentiated groupings despite some internal variation. Both clusters display relatively high dispersion values (350.43 for Class 0 and 369.13 for Class 1). The distribution of variance across the three components was balanced, with Component 3 capturing the highest proportion at 40.45%, followed by Component 2 (34.94%) and Component 1 (24.61%). The balanced distribution indicated that three-dimensional representation effectively captures the complex relationships in the data, with information meaningfully distributed across all dimensions rather than being dominated by a single component.

This compression effect demonstrated how the additional dimension in 3D space allowed for a more expansive distribution of data points, potentially revealing subtleties in genetic relationships that become condensed in 2D.

The cluster analysis collected from the K-Means classification results indicates stable evolution with some significant changes over time (Fig 3H and 3I), which showed some mutations moving within the class and some mutations classifying into different classes in the “after Omicron” Delta variant classification result. Clusters in before-omicron Delta variants are shown in Fig 3H, and Clusters in after-omicron Delta variants in Fig 3I.

In our study, we found 157 ‘persistent’ mutations identified from both before and after the omicron group of delta variants and four ‘vanished’ mutations (C2011G, C643T, T776A, C179T) that were lost after the omicron group of delta variants. Among the separated clusters, cluster shifts were identified between and within the clusters (Fig 3H and Fig 3I) due to genomic feature differences and includes (4,4):29; (2,2):33; (4,0):14; (3,3):32; (1,0):3; (1,4):12; (1,3):3; and (0,1):38.

The random forest classifier was trained on 80% of the dataset, using 1000 trees (estimators) as the main function parameter, resulting in a 93% accuracy and F1 scores of 96% and 75% for different classes. This high accuracy and balanced F1 metric indicate robust performance across classes. Furthermore, the default random forest feature importance calculation (often based on Gini importance) was used to quantify the contribution of each k-mer to the classification, and these importances were subsequently plotted to visualize the most influential features (Fig 3J).

The importance of the feature was calculated using the SHAP values (Shapely additive explanation). A positive SHAP value indicates that the feature increases the prediction (e.g., pushes the model towards a positive class in classification). A negative SHAP value indicates that the feature decreases the prediction (e.g., pushes the model away from a positive class). As shown in Fig 4A, k-mers “CAT”, “ACA”, and “CTA” had the highest SHAP value. Whereas k-mer “CAT”, “CGC”, and “ACA” had the highest importance according to the random forest classifier.

thumbnail
Fig 4. Model Interpretation, Motif Discovery, and Classification Performance Analysis.

(A) Comparative k-mer importance profiles derived from SHAP attribution and random forest metrics. (B) Genome-wide positional SHAP attribution landscape showing nucleotide-level contributions to model predictions. (C) Most prevalent sequence motif in pre-Omicron Delta genomes with frequency counts. (D) Most prevalent sequence motif in post-Omicron Delta genomes with frequency counts. (E) Dataset-wide discriminative motif patterns learned by the 1-D CNN model. (F) Receiver operating characteristic curve of the random forest model illustrating threshold-dependent classification performance. (G) Receiver operating characteristic curve of the 1-D CNN model depicting sensitivity–specificity trade-off.

https://doi.org/10.1371/journal.pone.0345259.g004

The custom 1-D CNN model took an input of shape (3813, 1) and had multiple 1-D CNN layers with dense layers to classify the sequences. The 1-D CNN model produced an accuracy of 77% and an F1 score of 86% and 36%, as shown in Table 1. This approach provided a more granular view of the sequence, as convolutional filters captured local motifs and patterns in an end-to-end manner without explicit feature engineering. By comparing the random forest’s k–mer–based feature importance with the 1-D CNN’s learned representations, evaluation of differences in predictive performance and interpretability was possible. The SHAP values were calculated for each position of the sequence since the 1D-CNN model took the entire genome sequence without converting it into k-mers as input. The nucleotide position is shown in Fig 4B.

thumbnail
Table 1. Comparison of the random forest model and the 1-D CNN model in classifying the genome sequences.

https://doi.org/10.1371/journal.pone.0345259.t001

The most commonly occurring pattern in the Delta variant “before Omicron,” along with its count, is shown in Fig 4C, and the most commonly occurring pattern “after Omicron along with its count, is shown in Fig 4D. The important patterns in the DNA sequences were recorded based on the 1-D CNN model. These patterns provided the difference between the two classes. Fig 4E shows the important patterns in the entire dataset. Receiver Operating Characteristic (ROC) curves of the random forest classifier model and 1-D CNN model, respectively, are shown in Figs 4F and 4G. The ROC curve was used for graphical representation to evaluate the performance of a binary classification model. It plotted the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. The curve showed the trade-off between sensitivity (TPR) and specificity (1 – FPR) across different thresholds. A model with a curve closer to the top-left corner indicated better performance.

4 Discussion

A diverse group of RNA viruses and their ability to constantly mutate during their evolutionary process are too complicated to comprehend and prepare for future pandemics [22]. The availability of next-generation sequencing data has provided new opportunities and achievements in the study of viral evolution and ecology, as well as challenges in handling the huge volume of raw sequence data [23]. Viral sequences are regularly monitored for genomic and proteomic data analysis, such as mutational and structural changes, using available bioinformatic tools. Such analyses are useful for studying its phenotypic features, drug discovery, epidemiological investigations, identifying new variants, and their clinical implications [21,24,25]. However, such alignment-based methods do not provide sufficient information due to their inherent limitations [26].

Previously, studies on sequence classification on the SARS-CoV-2 spike gene have been carried out using a k-mer-based approach for training the data to identify a specific variant [27]. The k-mer extraction, as it does not depend on sequence homology, is considered more effective in classifying distant or closely related strains as well as for large sequencing datasets, and possesses high computational speed, memory efficiency, and biological functionality. In our study, the model trained had high accuracy, as an effective feature of the genome sequence was extracted. The model was trained to gain a temporal trend suitable for the viral sequences of a specific variant that emerged at different periods.

Alignment-free sequence analyses are used to study the evolution of microorganisms and cis-regulatory molecules, and to analyze meta information such as next-generation sequencing data [28,29]. Viral sequences have been analyzed using an ML approach through mapping of the sequence in a feature space, followed by data processing using ML techniques. These methods have been shown to be highly valuable in classifying different viruses, including dengue viruses, HIV-1, influenza A, hepatitis B, and hepatitis C [3032].

GC content variations can reveal structural and functional regions within genomes, such as promoter sites or regions prone to mutation, aiding in both annotation and comparative genomics studies [33]. The SHAP values provide a way to explain the contribution of each feature to a specific prediction, making models more transparent and understandable, especially in complex models like deep learning [34,35]. The 1D CNN approach provided a more granular view of the sequence, as convolutional filters captured local motifs and patterns in an end-to-end manner without explicit feature engineering. By comparing the random forest’s k–mer–based feature importance with the 1-D CNN’s learned representations, evaluation of differences in predictive performance and interpretability was possible. Elevated GC content is typically associated with enhanced stability of the nucleic acid duplex, owing to the stronger triple hydrogen bonding between G and C bases, in contrast to the double hydrogen bonds formed between adenine (A) and thymine (T). This property can influence the melting temperature of DNA and the efficiency of PCR, making it a crucial factor in experimental design [36].

All sequences included in this study were confirmed to belong exclusively to the B.1.617.2 lineage based on their original annotations and verification prior to analysis, with no A.Y.* sublineages present. Thus, the “before Omicron” and “after Omicron” groups represent samples from different time periods within the same lineage, rather than different Delta sublineages. The classification differences detected by the k-mer–based models therefore reflect subtle genomic changes over time within B.1.617.2, rather than shifts in lineage assignment.

Following the identification of specific mutations within the “before Omicron” class and specific mutations that disappeared in the “after Omicron” class, the random forest classifier model was trained with 1000 samples, providing an accuracy of 93%, and F1 scores of 96% and 75%. This ensured that the model classified the dataset with high accuracy, and the importance factor of each feature (k-mer) was extracted using the default method of the random forest classifier [12]. Following the identification of specific mutations present in the “before Omicron” class and those that disappeared in the “after Omicron” class, we trained a random forest classifier on 1000 representative samples. The model achieved a high overall accuracy of 93%, reflecting its strong predictive capability in distinguishing between the two temporal variant classes. The F1 score for the majority class was 96%, indicating excellent precision and recall, while the minority class yielded an F1 score of 75%, which is still acceptable given the class imbalance. These results underscore the robustness of the classifier in capturing subtle mutational patterns associated with variant transitions. Additionally, we extracted the importance scores of each k-mer feature using the built-in feature ranking method of the random forest algorithm, enabling interpretability and highlighting the most influential mutations contributing to classification performance. This provides valuable insight into evolutionary shifts in the viral genome, particularly during the emergence of Omicron and its impact on the mutational landscape.

In this study, the sequences were analyzed using statistical methods and machine-learning models. The models were trained and fitted with the dataset such that the accuracy is maximized. The machine learning models were built to classify the classes of sequences and then retrieve the features that provide significant contributions to the classification. The Temporal Trend report suggested an overall number of 157 persistent point mutations from both groups of data, suggesting stable changes in genomic sequence over time. The comparison of point mutations displayed four significant mutations that vanished in the “after Omicron” sequence data. The cluster analysis derived from the K-Means classification results indicates stable evolution with some significant changes over time, which showed some mutations moving within the class and some mutations classifying into different classes in the “after Omicron” Delta variant classification. The k-mer-based supervised classification approach presented in this paper offers several advantages over commonly used software tools for virus subtype classification. Our evaluations on multiple manually curated datasets demonstrate that k-mer classification enables rapid and accurate SARS-CoV-2 subtyping, outperforming many current state-of-the-art methods.

This study has several limitations that should be considered when interpreting the findings. The sample size was modest (n = 190), with a marked class imbalance between pre-Omicron and post-Omicron groups, which may influence classifier generalizability despite the use of weighting strategies. In this study, alignment-free approaches were selected to explore complementary analytical capabilities, particularly for pattern recognition and machine-learning–driven classification, and scalability advantages become relevant primarily in larger datasets. Only spike gene sequences were analyzed rather than complete genomes; therefore, conclusions reflect spike-specific evolutionary patterns and may not represent genome-wide dynamics. The study design was retrospective and observational; thus, associations identified between temporal groups and sequence features should not be interpreted as causal evolutionary drivers. Model evaluation was performed on a single dataset without external validation, which limits inference regarding performance on independent or future datasets. Because k-mer frequencies summarize sequence content, convergent compositional similarity could theoretically yield classification signals unrelated to true evolutionary relatedness. Taken together, these considerations indicate that while the alignment-free machine learning framework used here is effective for detecting discriminative sequence patterns, it does not replace traditional alignment-based genomic analyses. Instead, each approach provides distinct analytical advantages, and integrative strategies combining positional mutation analysis with compositional feature learning may offer the most comprehensive framework for future viral evolutionary studies [37].

5 Conclusions

This study offers valuable insights into the temporal evolution of SARS-CoV-2 Delta variants in the context of the emergence of Omicron. Identification of persistent and vanished mutations among the delta variants, possibly influenced by the emergence of Omicron variants, contributes to the understanding of viral genomic stability and adaptability. The integration of alignment-free methods and machine learning models (e.g., K-means and random forest) provides an effective approach for uncovering subtle yet meaningful genomic changes that may not be apparent through traditional sequence comparisons. The ability to classify variants and identify influential k-mers enhances molecular surveillance capabilities, potentially aiding in early detection of functionally significant mutations. These findings may support public health efforts in tracking the evolution of variants, inform adjustments to vaccine strategies, and guide future studies in viral pathogenesis and evolutionary modelling.

Ethics approval and consent to participate

The protocols involving human participants were approved by the Institutional Ethics Committee of the Madras Medical College (EC No. 03092021).

Supporting information

S1 File. GenBank accession numbers of the sequences analyzed.

https://doi.org/10.1371/journal.pone.0345259.s001

(XLSX)

Acknowledgments

The authors thank all the national and international members of the Infectious Diseases Society of India (IDSI), [https://idsi.org.in/], Chennai, for extending insightful discussions as well as technical and logistic support.

References

  1. 1. Gu X, Watson C, Agrawal U, Whitaker H, Elson WH, Anand S, et al. Postpandemic Sentinel Surveillance of Respiratory Diseases in the Context of the World Health Organization Mosaic Framework: Protocol for a Development and Evaluation Study Involving the English Primary Care Network 2023-2024. JMIR Public Health Surveill. 2024;10:e52047. pmid:38569175
  2. 2. McClary-Gutierrez JS, Mattioli MC, Marcenac P, Silverman AI, Boehm AB, Bibby K, et al. SARS-CoV-2 Wastewater Surveillance for Public Health Action. Emerg Infect Dis. 2021;27(9):1–8. pmid:34424162
  3. 3. Chrysostomou AC, Vrancken B, Haralambous C, Alexandrou M, Aristokleous A, Christodoulou C, et al. Genomic Epidemiology of the SARS-CoV-2 Epidemic in Cyprus from November 2020 to October 2021: The Passage of Waves of Alpha and Delta Variants of Concern. Viruses. 2022;15(1):108. pmid:36680148
  4. 4. Chavda VP, Bezbaruah R, Deka K, Nongrang L, Kalita T. The Delta and Omicron Variants of SARS-CoV-2: What We Know So Far. Vaccines (Basel). 2022;10(11):1926. pmid:36423021
  5. 5. Dubey A, Kumar M, Tufail A. Inhibiting viral entry of bat-derived coronavirus HKU5-CoV-2: Targeting spike protein S1 subunit with FDA-approved antivirals-A structural dynamics and energetics study. Bioorg Chem. 2025;164:108910. pmid:40865231
  6. 6. Mallavarpu Ambrose J, Priya Veeraraghavan V, Kullappan M, Chellapandiyan P, Krishna Mohan S, Manivel VA. Comparison of Immunological Profiles of SARS-CoV-2 Variants in the COVID-19 Pandemic Trends: An Immunoinformatics Approach. Antibiotics (Basel). 2021;10(5):535. pmid:34066389
  7. 7. Selvavinayagam ST, Yong YK, Joseph N, Hemashree K, Tan HY, Zhang Y. Low SARS-CoV-2 viral load among vaccinated individuals infected with Delta B.1.617.2 and Omicron BA.1.1.529 but not with Omicron BA.1.1 and BA.2 variants. Front Public Health. 2022;10.
  8. 8. Selvavinayagam ST, Suvaithenamudhan S, Yong YK, Hemashree K, Rajeshkumar M, Kumaresan A. Genomic surveillance of omicron B.1.1.529 SARS-CoV-2 and its variants between December 2021 and March 2023 in Tamil Nadu, India - a state-wide prospective longitudinal study. J Med Virol. 2024.
  9. 9. Saada B, Zhang T, Siga E, Zhang J, Magalhães Muniz MM. Whole-Genome Alignment: Methods, Challenges, and Future Directions. Appl Sci. 2024;14:4837.
  10. 10. Chatzou M, Magis C, Chang J-M, Kemena C, Bussotti G, Erb I, et al. Multiple sequence alignment modeling: methods and applications. Brief Bioinform. 2016;17(6):1009–23. pmid:26615024
  11. 11. Lebatteux D, Remita AM, Diallo AB. Toward an Alignment-Free Method for Feature Extraction and Accurate Classification of Viral Sequences. J Comput Biol. 2019;26(6):519–35. pmid:31050550
  12. 12. Hemalatha M. A hybrid random forest deep learning classifier empowered edge cloud architecture for COVID-19 and pneumonia detection. Expert Syst Appl. 2022;210:118227. pmid:35880010
  13. 13. Moeckel C, Mareboina M, Konnaris MA, Chan CSY, Mouratidis I, Montgomery A, et al. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J. 2024;23:2289–303. pmid:38840832
  14. 14. Chen S, He C, Li Y, Li Z, Melançon CE. A computational toolset for rapid identification of SARS-CoV-2, other viruses and microorganisms from sequencing data. Brief Bioinform. 2021;22(2):924–35. pmid:33003197
  15. 15. Zhang Y, Wen J, Li X, Li G. Exploration of hosts and transmission traits for SARS-CoV-2 based on the k-mer natural vector. Infect Genet Evol. 2021;93:104933. pmid:34023511
  16. 16. Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data. 2021;8(1):53. pmid:33816053
  17. 17. Yaqoob A, Verma NK, Mir MA, Tejani GG, Eisa NHB, Mamoun Hussien Osman H, et al. SGA-Driven feature selection and random forest classification for enhanced breast cancer diagnosis: A comparative study. Sci Rep. 2025;15(1):10944. pmid:40159513
  18. 18. Vaz JM, Balaji S. Convolutional neural networks (CNNs): concepts and applications in pharmacogenomics. Mol Divers. 2021;25(3):1569–84. pmid:34031788
  19. 19. Yu YW, Yorukoglu D, Berger B. Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification. Res Comput Mol Biol. 2014;8394:385–99. pmid:28825060
  20. 20. Orozco-Arias S, Candamil-Cortés MS, Jaimes PA, Piña JS, Tabares-Soto R, Guyot R, et al. K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes. PeerJ. 2021;9:e11456. pmid:34055489
  21. 21. Bifarin OO. Interpretable machine learning with tree-based shapley additive explanations: Application to metabolomics datasets for binary classification. PLoS One. 2023;18(5):e0284315. pmid:37141218
  22. 22. Garcia-Blanco MA, Ooi EE, Sessions OM. RNA viruses, pandemics and anticipatory preparedness. Viruses. 2022;14:2176.
  23. 23. Radford AD, Chapman D, Dixon L, Chantrey J, Darby AC, Hall N. Application of next-generation sequencing technologies in virology. J Gen Virol. 2012;93(Pt 9):1853–68. pmid:22647373
  24. 24. Meher PK, Sahu TK, Rao AR. Identification of species based on DNA barcode using k-mer feature vector and Random forest classifier. Gene. 2016;592(2):316–24. pmid:27393648
  25. 25. Yang Z, Li H, Jia Y, Zheng Y, Meng H, Bao T, et al. Intrinsic laws of k-mer spectra of genome sequences and evolution mechanism of genomes. BMC Evol Biol. 2020;20(1):157. pmid:33228538
  26. 26. Wu DC, Yao J, Ho KS, Lambowitz AM, Wilke CO. Limitations of alignment-free tools in total RNA-seq quantification. BMC Genomics. 2018;19(1):510. pmid:29969991
  27. 27. Zhang Y, Wen J, Li X, Li G. Exploration of hosts and transmission traits for SARS-CoV-2 based on the k-mer natural vector. Infect Genet Evol. 2021;93:104933. pmid:34023511
  28. 28. Ren J, Bai X, Lu YY, Tang K, Wang Y, Reinert G, et al. Alignment-Free Sequence Analysis and Applications. Annu Rev Biomed Data Sci. 2018;1:93–114. pmid:31828235
  29. 29. Chang G, Wang H, Zhang T. A novel alignment-free method for whole genome analysis: Application to HIV-1 subtyping and HEV genotyping. Information Sciences. 2014;279:776–84.
  30. 30. Lebatteux D, Remita AM, Diallo AB. Toward an Alignment-Free Method for Feature Extraction and Accurate Classification of Viral Sequences. J Comput Biol. 2019;26(6):519–35. pmid:31050550
  31. 31. Remita MA, Halioui A, Malick Diouara AA, Daigle B, Kiani G, Diallo AB. A machine learning approach for viral genome classification. BMC Bioinformatics. 2017;18(1):208. pmid:28399797
  32. 32. Solis-Reyes S, Avino M, Poon A, Kari L. An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes. PLoS One. 2018;13(11):e0206409. pmid:30427878
  33. 33. Qiu Y, Kang YM, Korfmann C, Pouyet F, Eckford A, Palazzo AF. The GC-content at the 5’ ends of human protein-coding genes is undergoing mutational decay. Genome Biol. 2024;25(1):219. pmid:39138526
  34. 34. Ponce-Bobadilla AV, Schmitt V, Maier CS, Mensing S, Stodtmann S. Practical guide to SHAP analysis: Explaining supervised machine learning model predictions in drug development. Clin Transl Sci. 2024;17(11):e70056. pmid:39463176
  35. 35. Câmara GBM, Coutinho MGF, Silva LMD da, Gadelha WV do N, Torquato MF, Barbosa R de M, et al. Convolutional Neural Network Applied to SARS-CoV-2 Sequence Classification. Sensors (Basel). 2022;22(15):5730. pmid:35957287
  36. 36. Nakano S, Sugimoto N. Roles of the amino group of purine bases in the thermodynamic stability of DNA base pairing. Molecules. 2014;19(8):11613–27. pmid:25100254
  37. 37. Moneshwaran S, Macrin D, Kanagathara N. An unprecedented global challenge, emerging trends and innovations in the fight against COVID-19: A comprehensive review. Int J Biol Macromol. 2024;267(Pt 1):131324. pmid:38574936