RASCL: Rapid Assessment of Selection in CLades through molecular sequence analysis

An important unmet need revealed by the COVID-19 pandemic is the near-real-time identification of potentially fitness-altering mutations within rapidly growing SARS-CoV-2 lineages. Although powerful molecular sequence analysis methods are available to detect and characterize patterns of natural selection within modestly sized gene-sequence datasets, the computational complexity of these methods and their sensitivity to sequencing errors render them effectively inapplicable in large-scale genomic surveillance contexts. Motivated by the need to analyze new lineage evolution in near-real time using large numbers of genomes, we developed the Rapid Assessment of Selection within CLades (RASCL) pipeline. RASCL applies state of the art phylogenetic comparative methods to evaluate selective processes acting at individual codon sites and across whole genes. RASCL is scalable and produces automatically updated regular lineage-specific selection analysis reports: even for lineages that include tens or hundreds of thousands of sampled genome sequences. Key to this performance is (i) generation of automatically subsampled high quality datasets of gene/ORF sequences drawn from a selected “query” viral lineage; (ii) contextualization of these query sequences in codon alignments that include high-quality “background” sequences representative of global SARS-CoV-2 diversity; and (iii) the extensive parallelization of a suite of computationally intensive selection analysis tests. Within hours of being deployed to analyze a novel rapidly growing lineage of interest, RASCL will begin yielding JavaScript Object Notation (JSON)-formatted reports that can be either imported into third-party analysis software or explored in standard web-browsers using the premade RASCL interactive data visualization dashboard. By enabling the rapid detection of genome sites evolving under different selective regimes, RASCL is well-suited for near-real-time monitoring of the population-level selective processes that will likely underlie the emergence of future variants of concern in measurably evolving pathogens with extensive genomic surveillance.

Introduction Rapid characterization and assessment of clade-specific mutations that are found in persistent or rapidly expanding SARS-CoV-2 lineages have become an important component of efforts to monitor and manage the COVID19 pandemic. Identifying the most relevant mutations, e.g., those likely to impact transmission or immune escape, is a priority when new lineages are discovered, as these are key to assessing a lineage's potential threat-level. If observed mutations have not been previously characterized, computational and laboratory-based analytical approaches to inferring whether the mutations provide transmission or immune escape advantages are generally too slow to inform early public health responses.
Epidemiologically relevant mutations are likely subject to natural selection because they provide a fitness advantage [1]. Such mutations can be identified by detecting the subtle patterns of nucleotide variation within gene-sequence datasets that are indicative of selective processes. There are a multitude of powerful computational techniques that, given sufficiently informative sequence data, can identify individual codons within genes that are evolving under a range of different selective regimes [2]. Hundreds of papers and preprints have used some of the methods implemented in HyPhy [3] and Datamonkey [4] for SARS-CoV-2 selection analyses e.g., [5][6][7][8], but on small, commonly hand-curated, datasets. This is because codon-based selection analyses do not scale well to more than a few hundred sequences when using the generic out-of-the-box versions of these analyses, and because noisy sequencing data (errors in assembled consensus genomes) can drive false positives.
We developed RASCL to standardize and accelerate comparative selection detection analyses of SARS-CoV-2 variants of interest (VOI) or variants of concern (VOC). The tool has been used to study selective forces which, at least in-part, drove the emergence of the Alpha, Beta, Gamma, and Omicron VOCs [9][10][11][12]. More broadly, a tool like RASCL enables near-real-time monitoring of emergent lineages, which in turn can be used both to detect potentially adaptive mutations before they rise to high frequencies, and to help establish relationships between individual mutations and key viral characteristics including pathogenicity, transmissibility, immune evasiveness and drug resistance [13][14][15][16][17]. Through routine analysis of patterns of ongoing selection within individual major lineages, we can reveal the variants or circulating sub-lineages that carry potentially concerning fitness-enhancing mutations, and which would therefore most likely drive future viral transmission [18].

RASCL application overview
The "query" set of whole genome sequences is compared against a diverse set of "background" sequences, chosen to represent globally circulating SARS-CoV-2 sequences (throughout the pandemic), and the query data set is the set of sequences which are the target of selection analyses (e.g., BA.5 clade sequences). Our background dataset is available at https://github.com/ veg/RASCL/tree/main/data/ReferenceSetViPR and was assembled from Virus Pathogen Database and Analysis Resource (ViPR, viprbrc.org) [19], a curated database of publicly available viral pathogen sequences, assemblies, and genome annotations (S1 File). The inclusion of the background dataset also provides an "outgroup" clade, enabling the study of selection on branches basal to the clade of interest (COI), and to discover regions under selection pressure which are unique to the COI. The application uses several open-source tools, as well as selection analysis modules from the HyPhy software package and assembles the results from the analysis into JSON files, which can then be visualized with our full-featured ObservableHQ [20] notebook.

Map and compress
Specifically, RASCL takes as input (i) a "query" dataset comprising a single FASTA file containing unaligned SARS-CoV-2 full or partial genomes belonging to a clade of interest (e.g. all sequences from the PANGO [17,18] lineage B.1.617.2) and (ii) a generic "background" dataset that might comprise, for example, a set of sequences that are representative of global SARS-CoV-2 genomic diversity, e.g. those assembled from ViPR. It is not necessary to remove sequences in the query dataset that are duplicated in the background dataset-the pipeline will do this automatically.
The choice of query and background datasets is analysis-specific. For example, if another clade of interest is provided as a background, it is possible to directly identify the sites that are evolving differentially between the two clades. Other sensible choices of query sequences might be sequences from a specific country/region, or sequences sampled during a particular time-period. Note that the analysis does not require the two sets to be reciprocally monophyletic, but in many applications, this will be the case. Following the automated mapping of whole genome datasets into individual coding sequences (based on the NCBI reference annotation), the gene datasets (each containing a set of query and background sequences) are processed in parallel.

Prepare for selection analysis
Using complete linkage distance clustering implemented in the TN93 package (tn93-cluster tool, https://github.com/veg/tn93), RASCL subsamples from available sequences while maintaining overall genomic diversity; the clustering threshold distance is chosen automatically to include no more than a user-specified number of genomes "D" (e.g., 300). In the Results section below we discuss our recent analyses of several SARS-CoV-2 clades and while other available subsampling methods for SARS-CoV-2 genomes exist which rely on spatiotemporal distributions [21], our method relies on increasing sequence diversity to enhance evolutionary signal for downstream selection analyses, while reducing computational complexity. Core method implementations in HyPhy can handle up to 25,000 subsampled sequences in a reasonable time, but the computational cost increases rapidly, and for faster turnaround 1000 sequences are the recommended setting. Following dataset compression, RASCL creates a combined (query and background) alignment with only the sequences that are divergent enough to be useful for subsequent selection analyses. Inference of a maximum likelihood phylogenetic tree with RAxML-NG, [22], or IQ-TREE, [23] is performed on the merged dataset and the query and background branches of this tree are labeled as Query or Background; internal branches of the tree are labeled using maximum parsimony.

Selection analyses
Selection analyses are performed with state-of-the-art molecular evolution [24] models implemented in HyPhy. To partially mitigate the potentially confounding influences of within-host evolution [25,26], where mutations occurring within an individual have not been filtered by selection at the broader population-level, and sequencing errors, selection analyses are performed only on the internal branches of phylogenetic trees, where at least one or more rounds of virus transmission are captured [27]. The following individual selection tests are applied to each gene-level alignment of merged query and background sequences.
• Branch-site Unrestricted Statistical Test for Episodic Diversification with Synonymous rate variation (BUSTED[S]): this method tests for gene-wide selection which is either pervasive (occurring throughout the evolutionary tree) or episodic (occurring only on some lineages). [28].
• Single-Likelihood Ancestor Counting (SLAC), uses a combination of maximum-likelihood and counting approaches to infer pervasive selection through nonsynonymous (dN) and synonymous (dS) substitution rates on a per-site basis for a given coding alignment and corresponding phylogeny. We use the results from SLAC to create substitution mapping of genomic sites and selection analysis results across methods [29].
• Coevolution detection using Bayesian Graphical Models (BGM): this method identifies groups of sites that might be co-evolving using the joint distribution of substitutions [30].
• Fixed Effects Likelihood (FEL): this method locates codon sites within a gene with evidence of pervasive positive diversifying or negative selection by inferring nonsynonymous (dN) and synonymous (dS) substitution rates on a per-site basis for a given coding alignment and corresponding phylogeny [29].
• Mixed Effects Model of Evolution (MEME): a more sensitive analysis as compared to FEL, this method locates codon sites with evidence of episodic positive diversifying selection, [31].
• Relaxed Selection (RELAX): compares gene-wide selection pressure and looks for evidence that the strength of selection has been relaxed (or intensified) between the query clade and background sequences [32].
• Contrast-FEL: comparison of site-by-site selection pressure between query and background sequences to detect evidence indicative of different selective regimes [33].
• A FUBAR [34] Approach to Directional Selection (FADE): this method identifies aminoacid sites with evidence of directional selection [35]. FUBAR refers to our previously published method Fast, Unconstrained Bayesian AppRoximation for Inferring Selection.
• FitMultiModel (FMM): this method identifies genes with complex multiple instantaneous substitutions that occur within a codon, a rare but potent source of evolutionary signal. [36].

Visualization and downstream post-hoc analyses
Results are combined using a Python script "generate-report.py" into two machine-readable JSON files ("summary.json" and "annotation.json") that represent detailed analysis results for gene segments and individual sites, respectively. JSON is an open standard text-based file format which is also human-readable. It is well-suited for representing structured data and is commonly used for transmitting data in web applications. These JSON files can either be used as input for other software, or visualized within a standard web-browser via a feature-rich interactive RASCL dashboard hosted on ObservableHQ (Fig 2). For the web application implementation of RASCL, alignments, trees and analysis results are stored and made web-accessible via the Galaxy platform. Results are visualized with interactive notebooks hosted on ObservableHQ (Fig 2) that include an alignment viewer, a visualization of individual codons/ amino acid states at user-selected sites mapped onto the tips of phylogenetic trees, and detailed tabulated information on analysis results for individual genes and codon-sites.

Results and discussion
RASCL uses molecular sequence data from genotypically distinct viral lineages to identify distinguishing features and evolution within lineages. A query set of sequences is compared against a globally diverse set of background sequences. The background data set typically contains globally circulating viral sequences, and the query data set is the set of sequences of particular interest to the user. Below, we describe our analyses of several variants of SARS-CoV-2 whole genome sequences, but RASCL is applicable to any measurably evolving pathogen with sufficient surveillance data.

An overview of molecular surveillance of important SARS-CoV-2 viral clades
As a concrete example of the utility of RASCL consider our analysis of 112,017 BA.1 (WHO Omicron, all available BA.1 sequences as of January 2, 2022) sequences. RASCL selected a For each gene we then extract the representative gene diversity using genetic distance clustering. (2) is where we prepare our gene alignments for selection analysis. We accomplish this by merging alignments from the background and query datasets into a "combined" dataset. From this, we infer a phylogenetic tree and annotate it based on query and background sequences.

PLOS ONE
compression ratio of 99.17%. This level of compression is representative of recent analyses of a few VOIs/VOCs (see Table 1) with RASCL. Importantly, there is evidence of diversifying positive selection acting on the BA.1 sequences, on 42 (p�0.0001) individual sites (there are 4312 sites that are polymorphic in the amino-acid space among clade sequences; selection also includes the basal branch of the clade) along the internal branches of the clade (S1 Table). There is evidence of diversifying positive selection acting on the BA.1 sequences, on 359 individual sites along all branches of the clade (S2 Table), with 40 sites with an LRT p-value of � 0.0001. When comparing the strength of selection on BA.1 to background sequences along internal tree branches, 21 individual sites along the internal branches of the clade (S3 Table) showed statistically significant differences. Over the entire tree, 31 sites demonstrate evidence of directional selection (S4 Table). Along the internal branches of the BA.1 clade, 81 pairs of sites showed evidence of coevolution (S5 Table). Along the internal branches of the BA.1 clade, 47 sites showed evidence of negative selection (S6 Table). Along the internal branches of the BA.1 clade, 15 (out of 21 segments considered 71.4%) genes/ORFs showed evidence of episodic diversifying selection (BUSTED, qvalue � 0.1, S7 Table).

A closer examination of the SARS-CoV-2 BA.5 clade
We investigated the nature and extent of selective forces acting on the viral genes in BA.5 clade (all available BA.5 sequences as of August 9, 2022) sequences by performing a series of comparative phylogenetic analyses on a median of 258 BA.5 sequences and a median of 113 sequences from available sequences chosen to represent genomic diversity in SARS-CoV-2. We compiled our background dataset from the globally subsampled Nextstrain [39] (https:// nextstrain.org/ncov/gisaid/global/all-time) build (last accessed June 12, 2022, genomes were sampled from the beginning of the SARS-CoV-2 pandemic) (S1 File). We observe that: • There is evidence of diversifying positive selection acting on the BA.5 sequences, on 94 individual sites (there are 2737 sites that are polymorphic in the amino-acid space among clade sequences; selection also includes the basal branch of the clade) along the internal branches of the clade (S8 Table).
• There is evidence of diversifying positive selection acting on the BA.5 sequences, on 133 individual sites along all branches of the clade (S9 Table).
• When comparing the strength of selection on BA.5 to background sequences along internal tree branches, 12 individual sites along the internal branches of the clade (S10 Table) showed statistically significant differences.

Table 1. Examples of various recent RASCL analyses on variants of interest and variants of concern for SARS-CoV-2.
Full results, and more recent updates, are available through our interactive notebooks (see Methods section for additional details). The lower compression ratios for Omicron reflects the higher genomic variability of this lineage, which gave rise to many sublineages soon after emergence, combined with a high-volume of viral genome sequencing.

PLOS ONE
• Over the entire tree 11 sites show evidence of directional selection (S11 Table).
• Along the internal branches of the BA.5 clade, 37 pairs of sites showed evidence of coevolution (S12 Table).
• Along the internal branches of the BA.5 clade, 14 sites showed evidence of negative selection (S13 Table).
We consider our results from gene-wide estimates of adaptation where we observed that 3 structural and 11 non-structural proteins yield statistically significant results (Table 3). Within the set of structural proteins, we find Spike (S), Membrane glycoprotein (M), and the nucleocapsid phosphoprotein (N), these genes have been implicated in complex biological functions, including as a highly conserved target, M, [40] and have been the focus of studies on viral infection [41], pathology [42] and vaccination and therapeutic intervention [43]. Interestingly, within the set of non-structural proteins we find ORF3a, ORF6, ORF7a, and ORF8, which have been implicated in novel biological mechanisms in the SARS-CoV-2 virus including the induction of autophagy and role as a viral ion channel, ORF3a, [44,45] disruption of nucleocytoplasmic transport, ORF6, [46] inhibition of host interferon response, ORF7a ORF8 [47,48]. We also find  Table 1 are discussed, including: S/339, S/371, S/440, S/764, S/1162. We highlight these sites due to the level of statistical signal associated with them (we find � 5 branches selected in the exploratory MEME analysis), and provide selection profiles for each of these sites below.
Assessing evolutionary pressures on the SARS-CoV-2 μ (B.1.621) clade. We identify genomic sites in B.1.621 (μ) [49] clade sequences that may be subject to selective forces and could be prioritized for further studies but have not yet reached high frequencies. We present an analysis of the B.1.621 variant, performed on individual genes/protein products, using a median of 101 μ sequences subsampled from all available sequences in GISAID [50] to represent the genomic diversity in this clade (all sequences as of September 7, 2021). A similarly subsampled global SARS-CoV-2 background dataset from publicly available sequences via the ViPR database is used as background and provided in our Github repository, (linked to in our Methods section). Interactive results for this analysis can be explored in our ObservableHQ notebook (https://observablehq.com/@aglucaci/rascl-mu) and our Virological (https:// virological.org/) post (https://virological.org/t/assessing-evolutionary-pressures-on-the-sarscov-2-mu-b-1-621-clade/760). Our analysis identified 67 (S14 Table) individual codon sites (among 1643 sites that are polymorphic in the amino-acid space) that showed evidence of episodic diversifying selection along internal branches of this clade using the MEME method at q � 0.20 false discovery rate (FDR). A total of 5 sites (S15 Table) were found to be subject to directional selection using the FADE method.
We identify high-priority sites in SARS-CoV-2 μ (B.1.621) sequences (Table 4), with a "Rank" for each site based on a point system described below. We identify and rank sites (+1 for each category) according to the following protocol: • Inferred to be under positive selective pressure.
• Contain mutations that are not predictable based on the evolution of Sarbecovirus sequences [51].
• Contain mutations that occur in a large fraction of unique haplotypes. This was shown to be predictive of near-term growth in a separate analysis from our global SARS-CoV-2 analysis [17].  Genomic position (in SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon in gene: the location of the codon in the corresponding gene/ORF. p-value: the p-value for the likelihood ratio test that a non-zero fraction of branches have omega > 1 (i.e., episodic diversifying selection at this branch). This is not corrected for multiple testing; the MEME test is generally conservative on real data. qvalue: multiple-test corrected q-value (Benjamini-Hochberg). We assign a "Rank" to each site based on a point system described above. Properties: which, if any, of the five compositive biochemical properties [52] are conserved or changed at this site. https://doi.org/10.1371/journal.pone.0275623.t004

PLOS ONE
Additionally, where available, we provide interpretation of our identified sites in terms of known functional significance, temporal growth trends and location on the 3D structure of the protein. Briefly, each site is given a point for each of the follow requirements that are met: found to be positively selected, the site occurs outside of the clade defining site set, the site has any unexpected mutations as described in our Sarbecoviruses evolutionary analysis notebook (https://observablehq.com/@spond/sars-cov-2-pvo), the site has a mutation present above the minimum threshold in the SARS-CoV-2 Global Haplotype analysis (https://observablehq. com/@spond/sc2-haplotypes), for this we remove any mutation present in background clade, and remove gaps.
Evidence of natural selection history operating on SARS-CoV-2 genomes. For the set of high-priority SARS-CoV-2 genomic sites (taken from Table 4), sites inferred from B.1.621 sequences, we observe when and how selection (positively or negatively) operated on them, through a series of 3-month overlapping intervals going back to the beginning of the pandemic (Fig 3). The earliest intervals end in February 2020 and the latest-in September 2021. In selected sites we observe the temporal trends of high-priority sites in Spike (Fig 4) and RdRp (Fig 5) in B.1.621 sequences. We also describe the spatial location of sites inferred to be under positive selective pressure in the Spike gene (Fig 6) from B.1.621 (μ) sequences on the structure of the protein (from Table 4).  Table 4). If a site was found to be positively (red) or negatively (blue) selected during a specific time, a bubble will be drawn at a corresponding point on the plot. The area of the bubble is scaled as -log10 p, where p is the p-value of the FEL likelihood ratio test. Larger bubbles correspond to smaller p-values; p-values are not directly comparable between different time windows and different genes due to differences in sample sizes and other factors. The x-axis shows the endpoint of the time-window, e.g., March 30th, 2021, will correspond to the analysis performed with the data from January 1, 2021, to March 30, 2021. Figures like this can be generated with the "Evidence of natural selection history operating on SARS-CoV-2 genomes" ObservableHQ notebook (https://observablehq.com/@spond/sars-cov-2-selected-sites). https://doi.org/10.1371/journal.pone.0275623.g003

PLOS ONE
Potential biological and clinical significance of mutations. The ongoing monitoring of emergent VOIs and VOCs can detect adaptive mutations before they rise to high frequency and help establish their relationship to key clinical parameters including pathogenicity and transmissibility. Additionally, continued evolution within a particular clade may form the foundation for a subclade with further functional sites of interest. Based on current information for the Spike gene from Stanford Coronavirus Antiviral Resistance Database (CoVDB, https://covdb.stanford.edu/) [53] we include several annotations with clinical relevance. From the SARS-CoV-2 B.1.621 (μ) Spike gene, we identify the following sites of interest from Table 4 due to their interaction with epitope binding in monoclonal antibodies (mAbs): 144, 145, 146, 147, 148, 417, and 501.

Conclusions
A need that has reoccurred throughout the course of the COVID-19 pandemic is to rapidly identify molecular changes as they arise within the SARS-CoV-2 genome and to interpret how these changes impact the fitness and host-interaction of the virus. Additionally, this information is crucial to provide public health officials with the most up-to-date information when making public health decisions. To gather this information, computational and laboratorybased analytical approaches have been used to test and validate hypotheses about the observed genotype and phenotypic implications [54]. These current approaches require both significant effort and time to complete, therefore the results may be gained too slowly to inform early public health responses. Computational methods that detect natural selection can be leveraged to identify sites of interest within viral clades. Due to the massive amount of sequence availability of SARS-CoV-2 genomes, many such methods are rendered uncapable of providing results in a timely manner, bottlenecked by the increased computational complexity associated with large-scale analysis. We address this limitation with RASCL, an agile method that can be used to rapidly characterize and assess natural selection at sites and across proteins within viral genomes. Now, SARS-CoV-2 VOI/VOCs can be screened for signals of selection in a standardized manner and at an accelerated rate, while providing easily interpretable, near-realtime results. The novelty of RASCL lies in its design; it is highly modular and easily adaptable to rapidly analyze any molecular evolving pathogen. Regarding the modularity component, there are phases of the pipeline, described in Fig 1, each of which can be parallelized across either a highperformance computing environment or a personal computer. We take the intermediate and terminal files created by our methods throughout the analysis and combine the pertinent output files together into the commonly used, standardized JSON format. To make interpretation and visualization of the results easy for the user, we created a customizable RASCL dashboard page using ObservableHQ which runs in the browser of any internet browser. The results page is dynamic and interactive, allowing the user to inspect the results for different signals of selection with ease. The modularity of RASCL makes it highly scalable, yielding near-real-time results at any stage of an outbreak. At the beginning stages of pathogen emergence when very few sequences exist, analyses run quickly, and as the outbreak persists and the number of sequences increases, subsampling can be increased, limiting the computational bottleneck.
RASCL is available in two forms, as a standalone pipeline that uses Snakemake, as well as a web application integrated as a workflow in the Galaxy framework. By implementing the method in these two ways, users at any level of bioinformatics expertise benefit, making RASCL highly accessible. RASCL has been designed such that with minimal modifications to the background and reference genomes, genes under analysis, and default thresholding settings any other evolving pathogens can be rapidly scrutinized to immediately inform public health measures. RASCL can be modified to screen for signals of selection in gene sequences from the current Monkeypox outbreak. Differences in virulence have been reported between Monkeypox isolates from two different geographic regions [55,56], thus rapidly identifying the evolution within viral genes is highly relevant to public health efforts.
To date RASCL has been used to characterize the role of natural selection in the emergence of the Beta, Gamma, Omicron [17] and BA.4/BA.5 [57] VOC lineages, as well as to identify patterns of convergent evolution in the Alpha, Beta and Gamma lineages [11]. Whole genome sequences from any viral clade of interest (i.e., emerging pathogens), can be separated into a query and background sequence dataset representing the global diversity of viral sequences serves as the input to the tool, where the selective forces associated within and between the two sets of sequences are identified. RASCL has also been used to monitor the evolution of several lineages (see Table 1) and will be applied to future SARS-CoV-2 sublineages as they emerge. Therefore, whenever future genomic surveillance efforts reveal new potentially problematic SARS-CoV-2 lineages, we will use RASCL to analyze these too.
Among the limitations of the current study is that at this moment the RASCL application does not take recombination within genes into account; recombination between genes is handled by performing gene-by-gene analyses. While recombination plays a generally recognized role in the evolution of coronaviruses between species, only a limited amount of recombination is observed in the globally circulating viral population of SARS-CoV-2 [58]. In addition, the types and modalities of selection analyses employed in the RASCL application (described in the Methods section) have robust statistical inference that are only biased when significant recombination changes the topology of the inferred phylogenetic relationships. Future versions of RASCL will include an optional configuration to detect genetic recombination using state-of-the-art methods [59][60][61] and will include an updated interactive notebook to visualize and interpret these kinds of complex evolutionary signals. By taking recombination into account, we look forward to increasing the role that the RASCL application can play in the global monitoring and surveillance of evolution in SARS-CoV-2 and other important pathogens.
Our focus in this study is on the rapid, near-real-time monitoring and analysis of emerging pathogens, where the RASCL software application provides interpretable results for molecular surveillance of continued natural evolution. While an area of active research and both proliferative and heated debate, the question of SARS-CoV-2 origins [62,63] is out of scope for our study. Indeed, while our analysis of VOIs/VOCs in SARS-CoV-2, has uncovered important and emerging regions of interest in key viral proteins, RASCL has broader applicability to global health threats with known natural origins and existing animal reservoirs. However, further investigation in SARS-CoV-2 origins is critical for understanding the true biological and epidemiological context and complex evolutionary history [64] of global pathogens.
Supporting information S1 File. GISAID accession ID's for the analyses reported in Table 1. We also report the GISAID accession ID's for our Nextstrain background dataset. (ZIP) S1 Table. List of sites found to be under diversifying positive selection by MEME (p�0.05) along internal branches in BA.1, as well as biochemical properties that are important at this site (via the PRIME method). Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ ORF. % Of branches with omega>1: the fraction of tree branches (internal branches BA.1 clade) that have evidence of diversifying positive selection at this site (100%-pervasive selection, <100% -episodic selection). p-value: the p-value for the likelihood ratio test that a non-zero fraction of branches have omega > 1 (i.e., episodic diversifying selection at this branch). This is not corrected for multiple testing; the MEME test is generally conservative on real data. q-value: multiple-test corrected q-value (Benjamini-Hochberg). Properties: which, if any, of the five compositive biochemical properties from [52] are conserved or changed at this site. (CSV) S2 Table. List of sites found to be under diversifying positive selection by MEME (p�0.05) along all branches in BA.1, as well as biochemical properties that are important at this site (via the PRIME method). Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. % Of branches with omega>1: the fraction of tree branches (internal branches BA.1 clade) that have evidence of diversifying positive selection at this site (100%-pervasive selection, <100% -episodic selection). p-value: the p-value for the likelihood ratio test that a non-zero fraction of branches have omega > 1 (i.e., episodic diversifying selection at this branch). This is not corrected for multiple testing; the MEME test is generally conservative on real data. qvalue: multiple-test corrected q-value (Benjamini-Hochberg). Properties: which, if any, of the five compositive biochemical properties from [52] are conserved or changed at this site.  Segment corresponds to the gene or ORF does under analysis. Omega1 refers to the first omega rate class, p1 refers to proportion of sites which fit this rate class. Omega2 refers to the first omega rate class, p2 refers to proportion of sites which fit this rate class. Omega3 refers to the first omega rate class which captures the episodic diversifying features, p3 refers to proportion of sites which fit this rate class. P-value, the p-value for the likelihood ratio test. Q-value refers to the multiple-test corrected q-value (Benjamini-Hochberg). (CSV) S8 Table. List of sites found to be under diversifying positive selection by MEME (p�0.05) along internal branches in BA.5, as well as biochemical properties that are important at this site (via the PRIME method). Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ ORF. % Of branches with omega>1: the fraction of tree branches (internal branches BA.1 clade) that have evidence of diversifying positive selection at this site (100%-pervasive selection, <100% -episodic selection). p-value: the p-value for the likelihood ratio test that a nonzero fraction of branches have omega > 1 (i.e., episodic diversifying selection at this branch). This is not corrected for multiple testing; the MEME test is generally conservative on real data. q-value: multiple-test corrected q-value (Benjamini-Hochberg). Properties: which, if any, of the five compositive biochemical properties from [52] are conserved or changed at this site. (CSV) S9 Table. List of sites found to be under diversifying positive selection by MEME (p�0.05) along all branches in BA.5, as well as biochemical properties that are important at this site (via the PRIME method). Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. % Of branches with omega>1: the fraction of tree branches (internal branches BA.1 clade) that have evidence of diversifying positive selection at this site (100%-pervasive selection, <100% -episodic selection). p-value: the p-value for the likelihood ratio test that a non-zero fraction of branches have omega > 1 (i.e., episodic diversifying selection at this branch). This is not corrected for multiple testing; the MEME test is generally conservative on real data. qvalue: multiple-test corrected q-value (Benjamini-Hochberg). Properties: which, if any, of the five compositive biochemical properties from [52] are conserved or changed at this site. (CSV) S10 Table. List of sites found to be selected differentially along internal branches between BA.5 and background sequences (FDR � 0.2) using the Contrast-FEL method. Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. Ratio of omega (BA.1: background): the ratio of site-level omega estimates for the two sets of branches. If this ratio is > 1, then selection on BA.1 is stronger. These values are highly imprecise and should be viewed as qualitative measures. p-value: the p-value for the likelihood ratio test that omega ratios between the internal branches of the two clades are different. This is not corrected for multiple testing. qvalue: multiple-test corrected q-value (Benjamini-Hochberg). (CSV) S11 Table. List of sites found to be evolving under directional selection in the entire tree, using a FUBAR-like implementation of the DEPS method. The BA.5 tree was rooted on the genome reference sequence for this analysis. Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. Target amino-acid: which amino-acids have statistical support (Bayes Factor � 100) for accelerated evolution towards them. (CSV) S12 Table. Pairs of sites (BA.5) found to have epistatic (co-evolving) substitution patterns by BGM method. Gene/ORF: which gene or ORF does this site belong to. Codon 1/2 (in gene/ ORF): the location of the two interacting codons in the corresponding Gene/ORF. Posterior probability of non-independence: estimated posterior probability that substitutions which occur on the interior branches of the BA.1 clade are not independent. (CSV) S13 Table. List of sites found to be under pervasive negative selection by FEL (p�0.05) along internal branches in BA.5. Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ ORF. Synonymous rate: Site estimate for the synonymous substitution rate (alpha). These values are highly imprecise and should be viewed as qualitative measures. Non-synonymous rate: Site estimate for the non-synonymous substitution rate (beta). These values are highly imprecise and should be viewed as qualitative measures. p-value: the p-value for the likelihood ratio test. "q-value": multiple-test corrected q-value (Benjamini-Hochberg). (CSV) S14 Table. List of sites found to be under diversifying positive selection by MEME (p�0.05) along internal branches in B.1.621, as well as biochemical properties that are important at this site (via the PRIME method). Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. % of branches with omega > 1: the fraction of tree branches (internal branches BA.1 clade) that have evidence of diversifying positive selection at this site (100%pervasive selection, <100% -episodic selection). p-value: the p-value for the likelihood ratio test that a non-zero fraction of branches have omega > 1 (i.e., episodic diversifying selection at this branch). This is not corrected for multiple testing; the MEME test is generally conservative on real data. q-value: multiple-test corrected q-value (Benjamini-Hochberg). Properties: which, if any, of the five compositive biochemical properties from [52] are conserved or changed at this site. (CSV) S15 Table. List of sites found to be evolving under directional selection in the entire tree, using a FUBAR-like implementation of the DEPS method. The B.1.621 tree was rooted on the genome reference sequence for this analysis. Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. Target amino-acid: which amino-acids have statistical support (Bayes Factor � 100) for accelerated evolution towards them. (CSV)