Genetic grouping of SARS-CoV-2 coronavirus sequences using informative subtype markers for pandemic spread visualization

We propose an efficient framework for genetic subtyping of SARS-CoV-2, the novel coronavirus that causes the COVID-19 pandemic. Efficient viral subtyping enables visualization and modeling of the geographic distribution and temporal dynamics of disease spread. Subtyping thereby advances the development of effective containment strategies and, potentially, therapeutic and vaccine strategies. However, identifying viral subtypes in real-time is challenging: SARS-CoV-2 is a novel virus, and the pandemic is rapidly expanding. Viral subtypes may be difficult to detect due to rapid evolution; founder effects are more significant than selection pressure; and the clustering threshold for subtyping is not standardized. We propose to identify mutational signatures of available SARS-CoV-2 sequences using a population-based approach: an entropy measure followed by frequency analysis. These signatures, Informative Subtype Markers (ISMs), define a compact set of nucleotide sites that characterize the most variable (and thus most informative) positions in the viral genomes sequenced from different individuals. Through ISM compression, we find that certain distant nucleotide variants covary, including non-coding and ORF1ab sites covarying with the D614G spike protein mutation which has become increasingly prevalent as the pandemic has spread. ISMs are also useful for downstream analyses, such as spatiotemporal visualization of viral dynamics. By analyzing sequence data available in the GISAID database, we validate the utility of ISM-based subtyping by comparing spatiotemporal analyses using ISMs to epidemiological studies of viral transmission in Asia, Europe, and the United States. In addition, we show the relationship of ISMs to phylogenetic reconstructions of SARS-CoV-2 evolution, and therefore, ISMs can play an important complementary role to phylogenetic tree-based analysis, such as is done in the Nextstrain project. The developed pipeline dynamically generates ISMs for newly added SARS-CoV-2 sequences and updates the visualization of pandemic spatiotemporal dynamics, and is available on Github at https://github.com/EESI/ISM (Jupyter notebook), https://github.com/EESI/ncov_ism (command line tool) and via an interactive website at https://covid19-ism.coe.drexel.edu/.


Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the novel coronavirus responsible for the 39 COVID-19 pandemic, was first reported in Wuhan, China in December 2019. [2,3]. In a matter of weeks, 40 SARS-CoV-2 infections had been detected in nearly every country, and as of July 2020, reported cases 41 continue to rapidly increase across multiple continents. Powered by advances in rapid genetic sequencing, expands, there is a need for an efficient methodology to quantitatively characterize groups of variation in the 75 SARS-CoV-2 virus genome by defining genetic subtypes of the virus. Exemplary potential applications of 76 quantitative subtyping include the following: 1) Characterizing potentially emerging variants of the virus in 77 different regions, which may ultimately express different phenotypes. 2) Monitoring variation in the viral 78 genome that may be important for vaccine development, for example due to emerging structural differences 79 in proteins encoded by different strains. 3) Designing future testing methodology to contain disease 80 transmission across countries and regions, for example developing specific tests that can characterize whether 81 a COVID-19 patient developed symptoms due to importation or likely domestic community transmission. 4

) 82
Identifying viral subtypes that may correlate with different clinical outcomes in different regions and patient 83 subpopulations. 84 Phylogenetic trees obtained through sequence alignment may be utilized to map viral outbreaks 85 geographically and trace transmission chains [22,23] and have been applied to SARS-CoV-2 by, e.g., the 86 Nextstrain group as discussed above. At an early stage in the pandemic, however, phylogenetic trees may be 87 unreliable predictors of evolutionary relationships between viral strains circulating worldwide, because of 88 insufficient information regarding the molecular clock assumption, practical limits on data collection, and 89 sampling bias [24]. Accordingly, subtyping based on phylogenetic models may also be unreliable and change, 90 as the assumptions underlying the models change given more sequencing and continued variation in the viral 91 genome. The ISM approach described in this paper relies instead on compact measures of sequence similarity 92 that will remain conserved even as more genome sequence data is added over time. ISMs thus provide a 93 robust subtype definition, which can help track the virus as the pandemic progresses. Therefore, it may be 94 more efficient to focus on co-occurring patterns of only the sites of the more frequently occurring variation 95 within the viral genome to identify subtypes, rather than utilizing whole genome sequence data to cluster 96 viral genomes, which may contain additional confounding variation. 97 Moreover, the nomenclature of clades imply that the categorization of viruses in subtypes is static rather 98 than dynamic. SARS-CoV-2 is a novel virus in humans that is rapidly evolving, which makes it harder to 99 establish a stable nomenclature for genetic typing [10]. The Nextstrain project has sought to address these 100 challenges by providing their own clade definitions based on whether there are a certain number of mutations 101 at nucleotide positions in the sequence (at least two) and naming clades based on their estimated time of 102 emergence [25]. This shows that a conventional genetic subtype relying on whole genome phylogenetic trees 103 will be complicated by the changes in viral genome, especially early on in a pandemic before those changes 104 are clearly governed by selection pressure. Lineages will likely disappear and reemerge within different 105 geographical regions and over the course of time [10]. In addition, viral evolutionary analysis, such as by the 106

4/44
Nextstrain group, relies on making assumptions solely on molecular evolution (the degree of sequence 107 similarity and branching points in defining genetic clades and other levels of organization) and not on 108 transmission models. 109 In this paper, we propose a methodology to complement phylogenetics-based transmission and evolution 110 models of SARS-CoV-2 that can consistently and rapidly identify subtypes without requiring an initial tree 111 reconstruction step-and, thereby, avoid the need to make assumptions about the molecular evolution clock 112 and clustering thresholds. To generate highly informative molecular signatures indicative of a subtype or 113 emerging lineage, we look to methods that have been successfully employed in the microbiome field to resolve 114 species/subspecies from 16S ribosomal RNA (16S rRNA) gene [26]. The 16S rRNA gene is a highly 115 conserved sequence and therefore can be used for phylogenetic analysis in microbial communities [27][28][29][30][31]. 116 One way to differentiate between closely related microbial taxa is to identify nucleotide positions in 16S 117 rRNA data ("oligotypes") that represent information-rich variation [32]. This approach has also been used in 118 the reverse direction to find conserved sites as a way to assemble viral phylogenies [33]. Dawy et al. proposed 119 to use Shannon's mutual information to identify multiple important loci for Gene mapping and marker 120 clustering [34]. Shannon Entropy [35] has been applied in multiple sequence alignment data to quantify the 121 sequence variation at different positions [32,36]. Given a position of interest, entropy can be used to measure 122 the amount of "randomness" at that position, as determined by whether sequences may have different bases 123 at a specific position. For instance, if there is an A at a given position across all aligned sequences, the 124 entropy will be 0, i.e., there is no "randomness" at that position. On the other hand, if at a given position 125 there is a G in 50% of the sequences and a T in the other 50%, the entropy will be 1 (i.e., essentially 126 "random"), and thus a relatively high entropy. Based on this property, oligotyping [32] utilizes variable sites 127 revealed by the entropy analysis to identify highly refined taxonomic units. 1 128 Accordingly, we present herein a method to define a genetic signature, called an "informative subtype 129 marker" or ISM, for the viral genome that can be 1) utilized to define SARS-CoV-2 subtypes that can be 130 quantified to characterize the geographic and temporal spread of the virus, and 2) efficiently implemented for 131 identifying strains to potentially analyze for phenotypic differences. The method compresses the full viral 132 genome to generate a small number of nucleotides that are highly informative of the way in which the viral 133 genome dynamically changes. We draw on the aforementioned oligotyping approach developed for 16S rRNA 134 data [32] and build on its implementation of entropy and grouping patterns to address the particular 135 challenges of viral genomes. On top of oligotyping, we add error correction to account for ambiguities in 136 1 It is important to note that while we use the term "random" in the foregoing, in the biological context, a position may have a different base in different sequences due to selection pressure resulting in strains with different phenotypes, rather than purely random variation.

5/44
reported sequence data and, optionally, applied further compression by identifying patterns of base entropy 137 correlation. The resulting ISM, therefore, defines a viral genetic subtype (that can be related to a 138 phylogenetic "lineage", see Comparison of ISM-defined subtypes to clades identified using phylogenetic trees) 139 in the sense that it is a compressed (reduced complexity) representation of a set of genetic features (a.k.a 140 genotype).

141
The ISM pipeline may complement a phylogenetic approach in that it can efficiently identify viral 142 subtypes of the population through genetic hotspots and do not rely on evolutionary model assumptions.

143
ISMs identify subtypes with slight differences between sequences where the sequence identity is ą99%, as is 144 the case of SARS-CoV-2 with OrthoANI of 99.8% (at the end of April 2020) [37]. ISMs include the key base 145 mutations in the marker identification itself. And thus, unlike phylogenetic lineages (i.e., clades and 146 subsequent emergent subtypes), ISM-defined subtypes are expressly differentiated by mutations with high 147 diversity (over the viral population). For example, the ISM label of a subtype can include mutation in 148 SARS-CoV-2's spike protein, which may have an important phenotypic impact.

149
As a succinct and robust identifier, therefore, ISM-based subtyping can facilitate downstream analysis, 150 such as modeling and visualizing the geographic and temporal patterns of genetic variability of SARS-CoV-2 151 sequences obtained from the GISAID database. We have made the pipeline available on Github

Data collection and preprocessing
For the aligned sequences, we merged the sequence with the metadata provided by Nextstrain 5 as of June 17, 2020, based on identification number, gisaid epi isl, provided by GISAID [6]. Given the fast-moving nature of the pandemic, we filtered out sequences with incomplete date information in metadata (e.g. "2020-01") in order to incorporate temporal information with daily resolution. In addition, we filtered out sequences from unknown host or non-human hosts. The resulting final data set contained 45535 sequences excluding the reference sequence. Then, we calculated the entropy at a given position i by: where L is a list of unique characters in all sequences and p k piq is a probability of observing a character k at 169 position i. We estimated p k piq from the frequency of characters at that position. We refer to characters in 170 the preceding because, in addition to the bases A, C, G, and T, the sequences include additional characters 171 representing gaps (-) and ambiguities, which are listed in Supplementary file 2 -Sequence notation [40]. 6

172
Bases like N and -, which represent a fully ambiguous site and a gap respectively, are substantially less 173 informative. Therefore, we further define a masked entropy as entropy calculated without considering 174 sequences containing N andin a given nucleotide position in the genome. With the help of this masked 175 entropy calculation, we can focus on truly informative positions, instead of positions at the start and end of 176 the sequence in which there is substantial uncertainty due to artifacts in the sequencing process. Finally, 177 high entropy positions are selected by two criteria: 1) entropy ą 0.23, and 2) the percentage of N andis 178 less than 25%. Further details about the selection of these two criteria are provided in Supplementary file 1 -179 Masked entropy threshold analysis. In the data set we processed for this paper, the entropy threshold yielded 180 20 distinct positions within the viral genome sequence. We built the Informative Subtype Markers (ISMs) at 181 these 20 nucleotide positions on each sequence. 182 5 https://https://www.gisaid.org/ 6 The sequences are of cDNA derived from viral RNA, so there is a T substituting for the U that would appear in the viral RNA sequence.

7/44
Error correction to resolve ambiguities in sequence data and remove spurious The focus of the error correction method is to resolve an ISM that contains ambiguous symbols, i.e., a 185 nucleotide identifier that represents an ambiguous base call (as detailed in Supplementary file 2 -Sequence 186 notation [40]), such as N, which represents a position that could be A, C, T, or G. Our approach uses ISMs 187 with few or no ambiguous symbols to correct ISMs with many ambiguities. Given an ISM with an error, we 188 first find all ISMs that are identical to the subject ISM's nucleotide positions without error. We refer here to 189 these nearly-identical ISMs as supporting ISMs. Then, we iterate over all positions with an error that must 190 be corrected in the subject ISM. For a given nucleotide position, if all other such supporting ISMs with 191 respect to the said erroneous position contain the same non-ambiguous base (i.e., an A, C, T, or G), then we 192 simply correct the ambiguous base to that non-ambiguous base found in the supporting ISMs. However, 193 when the supporting ISMs disagree at a respective nucleotide position, the method generates an ambiguous 194 symbol which represents all the bases that occurred in the supporting ISMs and compare this artificially 195 generated nucleotide symbol with the original position in the subject ISM. If the generated nucleotide symbol 196 identifies a smaller set of bases, e.g., Y representing C or T rather than N, which may be any base, then we use 197 the generated symbol to correct the original one.

198
When we applied the foregoing error correction algorithm to ISMs generated from the genome data set 199 analyzed in this paper, we found that 90.2% of erroneous ISMs were partially corrected (meaning at least one 200 nucleotide position with ambiguity was corrected for that ISM if not all), and 24.5% of erroneous ISMs were 201 fully corrected (meaning all positions with ambiguity were corrected to a non-ambiguous base (i.e., an A, C, T, 202 or G)). Since one ISM may represent multiple sequences in the data set, overall the error correction algorithm 203 was able to partially correct 96.0% of sequences identified by an erroneous ISM, and 32.4% of such sequences 204 were fully corrected.

205
The error correction method necessarily results in the replacement of ISMs with an ambiguous base at a site by another ISM without an error at that site. We expect, and have observed that the abundance of non-ambiguous ISMs are inflated by the error correction process. Here, we utilize the inflation rate of ISMs to quantify the difference in abundance of an ISM before and after error correction process. The inflation rate is defined by: where N is the abundance of an ISM of interest (typically an ISM with few or no ambiguous bases) before 206 error correction, and N EC is the abundance of that ISM after error correction.

8/44
Quantification and visualization of viral subtypes 208 At the country/region level, we assess the geographic distribution of SARS-CoV-2 subtypes, and, in turn, we 209 count the frequency of unique ISMs per location and build charts and tables to visualize the ISMs, including 210 the pie charts, graphs, and tables shown in this paper. All visualizations in this paper and our pipeline are 211 generated using Matplotlib and Plotly [41,42]. To improve visualization, ISMs that occur with frequency of 212 less than 5% in a given location are collapsed into "OTHER" category per location. Our pipeline then 213 creates pie charts for different locations to show the geographical distribution of subtypes. Each subtype is 214 also labeled with the earliest date associated with sequences from a given location in the dataset.

215
The abundance table based ordination is widely used to visualize community ecology in the 216 microbiome [43]. [44] also used Principal Components Analysis (PCA) to produce a two-dimensional visual 217 summary of the genetic variation in human populations. In our application, we can use the abundances of 218 different ISMs in a country as features to quantify the genetic variation pattern of SARS-CoV-2 sequences. 219 In our analysis, we select countries that have more than 100 viral sequences uploaded in order to have 220 enough ISMs to viably generate such an abundance We use Bray-Curtis dissimilarity [45] to quantify the dissimilarity of ISM compositions between a pair of 226 regions and form a pairwise Bray-Curtis dissimilarity matrix. Finally, we employ PCA to reduce the 227 dimensionality of the pairwise Bray-Curtis dissimilarity matrix, plotting the first two components to visualize 228 the genetic variation patterns of those countries/regions. 229 To study the progression of SARS-CoV-2 viral subtypes in the time domain, we group all sequences in a given location that were obtained no later than a certain date (as provided in the sequence metadata) together and compute the relative abundance (i.e., frequency) of corresponding subtypes. Any subtypes with a relative abundance that never goes above 2.5% for any date are collapsed into "OTHER" category per location. The following formula illustrates this calculation: where ISM ps,cq ptq is the relative abundance of a subtype, s, in location, c, at a date t, N s,c ptq is the total number of instances of such subtype, s, in location, c, that has been sequenced no later than date t and 231 9/44 N c ptq is the total number of sequences in location, c, that has been sequenced no later than date t.

232
Comparison of ISM subtyping to phylogenetic analysis 233 Nextstrain [1] provides a phylogeny method to track and visualize the dynamic of SARS-CoV-2 sequences. 234 To obtain the results presented here, we downloaded the Nextstrain tree data from 235 https://nextstrain.org/ncov on June 17, 2020. Since both the ISMs and the Nextstrain phylogenetic 236 tree were generated based on the GISAID database, they may be easily compared. We present two 237 comparisons in this paper: 1) ISM hamming distance and phylogenetic tree branch length; 2) ISM clusters 238 defined at different entropy thresholds and Nextstrain defined "clades".

239
The highest-abundance ISMs are involved with hundreds, if not thousands, of sequences. For a given ISM, 240 we can find the lowest common ancestor (LCA) node in the phylogenetic tree of all sequences with that ISM. 241 The branch length between the LCA and the root can be considered as the inferred evolutionary distance 242 between the reference sequence and the LCA node. Hamming distance between our ISMs measures the 243 divergence between two clusters of sequences. We can compare branch length between the root and LCA 244 with the Hamming distance between a given ISM and the reference ISM. Then, we compute the Pearson 245 correlation coefficient to measure the correlation between the evolutionary distance from the reference 246 genome (inferred by the phylogenetic tree) and the Hamming distance between ISMs and the reference ISM. 247 In Nextstrain data, there are 5 "clades", namely, 19A, 19B, 20A, 20B and 20C, defined in [25]. Different 248 sequences are assigned to those 5 "clades" based on genetic variations in the sequence. Since our ISMs are 249 clusters of sequences with similar genetic variations, the Nextstrain "clades" provides us a good interface to 250 study how our entropy threshold influences the ISM definition by comparing the overlaps between Nextstrain 251 "clades" grouped sequences and ISM grouped sequences. To measure the similarity between ISM labels and 252 "clade" labels of sequences, we use two clustering metrics, homogeneity and completeness as proposed in [46]. 253 A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a 254 single class. A clustering result satisfies completeness if all the data points that are members of a given class 255 are elements of the same cluster [46]. We vary the entropy threshold to form different sets of ISM clusters of 256 sequences and compare each set with Nextstrain "clades" using homogeneity and completeness.

258
We begin by identifying and mapping the sites that form an ISM for each genome based on sequence entropy. 259 Then, we analyze the properties of ISMs and validate the ISMs generated from SARS-CoV-2 data as of June 260 10/44  conserved across all aligned sequences. Figure 1 shows the overall entropy at each nucleotide position, 281 determined based on calculating the masked entropy for all sequences as described in the Methods section. 282 Notably, at the beginning and end of the sequence, there is a high level of uncertainty. This is because there 283 are more N andsymbols, representing ambiguity and gaps, in these two regions (gaps are likely a result of 284 artifacts in MAFFT's alignment of the viruses or its genomic rearrangement [21], and both ambiguous in Methods section). Importantly, even though the combinatorial space for ISM is potentially very large due to the substantial 290 12/44 number of characters that may present at any one nucleotide position, only certain ISMs occur in substantial 291 quantities in the overall sequence population. Figure 2 demonstrates the rapid decay of the frequency of 292 sequences with a given ISM. In particular, the plot shows that the first three ISMs represent subtypes that 293 have more than 4000 sequences worldwide.

294
Some potential reasons for the rapid drop off in the frequency relative to the diversity of ISMs may 295 include the following: First, since the virus is transmitting and expanding so quickly, and the pandemic is 296 still at a relatively early stage, there has not been enough time for mutations that would affect the ISM to 297 occur and take root. In that case, we would expect the number of significant ISMs to rise over time. Second, 298 the population of publicly available sequences is biased to projects in which multiple patients in a cluster are 299 sequenced at once: e.g., a group of travelers, a family group, or a group linked to a single spreading event 300 (there are sequences from cruise vessels in the database). We expect that the impact of any such clustering 301 will be diminished in time as more comprehensive sequencing efforts take place. Third, ISMs may be 302 constrained by the fact that certain mutations may result in a phenotypic change that may be selected 303 against. In this case, we may expect a steep change in a particular ISM or close relative in the event that 304 there is selection pressure in favor of the corresponding variant phenotype. However, as of yet there has been 305 no solid evidence of mutations within SARS-CoV-2 that are associated with selection pressure, i.e., as being 306 more transmissible or evading antibodies, though studies do suggest the possibility [19,47,48]. ISMs. These represent instances in which there was insufficient sequence information to fully resolve 310 ambiguities. We expect that as the number of publicly available sequences increases, there will likely be 311 additional samples that will allow resolution of base-call ambiguities. That said, it is possible that the 312 ambiguity symbols in the ISMs reflect genomic regions or sites that are difficult to resolve using sequencing 313 methods, in which case the ISMs will never fully resolve. Importantly, however, because of the application of 314 the error correction algorithm, there are fewer spurious subtypes which are defined due to variants arising 315 from sequencing errors, and all remaining ISMs are still usable as subtype identifiers.

316
Potential functional significance of ISM locations 317 After the informative nucleotide positions were identified, we then mapped those sites back to the annotated 318 reference sequence for functional interpretation [13]. As a practical matter, because the ISM is made up of 319 the high-diversity sites within the SARS-CoV-2 genome, it inherently includes the major loci of genetic 320 changes that are being identified in population studies worldwide. The ISM also excludes sites at the ends of 321 RNA polymerase (i.e., synthesis) machinery [49]. One site is located in the reading frame encoding the S 326 surface glycoprotein, which is responsible for viral entry and antigenicity, and thus represents an important 327 target for understanding the immune response, identifying antiviral therapeutics, and vaccine design [50,51]. 328 High-entropy nucleotide positions were also found in the nucleocapsid formation protein, which is important 329 for packaging the viral RNA [52]. A study has also shown that, like the spike protein, the internal 330 nucleoprotein of the virus is significant in modulating the antibody response [53]. Other sites were found in 331 the ORF3a and ORF8, which, based on structural homology analysis do not have known functional domains 332 or motifs, and have diverged substantially from other SARS-related variants which contained domains linked 333 to increased inflammatory responses [54,55].

334
In sum, the majority of high-entropy sites are in regions of the genome that may be significant for disease 335 have an A at position 13 instead of N). Accordingly, ISM abundance inflation due to error correction will be 358 generally conservative, and will not confound population-level analyses of ISM subtypes based on their 359 relative abundance.

360
Sensitivity of ISM labels to the selection of the entropy threshold 361 To demonstrate the influence of the entropy threshold on ISM identification, we show a Sankey diagram in 362  The Sankey diagram further shows that ISMs defined at a lower entropy threshold can "merge" together 370 as the entropy threshold moves higher. For example, TCTTGGGGG and TCTTGTGGG are two different ISMs if we 371 choose 0.6 as the entropy threshold. They can be differentiated by the 6th position (a G/T variation).

372
However, when the threshold moves higher to 0.8, this position is dropped from the ISM, as its entropy now 373 falls below the threshold. As a result, the two ISMs are merged into TTTGGGG. Some ISMs are stably identified 374 throughout, while other ISMs merge together at different entropy thresholds. We can see from Figure 3       the pandemic (i.e., before February 2020) [12].  Table 2 shows the most abundant nucleotide configurations at certain covarying positions and how many 410 variations can be preserved after compression. The most abundant nucleotide configurations cover at least 411 96% of the sequences for each covarying group (the third column in Table 2). We further validate the groups 412 of covarying nucleotide sites identified by the temporal entropy curve in Figure 4 by Linkage Disequilibrium 413 (LD) analysis, which measures the degree of nonrandom association between two loci on a genome [56]. loci [56]. This is in line with previous studies of LD on SARS-CoV-2 [12,57,58], which found, e.g., that 418 positions 8782 and 28144 showed high significant linkage, with an r 2 value of 0.954 [12]. 419 We then select the representative positions with the highest entropy within each covarying group that can 420 cover all of the most abundant nucleotide configurations. Compression reduces the ISM length from 20 to 11 421 nucleotides.  Table 3 shows the mapping between the original 20-NT ISM and 11-NT compressed ISM for ISMs As such, defining subtypes based on compressed ISM will result in the inflation of a few principal subtypes 429 by an amount of around 1%. Compressed ISM subtypes, therefore, conserve the distribution of major ISM 430 subtypes. A more compact and easy-to-use subtype nomenclature may thus be utilized to quantitatively 431 assess the relative subtype abundance at the population level. mutation (corresponding to position 14 in the ISM) which has been discussed in recent studies [19,47].

456
The data further indicate that the United States has a distinct pattern of dominant subtypes. In 457 particular, the subtype with the highest relative abundance among U.S. sequences is CCTGCTAAGGG, first seen 458 on February 20, 2020. This subtype has also emerged as one of the major subtypes in Canada, with the first 459 sequence being found on March 5, 2020. 460 We also found that some states within the United States have substantially different subtype distributions. 461 Figure 6 shows the predominant subtype distributions in the states with the most available sequences. The 462 colors shown on the charts are also keyed to the colors used in Figure 5, which allows for the visualization of 463 commonalities between the subregional subtypes in the United States and the subtypes distributed in other 464 7 As discussed in the Results section, there are a few covarying positions that can be removed for a shorter ISM while still preserving the most of information. Therefore, for simplicity, we present SARS-CoV-2 subtypes in their short forms throughout this paper. In comparison with other methods, we include both the original ISM form and compressed ISM in the discussion.  Figure 5. Major subtypes in countries/regions with the most sequences (in the legend next to each country/region, we show the date when a major subtype was first sequenced in that country/region). Subtypes with less than 5% abundance are plotted as "OTHER". The raw counts for all ISMs in each country/region, as well as the date each ISM was first found in a sequence in that country/region, are provided in Supplementary file 6 -ISM abundance table of 20 countries/regions.  China, as shown in Figure 5. The most abundant subtype in Washington, CCTGCTAAGGG, is also a major 469 subtype in other states in United States. This CCTGCTAAGGG subtype is also found in substantial abundance 470 in Canada as well. This is consistent with the hypothesis that this subtype is endogenous to the US.

471
Regions with similar genetic variant patterns are identifiable in Figure 5, but only at a qualitative level. 472 As described in the Methods section, the ISM abundance table can be used to provide a quantitative analysis 473 of the similarity between the genetic variation patterns of countries and regions. Figure 7 shows a 474 visualization of the difference in genetic subtype patterns between different countries and regions using  ISMs over time) in Figure 8 and Figure 9. As discussed above, through the pipeline we have developed, these 496 plots use a consistent set of colors to indicate different ISMs (and are also consistent with the coloring 497 scheme in Figure 5).

498
In the United States, we can observe a few waves of different subtypes. For example, in the early stage 499 (late January and early February), the predominant subtype is the same as that of Mainland China. In 500 contrast, the most abundantly found subtype in late February and March, CCTGCTAAGGG, is not abundant in 501 either Asia or Europe. However, this subtype has been found in a substantial number of sequences in both 502 Canada and Australia. It is plausible, therefore, that the CCTGCTAAGGG subtype has become abundant as the 503 result of community transmission in the United States, and has been exported from the United States to 504 these other countries. Interestingly, while the CCTGCTAAGGG subtype has been found across the United States, 505 as shown in Figure 6, it has not been found to be substantially abundant in New York. Over time, however, 506 within the United States the dominant subtype has become TCCGCCAGTGG, which is the predominant subtype 507 in New York state (and linked to the dominant subtype in many European countries).

508
As shown in Figure 9 and additional temporal plots for the Netherlands and Spain contained in the 509 Supplementary Material, the subtype distribution in sequences within European countries differs significantly 510 from that of North America and Australia. In particular, as detailed above, the European dynamics of 511 SARS-CoV-2 appear to reflect the theory that in many European countries, initial cases may have been due 512 to travel exposure from Italy, rather than directly from China. For instance, we observe that the United 513 Kingdom data shows the same early subtypes as those of Mainland China which were also observed in 514 Australia and Canada, i.e., CCTGCCAAGGG and CCCGCCAAGGG. The CCCGCCAAGGG subtype emerged as a highly 515 abundant subtype in United Kingdom data in early February. This subtype was also been found with great 516 frequency in the Netherlands and Australia, but not in Spain, suggesting additional viral genetic diversity 517 within Europe for further study.

518
All inferences drawn from observed temporal trends in subtypes based on the genome sequence 519 dataset-whether based on ISM or phylogeny-based methods-will be limited by important caveats, including: 520 1) The collection date of the viral sequence is usually later than the date that the individual was actually 521 infected by the virus. Many of those individuals will be tested after they develop symptoms, which may only 522 begin to arise several days or even two weeks after infection according to current estimates [60]. 2) The 523 depth of sequencing within different regions is highly variable. As an extreme case, Iceland, which has a 524 small population, has 1.3% of all sequences in the complete data set. Italy, on the other hand, had a large 525 and early outbreak but has disproportionately less sequencing coverage (133 sequences). 526

26/44
Evaluating the ability of ISM-defined subtypes to track significant genetic changes during the SARS-CoV-2 pandemic 528 In our results section, we identified a few widespread ISM subtypes, e.g., TCCGCCAGTGG that dominates New 529 York and some ISM subtypes that are unique to a region, e.g., CCTGCTAAGGG that is mostly found in North 530 America. In this section, we show related literature and how their results relate to ours. We primarily use 531 the original 20-nt ISM identifiers in this section, rather than the compressed ISM, in order to discuss all the 532 positions identified by our entropy analysis and relate them to the literature.

533
Subtype prevalent in New York and some European countries TTACTCGTCCACAGTGTGGG 534 (TCCGCCAGTGG in compessed form) 535 This subtype has been dominating the US since mid-March, as shown in Figure 8. In Figure 6, we can see 536 that this subtype dominates many states including New York (first seen early March in New York).

537
Additionally, as shown in Figure 5, this same subtype has been dominant in European countries, first 538 observed in sequencing data in late February. The first detection dates in New York (later) and Europe 539 (earlier) align with the hypothesis of European travel exposure being the major contributor to the New York 540 outbreak of SARS-CoV-2. Various studies have demonstrated the SNV C14408T in ORF1b to be associated 541 with a virus subtype found abundantly in New York as well as multiple European countries [16,61,62], which 542 is designated as an ISM hotspot site 8 in Table 1. These studies also identified a SNV of A23403G in the S 543 spike protein to be heavily associated with the dominant subtype of both Europe and New York, correlating 544 to ISM hotspot site 14 from our analysis. Our temporal entropy plot in Figure 4 further indicates that these 545 two sites are covarying. Lastly, the studies also reported a SNV of G26144T, which corresponds to ISM site 546 16 and has been observed in the predominant subtypes found in Europe and New York. This is the prevalent subtype characteristic within Washington state through the lastest update of the 550 sequencing database analyzed in this paper (June 2020). It has been linked to the endogenous spread of the 551 virus across the United States [18,63]. According to our ISM analysis, this subtype is separated by a 552 hamming distance of 3 from one of the major subtype of the outbreak in China, CCACCTGCCCACAAGGCGGG 553 (the differences are at ISM positions 10, 11 and 12). Viral spread is suspected to be due to primary exposure 554 of an individual from China to Washington state, designating this case as "WA1" [17,61]. "WA1" lineage is 555 27/44 noted to have three characteristic SNVs, namely, C17747T, A17858G, and C18060T which correspond 556 matches with our ISM positions of 10, 11 and 12 respectively [16-18, 61, 62]. While "WA1" is suspected as 557 the primary subtype for viral spread in Washington state, there are cases that have shown additional SNVs, 558 which suggest mutational variation from the "WA1" strain. These SNVs include C8782T and 559 T28144C [16,17,61] and correspond to hotspot sites 6 and 17 respectively. The same major subtypes seen in 560 Washington state were also identified in positive cases in Connecticut (also detected by Figure 6 using our 561 ISM). It is highly probable that there was trans-coastal exposure due to domestic travel from Washington 562 state into Connecticut, due to the various high-volume airports that are present in and around this state [18]. 563 Subtypes including the A23403G/D614G spike protein variant 564 The SNV A23403G (resulting in D614G variant in spike protein) is a major viral mutation that has been 565 observed in the major European countries of Italy, Spain, France, as well as Middle Eastern regions of Turkey 566 and Israel [16,[64][65][66]. Some studies suggest that this D614G variant of the S spike protein provides greater 567 survival and transmission ability to the virus, however there need to be additional studies conducted to 568 confirm these claims [64]. This position corresponds to ISM position 14. Based on our ISM table, we can 569 quickly navigate to this position and plot the abundance of different variants at this position over time.  and LCA, which is considered as the evolutionary distance between a node and the reference sequence, and 587 the Hamming distance between a given ISM and the reference ISM, CCACCCGCCCACAAGGTGGG (CCCGCCAAGGG 588 in short form), which represents the degree of difference (by number of SNVs) between ISMs.
589 Figure 11 shows that Hamming distance (dark-colored) has a high correlation with the LCA branch 590 length (gray-colored). This means that the Hamming distances between ISMs are able to consistently reflect 591 evolutionary distance at a high level. There are a few outliers though; for example, CCACCTGCTCACAAGGCGGG 592 (CCTGTCAAGGG in short form) has higher LCA branch length but lower ISM Hamming distance. This 593 indicates that some evolutionary signals will be missed by grouping sequences by ISM, likely because the 594 signals are contained in lower-entropy genomic regions which are unrepresented in the ISM. Conversely, we 595 observe that the phylogenetic clades identified by Nextstrain are imperfect with respect to their preservation 596 of SNV information. Nextstrain identifies the clades based on whether they contain at least two prevalent 597 SNVs. But, presumably because the clades are identified by whole genome sequence clustering, not every 598 sequence within a clade will necessarily include those SNVs.

599
Moreover, not only can the ISM pipeline effectively define meaningful viral subtypes, but it can also do so 600 with greater computational efficiency than tree reconstruction methods. Fasttree [67], on its fastest setting, 601

29/44
is reportedly the fastest tree reconstruction method (orders of magnitude faster than most machine learning 602 methods and N is the number of sequences, and a is the size of the alphabet. Accordingly, the computational time 606 required to enumerate subtypes using the ISM is substantially reduced, i.e., a function of the thresholded loci 607 reduced and number of sequences instead. One caveat is that the ISM method requires multiple sequence 608 alignment to identify high entropy sites, which can be a computationally intensive process. However, 609 phylogenetic tree methods based on whole genome sequences require that as well. And, ISM identification 610 may be done on new sequences using previous positions between multiple sequence alignment updates.

611
In sum, ISM can provide a compact and effective representation of a sequence as it includes the essential 612 genetic variation information, while also including a substantial amount of molecular evolution information. 613   TCACTCGTCCACAGGGTAAC  TCACTCGTCCACAGGGTGGG  TTACTCGTCCACAGTGTGGG  CCACCCGCCCACAAGGTGGG  CCACCCTCCCACAAGGTGGG  TCACTCGTCCACAGTGTGGG  CCACCTGCCCACAAGGCGGG  TCACTCGTCCACGGGGTGGG  CCACCCGCCCACAAGTTGGG  CCACCTGCCTGTAAGGCGGG  CCACCTGCTCACAAGGCGGG  CCACCC-CTCACAAGTTGGG  CCACCCTCTCACAAGTTGGG  CCACCCTCCCACAAGTTGGG  CCGTCCTCTCACAAGTTGGG  TCACTCTTCCACAGGGTAAC  TCACTCTTCCACAGTGTGGG  -CACTCGTCCACAGGGTAAC  TCACTCGCCCACAGGGTGGG  TCACTCGTCCACDGGGTGGG  CCACCTGCCCACAAGG-GGG  TCACTCTTCCACAGGGTGGG  TCACTCGTCCACAGGG-----CACCCTCCCACAAGGTGGG  TTACTCTTCCACAGTGTGGG  CTACCCTCCCACAAGGTGGG  TCACTCGTCCACRGGGTAAC  CCACCTTCCCACAAGGCGGG  CCACCCGCCCACAAGGCGGG  TCACCCGTCCACAGTGTGGG  TCACCCGTCCACAGGGTGGG  TCACTCGTCYACAGGGTAAC  TCACTCTTCCACGGGGTGGG  TCACTC-TCCACAGGGTAAC  CCACTCGCCCACAAGGTGGG  CCACCTGCCCATAAGGCGGG  CCACCCDCCCACAAGGTGGG  TCACTCGTCCACAGGGTDDS  -CACCC-CTCACAAGTTGGG  TTACTCGTCYACAGTGTGGG  CCACCTGCCCACAAGGTGGG  TCACTCGTCCACAGGGTRDS  -CACTCTTCCACAGTGTGGG  -TACTCGTCCACAGTGTGGG  TCACTCGTCCACARGGTAAC  YCACTCGTCCACAGGGTAAC  TCACTCGTCYACAGGGTGGG  TCACTYGTCCACAGGGTGGG  TCACCCTCCCACAAGGTGGG  clade are identified by a common ISM. Figure 12 shows the homogeneity and completeness as a function of 618 the entropy threshold used to define ISMs. As shown therein, sequences with a common ISM are generally 619 30/44 assigned to a common clade, and sequences from a given clades also often identified by a set of few ISMs. As 620 the entropy threshold increases, ISMs correspondingly moving "upwards" through the phylogenetic tree to 621 better represent a clade, increasing completeness while maintaining high homogeneity. Conversely, as the 622 entropy threshold lowers, ISMs increase in their resolution, corresponding to an increase to almost perfect 623 homogeneity but with low completeness. At a low entropy threshold, each ISM contains sequences that nearly all belong to the same clade (high homogeneity) but the clade contains multiple ISMs (low completeness). As the entropy threshold rises, ISMs gain more sequences (some of which belong to other clades) and the clades contain fewer distinct ISMs. Thus, there is a trade-off with the entropy threshold, but the sweet spot is around 70-80% on both metrics, showing that ISMs capture some aspect of phylogenetics but have their own characteristics.

625
In this paper, we present a pipeline for subtyping SARS-CoV-2 viral genomes based on short sets of highly 626 informative nucleotide sites (ISMs). Our results demonstrate the following key features of ISM-based 627 subtyping. First, the ISM of a sequence preserves important nucleotide positions that can help to resolve 628 different SARS-CoV-2 subtypes. ISMs provide a quick and easy way to track key sets of SNVs which are 629 covarying as the SARS-CoV-2 pandemic spreads throughout the world. The SNVs which consistently covary 630 with the spike protein variant has rapidly become prevalent throughout the world and may be a potential 631 link to increased viral transmission [4,19,47]. Second, ISM-based subtypes are able to capture the majority 632 of phylogenetic relationships between viral genomes that are represented in Nextstrain tree clades. ISM 633 31/44 analysis shows promise as a complement to phylogenetic classification, particularly given the limits of 634 phylogenetics at early stages in the pandemic (e.g., due to uncertainty regarding key assumptions, such as 635 the rate of the molecular clock and confidence in branches) -while also being more computationally efficient. 636 Third, ISM subtyping can provide robust and informative insight regarding the geographic and temporal 637 spread of the SARS-CoV-2 sequences, as well potentially be a way to identify phenotypic variants of the  An important caveat of all viral analyses, including subtyping, is that they are limited by the number of 643 viral sequences available. Small and/or non-uniform sampling of sequences within and across populations 644 may not accurately reflect the true diversity and distribution of viral subtypes. However, the ISM-based 645 approach has the advantage of being scalable as sequence information grows, and with more information, it 646 will become both more accurate and precise for different geographic regions and within subpopulations. We downloaded all SARS-Cov-2 sequences available from and acknowledge the contributions of the Global 659 Initiative on Sharing All Influenza Data (GISAID) EpiFlu database, which has made accessible novel 660 coronavirus sequencing data, including from the NIH Genbank resource [6]. We would also like to other European countries have generally involved the same viral subtypes as those which are most abundant in Italy, as defined by ISM. Indeed, initial reports of cases in various other European countries in late February 2020 were linked to travelers from Italy [68]. The subtypes which are predominant in Italy are found, however, at lower yet notable abundance in countries including Japan, Canada and Australia.
Somewhat surprisingly, though the Italy subtypes were found in other U.S. states, only 88 out of the 1478 sequences from New York in the data set had the same ISM as the two dominant subtypes in Italy (see Supplementary file 7 -ISM abundance table of 25 U.S. states). This suggests that the outbreak in New York may not be linked to travel exposure directly from Italy, but rather from another location in Europe, with the important caveat that some potential subtypes may not have been detected there (due to relatively low number of sequences available from Italy). Indeed, the dominant subtype in New York (TCCGCCAGTGG) was detected in 86 sequence from Iceland and only one of them linked to travel exposure in Italy. However, 30 out of 86 cases linked to exposure in Austria, 6 linked to UK, 2 linked to Denmark, and 1 linked to Germany. This further suggests that it was unlikely that the incidence of the TCCGCCAGTGG subtype in New York is connected to Italy but rather than elsewhere in Europe, but limited sequence coverage in Italy prevents more definitive inference. However, one of the dominant subtypes in Italy, CCCGCCAGGGG, is not abundant in East Asian regions such as Mainland China and Japan, as indicated in this supplementary figure.
Supplementary file 9 -Relative abundance (%) of ISMs in DNA sequences from Australia as sampled over time Australia shows growing subtype diversity as its cases increase over time. Initially, Australia's sequences were dominated by two subtypes that were also substantially abundant in Mainland China (CCCGCCAAGGG and CCTGCCAAGGG). Later, another subtype (CCCTCCAAGGG) starts to emerge. This subtype was less relatively abundant in Mainland China but more highly abundant in sequences from Hong Kong and Singapore (see Figure 5). Then, starting with sequences obtained on February 27, 2020, and subsequently, more subtypes are seen to emerge in Australia that were not found in other Asian countries but were found in Europe. This pattern suggests a hypothesis that Australia may have had multiple independent viral transmissions from Mainland China -or, as noted in the previous discussion, potentially through transmissions from Iranfollowed by potentially independent importation of the virus from Europe and North America. numbers of sequences being sampled-making it less likely that the reference subtype (CCCGCCAAGGG) is simply being missed. In those cases, it appears from Figure 8, 9, and Supplementary file 12 -Relative abundance (%) of ISMs in DNA sequences from the Netherlands as sampled over time that in later times, other subtypes have emerged over time and are becoming increasingly abundant. One potential explanation is that the SARS-CoV-2 is an RNA virus and thus highly susceptible to mutation as transmissions occur [69].
Therefore, as transmissions have continued, the ISM associated with the reference sequence has been replaced by different ISMs due to these mutations. Another plausible explanation for such leveling off in a region is that the leveling off in relative abundance of the subtype represents containment of that subtype's transmission while other subtypes continue to expand in that country or region. The latter could plausibly explain the pattern observed in the United States, where earlier subtypes connected to Asia did not increase in abundance while a putative endogenous subtype, as well as the dominant New York subtype, have significantly increased in abundance (see Figure 8 and accompanying discussion above). Further investigation and modeling of subtype distributions, as well as additional data, will be necessary to help resolve these questions -particularly in view of the caveats described below.
Supplementary file 15 -Acknowledgements of sequences this research is based on A list of sequences from GISAID's EpiFlu Database on which this research is based and corresponding authors and laboratories.