SARS-CoV-2: Possible recombination and emergence of potentially more virulent strains

COVID-19 is challenging healthcare preparedness, world economies, and livelihoods. The infection and death rates associated with this pandemic are strikingly variable in different countries. To elucidate this discrepancy, we analyzed 2431 early spread SARS-CoV-2 sequences from GISAID. We estimated continental-wise admixture proportions, assessed haplotype block estimation, and tested for the presence or absence of strains’ recombination. Herein, we identified 1010 unique missense mutations and seven different SARS-CoV-2 clusters. In samples from Asia, a small haplotype block was identified, whereas samples from Europe and North America harbored large and different haplotype blocks with nonsynonymous variants. Variant frequency and linkage disequilibrium varied among continents, especially in North America. Recombination between different strains was only observed in North American and European sequences. In addition, we structurally modelled the two most common mutations, Spike_D614G and Nsp12_P314L, which suggested that these linked mutations may enhance viral entry and replication, respectively. Overall, we propose that genomic recombination between different strains may contribute to SARS-CoV-2 virulence and COVID-19 severity and may present additional challenges for current treatment regimens and countermeasures. Furthermore, our study provides a possible explanation for the substantial second wave of COVID-19 presented with higher infection and death rates in many countries.


Introduction
The severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) outbreaks have grievously impacted the world in a short span of time. Understanding the factors that govern the examined the correlation between the LD results and physical distance across all continental data sets.
To detect recombination within datasets, we tested individual datasets for pairwise homoplasy index using PhiPack software [23]. All the PhiPack tests were performed with 1000 permutations. We used 'Profile' program available in PhiPack suite with default options for window scan size and step size in all the aligned sequence datasets. Further validation was done using RDP, GENECONV, MaxChi 2 , Chimaera, and 3Seq algorithms available from RDP4 software suite [24] with default parameter settings and 1000 permutations. Admixture 1.3.0 software [25] was used to identify genetic substructure of strains across the continental transmission as follows: variants with MAF � 0.5%, variants in LD (R 2 > 0.5), haplotype blocks, independent variants, nonsynonymous, and synonymous variants. All analyses were iterated for K = 20 and cross validation errors were examined to infer an optimal K cluster. Replicate runs were further processed using CLUMPAK [26] and results for the major modes were illustrated using the ggplot2 data visualization package for the statistical programming language R (https://www.r-project.org/).

Protein structural analysis
The crystal structures of the viral RNA-dependent RNA polymerase (RdRp, PDB ID: 6M71) [27] and the S protein (PDB ID:6VSB) [28] were used as a scaffold to model the amino acid variations observed in the viral strains. The missing amino acids, i.e. invisible in the Cryo-EM structure of the S protein, were modelled-in by using SWISS-Model [29]. DynaMut web server [30] was used to predict the effect of the mutations on the proteins stability and flexibility. PyMol (Molecular Graphics System, Version 2.0, Schrodinger, LLC) was used to generate structural images.

Detection and classification of mutations from global SARS-CoV-2 genome sequences
We analyzed 2431 high quality SARS-CoV-2 genome sequences from six continental groups. 2352 sequences showed substantial genetic differences from the Wuhan reference sequence (NC045512.2). We identified 1010 unique mutations using our variant calling pipeline. 613 variants were nonsynonymous, 387 variants were synonymous, 9 variants were stop-gain, and 1 variant in the 5 0 UTR. We found only 72 variants with MAF � 0.5%, which were used in admixture and haplotype block analyses. Fig 1 shows the distribution of synonymous and nonsynonymous variants in each gene of the SARS-CoV-2 genome with varying MAF thresholds. The genes with the highest percentage of nonsynonymous variants and MAF � 0.5% were: ORF3a, M, and ORF8, while the genes with the highest percentage of synonymous variants and MAF � 0.5% were: ORF6 and ORF10.

Identification of SARS-CoV-2 genetic clusters in different continents
The principal components analysis (PCA) of early spread 2352 SARS-CoV-2 sequences gave three distinct clusters of samples based on their continent of origin (Fig 2). All three clusters diverged from a single point (Fig 2, red circle). The North American cluster showed the least viral genetic variances unlike the samples from Asia and Oceania which harbored the most genetic diversity. The European cluster is well-defined with few interspersed Asian samples, which is an indication of its origin. This clustering in Europe and North America is probably associated with a founder effect where a single mutation was introduced and subsequently transmitted. This suggestion is corroborated by the fact that the collection date of the founder strain is prior to those in the European and the North American clusters (S1 Fig).
The admixture analysis showed a gradual reduction in cross validation (CV) error for iterations up to K = 7 (S2 Fig). Subsequent iterations showed a fluctuating pattern; however, the least error was seen at K = 17. Further, we verified this CV trend in a subset of variants filtered for MAF � 0.5% (comprised of 72 SNVs). Interestingly, we again observed a gradual reduction in CV error up to K = 7 (with best fit at 7) and subsequently, a trend of increasing CV error.
We created subsets of strong LD, weak LD, Haplotype block, nonsynonymous, and synonymous variants from 72 variants at K = 7 across the continents. We separately performed admixture analysis on these subsets. The analysis of the detected seven datasets revealed interesting mosaic patterns (Fig 3A and 3B). Samples from Asia formed largely two clusters (C1and C6 with C1 as dominant); whereas, samples from the European dataset were distributed into six different clusters (C1, C2, C3, C5, C6 and C7 with C2 as dominant); and the North American samples formed four clusters (C1, C2, C4 and C7 with C4 as dominant). The African and Oceanian datasets formed two clusters each ([C2 and C3 with C2 as dominant] and [C3 andC5 with C3 as dominant cluster], respectively) and South American dataset was formed by five clusters (C1, C2, C3, C5 and C6).
In the context of dependent variants (Fig 3C and 3D), a strong LD block (R 2 � 0.5) was mostly observed in two clusters (C1 and C6) and a weak LD block (R 2 < 0.5) was observed in three clusters (C1, C3, and C7) in Asia. In Europe, a strong LD block was observed in 4 clusters (C1, C2, C3, and C5) and a weak LD block was observed in five clusters (C1, C2, C3, C4, and C5; predominantly from C3 and C5). Interestingly, four clusters were common between the strong and weak LD blocks, suggesting that a makeup of four strains that dominate in Europe have significant proportion of strong and weak LD signatures in them. Likewise, in North America, both strong and weak LD were observed among three clusters (C1, C2, and C4) and (C1, C3, and C4) respectively. Further in Africa, Oceania, and South America, strong LD was observed among (C2), (C1) and (C1 and C2) respectively; weak LD was observed in (C3 and C4), (C1, C2, C3, and C5) and (C3 and C5) respectively. Interestingly, proportions of haplotype blocks (Fig 3E), identified using the whole data, followed mostly the pattern observed in the track of strong LD (i.e. the track of Fig 3C) but admixed with weak LD strains of the respective continents.
Of particular note, variations in proportions of strains carrying nonsynonymous and synonymous signatures were also very evident (Fig 3F and 3G). Proportions of two nonsynonymous clusters (C1 and C5, which have admixed with each other) and two synonymous clusters were dominating in Asia. Four nonsynonymous clusters (C2, C4, C5, and C6) and five synonymous clusters were dominating in Europe, while in North America, three nonsynonymous (C1, C3, C6) and two synonymous (C1, C3) clusters were in higher proportions. Similarly, C4 and C6, C5, and C2 nonsynonymous clusters along with C1 and C6, C2, and C5 synonymous clusters were in high proportions in Africa, Oceania, and South America respectively.

PLOS ONE
The results from these tests on the combined dataset suggested the possibility of coinfection at a global level (Table 1). However, significant P-values (< 0.05) for Phi statistics were observed only with European and North American populations: European (NSS test, P-value = 0.001) and North American (NSS and Phi (normal), P-value = 0.007, 0.042, respectively). These significant P-values showed evidence of the presence of recombination events on a continental level i.e. European and North American populations. These two populations gave the highest numbers of informative variants, 276 and 194 respectively. Furthermore, a relatively large number of regions with significant evidence of recombination was observed in these two continents ( Table 1). Moreover, plausible evidence of recombination in North American and European datasets were ascertained using algorithms available from RDP4 software: recombination events in European population were confirmed with significant P-values by the MaxChi 2 , Chimera, and 3Seq algorithms while possible recombination events in North American populations were validated by MaxChi 2 and 3Seq algorithms ( Table 2). On the flip side; African, Oceanic, South American, and Asian datasets showed no recombination in early spread of SARS-CoV-2 virus in the respective continents.

Estimation of LD and haplotype blocks in continental samples
Haplotype block analysis was carried out using two approaches; first, because of the small sample size in African (n = 25), Oceanian (n = 69), and South American (n = 24) datasets, we

PLOS ONE
decided to compare haplotype blocks obtained from pooled datasets of all variants with MAF � 0.5%. Hence, we initially compared LD blocks obtained from the combined dataset and then observed LD among the same variants in each continental dataset. In the second approach, we estimated haplotype blocks only within continental datasets with large sample size such as Asia, Europe, and North America.
The first analysis, using the combined dataset, showed that LD block varies among continental datasets. S3 Fig illustrates the extent of LD variation in haplotype block of the combined pool in each continental dataset. Examination of the variants in haplotype blocks suggested a clear variation in allele frequency between continental datasets. Table 3 shows MAF of 18 variants involved in haplotype block of the combined data in each continental dataset. These differences called for the second approach of estimating haplotype blocks and the extent of LD between variants directly from individual continental datasets having large sample size. Surprisingly, we observed different sets of variants in haplotype blocks, different lengths of haplotype blocks, and differences in nonsynonymous composition in haplotype blocks among the three continents datasets (Fig 4). Table 4 describes characteristics of the haplotype blocks observed in the datasets from Asia, Europe, and North America. Correlation between LD and physical distance suggested that a strong LD was observed in North American sequences throughout the genome compared to sequences from other continents (Fig 5). The South American and European sequences showed strong LD towards the upstream region of ORF1a gene, while North American and Oceanic sequences showed strong LD from the S protein region to the end of the genome. Analysis performed using RDP, GENECONV, MaxChi 2 , Chimera, and 3Seq algorithms. NS = No significant p-value is observed for the recombinant event using respective method. � = The actual breakpoint position is undetermined (it was most likely overprinted by a subsequent recombination event.
= It is possible that this apparent recombination signal could have been caused by an evolutionary process other than recombination. https://doi.org/10.1371/journal.pone.0251368.t002

PLOS ONE
Coinfection with different SARS-CoV-2 strains increases the severity of COVID-19 Structural analysis of SARS-CoV-2 mutations D614G mutation. Wrapp et al. recently solved the cryo-EM structure of the S protein with 3.5 Å resolution [28] (Fig 6A). The S protein has many flexible loop regions that were not visible in the structure, including the RRAR cleavage site. Therefore, we modelled-in the cleavage site and the undetected flexible regions using the cryo-EM structure PDB ID:6M71 as a scaffold [29]. Fig 6B shows the overlay of the S protein from PDB ID:6VSB with the modelled S protein, with an RMSD of 0.25 Å. As shown in the overlay figure, there were more loops present in the modelled structure that were undetectable by Cryo-EM. The RRAR cleavage site (Fig 6C) presents a highly accessible surface region, where the host protease enzyme can readily cleave the S protein [28]. As such, any mutations on the S protein, close to RRAR furin protease cleavage site, might alter its activity. Therefore, the D614G mutation is believed to increase SARS-CoV-2 virulence [9,33]. One possibility is that the change from a negatively charged aspartate to a non-polar glycine may modify the structure and therefore the function of the protein. Charged amino acids form ionic and hydrogen bonds (H-bond) through their side chains and stabilize proteins [34]. The targeted aspartate is present in the loop region, therefore a mutation to a glycine would cause unfolding of the loop and possibly render it more flexible making the furin cleavage site more accessible.

PLOS ONE
Coinfection with different SARS-CoV-2 strains increases the severity of COVID-19 D614 is in close vicinity to T859 of the adjacent monomer's S2 (Chain B); thus, they can form a H-bond (Fig 7) through both sidechains. In addition, backbone H-bonds can be formed with A646 of the same chain. It was documented that S2 domains alter their structure after furin site cleavage [35]. Therefore, the mutation of D614 to G might weaken the stability of S2 and make cell entry more aggressive. It is probably the loss of the H-bond between G614 (S1/Chain A) and T859 (S2/Chain B) that stops the hinging of the S2 domain making it more flexible in the transition state when interacting with the host cell receptor. A thermodynamic analysis showed that D614G mutation resulted in slightly destabilizing the protein with a ΔΔG: -0.086 kcal/mol and increasing the vibrational entropy to ΔΔS Vib 0.137 kcal.mol -1 .K -1 as seen in Fig 8A where the red parts indicate more flexibility. Since this mutation will occur on the trimeric structure of the S protein, all the three domains will be more flexible. Such flexibility will render the furin cleavage site more accessible which is concomitant with the virulence of the D614G mutation. Furthermore, Wrapp et al. suggested that the flexibility observed in the receptor binding domain region (RBD) (Fig 8) may facilitate the binding of ACE2 to the S protein [28]. P314L/P323L mutation. ORF1a and ORF1b produce a set of non-structural proteins (nsp) which assemble to facilitate viral replication and transcription (nsp7, nsp8, and nsp12) [36]. The nucleoside triphosphate (NTP) entry site and the nascent RNA strand exit paths have positively charged amino acids, are solvent accessible, and are conserved in betacoronaviruses [37] (Fig 9). P314L mutation (or Position P323 on the protein structure PDB ID:6M71 because of a frame shift and written as P323L hereafter) is positioned on the interface domain of the RdRp (or nsp12) between A250-R365 residues. Previous studies have shown that the interface domain has functional significance in the RdRp of Flavivirus. In addition, when polar or charged residue mutations were introduced into these sites, viral replication levels

PLOS ONE
were significantly affected [38]. Thus, mutations on nsp12 interface residues may affect the polymerase activity and RNA replication of SARS-CoV-2. Proline is often found in very tight turns in protein structures and can also function to introduce kinks into α-helices. In Fig 10, we investigated the proposed intermolecular bonds that P323 can make, where the backbone COO-group of proline can form H-bonds with the backbone NH groups of T324 and S325 or

PLOS ONE
Coinfection with different SARS-CoV-2 strains increases the severity of COVID-19 the OH-group of S325 side chain. The pyrrolidine, on the other hand, forms hydrophobic interactions with W268 and F275 (Fig 10A). The mutation to leucine tightens the structure and reduces the flexibility with an increase in ΔΔG: 0.717 kcal/mol and a decrease in vibrational entropy to ΔΔS Vib ENCoM: -0.301 kcal.mol -1 .K -1 (Fig 11). Furthermore, leucine possesses a non-polar side chain, seldomly involving catalysis, which can play a role in substrate recognition such as binding/recognition of hydrophobic ligands. L323 backbone COO-forms a H-bond with the sidechain OH-group of S325, in addition to a hydrophobic interaction with W68 and L270 (Fig 10B). L270 positioned on top of the flexible loop region, forming a hydrophobic interaction would therefore displace any water molecules entering the looped region and thereby make it more compact. The overall improved stability of RdRp can make it more efficient in RNA replication and hence increase SARS-CoV-2 replication.

Discussion
Our pairwise homoplasy index tests suggest that, among continental datasets, European and North American sequences have shown evidence for the presence of recombination events (Pvalue = 0.001 and 0.007 respectively); while African, Oceanic, South American, and Asian datasets have shown no recombination events. The recombination events in European population were further confirmed with significant P-values by MaxChi 2 , Chimera, and 3Seq tools, while possible recombination events in North American populations were confirmed by Max-Chi 2 and 3Seq tools. This indicates once more that European and North American populations are at higher risk of coinfection with different SARS-CoV-2 variants simultaneously. The recombination effects might also lead to deletion of big portions of RNA such as the one reported in Holland et al. paper, where an 81-nucleotide deletion was detected in SARS-CoV-2 ORF7a [39]. Similar deletions might decrease viral fitness and affect COVID-19 pandemic trajectory.
Depending on the variance seen within the clustered samples, PCA analysis indicated that clustering in Europe and North America is probably associated with a founder effect where a single mutation was introduced and subsequently transmitted; such founder mutations are 23403AG in Europe and 28144TC in North America. Admixture analysis identified differing

PLOS ONE
number of clusters of viral strains in different continents: Europe with six clusters, South America with five, North America with four, and Africa and Asia with two each. Both strong and weak LD blocks were seen among clusters of strains in every continent; the four dominant strains in Europe have significant proportion of strong and weak LD signatures in them. Proportions of strains carrying missense variants over synonymous variants differ among continents-five clusters of missense variants were dominant in Europe, three in North America while two clusters of missense variants were dominant in Asia, Africa, and South America. Regarding recombination patterns, European and North American continents showed evidence for the presence of recombination events among SARS-CoV-2 genomes indicating continuous evolution amid SARS-CoV-2 viral strains.
Admixture analyses have shown 7 different strains with differential segregation of alleles in SARS-CoV-2 isolates. Upon constricting variants with strong LD, the proportional assignment did not change in Africa and Asia, whereas it changed in Europe, North America, Oceania, and South America. In fact, proportions in Europe (C2 & C3) and North America (C2 & C4) increased excessively suggesting that strong LD sites were present in more than one strain in each of these two continents. The presence of an un-admixed block pattern of strong LD between strains suggests that LD sites were not broken by significant recombination seen in these two continents. This is either due to their physical distance or natural selection. On the contrary, weak LD sites have shown clear admixture between strains. Strikingly, each continent is dominated by a different set of nonsynonymous clusters such as Africa by C4, Asia by C1, Europe by C2, North America by C3, Oceania by C5, and South America by C2. This is also evident from the allele frequency variation seen in each continent.
Further, estimation of continental-wise haplotype block enabled us to identify variations in linked nonsynonymous and synonymous sites in Asia, Europe, and North America. Although selection primarily acts on variation that undergoes amino acid change, many synonymous variants were observed in haplotype blocks. This suggests that these synonymous sites hitchhiked along with nonsynonymous variants due to their physical proximity. Another interesting feature we observed was the variation in the number of nonsynonymous sites between Asia, Europe, and North America. The Asian haplotype block carried three nonsynonymous variants, European haplotype block carried ten nonsynonymous variants, and North American haplotype block carried seven nonsynonymous variants. This suggests that the initial strain, which originated and travelled from Asia, had less functional sites whereas coinfection-led recombination in Europe and North America enriched functional sites in strains.
The structural analysis of the two European strain mutations, D614G located in the spike gene and P314L located in the RdRp gene, indicated that the former mutation may render the furin cleavage site more accessible while the latter would increase protein stability. Around 73% of the European samples have both mutations segregating together; while in Africa, only 11% of the viral samples harbor these mutations. The European strain carries additional mutations, notably the hotspot mutations R203K and G204R, that cluster in a serine-rich linker region at the RdRp. It was suggested that these mutations might potentially enhance RNA binding and replication and may alter the response to serine phosphorylation events [40], which might further exacerbate SARS-CoV-2 virulence.
We understand that other confounding factors such as SARS-CoV-2 testing, socioeconomic status, the availability of suitable medical services, and the burden of other diseases are important contributors to the disparities seen in mortality rates around the world. Nonetheless, it is imperative that more functional studies are conducted to delineate the impact of these variants on SARS-CoV-2 transmissibility, diagnostics, vaccines, and therapeutics. Finally, our data