SARS-CoV-2: Proof of recombination between strains and emergence of possibly more virulent ones

COVID-19 is challenging healthcare preparedness, world economies, and livelihoods. The infection and death rates associated with this pandemic are strikingly variable in different countries. To elucidate this discrepancy, we analyzed 2431 early spread SARS-CoV-2 sequences from GISAID. We estimated continental-wise admixture proportions, assessed haplotype block estimation, and tested for the presence or absence of strains recombination. Herein, we identified 1010 unique missense mutations and seven different SARS-CoV-2 clusters. In samples from Asia, a small haplotype block was identified; whereas, samples from Europe and North America harbored large and different haplotype blocks with nonsynonymous variants. Variant frequency and linkage disequilibrium varied among continents, especially in North America. Recombination between different strains was only observed in North American and European sequences. Additionally, we structurally modeled the two most common mutations D614G and P314L which suggested that these linked mutations may enhance viral entry and stability. Overall, we propose that COVID-19 virulence may be more severe in Europe and North America due to coinfection with different SARS-CoV-2 strains leading to genomic recombination which might be challenging for current treatment regimens and vaccine development. Furthermore, our study provides a possible explanation for the more severe second wave of COVID-19 that many countries are currently experiencing presented as higher rates of infection and death.


Introduction 40
The recent severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) outbreaks have grievously 41 impacted the world by threatening lives and impeding human activity in a short period. Understanding the 42 factors that govern the severity of a pandemic is of paramount importance to design better surveillance 43 systems and control policies [1]. In the case of COVID-19, three variables play a critical role in its spread 44 and severity in a country: the nature of the pathogen, genetic diversity of the host population, and 45 environmental factors such as public awareness and health measures provided by governments [2]. 46 SARS-CoV-2 spike (S) glycoprotein plays an integral role in the viral transmission virulence [3]. The S 47 protein contains two functional subunits S1 and S2 cleaved by FURIN protease at the host cell [4]. The S1 48 subunit contains the receptor binding domain and facilitates interactions with the host cell surface receptor, 49 Angiotensin-converting enzyme 2 (ACE2) [5,6]. The S2 subunit, activated by the host Transmembrane 50 Serine Protease 2, harbors necessary elements for membrane fusion [7]. S protein mutations may induce 51 conformational changes leading to increased pathogenicity [8]. We were pioneers to report the perspective 52 role of S protein D614G (23403A>G) variant located at the S1-S2 proximal junction. The D614G mutation 53 generates conformational changes in the protein structure which renders the FURIN cleavage site (664-54 RRAR-667) flexible and thus enhances viral entry [9]. Currently, the D614G mutation has become the 55 focus of several recent studies for the use in prospective drug targeting strategies [10]. Furthermore, studies 56 on understanding the genetic diversity and evolution of SARS-CoV-2 are emerging [11] . Admixture 57 analyses have been conducted to understand the evolution of Beta-coronaviruses and in particular the 58 diversification of SARS-CoV-2 [12]. Haplotype analysis have also indicated that frequencies of certain 59 haplotypes correlate with virus pathogenicity [13]. 60 Here we extended our study by analyzing the population genetics aspects within genome sequences of 61 SARS-CoV-2 to understand the contiguous spread of SARS-CoV-2, its rapid evolution, and the differential 62 severity of COVID-19 among different continents. We analyzed 2341 viral sequences deposited in GISAID 63 (https://www.r-project.org/). 88

Protein Structural Analysis 89
The crystal structures of the viral RNA-dependent RNA polymerase (RdRp, PDB ID: 6M71) [25] and the 90 S protein (PDB ID:6VSB) [26] were used as model proteins for the structural analysis. The missing aa, 91 invisible by Cryo-EM structure of the S-protein, were modeled-in by using SWISS-Model [27]. DynaMut 92 web server [28] was used to predict the effect of the mutations on the proteins stability and flexibility. 93 PyMol (Molecular Graphics System, Version 2.0, Schrodinger, LLC) was used to generate structural 94 images. 95

Detection and classification of mutations from global SARS-CoV-2 genome sequences 97
We analyzed 2431 high quality SARS-CoV-2 genome sequences from six continental groups. In 98 comparison to the Wuhan reference sequence (NC045512.2), 2352 sequences showed substantial genetic 99 differences. We identified 1010 unique missense amino acid (aa) mutations using our variant calling 100 pipeline. 613 variants were nonsynonymous, 387 variants were synonymous, 3 variants were stop-gain, and 101 1 variant was a 5'-utr variant. We found only 72 variants at MAF ≥ 0.5% which were for admixture and 102 haplotype block analysis.

Identification of SARS-CoV-2 genetic clusters in different continents 107
We performed Principal Component Analysis (PCA) of 2352 SARS-CoV-2 sequences as indicated in Fig  108   2. The PCA analysis gave three distinct clusters of samples based on their continent of origin. All three 109 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020 genetic variances; unlike, samples from Asia and Oceania which harbored the most genetic diversity. 111 Whereas the European cluster is well defined with few interspersed Asian samples which is an indication 112 of its origin. This clustering in Europe and North America is probably associated with a founder effect 113 where a single mutation was introduced and subsequently transmitted. This suggestion is corroborated by 114 the fact that the collection date of the founder strain is prior to that observed in the European and the North 115 American clusters (S1 Fig). Then, in order to estimate the ancestral allele frequencies and admixture proportions for each SARS-CoV-117 2 sample, admixture analysis was implemented. We used cross validation (CV) procedure to find best (i.e., CV error in RAW (A) and variants filtered for MAF≥0.5% (B). We observed a gradual reduction in CV 120 error for iterations up to K=7 and subsequently, a pattern of an increase-decrease was observed in 121 subsequent iterations; however, the least error was seen at K=17. Further, we verified this CV trend in a 122 subset of variants filtered for MAF ≥0.5% (comprised of 72 SNVs). Interestingly, we again observed a 123 gradual reduction of CV error up to K=7 (with best fit at 7) and subsequently, a trend of increasing CV 124

error. 125
We created subsets of strong LD, weak LD, Haplotype block, nonsynonymous, and synonymous variants 126 from 72 variants at K=7 across the continents. We separately performed admixture analysis on these 127 subsets. The analysis of the detected seven datasets revealed interesting mosaic patterns (Fig 3 A and B). 128 and the weak LD block is observed in three clusters (C1, C3, and C7) in Asia. In Europe, the strong LD 135 block is observed in 4 clusters (C1, C2, C3, and C5) and the weak LD block is observed in five clusters 136 (C1,C2,C3,C4, and C5; predominantly from C3 and C5). Interestingly, four clusters are common between 137 the strong and weak LD blocks, suggesting that a makeup of four strains that dominate in Europe have 138 significant proportion of strong and weak LD signatures in them. Likewise, in North America, both strong 139 and weak LD are observed among three clusters (C1, C2, and C4) and (C1, C3, and C4) respectively. 140 Further in Africa, Oceania and South America, strong LD is observed among (C2), (C1) and (C1 and C2) 141 respectively; weak LD is observed in (C3 and C4), (C1, C2, C3, and C5), (C3 and C5) respectively. 142 Interestingly, proportions of a Haplotype block (Fig 3 E) identified using whole data, mostly followed 143 pattern of strong LD of respective continents, however admixed with a weak LD strain of respective 144 continents. proportions. Similarly; C4 and C6, C5, and C2 nonsynonymous clusters along with C1and C6, C2, and C5 151 synonymous clusters are in high proportion in Africa, Oceania, and South America respectively. 152

Evidence of co-infection in the sequences from continental datasets 153
Here, we tested for the presence or absence of recombination among continental datasets using PhiPack 154 software which effectively differentiates between presence or absence of recombination using three 155 different tests "Pairwise Homoplasy Index (Phi)" (Bruen et al. 2006), "Neighbor Similarity Score (NSS)" 156 (Jakobsen and Easteal, 1996) and "Maximum χ 2 (MaxChi)" [29]. 157 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020 The results of these tests on the combined dataset suggested the possibility of co-infection at a global level. 158 Among continental datasets, only European (NSS test, P-value = 0.001) and North American [NSS and Phi 159 (normal), P-value = 0.007 and 0.042, respectively] sequences have shown evidence for the presence of 160 recombination events; while African, Oceanic, South American, and Asian datasets have shown no 161 recombination in early spread of SARS-CoV-2 ( Table 1). 162 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020  in early spread of SARS-CoV-2 in respective continents. 169 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ; https://doi.org/10. 1101/2020 We carried out this analysis in two ways; first by comparing haplotype blocks obtained from combined 171 dataset of variants with MAF ≥0.5% with each continental dataset. We did this analysis because continental 172 datasets of Africa (n=25), Oceania (n=69), and South America (n=24) were very small to infer high 173 confident LD blocks from respective datasets alone. Hence, we first compared LD blocks obtained from 174 the combined dataset and then observed LD among the same variants in each continental dataset. The 175 second way of analysis was by directly estimating haplotype blocks in each continental dataset with large 176 sample size such as Asia, Europe, and North America. 177 From the first analysis, we observed that LD block, obtained from the combined dataset, varies among  Table 2 shows MAF of 18 variants involved in 181 haplotype block of combined dataset in each continental dataset. These differences called for the second 182 approach of estimating haplotype blocks and the extent of LD between variants directly from each 183 continental dataset having large sample size. Surprisingly, we observed different sets of variants in 184 haplotype blocks, different length of haplotype blocks, and differences in nonsynonymous composition in 185 haplotype blocks among the three continents datasets (Fig 4). Table 3 describes characteristics of the 186 haplotype blocks observed in the datasets from Asia, Europe, and North America. 187 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ; https://doi.org/10.1101/2020.11.11.20229765 doi: medRxiv preprint . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ; https://doi.org/10.1101/2020.11.11.20229765 doi: medRxiv preprint . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ; https://doi.org/10.1101/2020.11.11.20229765 doi: medRxiv preprint D614G mutation. Wrapp et al. recently solved the cryo-EM structure of the S protein with 3.5 Å resolution 196 [26] (Fig 5 A). The S protein has many flexible loop regions that were not visible in the structure, including 197 the RRAR cleavage site. Therefore, we modeled-in the cleavage site and the undetected flexible regions 198 using the cryo-EM structure PDB ID: 6M71 as a scaffold [27]. Fig 5 B shows the overlay of the S protein 199 from PDB ID:6VSB with the modelled S protein, with an RMSD of 0.25 Å. As shown in the overlay figure,  200 there are more loops present in the modelled structure that were not detected by Cryo-EM. The RRAR 201 cleavage site (Fig 5 C) shows a high surface accessible region, where the viral protein can attach to the host 202 protein. As such, any mutations on the S protein especially close to RRAR site might alter its activity. 203 D614G is a mutation believed to increase SARS-CoV-2 virulence [9]. One possibility is that the change 204 from a negatively charged aspartate to a non-polar glycine may modify the structure and therefore the 205 D614 is in close vicinity to T859 of the adjacent monomer's S2 (Chain B) where they can form a H-bond 210 (Fig 6) through both sidechains. In addition, backbone H-bonds can be formed with A646 of the same 211 chain. It was documented that S2 domains alter their structure after FURIN site cleavage [31]. Therefore, 212 the mutation of D614 to G might weaken the stability of S2 and make cell entry more aggressive. It is 213 probably the loss of the H-bond between D614 (S1/Chain A) and T859 (S2/Chain B) that stops the hinging 214 of the S2 domain making it more flexible in the transition state when interacting with the host cell receptor. 215 Another possibility would be that the mutation to G and the loss of the H-bond to the adjacent chain made 216 the protein more flexible. A thermodynamic analysis showed that D614G mutation resulted in slightly 217 destabilizing the protein with a ΔΔG: -0.086 kcal/mol and in increasing in vibrational entropy ΔΔSVib 0.137 218 kcal.mol -1 .K -1 as seen in Fig 7 A where the red parts indicate more flexibility. Since this mutation will occur 219 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ; https://doi.org/10.1101/2020.11.11.20229765 doi: medRxiv preprint render the FURIN cleavage site more accessible which is concomitant with the virulence of the D614G 221 mutation. Furthermore, the flexibility observed in the ribosomal binding domain region (Fig 7) may 222 facilitate the binding of ACE2 to the S protein [26]. 223 P314L mutation. ORF1a and ORF1b produce a set of non-structural proteins (nsp) which assemble to 224 facilitate viral replication and transcription (nsp7, nsp8 and nsp12) [32]. The nucleoside triphosphate (NTP) 225 entry site and the nascent RNA strand exit paths has positively charged aa, is solvent accessible, and is 226 conserved in SARS-CoV-2 (Fig 8). P314L mutation is positioned on the interface domain of the RdRp (or 227 nsp12) between A250-R365 residues. Previous studies have shown that the interface domain has functional 228 significance in the RdRp of Flavivirus. In addition, when polar or charged residue mutations were 229 introduced into these sites, viral replication levels were significantly affected [33]. Thus, mutations on 230 nsp12 interface residues may affect the polymerase activity and RNA replication of SARS-CoV-2. Proline 231 is often found in very tight turns in protein structures and can also function to introduce kinks into α-helices. 232 In Fig 9, we investigated the proposed intermolecular bonds that P314 can make, where the backbone COO -233 group of proline can form H-bonds with the backbone NH groups of T and S or the OHgroup of S side 234 chain. Whereas the pyrrolidine forms hydrophobic interactions with the W268 and F275. The mutation to 235 leucine tightens the structure and reduces the flexibility with an increase in ΔΔG: 0.717 kcal/mol and a 236 decrease in vibrational entropy to ΔΔSVib ENCoM: -0.301 kcal.mol -1 .K -1 (Fig 10). Furthermore, leucine 237 possesses a non-polar side chain, seldomly involving catalysis, which can play a role in substrate 238 recognition such as binding/recognition of hydrophobic ligands. L314 backbone COO-forms a H-bond 239 with the sidechain OH-group of S325, in addition to a hydrophobic interaction with W68 and L270. L270 240 is positioned on top of the flexible loop region, therefore forming a hydrophobic interaction, displacing any 241 possibility of water molecules entering the looped region thus making it more compact. The overall 242 improved stability of RdRp can make it more efficient in RNA replication and hence increasing SARS-243 CoV-2 virulence. 244 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Our pairwise homoplasy index tests suggest that, among continental datasets, European and North 246
American sequences have shown evidence for the presence of recombination events (P-value = 0.001 and 247 0.007 respectively); while African, Oceanic, South American, and Asian datasets have shown no 248 recombination events till now. This again shows that the European and North American continents are at 249 higher risk of having super evolved viruses that can co-infect their hosts. Thus far, these recombination Admixture analysis has shown 7 different strains with differential segregation of alleles in SARS-CoV-2 isolates. 266 Upon constricting variants with strong LD, the proportional assignment did not change in Africa and Asia, whereas 267 it changed in Europe, North America, Oceania, and South America. In fact, proportions in Europe (C2 & C3) and 268 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted November 15, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020 in each of these two continents. Presence of an un-admixed block pattern of strong LD between strains suggests that 270 LD sites are not broken by significant recombination seen in these two continents. This is either due to their physical 271 distance or natural selection. On the contrary, weak LD sites have shown clear admixture between strains. Strikingly, 272 each continent is dominated by different set of nonsynonymous clusters such as Africa by C4, Asia by C1, Europe 273 by C2, North America by C3, Oceania by C5, and South America by C2. This is also evident from the allele 274 frequency variation seen in each continent. 275 Further, continental-wise haplotype block estimation enabled us to identify variation of linked nonsynonymous and 276 synonymous sites in Asia, Europe, and North America. Although selection primarily acts on variation that undergo 277 amino acid change, many synonymous variants were observed in haplotype blocks. This suggests that these 278 synonymous sites hitch-hiked along with nonsynonymous variants due to their physical proximity. Another 279 observed interesting feature was the variation in the number of nonsynonymous sites between Asia, Europe, and 280 North America. Asian haplotype block carried three nonsynonymous variants, Europe haplotype block carried ten 281 nonsynonymous variants, and North American haplotype block carried seven nonsynonymous variants. This 282 suggests that the initial strain which originated and travelled from Asia had less functional sites whereas coinfection-283 led recombination in Europe and North America enriched functional sites in strains. 284 Our preliminary structural analysis of the European strain main mutations, D614G located in the spike gene 285 and P314L located in the RdRp gene, showed that the first mutation will render the FURIN cleavage site 286 more accessible while the latter would increase protein stability. 73% of the European samples have both 287 mutations segregating together; while in Africa only 11% of the sequenced viral samples have them. This 288 is probably the reason behind the elevated mortality rates in Europe. This points out to the fact that the virus 289 has evolved at an alarming rate by introducing two mutations that increase its chances of survival. The 290 European strain harbors additional mutations, notably the hotspot mutations R203K and G204R that cluster 291 in a serine-rich linker region at the RdRp. It was suggested that these mutations might potentially enhance 292 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ; https://doi.org/10.1101/2020.11.11.20229765 doi: medRxiv preprint further exacerbates SARS-CoV-2 virulence. 294 We understand that other confounding factors like SARS-CoV-2 testing, socioeconomical status, the 295 availability of proper medical services, and the burden of other diseases are important contributors to the 296 disparities seen in mortality rates around the world. It is imperative that more comprehensive studies should 297 be conducted as more patient data emerges from different parts of the world. We also realize that our 298 analysis focuses only on the early spread samples and that we have to keep analyzing the viral sequences 299 to determine if recombination is occurring between other strains and identify any new mutations specially 300 since the world is currently fighting the second wave of the virus. Our data highlight the urgent need to 301 correlate patients' medical/infection history to the viral variants in order to predict in a more accurate and 302 personalized way how different viral strains are influencing this pandemic. 303 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ;https://doi.org/10.1101/2020.11.11.20229765 doi: medRxiv preprint Flaviviridae Family. Int J Mol Sci. 201516: 12943-12957 CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020  . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020  CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ; https://doi.org/10.1101/2020.11.11.20229765 doi: medRxiv preprint . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ; https://doi.org/10. 1101/2020  CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ; https://doi.org/10.1101/2020.11.11.20229765 doi: medRxiv preprint . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ; https://doi.org/10.1101/2020.11.11.20229765 doi: medRxiv preprint CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ; https://doi.org/10.1101/2020.11.11.20229765 doi: medRxiv preprint . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ; https://doi.org/10.1101/2020.11.11.20229765 doi: medRxiv preprint . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ; https://doi.org/10.1101/2020.11.11.20229765 doi: medRxiv preprint with the side chain of S325 and forms a hydrophobic interaction with L270 which is at the curve of the loop 473 bringing making that region more compact. 474 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ; https://doi.org/10.1101/2020.11.11.20229765 doi: medRxiv preprint interaction between L314 (L322 in structure) and hydrophobic residues in the moiety. 482 483 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 15, 2020. ; https://doi.org/10. 1101/2020