Figures
Abstract
The world is experiencing one of the most severe viral outbreaks in the last few years, the pandemic infection by SARS-CoV-2, the causative agent of COVID-19 disease. As of December 10th 2021, the virus has spread worldwide, with a total number of more than 267 million of confirmed cases (four times more in the last year), and more than 5 million deaths. A great effort has been undertaken to molecularly characterize the virus, track the spreading of different variants across the globe with the aim to understand the potential effects in terms of transmission capability and different fatality rates. Here we focus on the genomic diversity and distribution of the virus in the early stages of the pandemic, to better characterize the origin of COVID-19 and to define the geographical and temporal evolution of genetic clades. By performing a comparative analysis of 75401 SARS-CoV-2 reported sequences (as of December 2020), using as reference the first viral sequence reported in Wuhan in December 2019, we described the existence of 26538 genetic variants, the most frequent clustering into four major clades characterized by a specific geographical distribution. Notably, we found the most frequent variant, the previously reported missense p.Asp614Gly in the S protein, as a single mutation in only three patients, whereas in the large majority of cases it occurs in concomitance with three other variants, suggesting a high linkage and that this variant alone might not provide a significant selective advantage to the virus. Moreover, we evaluated the presence and the distribution in our dataset of the mutations characterizing the so called “british variant”, identified at the beginning of 2021, and observed that 9 out of 17 are present only in few sequences, but never in linkage with each other, suggesting a synergistic effect in this new viral strain. In summary, this is a large-scale analysis of SARS-CoV-2 deposited sequences, with a particular focus on the geographical and temporal evolution of genetic clades in the early phase of COVID-19 pandemic.
Citation: Cairo A, Iorio MV, Spena S, Tagliabue E, Peyvandi F (2022) Worldwide SARS-CoV-2 haplotype distribution in early pandemic. PLoS ONE 17(2): e0263705. https://doi.org/10.1371/journal.pone.0263705
Editor: Yury E. Khudyakov, Centers for Disease Control and Prevention, UNITED STATES
Received: January 22, 2021; Accepted: January 25, 2022; Published: February 16, 2022
Copyright: © 2022 Cairo et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting Information files.
Funding: This work was supported by the Italian Ministry of Health - Bando Ricerca Corrente.(RC2020 to Dr. F. Peyvandi). Dr. Iorio MV is supported by the Young Researcher Grant from the Italian Ministry of Health (grant N.GR-2016-02361750) and by a Career Foundation Grant from Berlucchi Foundation.
Competing interests: The authors have declared that no competing interests exist.
Introduction
In nineteen months after the declaration of the SARS-CoV-2 pandemic [1], the scientific community has been struggling to understand the complexity of this novel coronavirus of still debated zoonotic origin [2], the clinical symptoms [3, 4], the risk factors, the potential treatments, in the urgent effort to contain the infection, to predict potentially serious disease outcomes, to find a cure. In 2021, several SARS-CoV-2 vaccines have been generated and approved, and several other are still in clinical trials [5]. The circulation of SARS-CoV-2 before the pandemic declaration has been ascertained, and several efforts have been made to track the worldwide spreading and the genetic changes that originated different viral strains.
The first case of COVID-19 was reported in Wuhan, China, in December 2019 and, despite the relatively low mortality (approximately 2% on average worldwide, as of December 10th 2021, https://ourworldindata.org/mortality-risk-covid#the-case-fatality-rate) and the high percentage of asymptomatic or pauci-symptomatic subjects (over 80%), the viral outbreak has literally caused a dramatic collapse of the health care system in the most hit countries, as in Italy, where the mortality rate reached over 14% between May and August 2020.
In the urgent race to find efficient drugs and decrease the complications, it has been immediately clear that a deeper understanding of the genomic diversity of this virus was crucial. Indeed, the existence of different strains and their temporal and geographical distribution could provide relevant information on: how the virus spread all over the world, the possible acquisition of selective advantages, the most conserved sequences suitable for a vaccine design.
Since the first complete genomic sequence of SARS-CoV-2 release on January 5th 2020 [1, 2], thousands of additional sequences have been deposited. Different virus strains have been reported [6], and the accumulation of recurrent variants [7, 8]. Here, we evaluated the occurrence of different SARS-COV-2 clades and their geographical distribution starting from 75401 sequences deposited in two major available databases, GISAID and NCBI, from the beginning of COVID-19 pandemic to September 2020.
Material and methods
Data acquisition
The VCF file containing all identified variants in 77648 SARS-CoV-2 genomes and relative information about geographical area, date of sampling and genome length were downloaded on September 29th 2020 from http://covseq.baidu.com/. Viral sequences were collected from major available databases: GISAID (74303) and NCBI (3345). Genomes without detailed date of sampling, geographical localization and with a sequence shorter than 29000 nucleotides were excluded. After filtering, 75401 viral genomes were included in the analysis. The list of all analyzed sequences with the collection dates is available in S1 Table.
Sequence and mutational analysis
The reference sequence used for our analysis is NC_045512.2 (NCBI RefSeq). Multiallelic variants were splitted using Bcftools [9]. Variants were annotated using SnpEff v.4.5covid19 [10]. Data manipulation and descriptive analysis were performed using python package PANDAS [11]. Viral genomes and relative variants were organized in a binary matrix. The identification of clades and haplotypes was performed including in the analysis only the variants identified in at least 1000 viral sequences. We created a sub-dataset containing the 53 variants identified as present in more than 1000 viral sequences (S1 Table) and different haplotypes were identified and counted using groupby PANDA function (S2 Table). The sequences were classified into clades and deriving haplotypes based on the mutational distribution. Frequency of each variant in all countries was calculated (S3 Table). and represented by clustering analysis and relative heatmap, which were performed with R 4.0.0s. Only countries with at least 50 viral sequences were included in the clustering analysis.
Results and discussion
The geographical distribution of available sequences is reported in S1 Fig. Almost 58% of all available sequences was obtained from European centers, principally United Kingdom, followed by North America (~20%), Oceania (~11%) and Asia (~8%). The samples were collected from December 2019 to August 17th 2020 at different times across different countries, and analyzing their temporal distribution it is worth noting that few sequences were early obtained (January 2020) in United States (January 19th 2020), Australia (January 22nd 2020) and North Europe (January 23rd 2020) (S2 Fig).
China represents the country with the highest percentage of unmutated viral genomes (3%) (Table 1).
Observing the distribution of viral unmutated sequences, it could be noticed how USA, Northern Europe, Australia, South Africa, Brazil, Canada and India have been directly involved in the spread of the pandemic after China (Fig 1). Analyzing the minimum number of variants per sequence in each country we could indeed track the mutational evolution and the relative geographical distribution, and the circulation timing of the virus. Importantly, this approach allowed us to hypothesize that some countries severely hit by COVID-19, as Italy and France, were characterized by the spread of already mutated forms of the virus (Fig 1).
A total of 26539 variants were identified (S4 Table): 57% missense, 28% synonymous, 7% insertions/deletions, 2% stop variants, 7% of variants being localized in the untranslated regions, 4% in the 3’UTR and 3% in the 5’UTR (S3A Fig). The number of variants per sequence spans mainly from 0 to 20; the majority of sequences carried from 7 to 9 variants and only few cases (431) had more than 20 variants (S3B Fig). A plausible hypothesis could be that, whereas few variants may have provided such favorable features as an improved infectivity, a higher number of variants did not result in a selective advantage. Variants present in at least 1000 sequences are reported in Table 2 (N = 53). As expected, nonsense mutations were not found among the most frequent variants.
Only the variants identified in at least 1000 viral sequences are reported.
Tracing the mutational evolution of the viral genome has been crucial to evaluate potential functional consequences and to obtain an efficient vaccine, which should preferentially target more evolutionary conserved regions. Our analysis underlines that, despite the high number of variants, the regions coding for the polyprotein ORF1ab, the spike (S) and the membrane (M) proteins are the most conserved (S3C Fig). This is not surprising, since most evolutionary conserved regions usually encode essential proteins. However, we confirmed the missense variant p.Asp614Gly in S protein as the most frequent (found in 57820 sequences, 77% of the total) (Table 2), as described in previous studies [7, 8, 12, 13]. This variant has been shown to improve the binding affinity of the S protein to the human ACE2 receptor, reported as main entry site of the virus into human cells [14], through the cleavage of the S1/S2 domain [15]. Indeed, it was previously shown that SARS-CoV infection can be enhanced by exogenous proteases.
It has been demonstrated that the p.Asp614Gly variant confers higher infectivity [16], competitive fitness and improved transmission in human primary cells and animal models [17]. However, the variant does not seem to be associated with an increased disease severity [16], despite what was initially inferred from the correlation of mortality rate and prevalence of p.Asp614Gly in different countries [8, 18].
Given its pitoval role in receptor binding, membrane fusion and cell infectivity, the capability to induce neutralizing antibodies and T cell-immune responses and the relative low variability so far reported, the S protein has been considered a valid target for the development of SARS-CoV-2 vaccines [19–21].
It is relevant to notice that, focusing on the 53 most frequent variants, we found the p.Asp614Gly variant as a single mutation in three patients only, suggesting that this variant alone does not provide a significant selective advantage to the virus. Indeed, in the large majority of cases p.Asp614Gly occurs concomitantly with other variants, in particular with three variants, all characterized by similar frequency and located in the ORF1ab gene: the missense p.Pro4715Leu, the synonymous p.Phe924Phe and the c.-25C>T in the 5’ untranslated region. These four variants are indeed in strong linkage and constitute the most common clade (clade 1). We identified 4 different clades and evaluated their temporal occurrence, geographical distribution and potential connections (Table 3).
The variants characterizing clade 1 have been identified in 55582 sequences, 74% of the total. This clade, first reported in UK in January 2020, has literally spread worldwide and particularly to European countries, where it represents 78% of sequences in (1059 cases), to North and South America (73 and 77%), Africa (77%) and Oceania (73%) (S4 Fig). Interestingly, this core is present in approximately 93% (201 of 215) of Italian cases.
Clade 1 has likely generated two subclades, subclade_1A and subclade_ 1B, characterized by a different geographical frequency (Fig 2), with the presence of subclade_1A mainly in Europe (41%), Oceania (52%) and South America (50%), whereas subclade_1B is present mainly in North America (51%) (Table 3).
A visual representation of the spreading of different variants and relative clades is clearly inferable from the hierarchical cluster analysis reported in Fig 3. The plot highlights the co-presence of the four most common variants constituting clade 1 and their worldwide spreading. Additionally, it is possible to observe how the distribution of the three variants characterizing subclade_1A (28881, 28882, 28883) and the two variants peculiar of subclade_1B (25563, 1059) present an almost mutually exclusive geographical localization.
Only countries with at least 50 viral sequences were included in the clustering analysis.
Clade 2, characterized by two different variants (8782 and 28144), is present in North America, some European countries such as Spain, some regions of Asia (China, Kazakistan and Thailand), Africa (Ghana and Nigeria) and Australia (Fig 3 and S5 Fig).
Clade 3 (11083, 14805, 26144) and clade 4 (1440) appear to be more common in Europe. Despite the worldwide distribution of clade 1 variants, it is remarkable the peculiar case of South Korea, characterized by an almost exclusive presence of clade 3, and at lower frequency of clade 2 (Fig 3). In addition, some regions are characterized by a high frequency of different variants not related to the four major clades. 58% of sequences of Singapore presents a specific pattern of 5 variants (11083, 28311, 13730, 23929, 6312); Spain and Kazakistan curiously present the diffusion of similar virus strains characterized by the presence of 4 variants (28863, 9477, 25979, 28657) observed in 28% and 35% of the sequences, respectively. Likewise, 47% of the Australian sequences are characterized by the presence of 6 variants (7540, 23401, 22992, 16647, 18555, 1163) (Fig 3). The 22992 missense variant, leading to a S477A aminoacidic change, has been associated to a stronger transmission capacity [22] and higher infectivity, perhaps due to an improved binding of the S protein RBD (receptor-binding domain) to the ACE receptor [23]. Other sporadic mutations are related to a specific geographical area, as the variant 313 identified in 37% of Japanese sequences, the variants 18877 and 29829 observed in the 69% of sequences from Saudi Arabia and 90% of the sequences from Netherlands, respectively. These four identified clades constitute the core of 1213 different haplotypes, of which 548 are unique, whereas the remaining occurred in more than one patient. We report the 20 most frequent haplotypes according to the number of cases (Table 4).
In bold the variants charactering Clade 1, in italics Clade 2, underlined Clade 3 and double underlined Clade 4.
Geographical distribution of haplotypes 1,2 and 3 are described in the S6 Fig. Haplotype 1 (241, 3037, 14408, 28881, 28882, 28883) is more common in South America and Europe, haplotype 2 (23403, 14408, 3037, 241, 25563, 1059) has mainly spread to North America and haplotype 3 (241, 3037, 14408, 23403) to Africa and Europe. Haplotype 3 is present in 48% of Italian sequences, and haplotype 1 in 36%.
At the beginning of 2021, a new viral strain characterized by 23 variants (6 synonymous and 17 non-synonymous) and associated with a high diffusion capacity was reported in the United Kingdom [24]. One of these mutations, the p.Asn501Tyr mutation located at the RBD of the S protein, has been previously reported to be an adaptive mutation in a mouse model subjected to serial passages of a human SARS-CoV-2 by means of intranasal inoculations [25]. This mutation seems to favor virus binding to the human ACE2 receptor, thus leading to a phenotype of increased virulence in mice [23]. However, p.Asn501Tyr variant and other 8 of the 17 variants (3267C>T p.Thr1001Ile, 6954T>C p.Ile2230Thr in the ORF1ab gene; 23063A>T, 23271 C>A p.Ala570Asp, 23604C>A p.Pro681His, 23709C>T p.Thr716Ile in the S gene; 27972C>T p.Gln27*, 28048G>T p.Arg52Ile in the Orf8 gene; 28977C>T p.Ser235Phe in the N gene) were observed only in a few sequences in our dataset, but never in linkage with each other, suggesting a synergistic effect of the variants in this new viral strain (S3 Table). In particular p.Asn501Tyr was observed in 2 sequences at the beginning of April, one in Brazil and the other in USA. Later, starting from June 2020, the same mutation was identified in 34 Australian sequences.
In conclusion, our analysis emphasizes that the most frequent SARS-CoV-2 variants in early phases of COVID-19 pandemic clustered into four major clades, characterized by different geographical localization, likely reflecting a temporal and spatial spread of virus strains. Moreover, we determined that the most frequent variant, the missense variant p.Asp614Gly in the S protein, is in strong linkage with three additional variants, suggesting a potential functional cooperation in providing a selective advantage to the virus.
Despite the limitations of this study, as the fact that most sequences in the first year of COVID-19 pandemic were reported by UK, and the unavailability of clinical data to correlate the different variants to disease outcome, our analysis provides a portrait of SARS-CoV-2 temporal and spatial spread during the first phases of the pandemic.
During last year several novel virus variants have been sequenced, in particular the World Health Organization (WHO) identified variants of concern (VOCs), which cause an improved virulence and transmissibility or affect the efficacy of diagnosis, treatment or vaccines [26]. The most recently described VOC, the so called Omicron, has been first documented in South Africa in November 2021, where it has been responsible for the forth waive of COVID-19. This variant has raised concerns about its rapid spread and potential escape to vaccines. According to still preliminary data however, Omicron seems to have improved transmissibility but decreased severity [27]. Further studies are needed to verify whether vaccines retain their efficacy also against this variant.
Considering the continuous circulation of the virus and the emerging of new SARS-CoV-2 variants, it is extremely relevant to associate the different haplotypes to a specific outcome of the disease and to understand whether or not the acquired mutations have functional consequences in terms of infectiveness, clinical severity and potential responsiveness to specific treatments and to vaccines.
Conclusions
By performing a large scale SARS-CoV-2 analysis of reported sequences in the first year of Covid-19 pandemic, we described the existence of 26538 genetic variants, the most frequent clustering into four major clades characterized by a specific geographical distribution. We also found that the most frequent variant, the missense p.Asp614Gly in the S protein, occurs almost exclusively in concomitance with three other variants, suggesting a high linkage and a potential cooperative effect on the virus fitness.
Supporting information
S1 Fig. Worldwide distribution of available genomic sequences.
The number and the percentage of sequences reported in each continent and country are indicated in panel (a) and (b), respectively. Only countries with more than 200 sequences are shown.
https://doi.org/10.1371/journal.pone.0263705.s001
(PDF)
S2 Fig. Geographical distribution of the collection dates of the first sequenced sample in each country.
https://doi.org/10.1371/journal.pone.0263705.s002
(PDF)
S3 Fig. Mutational spectrum of SARS-CoV-2 genome.
(a) Type of variants are listed with the corresponding number and frequency. (b) Variant number distribution. The bar graph shows the amount of the reported sequences with the same number of variants. (c) Frequency of variants with a direct effect on the virus’ proteins (missense, nonsense and insertion/deletion). Variants are listed with the corresponding number and frequency of reported mutations.
https://doi.org/10.1371/journal.pone.0263705.s003
(PDF)
S4 Fig. Frequency and geographical distribution of clade 1.
https://doi.org/10.1371/journal.pone.0263705.s004
(PDF)
S5 Fig. Frequency and geographical distribution of clade1-derived haplotypes 1–3.
https://doi.org/10.1371/journal.pone.0263705.s005
(PDF)
S6 Fig. Frequency and geographical distribution of clade2.
https://doi.org/10.1371/journal.pone.0263705.s006
(PDF)
S1 Table. List of all analyzed sequences with collection dates and presence of most frequents variants.
https://doi.org/10.1371/journal.pone.0263705.s007
(XLSX)
S2 Table. Haplotype identified starting from the 53 most frequent variants.
https://doi.org/10.1371/journal.pone.0263705.s008
(XLSX)
S3 Table. Frequency of each variant in all countries.
https://doi.org/10.1371/journal.pone.0263705.s009
(XLSX)
References
- 1. Wu F, Zhao S, Yu B., Chen JM, Wang W, Song Z-G, et al. A new coronavirus associated with human respiratory disease in China. Nature. 2020;579(7798):265–269. pmid:32015508
- 2. Zhou P, Yang X-L, Wang X-G, Hu B, Zhan L, Zhang W, et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579(7798):270–273. pmid:32015507
- 3. Huang C, Wang Y, Li X, Ren L, Zhao J, Hu Y, et al. Clinical features of patients infected with 2019 novel coronavirus in Huwan, China. Lancet. 2020;395(10223):497–506. pmid:31986264
- 4. Hypercoagulability of COVID-19 patients in intensive care unit: A report of thromboelastography findings and other parameters of hemostasis. Panigada M, Bottino N, Tagliabue P, Grasselli G, Novembrino C, Chantarangkul V, et al. J Thromb Haemost. (2020). pmid:32302438
- 5. Creech CB, Walker SC, Samuels RJ. SARS-CoV-2 vaccines. JAMA. 2021;325(13):1318–1320. pmid:33635317
- 6. Van Dorp L, Acman M, Richard D, Shaw LP, Ford CE, Ormond L, et al. Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infect Genet Evol. 2020;83: 104351. pmid:32387564
- 7. Koyama T, Platt D and Parida L. Variant analysis of SARS-CoV-2 genomes. Bull World Health Organ. 2020;98:495–504. pmid:32742035
- 8. Toyoshima Y, Nemoto K, Matsumoto S, Nakamura Y and Kiyotani K. SARS-CoV-2 genomic variation associated with mortality rate of COVID-19. Journal of Human Genetics. 2020;65:1075–1082. pmid:32699345
- 9. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009; 25(16):2078–9. pmid:19505943
- 10. Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly(Austin) 2012;6(2):80–92. pmid:22728672
- 11. McKinney W, & others. (2010). Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference (Vol. 445, pp. 51–56).
- 12. Gudbjartsson D F, Helgason A, Jonsson H, Magnusson OT, Melsted P, Norddahl GL, et al. Spread ofSARS-CoV-2 in the Icelandic population. N. Engl. J. Med. 2020;382: 2302–2315. pmid:32289214
- 13. Biswas N K, Majumder PP. Analysis of RNA sequences of 3636 SARS-CoV-2 collected from 55 countries reveals selective sweep of one virus type. Indian J. Med. Res. 2020;151 (5):450–458. pmid:32474553
- 14. Hoffmann M, Kleine-Weber H, Schroeder S, Krüger R, Herrler T, Erichsen S, et al. SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor. Cell. 2020;181(2):271–280. pmid:32142651
- 15. Bhattacharyya C, Das C, Ghosh A, Singh AK, Mukherjee S, Majumder PP, et al. Global spread of SARS-CoV-2subtypes with Spike Protein Mutation D614G is Shaped by Human Genomic Variations that Regulate Expression of TMPRSS2 and MX1 Genes. Preprint https://doi.org/10.1101/2020.05.04.075911 (2020).
- 16. Korber B, Fischer WM, Gnanakaran S, Yoon H, Theiler J, Abfalterer W, et al. Wyles, Sheffield COVID-19 Genomics Group, Tracking changes in SARS-CoV-2 spike: Evidence that D614G increases infectivity of the COVID-19 virus. Cell. 2020;182:812–827.e19. pmid:32697968
- 17. Hou YJ, Chiba S, Halfmann P, Ehre C, Kuroda M, Dinnon KH 3rd, et al. SARS-CoV-2 D614G variant exhibits efficient replication ex vivo and transmission in vivo. Science. 2020 Nov 12:eabe8499. pmid:33184236
- 18. Becerra-Flores M, Cardozo T. SARS-CoV-2 viral spike D614 mutation exhibits higher case fatality rate. Int J Clin Pract. https://doi.org/10.1111/ijcp.13525 (2020)
- 19. Wang N, Shang J, Jiang S and Du L. Subunit Vaccines Against Emerging Pathogenic Human Coronaviruses. Front. Microbiol., 28 February 2020 pmid:32265848
- 20. Dong Y, Dai T, Wei Y, Zhang L, Zheng M and Zhou F. A systematic review of SARS-CoV-2 vaccine candidates. Signal Transduction and Targeted Therapy. 2020;5: 237. pmid:33051445
- 21. Krammer F. SARS-CoV-2 vaccines in development. Nature. 2020;586,516–527. pmid:32967006
- 22. Chen J, Wang R, Wang M and Wei GV. Mutations strengthened SARS-CoV-2 infectivity. J Mol Biol 2020;432:5212–5226. pmid:32710986
- 23. Zahradník J, Marciano S, Shemesh M, Zoler E, Chiaravalli J, Meyer B, et al. SARS-CoV-2 RBD in vitro evolution follows contagious mutation spread, yet generates an able infection inhibitor. bioRxiv preprint. https://doi.org/10.1101/2021.01.06.425392
- 24.
European Centre for Disease Prevention and Control. Rapid increase of a SARS-CoV-2 variant with multiple spike protein mutations observed in the United Kingdom– 20 December 2020. ECDC: Stockholm; 2020.
- 25. Gu H, Chen Q, Yang G, He L, Fan H, Deng YQ, et al. Adaptation of SARS-CoV-2 in BALB/c mice for testing vaccine efficacy. Science. 2020;369(6511):1603–1607. pmid:32732280
- 26. Han X and Ye Q. The variants of SARS-CoV-2 and the challenges of vaccines. J Med Virol 2021. Dec 9. pmid:34890492
- 27. Maslo C, Friedland R, Toubkin M, Laubscher A, Akaloo T, Kama B. Characteristics and Outcomes of Hospitalized Patients in South Africa During the COVID-19 Omicron Wave Compared With Previous Waves. Jama. 021 Dec 30. pmid:34967859