Whole-genome sequencing of SARS-CoV-2 reveals the detection of G614 variant in Pakistan

Since its emergence in China, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has spread worldwide including Pakistan. During the pandemic, whole genome sequencing has played an important role in understanding the evolution and genomic diversity of SARS-CoV-2. Although an unprecedented number of SARS-CoV-2 full genomes have been submitted in GISAID and NCBI, data from Pakistan is scarce. We report the sequencing, genomic characterization, and phylogenetic analysis of five SARS-CoV-2 strains isolated from patients in Pakistan. The oropharyngeal swabs of patients that were confirmed positive for SARS-CoV-2 through real-time RT-PCR at National Institute of Health, Pakistan, were selected for whole-genome sequencing. Sequencing was performed using NEBNext Ultra II Directional RNA Library Prep kit for Illumina (NEW ENGLAND BioLabs Inc., MA, US) and Illumina iSeq 100 instrument (Illumina, San Diego, US). Based on whole-genome analysis, three Pakistani SARS-CoV-2 strains clustered into the 20A (GH) clade along with the strains from Oman, Slovakia, United States, and Pakistani strain EPI_ISL_513925. The two 19B (S)-clade strains were closely related to viruses from India and Oman. Overall, twenty-nine amino acid mutations were detected in the current study genome sequences, including fifteen missense and four novel mutations. Notably, we have found a D614G (aspartic acid to glycine) mutation in spike protein of the sequences from the GH clade. The G614 variant carrying the characteristic D614G mutation has been shown to be more infectious that lead to its rapid spread worldwide. This report highlights the detection of GH and S clade strains and G614 variant from Pakistan warranting large-scale whole-genome sequencing of strains prevalent in different regions to understand virus evolution and to explore their genetic diversity.

Introduction Since its emergence in Hubei province, China, in December 2019, the highly contagious severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has spread across the world, causing more than 70 million cases and 1.6 million deaths till December 13, 2020 [1]. The first case of SARS-CoV-2 infection in Pakistan was reported on February 26, 2020 when a traveler returned from Iran tested positive for the virus [2]. Sustained local transmission has resulted in 432,327 cases and 8,653 deaths in the country as of December 10, 2020 (http://covid.gov.pk/ stats/pakistan).
The virus is a member of Coronaviridae family, genus Betacoronavirus, and has a positive-sense single-stranded RNA genome. The genome is about 30kb and encodes sixteen nonstructural proteins (NSP1-NSP16) and four structural proteins (nucleocapsid, envelop, membrane, and spike glycoprotein). The spike (S) protein is involved in SARS-CoV-2 entry into host cells through binding with angiotensin-converting enzyme 2 (ACE2) receptor. The receptor binding domain (RBD) in the S protein interacts with ACE2 receptor that eventually leads to the fusion of virus to host cell membranes [3]. Recently, the detection of a D614G mutation in the S protein of SARS-CoV-2 and subsequent global spread has received tremendous attention. The D614G mutation has been shown to be present in the S2 domain of spike protein and is critical for cleavage of S1 facilitating the fusion of virus with host cell membrane [4,5]. However, the role of D614G mutation in the emergence of a more transmissible variant of SARS-CoV-2 was first highlighted by Korber et al. through the development of an analysis pipeline [4]. Since the initial report, several studies on the role of D614G amino acid change in enhanced infectivity of SARS-CoV-2 have been documented [6][7][8]. The G614 variant which arose in Europe during February 2020, rapidly spread to other parts of the world occurring in over 74% of all published sequences by June 2020 [9].
As per GISAID nomenclature, the SARS-CoV-2 sequences have been grouped into 6 major clades i.e. L (the clade harboring the Wuhan reference strain), G, GH, GR, S, & V. Genomic epidemiology of SARS-CoV-2 has shown that L and S clades emerged during the early phase of pandemic (January-February) and rapidly declined afterward. In Asia, the L clade was found at a frequency of 48% in January and 2% in July whereas the S clade which was commonly circulating in January (31%), almost disappeared after August. Unlike the L and S clades, the G clade sequences were reported at low frequency during January-February 2020 and progressively increased with a peak in June (25%), following with constant detection till December 2020 (17%). The G clade along with GH and GR started emerging on a global scale during March 2020. The GR clade was found at the highest frequency (49%) in September 2020 and GH (45%) in December 2020 (https://www.gisaid.org/epifluapplications/phylodynamics/).
In case of emerging viruses, genomic epidemiology has proven to be a useful tool for investigating the outbreak and tracking virus evolution and spread [10][11][12]. During SARS-CoV-2 pandemic, genomic epidemiology has been used to track the nature of transmission of virus in several countries and for investigating the introduction of multiple importations [13][14][15][16][17][18][19][20]. To understand SARS-CoV-2 evolution and genetic diversity of circulating strains, transmission patterns, and detection of variants such as G614, whole-genome sequencing is imperative. Although more than 150,000 SARS-CoV-2 genome sequences have been deposited in the GISAID as of December 02, 2020, data from Pakistan is limited (n = 10) (https://www.gisaid. org/). We hereby report the complete genome sequences and phylogenetic analysis of five SARS-CoV-2 strains detected in Pakistan.

Sample selection
The oropharyngeal swabs of five patients that were confirmed positive for SARS-CoV-2 through real-time RT-PCR using commercially available kit genesig1 Real-Time PCR Coronavirus COVID-19 (Primerdesign Ltd., UK) at the Department of Virology, National Institute of Health, Pakistan, were selected for whole-genome sequencing. The rationale for the selection of these samples was quality (RNA concentration of >10 ng/ul) and quantity (>1.5 ml volume) of the samples and low cycle threshold (Ct) values on real-time RT-PCR. Written informed consent was obtained from all study participants and the study design was approved by the Internal Review Board of the National Institute of Health (VIR-PHLD/IRB-2020/05).

RNA extraction and next generation sequencing
The RNA was extracted from oropharyngeal swabs using the QIAamp Viral RNA Mini kit (Qiagen, Hilden, Germany) according to the manufacturer's instructions. The NEBNext Ultra II Directional RNA Library Prep kit for Illumina (NEW ENGLAND BioLabs Inc., MA, US) was used for library preparation and whole-genome sequencing was performed on Illumina iSeq 100 instrument (Illumina, San Diego, US).

Genome assembly
Genomes were assembled using in-house scripts which iteratively mapped reads (3x) with bwa-mem to the WIV04 (MN996528) reference and called the consensus genomes with Geneious (version 10.1.64, gaps filled with reference/intermediate sequences). Final consensus sequences were called with samtools mpileup (-A -d 6000000 -B -Q 0) and piping the output to iVar consensus (-m 2 -n N). Briefly, the samtools pileup command generates a pileup of alleles from the bam files using these parameters: include orphan read pairs (-A), set the maximum depth of reads at each position to 6000000 (-d 6000000), disable base alignment quality (BAQ) computation (-B), minimum base quality for mapping is 0 (-Q 0). The iVar consensus command was run with the following parameters: minimum quality (Default: 20), minimum frequency threshold (Default: 0.03), minimum depth to call a consensus (-m 2), and a character to call in regions with coverage lower than the specified minimum depth (-n N). Minor viral variants were assessed using bam files using the same parameters (samtools mpileup -A -d 6000000 -B -Q 0 piped to ivar variants (-q 20 (default) -t 0.03 (default) -m 2). The significance of minor viral variants was assessed using the Fisher Exact test in iVar to evaluate whether the frequency of the alternate allele is significantly higher than the mean error rate at that position.
These subsampled datasets were combined with all available complete genomes (with less than 1% of ambiguous bases) from surrounding geographic regions: India, Iran, Oman, and available sequences from Pakistan (no complete genomes were available from Afghanistan, Tajikistan, the Xinjiang and Tibet provinces of China) (n = 3625 genomes). Genomes (n = 4176) were aligned with mafft/7.450 and phylogentic relationships were inferred using IQtree/1.6.8 with 1000 bootstrap iterations (-bb 1000 -m GTR+G4). Trees were visualized with the ggtree package from R. The whole genomes sequenced in the current study have been submitted to the GISAID and have the following accession ID's: EPI_ISL_468159 to EPI_ISL_468163.

Results
In Pakistan, the first confirmed case of CVOID-19 was reported on 26 th February, 2020. Till March, a total of 2021 cases and 26 deaths were reported. The number of cases increased rapidly in April (n = 14,778) and May (n = 52,679). However, the peak of pandemic was witnessed in June with 141,010 cases. Since June, Pakistan saw a decline in COVID-19 cases (July n = 65,676; August n = 17,003; September n = 16,657 and October n = 21,164). The total deaths due to COVID-19 in Pakistan were n = 359 in April, n = 1184 in May, n = 2852 in June, n = 1575 in July, n = 328 in August, n = 186 in September, and n = 339 in October (Fig 1).
The clinical and demographic details of the patients enrolled in current study along with real-time PCR cycle threshold (Ct) values are shown in Table 1. These five oropharyngeal samples had the lowest Ct values among the available samples and full genome of SARS-CoV-2 was successfully obtained from these. The sequencing stats of Pakistani strains discussed in the current study are shown in Table 2.

Discussion
As of December 25, 2020, the 22,269 whole-genome sequences of SARS-CoV-2 available on the GISAID from Asia showed that the majority of these strains belong to the GR clade, followed by G and GH. In the present study, three strains (NIH-45090, NIH-45143, and NIH-45579) belong to the GH clade, while two strains (NIH-HAS001 and NIH-44905) clustered in S clade which is very rare in Asia. The global distribution of these clades between January-December 2020 showed that GH was found at low frequency during the start of the pandemic (January-February 2020). However, increased frequencies were reported in March (21%), April (26%), and May (30%). The GH clade strains peaked in June (31%) in agreement with our findings and showed constant presence during July-October 2020. A second rise in infections due to GH clade was observed during November-December 2020 (35% and 45% respectively). On the contrary, S clade was dominant during January-February 2020 (42% and 30% respectively) and since then has shown a rapid decline and was reported with a frequency of 1% in November 2020 (https://www.gisaid.org/epiflu-applications/phylodynamics/). According to the Nextstrain classification and phylogenetic analysis, Pakistani strains (NIH-45090, NIH-45143, and NIH-45579) of 20A clade were closely related to the strains from Oman, Slovakia, United States, and indigenous strain EPI_ISL_513925. Similarly, the Pakistani strains of 19B clade were closely related to the strains from India and Oman. We assume multiple introductions of SARS-CoV-2 strains into Pakistan particularly through international travelers. Eight independent events of importation were observed (based on the analysis of available sequence data from Pakistan) mainly from the neighboring countries especially Oman where a large proportion of Pakistani people are settled and employed for years. These findings suggest that the transmission of SARS-CoV-2 in Pakistan was influenced by very frequent and recurrent introductions through international travelers that lead to the circulation of different clades over time. Although the interventions at a national scale such as bans on indoor and outdoor gatherings may help to restrict and combat the community transmission chains, the frequent introductions through international travelers have also impacted the transmission dynamics. The limited sample size and genomic information available warrants further epidemiological and genomic surveillance for the more accurate estimation of SARS-CoV-2 diversity and transmission behaviors across the country.
Spike protein of SARS-CoV-2 is considered a major target for drug and vaccine design because of its involvement in-recognition of host cell receptor, attachment, and entry [21,22]. Notably, we have found a D614G (aspartic acid to glycine) mutation in the envelope spike protein of indigenous strains (NIH-45090, NIH-45143, and NIH-45579). Moreover, the D614G change was accompanied by three characteristic mutations: 241C>T, 3037C>T, and 14408C>T. The haplotype containing these four mutations has now become the most prevalent form globally; before March, the haplotype was found in 10% of global sequences but increased substantially to 67% during March and 78% between April-May 2020. During the initial outbreak in China, nearly all the strains had aspartic acid at position 614; however, later strains with glycine 614 emerged in Europe, followed by North America, Oceania, and Asia [4]. Recently, Korber et al., reported that the G614 form is more infectious, and infected patients had a higher upper respiratory tract viral load; however, association with disease severity was not observed [4]. In vitro studies conducted by Li et al., involving pseudoviruses showed that the D614G mutation was responsible for increased infectivity [23]. Similar findings have been reported where the G614 mutant showed the highest cell entry among the spike variants [7] as well as efficient infection of 293T-ACE2 cells compared to D614 pseudovirus [6]. Moreover, experimental work by Hou et al., demonstrated that the D614G substitution enhances SARS-CoV-2 infectivity, competitive fitness, and transmission in primary human cells and animal models [24]. In another study, an increased case fatality rate reported in European countries and the East Coast of the United States was correlated with the increased prevalence of the G614 variant [25]. In Pakistan, an upsurge in COVID-19 cases (n = 141,010 cases out of total 213,470; 66%) and deaths (n = 2,852 deaths out of total 4,395; 64.8%) were reported during the month of June 2020 compared with those reported during February-May 2020 (72,460 confirmed cases and 1,543 deaths). This substantial increase in numbers of cases and deaths may indicate widespread circulation of the G614 variant, which needs to be further investigated [4,25]. Although the D614G mutation is increasingly reported in strains worldwide, we have found a novel mutation D830A (aspartic acid to alanine) in strains NIH-44905 and NIH-HAS001 of clade S. Additional molecular epidemiological studies are needed to monitor the circulation of the G614 and A830 variant strains in Pakistan, which might also help to clarify the impact of SARS-CoV-2 genetic variants that might affect disease severity and clinical management. Furthermore, tracking mutations in spike glycoprotein of SARS--CoV-2 is important due to its role in host cell receptor binding and entry as well as eliciting neutralizing antibodies. Since the detection and global spread of the G614 variant, a major concern has been on the impact of D614G mutation on the effectiveness of vaccines as most of the vaccines have been designed using the D614 virus (Lurie et al., 2020). This concern has been addressed by the findings of Weissman et al., which showed that the variant with D614G mutation does not escape neutralization but rather is neutralized at a higher level by serum from vaccinated mice, non-human primates, and humans [26]. However, continuous surveillance of genetic variations in spike protein of SARS-CoV-2 is imperative to detect any escape mutants and for future vaccine development programs.
In summary, we report 29 mutations in the genomic sequences of Pakistani SARS-CoV-2 strains of GH and S clades. Both the clades showed missense mutations in the spike gene; however, clade S had less overall variability compared with the Wuhan reference strain than the GH clade strains. This highlights the fact that circulating SARS-CoV-2 strains in Pakistan have evolved and have sequence variations compared with the early ancestors. Therefore, we recommend whole-genome sequencing of indigenous strains prevalent in different regions of the country to understand virus evolution and to identify strains with unique mutational patterns.