In silico comparative genomics of SARS-CoV-2 to determine the source and diversity of the pathogen in Bangladesh

The COVID19 pandemic caused by SARS-CoV-2 virus has severely affected most countries of the world including Bangladesh. We conducted comparative analysis of publicly available whole-genome sequences of 64 SARS-CoV-2 isolates in Bangladesh and 371 isolates from another 27 countries to predict possible transmission routes of COVID19 to Bangladesh and genomic variations among the viruses. Phylogenetic analysis indicated that the pathogen was imported in Bangladesh from multiple countries. The viruses found in the southern district of Chattogram were closely related to strains from Saudi Arabia whereas those in Dhaka were similar to that of United Kingdom and France. The 64 SARS-CoV-2 sequences from Bangladesh belonged to three clusters. Compared to the ancestral SARS-CoV-2 sequence reported from China, the isolates in Bangladesh had a total of 180 mutations in the coding region of the genome, and 110 of these were missense. Among these, 99 missense mutations (90%) were predicted to destabilize protein structures. Remarkably, a mutation that leads to an I300F change in the nsp2 protein and a mutation leading to D614G change in the spike protein were prevalent in SARS-CoV-2 genomic sequences, and might have influenced the epidemiological properties of the virus in Bangladesh.


Introduction
The pandemic of coronavirus disease referred to as COVID-19 pandemic, which originated in Wuhan, China in December 2019 is ongoing and has now spread to 213 countries and territories. As of October 2020, the pandemic has caused about 45 million cases and over half a million death. This novel virus of the Coronaviridae family and Betacoronavirus genus [1,2] designated as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is the causative agent of COVID-19. Previously two other coronaviruses, namely SARS-CoV and MERS-CoV demonstrated high pathogenicity and caused epidemic with mortality rate~10% and~34% respectively affecting more than 25 countries each time [3][4][5]. However, SARS-CoV-2 has proven to be highly infectious and caused pandemic spread to over 213 countries and territories. Besides its devastating impact in North America and Europe, the disease has now rapidly spread in South America including Brazil, and in South Asian countries particularly India, Pakistan and Bangladesh [6].

PLOS ONE
PLOS ONE | https://doi.org/10.1371/journal.pone.0245584 January 20, 2021 1 / 12 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 The virus was first detected in Bangladesh in March 2020 [7]. Although infections remained low until the end of March it began to rise steeply in April. By the end of June, new cases in Bangladesh grew to nearly 150,000 and the rate of detection of cases compared to the total number of samples tested increased to about 22% which was highest in Asia [6].
SARS-CoV-2 is a positive-sense single-stranded RNA virus with a genome size of nearly 30kb. The 5' end of the genome codes for a polyprotein which is further cleaved to viral nonstructural proteins whereas the 3' end encodes for structural proteins of the virus including surface glycoproteins spike (S), membrane protein (M), envelop protein (E) and nucleocapsid protein (N) [8]. Like other RNA viruses, SARS-CoV-2 is also inherently prone to mutations due to high recombination frequency resulting in genomic diversity [9][10][11]. Due to the rapid evolution of the virus, development of vaccines and therapies may be challenging. To monitor the emergence of diversity, it is important to conduct comparative genomics of viruses isolated over time and in various geographical locations. Comparative analysis of genome sequences of various isolates of SARS-CoV-2 would allow to identify and characterize the variable and conserved regions of the genome and this knowledge may be useful for developing effective vaccines, as well as in molecular epidemiological tracking. Thousands of SARS-CoV-2 virus genomes have been sequenced and submitted in public databases for further study. This include 66 SARS-CoV-2 genomic sequences submitted from Bangladesh in the Global initiative on sharing all influenza data (GISAID) database, till 11th June 2020 [12]. We conducted comparative analysis of publicly available genome sequences of SARS-CoV-2 from 27 countries to predict the origin of viruses in Bangladesh by studying a time-resolved phylogenetic relationship. Later, we analyzed the variants present in different isolates of Bangladesh to understand the pattern of mutations in relation to the ancestral Wuhan strain, find unique mutations, and possible effect of these mutations on the stability of encoded proteins, and selection pressure on genes.

Genome sequence acquisition
A total of 435 whole genome sequences of SARS-CoV-2 isolated in 27 countries (S1 Table) with high frequency of infection including 64 sequences isolated in Bangladesh (S2 Table), and that of 5 isolates of each month between January to May, 2020 in the selected countries obtained from GISAID [12] were included in this analysis (S3 Table). However, since only a small number of sequences were reported from different African countries, we included all available sequences from the African countries and categorized collectively as African sequence. Reference sequence included in various analysis was the sequence of the ancestral strain from Wuhan, China (NC_045512.2) [13]. All the sequences were quality checked prior to further analysis and any sequence found to contain ambiguous characters, were removed.

Sequence annotation, phylogenetic analysis and clustering
The selected sequences were annotated by Viral Annotation Pipeline and iDentification (VAPiD) tool [14] to identify their coding regions and corresponding amino acid sequences for further analysis. Subsequently, multiple sequence alignment was carried out using Mafft algorithm [15] and a maximum likelihood phylogenetic tree was constructed with IQ-TREE [16] with bootstrap value of 1000 to identify closeness among the sequences. Then the generated tree was reconstructed based on time-calibration by TreeTime [17]. The tree was then annotated and visualized on iTOL server [18] putting the sequences into different clades based on specific mutations proposed in GISAID [19], and further classified as D614G type [20,21]. In addition, another phylogenetic tree containing only SARS-CoV-2 sequences from Bangladesh was constructed and categorized using the same tools, and further clustered with TreeCluster [22] to group them based on their identity with each other.

Analysis of mutations and their predicted effects
For analysis of mutations, sequence were mapped with minimap2 [23], and variants were detected using SAMtools [24]. Furthermore, snp-sites [25] was also used to detect variations among genomes with the reference, and finally common mutations from these two analysis were considered. In addition, SNPeff [26] was used to predict the effect of mutations in terms of changes in amino acid sequences. Subsequently, a haplotype network was generated based on mutations in genome using PopArt by Integer Neighbor-Joining method [27]. Then the direction of selection in sequences from Bangladesh was calculated by the SLAC algorithm [28] in the Datamonkey server [29] to understand their evolutionary pattern as to whether they are stabilizing or changing with time. Finally, since it is important to understand the functional consequences of mutations on corresponding proteins, we used a manually curated neural network DeepDDG [30] to predict the effects of the mutations on protein stability.

Phylogenetic analysis of all SARS-CoV-2 sequences
A total of 435 Genomic Sequences of SARS-CoV-2 reported from various countries which included 64 sequences from Bangladesh and the sequence of the ancestral SARS-CoV-2 isolated in Wuhan, China were analyzed in the time-resolved phylogenetic tree. Sequences from Bangladesh belonged to three different clusters of which one cluster carried 43 of the 64 sequences, and shared the same node with sequence from Germany while they had a common ancestry with isolates from the United Kingdom (Fig 1). The remaining two clusters of SARS--CoV-2 sequences contained 4 and 5 sequences respectively from Bangladesh, and they shared the same node with sequence of SARS-CoV-2 reported from India, and also shared a common ancestry with isolates from Saudi Arabia. Besides, 12 lone sequences that did not belong to any of these clusters were found to have similarity with sequences from Europe including United Kingdom, Germany, France, Italy, and Russia. One of these sequences was closely related to sequence reported from the USA. Subsequently, all SARS-CoV-2 sequences from representative countries were clustered based on some specific mutations sustained, into 7 different clades as mentioned by GISAID. In this analysis, the sequences from Bangladesh were found to be distributed in all clades except V (Figs 1 and 2).

Phylogenetic analysis of SARS-CoV-2 sequences from Bangladesh
In order to understand the evolutionary relationship and possible transmission dynamics of SARS-CoV-2 in Bangladesh at a higher resolution, another time-resolved phylogenetic tree carrying only sequences of the pathogen isolated in various regions of Bangladesh was generated using the sequence of the first SARS-CoV-2 reported from Wuhan, China as a reference. Of the three clusters produced in this analysis, cluster-1 included mostly isolates from Chattogram and one isolate from Dhaka, cluster-2 included isolates from Dhaka, Narayanganj and Chattogram districts, whereas cluster-3 included isolates from Chattogram only. As mentioned above, the isolates from Bangladesh were found to be distributed in all 7 GISAID clades based on specific mutations, except in clade V (Figs 1 and 2). Most isolates of Dhaka and Narayanganj (47 of 52) belonged to the GR clade, whereas those of Chattogram belonged to five different GISAID clades (G, GH, GR, O, and S).

D614G mutations in spike protein
The SARS-CoV-2 sequences were also categorized according to D614G type mutation (Fig 1). The D614G mutation generates an additional serine protease (Elastase) cleavage site near the S1-S2 junction of the Spike protein [21]. All but one sequence from Dhaka and Narayanganj were found to be of G type which carries Glycine at position 614. On the other hand, 5 out of 11of the sequences of isolates in Chattogram were G type while the rest carried Aspartate at 614th position of the spike protein (Fig 2). In addition, the first sequence from Bangladesh carried G614 type of surface glycoprotein, which indicate that this dominant variant was present since the first isolation of SARS-CoV-2 in Bangladesh and the mutant virus might have been imported to the country from Europe. Since spike protein having glycine at 614 position is more stable and presumably increase the transmissibility [20,21], we assume that, introduction of the strain carrying this mutation might have facilitated rapid viral transmission here in Bangladesh. This particular subtype with a non-silent (Aspartate to Glycine) mutation at 614th position of the Spike protein might have also rapidly outcompeted any preexisting subtype here, including the ancestral strain.

Haplotype analysis of SARS-CoV-2 sequences
Relationships among DNA sequences within a population are often studied by constructing and visualizing a haplotype network. We constructed a haplotype network by the Median Joining algorithm and found that 338 of 434 SARS-CoV-2 sequences from representative countries were alike, therefore formed a large haplo group (Fig 3A). However, there were presences of a significant number of unique lineages too consisting of a single or multiple SARS-CoV-2 sequences (Fig 3A). This network demonstrated the closeness of the sequences and their pattern of mutation beyond the geographical boundary. Several SARS-CoV-2 isolates appeared to have sustained certain common mutations along with certain unique mutations. Although a large proportion of sequences from Bangladesh belonged to the common cluster (Fig 3A),

PLOS ONE
Comparative genomics of SARS-CoV-2 in Bangladesh there was a significant number of unique nodes as well due to mutations overtime subsequent to being carried into Bangladesh (Fig 3A). Therefore analysis of the sequences from Bangladesh provided further insight of their mutation patterns. The haplotype network revealed that viruses isolated in Bangladesh had certain unique mutations in them, and as a result they belonged to different haplo groups and no significant cig group (Fig 3A). Most of the isolates sustained a significant number of mutations compared with each other. In addition it further confirms that most isolates from Chattogram ( Fig 3B) were not directly related to those isolated in Dhaka or Narayanganj.
In consideration of phylogenetic relationship, spike protein mutation at position 614, and haplotype network we predict that SARS-CoV-2 were introduced into the country from different sources. Firstly, the viruses in Dhaka were imported from either of the European countries, including United Kingdom and France, while that in Chattogram was possibly imported from the Middle East, in particular the Kingdom of Saudi Arabia. Later on, Dhaka served as the infection hub inside the country to infect other cities like Chattogram and Narayanganj. Since Dhaka is the capital city, enormous number of people live here and uncontrolled movement of large number of people occurs between other districts and the capital city.

Location and predicted effects of the mutations
We detected the presence of 209 point mutations in 64 SARS-CoV-2 isolates from Bangladesh when compared to the reference sequence from Wuhan, China. In addition, 19 isolates were found to have lost significant portions of their genome. In 4 cases these deletions were associated with loss of non-structural proteins such as ORF7 and ORF8 while other deletions were upstream  Table). Among the point mutations, 29 mutations were in the non-coding region of the genome and 180 were in coding regions. Ten of the 29 non-coding mutations were in upstream non-coding region and rest was in downstream non-coding region of the genome. 70mutations in the coding region were synonymous and 110 mutations predicted substituted amino acids. Among twelve predicted ORFs, ORF1ab which comprises approximately 67% of the genome encoding 16 nonstructural proteins had more than 60 percent of the total mutations while gene E encoding envelope protein and ORF7b were conserved and did not carry any mutation. Though ORF1 harbored the highest number of mutations, mutation density was highest in ORF10 considering ORF lengths. Details and distribution of the mutations are presented in (Table 1 and Fig 4) and full analysis report is placed in S3 Table. Among SARS-CoV-2 sequences reported from Bangladesh, 241C>T and 3037C>T changes were the two most abundant mutations found in 58 out of 64 isolates, and were always found simultaneously ( Table 2). Position 241 is located in the non-coding region whereas the mutation in position 3037 was synonymous. On the other hand, 57 sequences were found to harbor 14408C>T and 23403A>G mutations which altered amino acid Pro>Leu and Asp>Gly respectively, and these two mutations were found to be present simultaneously as well. In addition, other co-evolving mutations found were (241C>T, 3037C>T, 14408C>T, 23403A>G), (28881G>A, 28882G>A, 28883G>C), (8782C>T, 28144T>C), (4444G>T, 8371G>T, 29403A>G). Besides the highly abundant mutations, several less common and unique mutations were also present in the sequences analyzed.
As a result of continuous mutations genes are seemingly going through natural selection. ORF6 was predicted to have dN/dS value of 17.8 due to the presence of higher number of missense than synonymous mutations. This finding indicates that ORF6 is rapidly evolving and is highly divergent. The ORF6 protein is an accessory protein whose function is yet to be fully elucidated [31].
The ORF7a and Nucleocapsid phosphoprotein (N) had dN/dS values 1.08 and 1.55 respectively which confer their strong evolution to cope up with challenges under positive selection pressure. ORF10 is predicted to be conversed with dN/dS value 0 while envelope protein and ORF7b did not harbor any mutation and was conserved. On the other hand, ORF3a and surface glycoprotein (S) might approach toward positive selection pressure and evolve but the remaining protein coding genes were under negative selection pressure.

PLOS ONE
Comparative genomics of SARS-CoV-2 in Bangladesh

PLOS ONE
Comparative genomics of SARS-CoV-2 in Bangladesh Among 110 missense mutations, 99 were predicted to destabilize the corresponding proteins while only 11 mutations predictably increase stability. Ten protein stabilizing mutations were present in ORF1 with 1 mutation in ORF3a. On the other hand, destabilizing mutations were distributed among most of the ORFs (Table 3). None of the mutations in structural proteins were predicted to increase stability.
Taken together these results suggest that the SARS-CoV-2 isolated in Bangladesh has been evolving rapidly, and that 90% of the missense mutations are destabilizing the genes encoding virulence as well as structural proteins. This could be the reason that the infection and mortality rate due to SARS-CoV-2 in Bangladesh is apparently significantly lower than that of the United States of America, as well as European and Latin American countries.
In summary, mutation analysis revealed point mutations as well as deletion of base pairs. Deletions of the base pairs were associated with missing non-structural proteins and predictably affected certain viral properties since ORF7a protein is the growth factor of the coronavirus family, induce apoptosis, and promotes viral encapsulation [32][33][34] while ORF8 is associated with viral adaptation by playing role in host-virus interaction [35,36]. Furthermore, we have found that some genes are under positive selection pressure indicating that the virus is fast-evolving presumably to evade host cell's innate immunity; which should be taken into special consideration prior to vaccine development or other treatment strategies.
Finally, a missense mutation at 1163A>T changing the amino acid isoleucine to phenylalanine in Nsp2 protein was found uniquely among 44 isolates in Bangladesh but absent among all the selected sequences from other countries. Nsp2 is a methyltransferase like domain that interacts with PHB and PHB2, and modulates the host cell survival strategy by affecting cellular differentiation, mitochondrial biogenesis, and cell cycle progression to escape from innate immunity [36,37]. In considering death ratio against the infection rate in Bangladesh in view of the function of this non-structural protein, we assume that this unique mutation might turn out to be vital for the observed low death rate in this country. Therefore possible effects of this missense, point mutation might warrant further studies, and should be monitored closely. Table 3. Mutations present in different ORFs in the genome sequences of SARS-CoV-2 isolates reported from Bangladesh.

Remarks
ORF1ab 93 A mutation at 1163A>T (Ile300Phe) in ORF1 operon which lower the structural stability of the protein was found existing in 44 strains of Bangladesh but absent in representative countries S 13 A highly prevalent mutation 23403A>G (Asp614Gly) which ease the viral transmission, dominating throughout the world was found to be present in 57 of 64 sequences in Bangladesh.

ORF3a 9
A mutation at 25563G>T (Gln57His) was found highly prevalent among selected sequences from the USA was also present in Bangladeshi sequences.