Clinical and whole genome characterization of SARS-CoV-2 in India

We report clinical profile of hundred and nine patients with SARS CoV-2 infection, and whole genome sequences (WGS) of seven virus isolates from the first reported cases in India, with various international travel histories. Comorbidities such as diabetes, hypertension, and cardiovascular disease were frequently associated with severity of the disease. WBC and neutrophil counts showed an increase, while lymphocyte counts decreased in patients with severe infection suggesting a possible neutrophil mediated organ damage, while immune activity may be diminished with decrease in lymphocytes leading to disease severity. Increase in SGOT, SGPT and blood urea suggests the functional deficiencies of liver, heart, and kidney in patients who succumbed to the disease when compared to the group of recovered patients. The WGS analysis showed that these isolates were classified into two clades: I/A3i, and A2a (four according to GISAID: O, L, GR, and GH). Further, WGS phylogeny and travel history together indicate possible transmission from Middle East and Europe. Three S protein variants: Wuhan reference, D614G, and Y28H were identified predicted to possess different binding affinities to host ACE2.


Whole genome sequencing
The WGS was carried out at Med Genome Labs Ltd., Bangalore, India. NEB Ultra II directional RNA-Seq Library Prep kit (NEB, Cat# E7760L) was used to prepare libraries for total RNA sequencing. Libraries were pooled and diluted to final optimal loading concentrations for cluster amplification on Illumina flow cell followed by sequencing on Illumina HiSeq X instrument to generate 150bp paired end reads. The quality of the reads was evaluated using Fast QC v0.11.9 [16] and cut adapt v2.9 [17] was used to trim adapters and remove contaminants. The reads were aligned to the human reference genome hg19 using STAR v2.4 [18]. The reads that do not align to the human reference were aligned to the Wuhan reference genome downloaded from NCBI (Ref Seq NC_045512.2) using BWA v.0.7.12 aligner [19]. The reads mapping to the genes were counted using bed tools v2. 26.0 [20]. The aligned reads were sorted, and then variant calling was performed using GATK variant caller v 4.1.0.0 [21]. The variants identified were then annotated to the genes. The variant class, amino acid changes and other relevant annotations were added to the variants. The variants were then compared with the genomes available in NCBI and GISAID database.

Phylogeny reconstruction
Phylogenetic analysis of the WGS data of samples was performed by multiple sequence alignment of the sequences with the genomes available in NCBI and GISAID database. The phylogenetic tree in newick file format was downloaded from Indian Coronavirus Genome Datasets maintained by CSIR Institute of Genomics and Integrative Biology (CSIR-IGIB; http://clingen. igib.res.in/genepi/phylovis/) for Indian SARS-CoV-2 sequences and visualized using FigTree v1.4.4. Further, BLAST was performed for each sequence of the current study on the GISAID database against all the SARS-CoV-2 genome sequences. Top 30 sequence hits for each isolate were downloaded and merged into a single file. This resulted in total of 142 sequences which also included the seven sequences of the current study. The sequence alignment was performed using MUSCLE program available in MEGA v10.1.7, and the phylogenetic tree was reconstructed using Neighbor-joining method and Kimura 2 parameter as nucleotide substitution model with 1000 bootstrap values. For phylogenetic analysis of the S protein, protein sequences of SARS-CoV-2 S protein of all Indian isolates were downloaded from GISAID database. The multiple sequence alignment was performed using MUSCLE algorithm in MEGA v10.1.7. The phylogenetic tree was reconstructed using Neighbor-joining method with 1000 bootstrap validation and p-distance substitution model.

Structural analysis of spike protein
Homology modelling based on the spike sequence was carried out using Swiss Model [22]. Obtained model was further optimized using fragment guided molecular dynamics simulation [23]. Optimized model was visualized in Pymol. Trimeric spike protein optimized above was docked using protein-protein docking using ClusPro Srver; highest neigbours were taken as most probable conformation. Docked conformation was visualized in Pymol [24][25][26]. Binding constant & energy, and dissociation constant values were computed at Web server PRODIGY [27].

Statistical analysis
In the current study, categorical variables were described as frequency rates and percentages; continuous variables were described as mean, median, and interquartile range. Means for continuous variables were compared using independent two sample t-test statistics (for two-tail) when the data were normally distributed. On the other hand, categorical type variables were investigated using a non-parametric Mann-Whitney-U-test. For determining the proportions for categorical variables, a non-parametric Chi-square test was used, then complete data was cross tabulated (rows v/s columns) and analyzed. The Fisher exact test was used when the data were limited. In each case, a two-sided α of less than 0�05 was considered as statistically significant. The statistical analyses were done using SPSS ver. 20 software.

Clinical, radiological, functional, and immunological profile of COVID-19 patients
The current study enrolled patients (N = 109) diagnosed for SARS-CoV-2 during February 2020 to May 2020, admitted to a tertiary referral hospital in Hyderabad, India. Patients exhibiting hypoxia, shortness of breath, altered sensorium, haemoptysis, acute kidney injury (AKI), chronic kidney disease (CKD), aphasia, chest pain, dysphagia, oral ulcers, hypoxia, metabolic encephalopathy, renal failure, tracheostomy care, and sudden syncope were admitted to intensive care unit (ICU). Majority of the patients had frequently reported fever (37%) and cough (39%). Patients with comorbidities such as diabetes (33%), hypertension (31%), CKD (7%), and cardio vascular disease (CVD) (9%) became severely ill, even when intervened through respiratory support along with oxygen, and medical support with antibacterials, antivirals, hydroxy chloroquine (HCQ), and steroids. Conditions such as cancer (1%), septic shock (11%), acute respiratory distress syndrome (ARDS) (40%), CKD (5%), non-ICU sudden collapse (1%), were seen in patients that could not be recovered and succumbed to the disease (Table 1). For clinical comparison in the current study, the patients were grouped in to recovered (R) (N = 57; male (n): 37, female (n): 20, aged 3 years-72 years) and succumbed (S) (N = 52; male (n): 41, female (n):11, aged 2 months-85 years). Median age of recovered patients was found to be 32 years (26y-35y), and death patients to be 61 years (35y-85y). Between R and S groups there was a significant difference (P<0.05) in favour of S group in the onset of certain symptoms (fever, cough, breathing difficulty, and crepts) and comorbidities (diabetes, hypertension, systolic blood pressure, diastolic blood pressure, and asthma) ( Table 1). Significant difference (P<0.05) between these two groups (in favour of S group) was also observed in ECG, viral Ct values, majority of the pathological lung radiological findings ( Table 1, Fig 1). Interestingly, significant difference in favor of S group was noticed with supportive therapy. Supportive therapy was provided with antibiotics (cefotaxime, azithromycin), antivirals (oseltamivir and lopinavir/ritonavir), HCQ, and steroids. This suggests that the treatment started in severely ill patients had no effect in the recovery of the patients. This also suggests that supportive therapy although statistically found significant, has no effective role in disease outcome. Cause of death was significant (P<0.05) due to ARDS, CKD, and septic shock. Other nonsignificant causes of death include meningoencephalitis, cancer and non-ICU related sudden death (Table 1). It was shown previously that 90% of the deaths were due to ARDS followed by CKD (18%), and shock (12%) [28]. Previous studies reported that comorbidities, particularly the CVDs and chronic pulmonary diseases, were important to predict the in-hospital mortality in critically ill patients [29]. Our study reports that majority of patients in the S group had underlying diseases, especially hypertension, lung disease, and heart disease, more often with more than one comorbidity in individual patient. Sepsis was also a common complication found in this study, which might be directly caused by SARS-CoV-2 infection; further research is needed to investigate the pathogenesis of sepsis in COVID-19 illness.
In this study, radiological findings were obtained in only 40% of cases due to various logistic reasons. 35% of S group had infiltrates in two or more lobes, and almost 68% had normal radiological findings in R group (Fig 1). Overall, the findings commonly included patchy ground glass opacity (GGO) (11%) followed by confluent haziness (7%) in S group; in contrast consolidation was equal in both groups. Overall, our study highlights the radiological findings were consistent with the severity of infection varying from infiltrations through GGO to consolidations.
When immune response of R and S groups was compared, the results show a higher WBC and neutrophil count in the S group (Table 2). Neutrophilia is a hallmark of any acute infection. It was also suggested before that aberrant activation of neutrophils contribute to the exacerbated host response in patients with severe COVID-19, predicting poor outcome [30]. Importantly neutrophil infiltration was also observed in pulmonary capillaries of COVID-19 patients leading to acute capillaritis with fibrin deposition, extravasation of neutrophils into the alveolar space, and neutrophilic mucositis [31]. In addition to this, the role of neutrophilia as a source of excess neutrophil extracellular traps (NETs) was also shown before [32]. NETs are web-like structures of DNA and proteins expelled from the neutrophils that capture and kill the pathogens [33]. In addition to being useful, excessive NET formation can trigger a cascade of inflammatory reactions that results in permanent organ damage to the pulmonary, cardiovascular, and renal systems, the most commonly affected organ systems in severe COVID-19 [6,[34][35][36][37][38].
Another interesting finding was that the lymphocyte count was low in S group compared to R group (Table 2); such a decrease in lymphocyte count was reported previously in pneumonia patients [39,40]. This suggests that the increased lymphocytes in the R group may be contributing for the recovery of the patients, while lymphocytes may be damaged in the condition of severe infection condition, thus reducing recovery of S group. Lymphocytes play an important role in the maintenance of immune system function. After virus infection, alteration in total lymphocyte numbers varies with different viral pathogenic mechanisms [41]. Several mechanisms were shown before for virus induced lymphopenia including immune injuries from inflammatory mediators, virus attachment, exudation of circulating lymphocytes into inflammatory lung tissues [39]. Recent studies indicated a clear association of decrease in peripheral lymphocytes with the clinical characteristics of COVID-19 patients [39]. This indicates an impairment of the immune system during the course of SARS-CoV-2 infection. These alterations were also found in patients with pneumonia caused by MERS-CoV and SARS-CoV [42]. Marginal differences were seen in monocyte, eosinophil and platelet counts between R and S groups ( Table 2).
When functional parameters were compared, large deviation of serum glutamic-oxaloacetic transaminase (SGOT), serum glutamic pyruvic transaminase (SGPT) and blood urea was seen when mean values were taken in consideration (Table 2). There was a significant increase in SGPT, SGOT and blood urea in S group compared to R group (Table 2). There was no significant change seen in alkaline phosphatase (AP), which indicates that liver may not be severely affected. Higher SGOT and SGPT levels could be due to effect of infection to both heart and liver [43]. Since blood urea was also significantly increased in S group, the observed increase may also be associated with comorbid diabetes in these patients ( Table 2). In summary, neutrophil and lymphocyte count, the levels of SGOT and SGPT in COVID-19 patients may provide an indication on severity of the infection in Indian population infected with SARS CoV-2, while the comorbid conditions such as diabetes, hypertension and CVD would further enhance the complexity of the disease condition. Death was reported in the patients on an average of 9 (6-12) days of hospitalization, while recovery was noticed with in 16 (8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20) days of hospitalization (Table 1). A previous study reported number of days from hospital admission to death is 6.35+/-4.51 days [44].
Hospitalization was shorter in death patients, which is most likely due to rapid progression of disease in such severe cases. As per the recent health bulletin released by the Govt. of  [28,47]. It is evident that respiratory distress is the major cause of mortality. It is possible that ageing diminishes the lung function, regeneration, and remodelling thereby enhancing its susceptibility to the disease [48].

Next generation sequencing
From the first reported cases in India that had international travel histories from USA, UK, Italy, Indonesia, United Arab Emirates during 20-02-2020 to 25-03-2020, n = 7 samples were sent to whole genome sequencing (n = 6 from R group and n = 1 from S group). > 4.9 Gb data was obtained by 151bp pair end sequencing. Of the total, more than 80% of bases have base quality > = Q30. The raw reads were pre-processed for adapter and contamination removal. Of the total raw reads,~50-64% of the total reads were left after pre-processing. Out of total pre-processed reads,~50-77% align to the host transcriptome/genome (human-hg19) using STAR aligner. The unaligned reads from host analysis were converted to fastq and then aligned to the SARS-CoV-2 reference genome (Genbank Ref seq NC_045512.2). A total of >200,000 reads mapped to the viral genome for all the samples, and of these 97% of the reads were properly paired. Coverage analysis of the genome showed 100% coverage for all the samples with an average read depth of >225X. Of the total bases of interest~100% of the genome have more than 100X depth. The full-genome viral sequences were deposited in the dataset of GISAID (EPI_ISL_431101, EPI_ISL_431102, EPI_ISL_431103, EPI_ISL_431117, EPI_ISL_438139, EPI_ISL_437626, EPI_ISL_438138) and NCBI GenBank (MT415320, MT415321, MT415322, MT415323, MT477885, MT457402, MT457403).

Genetic variation in the sequences
Mutations among the seven SARS-CoV-2 strains, were identified both at the genome level and corresponding amino acid level throughout the whole genome, with reference to the prototype SARS-CoV-2 Wuhan strain genome (NCBI Ref Seq NC_045512.2) ( Table 3). Two synonymous (15324C>T in ORF1ab; 24130C>T in S) and two nonsynonymous mutations (21644T>C{Y28H} in the S gene; 29303C>T{P344S} in the N gene) were observed in the sample (OUMRK100/2020) collected from patient with travel history to Dubai. One synonymous (3037C>T in ORF1ab) and two nonsynonymous mutations (14408C>T{P4714L} in ORF1ab; 23403A>G{D614G} in the S gene) were observed in the sample (GMCKN318/2020) collected from patient with travel history to Italy. One synonymous (23929C>T in the S gene) and four nonsynonymous mutations (6312C>A{T2016K}, 11083G>T{L3606F}, 13730C>T{A4489V} in ORF1ab; 28311C>T{P13L} in the N gene) were observed in the sample (GMCKN443/2020) collected from patient with travel history to Indonesia. One synonymous (3037C>T in ORF1ab) and four nonsynonymous mutations (14408C>T{P4714L} in ORF1ab; 23403A>G {D614G} in the S gene; 28881G>A{R203L}, 28883G>C{G204R} in the N gene) were observed in the sample (GMCTC469/2020) collected from patient with travel history to UK. One synonymous (23929C>T in the S gene) and four nonsynonymous mutations (6312 C>A{T2016N}, 11083G>T{L3606F}, 13730C>T{A4489V} in ORF1ab; 28311C>T{P13L} in the N gene) were observed in the sample (OUMRK1090/2020) collected from patient with travel history to UAE. Two synonymous (3037C>T, 18877C>T in ORF1ab) and three nonsynonymous mutations (14408C>T{P4714L} in ORF1ab; 23403 A>G{D614G} in the S gene; 25563G>T{Q57H} in the N gene) were observed in the sample (GMCKP1125/2020) collected from patient with Table 3. Nucleotide polymorphisms and amino acid changes throughout the seven genomes.

PLOS ONE
Clinical and genome profile of CoV-2 in India travel history to USA. One synonymous (23929C>T in the S gene) and four nonsynonymous mutations (6312 C>T{T2016K}, 11083G>T{L3606F}, 13730C>T{A4489V} in ORF1ab; 28311C>T{P13L} in the N gene) were observed in the sample (GMCRR1191/2020) collected from patient with cluster infection and no travel history, and who succumbed to the disease. Orf1ab is a poly-protein, which, after translation, is cleaved into 15 nonstructural proteins (NSP). Total 4 nonsynonymous mutations were observed at 4 different positions among different samples. Mutation at position 6312 C>T{T2016K} is on NSP3 in sample GMCKN443/ 2020, GMCRR1191/2020, and OUMRK1090/2020. NSP3 has 10 different conserved domains including the nucleic acid binding domain (cl24732: NAR superfamily). The mutation 6312 C>T{T2016K} falls on relative position of 1198 on NSP3 protein sequence. This position is related to the nucleic acid binding domain. The mutation 11083G>T{L3606F} was observed in NSP6 in sample GMCKN443/2020, GMCRR1191/2020, and OUMRK1090/2020. NSP6 is a 1945 amino acid long peptide containing putative transmembrane domain. In pp1ab it ranges from amino acid 3570 to 3895. Two mutations were observed in NSP11, in which the first mutation 13730:C>T{A4489V} was present in samples GMCKN443/2020 GMCRR1191/2020 and OUMRK1090/2020; and the second 14408:C>A{P4714L} was observed in samples GMCKN318/2020, GMCTC469/2020, and GMCKP1125/2020. NSP11 is RNA-dependent RNA polymerase (RdRp) having one conserved domain SARS-CoV-like RdRp (cd21591) (Fig 2).
N gene codes for nucleocapsid phosphoprotein (N Protein). Four different variants in N protein were reported among six out of seven different samples. The mutation 28311C>T {P13L}, was found in three samples GMCKN443/2020, OUMRK1090/2020, GMCRR1191/ 2020, while one mutation in each was observed in other three samples. The N protein of SARS-CoV-2 shares 90% sequence similarity with the N protein of SARS-CoV [49]. The Nprotein is a RNA binding protein and has three intrinsically disordered regions (IDRs) from residues 1-44, 182-247 and 366-422 in case of SARS-CoV, which may modulate the RNAbinding activity [50,51]. The two mutations out of four in N-protein reported from our analysis (28311C>T{P13L} and 28883G>C{G204R}) are present in IDRs suggesting there may be variation for RNA binding activity among the N-protein of different variants of SARS-CoV-2.
Several mutations were observed among the seven isolates with both synonymous and nonsynonymous mutations with pyrimidine exchanges (C to T or T to C) suggesting the possibility of occurrence of transcription-induced mutations [52]. This could be due to high replication rates of the virus. Nonsynonymous mutations showed higher frequency than synonymous mutations in the ORF regions of the viral genomes suggesting that the selection of mutation may be biased towards viral phenotype changes which may be contributing to the observed variability in the severity of infection reported in different epidemiological studies [53].

Phylogenetic analyses
The genomes sequenced in the current study were found to be in two clades, when analysed with the other available Indian SARS-CoV-2 sequences. Four of the seven sequences belong to clade-I/A3i; of which three sequences (OUMRK1090/2020, GMCRR1191/2020, and GMCKN443/2020) are clustered together, and one sequence (OUMRK100/2020) clustered separately in the same clade. The other three sequences (GMCTC469/2020, GMCKP1125/ 2020, and GMCKN318/2020) belong to the clade A2a in three different clusters (Fig 3). The sequences were further analyzed with other SARS-CoV-2 sequences worldwide (Fig 4). It can be seen that the virus sequence from patient with travel history to Dubai (OUMRK100/2020) clustered with sequences from UAE, Japan, and the original Wuhan-Hu-1 strain. Virus sequences from patients with travel history to Indonesia (GMCKN443/2020), UAE (OUMRK1090/2020), and no travel history (GMCRR1191/2020) clustered together along with two other Indian isolates, and a Singapore isolate. Sample GMCKP1125/2020 isolated from patient who travelled to India from USA via UAE can be seen closely clustered with one USA and three Saudi Arabian isolates. Virus sequence from patient with travel history to UK (GMCTC469/2020); and Italy (GMCKN318/2020) were found to be clustered with other England, Italian; and USA, Russian strains, respectively. The global comparison according to GISAID database revealed that three sequences from the current study: GMC-RK1090, GMC-RK1191, and GMC-KN443 are in the cluster labeled as clade O according the GISAID database. Other four sequences belonged to four different clades: L (GMC-RK100), G (GMC-KN318), GR (GMC-TC469) and GH (GMC-KP1125) (Fig 5).
These findings, suggest that there might be two distinct points of transmission: one is involving Middle East, and other Europe, this could be a transition point during travel for an airway transmission of the virus. A previous study suggested that divergence from a single point of Clade I/A3i suggests a single point of introduction into the country, and that the spread from this could be from a single outbreak [54]. The first sequence from this cluster in India was GMC-KN443 which was from a patient who travelled to India from Indonesia. However, the authors suggested that the strain is similar to those strains from Singapore and Philippines [54]. Furthermore, another study reported that isolate RK100 clustered with the ancestral first isolates from India that were originally reported from Kerala [55].

Analysis of spike protein
Protein sequences for spike protein of SARS-CoV-2 of all Indian isolates were downloaded from GISAID database. Total 53 sequences were obtained in which 4 sequences (EPI_ISL_431101, EPI_ISL_431102, EPI_ISL_431103, and EPI_ISL_431117) were from Hyderabad isolates. The multiple sequence alignment was performed using MUSCLE algorithm in MEGA v10.1.7. The phylogenetic tree was constructed using Neighbor-joining method with 1000 bootstrap validation and p-distance substitution model. The results show four different genotypic variants (Fig 6). Among these, three amino acid variations were observed including reference Wuhan isolate. RK100 showed Y28H mutation, while sample nos KN318, TC469, and KP1125 showed D614G mutation, and samples KN443, RK1090 and RR1191 spike protein are similar to Wuhan reference. Thus, among seven samples there are three phenotypic variants of spike protein.
Total 53 different types of proteins were identified in GISAID database to date. Among seven Hyderabad isolated total 4 different types of Spike proteins are observed (Red).
Structure prediction analysis of spike protein shows that overall structure of isolated viruses may not be significantly affected, while loop regions connected to beta sheets are varied among the spike proteins of four variants. When predicted affinity of spike protein with ACE2 was analysed through docking, the results showed that spike protein of RK100 may be interacted between position 470-490 and 200-220, while KN318, TC469 and KP1125 spike protein may be interacting with ACE2 between position 250-270 and 630-650 (Fig 7). In case of KN443, RK1090 and RR1191, spike protein interacts with ACE2 between position 70-80, 120-160, 170-250 and 470-485. When energy of interaction and binding constant was compared with Wuhan reference, the results showed that the binding affinity RK100 and KN318 decreased by 5615 and 6000-fold, respectively. Thus, suggesting that the virus affinity to ACE2 receptor could be varied among the three variants of the isolates in circulation (Table 4).
We observed 3 distinct variants Y28H (RK100), D614G (KN318, TC469, KP1125), and similar to Wuhan reference NC_045512.2 (KN443, RK1090 and RR1191). These mutations were distinctly different with varied affinities as assessed by theoretical analysis, wherein the affinity of Wuhan was higher followed by Y28H and D614G (Table 4). When RNA reads for ACE2 were quantified, it shows higher variability among the patients, but the ACE2 expression (0.13 to 1.51) ( Table 4) could not be correlated with the predicted affinity of the S protein to ACE2 of isolates and also Ct values. A larger WGS sample size, which is a limitation of the current study, can provide better insights in this area. It was interesting to note comparatively low Ct values (high viral loads) in the Spike G614 variants when compared to D614 variants (E gene: 19±1 vs 22±2; RdRp: 26±0 vs 27±5; ORF: 25±1 vs 28±3).

Conclusion
Clinical profile of COVID-19 in India majorly shows fever and cough; comorbid diabetes, hypertension, and CVD were associated with increased severity of infection. Increased neutrophil count, decreased lymphocyte count, increased levels of SGOT, SGPT and blood urea were also associated with increased severity of infection. The observations in X-ray and CT scan confirms the severity of infection. A limitation of the current study is the inability to retrieve all the data regarding radiological findings. This hindered us to compare the findings extensively in both groups. It is hoped that findings from this study will guide clinicians to identify patients with different prognosis at an early stage based on the clinical characteristics presented by patients, and help in providing appropriate and effective management for the patients. Sequence analysis of seven patients of SARS CoV-2 in India showed they are under two clades I/A3i, A2a; and O, L, GR, and GH clades as per the latest GISAID classification. Analysis of the S protein showed that they belong to three variants with distinct predicted ACE2 interaction. ACE2 expression levels were variable among the patients.  their support, and Medgenome, Bangalore for their technical support in WGS. We are indebted to all the patients and staff from Department of Microbiology and General Medicine, Gandhi Medical College and Hospital who assisted in this study. We sincerely appreciate the researchers worldwide who sequenced and shared the complete genome data of SARS-CoV-2 and other coronaviruses from GISAID (https://www.gisaid.org/). Dr. Kondapi participated in the analysis during his sabbatical period.