Full length genomic sanger sequencing and phylogenetic analysis of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) in Nigeria

In an outbreak, effective detection of the aetiological agent(s) involved using molecular tech-niques is key to efficient diagnosis, early prevention and management of the spread. However, sequencing is necessary for mutation monitoring and tracking of clusters of transmission, development of diagnostics and for vaccines and drug development. Many sequencing methods are fast evolving to reduce test turn-around-time and to increase through-put compared to Sanger sequencing method; however, Sanger sequencing remains the gold standard for clinical research sequencing with its 99.99% accuracy This study sought to generate sequence data of SARS-CoV-2 using Sanger sequencing method and to characterize them for possible site(s) of mutations. About 30 pairs of primers were designed, synthesized, and optimized using endpoint PCR to generate amplicons for the full length of the virus. Cycle sequencing using BigDye Terminator v.3.1 and capillary gel electrophoresis on ABI 3130xl genetic analyser were performed according to the manufacturers’ instructions. The sequence data generated were assembled and analysed for variations using DNASTAR Lasergene 17 SeqMan Ultra. Total length of 29,760bp of SARS-CoV-2 was assembled from the sample analysed and deposited in GenBank with accession number: MT576584. Blast result of the sequence assembly shows a 99.97% identity with the reference sequence. Variations were noticed at positions: nt201, nt2997, nt14368, nt16535, nt20334, and nt28841-28843, which caused amino acid alterations at the S (aa614) and N (aa203-204) regions. The mutations observed at S and N-gene in this study may be indicative of a gradual changes in the genetic coding of the virus hence, the need for active surveillance of the viral genome.


Introduction
It is no longer news that one of the major battle humanity is fighting currently is the fight against SARS-CoV-2. It started in Wuhan, China in December 2019 [1] but has spread to most countries in the world. As the global community battles this novel virus through aggressive screening of individuals for the presence or absence of infection, there is also a salient need for a genetic characterization which will help in the long-term management of the infection. A pool of sequences from various countries of the world would help in assessing the mutational tendencies of the virus as it crosses different geographical conditions and people of different cultures, genetic variations and immune status [2]. Complete genome sequencing of viruses is an essential tool for the development of diagnostics and vaccines, studying virus pathogenicity and virulence, tracking evolutionary paths and studying the genetic association between viruses and their hosts [3]. Optimized sets of primers for Sanger sequencing could be of help for whole genome generation especially for laboratories who lack Next Generation Sequencing platform but have Genetic analyzer. This method though an older methodology in genome sequencing, with its 99.99% accuracy, remains a 'gold standard' for clinical research sequencing (https://www.thermofisher.com/blog/behindthebench/when-do-i-use-sanger-sequencing-vsngs-seq-it-out-7/). A critical determinant for successful full genome Sanger sequencing of any pathogen is the availability of primers that target only the pathogen of interest reliably but not the sequences of host genes or other pathogens [4].
The rapid spread of SARS-CoV-2 across countries calls to questions on whether its evolution is mutation driven. It is reported that genomes with new variations are emerging as the virus moves across diverse environmental conditions [5]. Hence, there is a need for mutation monitoring of the virus. A study reviewed that RNA viruses' mutation rate is higher than that of their hosts and this rate plays an important role in virulence modulation and in the development of traits that are useful for viral adaptation [6]. Also, variations have been noticed at ORF1ab, S, ORF3a, ORF8 and N regions, in which nt28144 in ORF8 and nt8782 in ORF1a showed variation rates of 30.53% and 29.47%, respectively [7]. Variations at the nucleotide level which is proposed to be one of the most significant measures of viral evolvement, also contributes to the adaptation of the virus in every condition it finds itself thus creating a balance between its genetic information and genome variability [8,9].
SARS-CoV-2 is a positive-sense single-stranded RNA virus. Studies showed that it has a zoonotic origin with genetic similarity to bat coronaviruses [10,11]. The whole genome size of the SARS-CoV-2 varies from 29800bp to 30000 bp and it consists of a long open reading frame (ORF1ab) of about 21290bp encoding orf1ab polyproteins, structural proteins such as surface (S) about 3822bp, envelope (E) about 228bp, membrane (M) about 669bp, and nucleoprotein (N) about 908bp. It also has additional 6 accessory proteins, encoded by ORF3a, ORF6, ORF7a, ORF7b, and ORF8 genes [12].
This study explored the use of Sanger technology to sequence and characterize the whole genome of SARS-CoV-2 using specific overlapping primers designed from the various genes of the virus.

Ethics
Ethical approval (IRB/19/025) was obtained from Nigerian Institute of Medical Research Internal Review Board for the collection of samples for this study. Informed consents were filled by participants who agreed for their samples to be taken for diagnosis and research.

Samples
In April 2020, fifty (50) nasopharyngeal swabs were collected from suspected cases at the modified COVID-19 drive-through testing facility at the Nigerian Institute of Medical Research (NIMR), Yaba, Lagos. The samples were collected into a Viral Transport Medium (VTM) taken to a Biosafety level III laboratory within the Institute for inactivation by aliquoting into lysis buffer and leaving to stand at room temperature for 10 minutes. Thereafter, the inactivated aliquots were moved to a BSL II laboratory for RNA extraction and sequencing. The sample used for this study coded 57752 in our database was selected for Sanger whole genome sequencing based on its relatively high Ct value when compared to others.
RNA extraction. The viral nucleic acid from the inactivated samples aliquots were extracted using the QIamp RNA extraction kit (Qiagen, Maryland, United States) according to the manufacturer's instructions.
RT-qPCR. One-step reverse transcriptase (RT) real-time (qPCR) was carried out to detect SARS-CoV-2 using qPCR assay designed by BGI, Shenzhen, China. The process contained 18.5μl of nucleic acid mix, 1.5μl of enzyme mix and 10μl of RNA in a total reaction volume of 30μl. RT-qPCR cycling was performed on QuantStudio 3 (Applied Biosystems) as follows: 50˚C for 20 minutes, 95˚C for 10 minutes, then 40 cycles of 15s at 95˚C and 30s at 60˚C.

Primers
Specific overlapping primers for the amplification of the entire length of the virus were designed in-house from conserved regions using Oligo Primer Analysis Software v.7 to obtain amplicons of 1300-1600bp. These primers were synthesized by Macrogen Europe B.V., Amsterdam, Netherlands.

Nucleic acid amplification
Synthesis of highly structured and long cDNA fragments was performed using SCRIPT cDNA Synthesis Kit (Jena Bioscience, GmbH, Germany) according to the manufacturer's instruction. Gradient PCR was carried out using the primers pairs, to optimize the annealing temperature and amplification of targeted fragments using MiniAmp Plus Thermal Cycler (Applied Biosystems). The PCR process was performed in a total volume of 50μl comprising 10μl of 5x Red Load Taq Master Mix, 5ul each of 10μM forward and reverse primers, 3ul of cDNA template and PCR grade water to make up the volume. PCR cycling conditions: 1 cycle of 94˚C for 2 minutes; 30 cycles of 94˚C for 30 secs, 53-64˚C for 30secs, 72˚C for 2 minutes and 1 cycle of 72˚C for 2 minutes were used. The amplicons were analysed by 1.8% agarose gel electrophoresis.

Cycle sequencing and ABI analysis
In this study, we used Sanger technology to sequence and characterize the whole genome of SARS-CoV-2 using specific overlapping primers designed from the various genes of the virus.
The amplicons were purified using HT ExoSAP-IT (Thermo Fischer scientific) according to the manufacturer's instructions. The purified amplicons were quantified using Qubit 4 Fluorimeter (Thermo Fischer Scientific, CA). Sequencing was performed using the BigDye Terminator kit v. 3.1 and cleaned up with BigDye XTerminator v. 3.1 (Applied Biosystems, Foster City, CA). The purified products of the cycle sequencing were analysed on the ABI 3130xl Genetic Analyser (Applied Biosystems). This process was carried out at the Centre for Human Virology and Genomics, NIMR. Sequences generated were assembled using DNAS-TAR Lasergene 17 SeqMan Ultra in alignment with the reference genome (NC_045512.2).

Phylogenetic analysis
Thirty four (34) whole genome sequences of Sars-CoV-2 were downloaded from GenBank of NCBI (https://www.ncbi.nlm.nih.gov/GenBank) using the blastn search tool with SARS-CoV-2 as query. The sequences were aligned alongside the sequence data generated from this study using MAFFT v. 7 [13] and BioEdit software. Phylogenetic analysis was done using the MEGA X [14], applying the Maximum likelihood method (ML) andTamura-Nei model [15]. The bootstrap consensus tree inferred from 1000 replicates based on General Time Reversible (GTR) substitution model with gamma distribution.

PCR Virus detection and amplification
A Ct-value of 19 was observed on QuantStudio 3 (QS-3) qPCR system for sample NGN57752 using SARS-CoV-2 real time PCR assay developed by BGI, China. These Ct value was less than the cut-off Ct value of 38 as set by the assay. Hence, this result shows the presence of the virus in the sample assayed. The cDNA synthesized from the RNA according to the process described above gave a yield of 4.08 ng/μl. The amplicons from PCR process were loaded on 1.8% agarose gel and ran on CyFox 1 live imaging and gel electrophoresis device (Sysmex Partec, GmbH).
The primers developed and successfully used in this study to generate a whole genome will enable laboratories with access to only Sanger technology, to generate sequence data for their countries. This could increase representation of SARS-CoV-2 genome data from more countries to available genome data in various data banks across the world. The Sanger method, an older technology when compared with NGS, has a lower throughput, is more expensive if you want to use it for several samples and pathogens with large genome length. Also, the quality of Sanger sequences is often not very good at the extremes due to the fact that those are the sites where the primers bind before extension. However, with high quality read lengths of between 500-900bp per primer, it tends to minimize errors during assembly when compared to NGS reads. The primers used in this studies are able to amplify amplicons from RNA with Ct values of � 30 in first round amplification. Fig 1 below shows gel images from the first round amplification using the primers in Table 1 above.

Sequencing and amplification analysis
The PCR fragments were amplified and sequenced using the primers listed in Table 1. The gel images obtained from the gel electrophoresis of the amplicons from the PCR process using the designed primers are shown in Fig 1 below. Sequence data with read lengths ranging from 500-900bp per primer were assembled against the reference strain (NC_045512.2) into a single contiguous sequence using DNASTAR Lasergene 17 SeqMan ultra. The contig was "BLASTed" against the sequence data available in Gen-Bank. The result showed 99.97%identity with the reference sequence (NC_045512.2) and other sequences as highlighted in Table 2. Also, the amino acid identity ranges between 99.52%-100% (Table 2) depending on the region been considered. The sequence has been deposited in GenBank with accession number: MT576584.1. Other sequences generated using sets of primers from Table 1

Variation analysis
In comparing the assembled sequence with the reference sequence-NC_045512.2, there were notable nucleotide differences in some positions as shown in Table 3. Fig 3A-3D shows the points of variations in the assembled sequence chromatograms.

Discussion
Many developing countries in the world are limited in resources to set up latest sequencing platforms, however, in order to attain a highly effective tools for diagnosis and management of an infection on a global level, there is a need to have representative sequences of such infection causing organism generated from across the world. This will help to put into consideration the

PLOS ONE
available strains of such pathogen circulating the globe when vaccines and drugs are been developed. Sanger sequencing platform provides a good alternative method to generate sequences by laboratories that lack NGS methods. This study reports an identity of 99.97% between the full genome sequence generated and the reference strain at the nucleotide level and a 99.95% at amino acid level. Other sequences sampled from GenBank as shown in Table 2 had identity of 99-100% at both nucleotide and amino acid levels. The sequence analysed is not 100% identical to the reference genome, but 99.97% identity shows that there is a minimal or insignificant variation as the virus spreads across several environmental regions, from China to Nigeria at the moment. However, this seemingly insignificant variation, needs to be further studied as its corresponding changes in the protein building block could result in a mutation that will alter the behaviour of the virus. Lu and colleagues [16] established that coronaviruses have an average evolutionary rate of

PLOS ONE
about 10 −4 base substitution site per year with mutations arising at every replication cycle however that has not been the case with SARS-CoV-2 which has remained majorly conserved even after 6 months of its outbreak [7].
Nevertheless, in this study, we have observed some variations at ORF1b, S-region and Nregion leading to changes in the corresponding amino acid sequence as highlighted in Table 3 above. ORF1b which covers from 13468nt to 21555nt of the entire genome partly codes for the RdRp gene which is involved in genome replication, mRNA synthesis, or RNA recombination. It is also important for the survival of viruses and play a role in their evolution [5]. Though there is 100% identity at the RdRp gene among our sequences, the reference and other sequences (Table 2), surveillance is needed as the virus evolves.
The spike (S) gene is an important gene in the virus morphology and plays a major role in attachment of the viral particle to the receptor cells in its host range determination and in membrane fusion [17]. In this study, a nucleotide variation was noticed at nt23363 which resulted in a corresponding amino acid change from D (Aspartic acid) to G (Glycine) at amino acid position 614 as seen in Fig 5 above. We are of the opinion that further studies are

PLOS ONE
necessary to investigate the possible origin and effect of this variation especially with respect to the pathogenesis of the virus.
Furthermore, mutations at the nucleoprotein region nt28841, nt28842 and nt28843, where "AAC" replaces "GGG" of the reference strain were also observed. This led to corresponding variations in the amino acid coding, at position aa203, where K (Lysine) replaced R (Arginine) and at position aa204, where R (Arginine) replaced G (Glycine) as seen in Fig 6 above. It has been suggested that the C-terminal domain of the Nucleoprotein of coronaviruses plays an important role of binding the gRNA packaging signal leading to selective genome incorporation [18,19]. The nucleoprotein has also been associated with viral RNA replication and transcription; protection of the genome and possibly facilitate genome transport to the budding site [20]. Thus, a possible mutation at that region could play a vital role in viral fitness and even pathogenesis. Further investigation is required to elucidate the possible effect of these variations.

Conclusion
The use of Sanger sequencing technology for successful sequencing of whole genome of SARS-CoV-2 in this study is instructive that the technology could be useful for facilities, lacking NGS but have Sanger technology, especially those in resource constrained settings. The variations in the sequence obtained, highlight the need for molecular surveillance to generate more genomic data from Africa, even as SARS-CoV-2 spreads and adapts to various environmental conditions globally.