A short plus long-amplicon based sequencing approach improves genomic coverage and variant detection in the SARS-CoV-2 genome

doi:10.1371/journal.pone.0261014

Fig 1.

Study rationale and assay design.

Panel A illustrates the study’s rationale and sketches the layout of the ARTIC and MRL primer pools across the SARS-CoV-2 genome. Small and long arrows indicate short and long amplicons, respectively. Red arrows indicate a mutation with potential to alter the primer binding site. Panel B shows the design of the three assays developed and assessed in the present study. Assay-1 is based on short-amplicons generated with the ARTIC primer pool. Assay-2 is based on long-amplicons made with MRL primers followed by short-read sequencing on MiSeqDx. Assay-3 is based on long-amplicons sequenced by long read sequencing technology on MinION platform.

More »

Expand

Table 1.

Summary of sequencing metrics for all assays on fifteen samples.

More »

Expand

Fig 2.

Virus genome mapped reads and genomic fraction covered in various assays.

Percentage of virus genome mapped sequencing reads (A) and virus genome fraction covered with ARTIC primers (B). NC represents the negative control (Human Dendritic Cell RNA), PC represents the positive control (VR1986D-ATCC SARS-CoV-2 RNA), and fifteen nasopharyngeal swab RNAs from COVID-19 positive patients. Panels C and D, show the percentage of viral genome mapped sequencing reads (C) and virus genome fraction covered (D) with MRL primers. Data in panels E & F show the percentage of the viral genome mapped sequencing reads (E) and virus genome fraction covered (F) in ARTIC plus MRL Hybrid data set. The NP2 sample could not be analyzed with MRL primers due to limited sample material. Panel G: Venn-diagram summarizes the number of mutations that were uniquely or commonly detected by individual (ARTIC or MRL) primers and hybrid analysis. Panel G: Shows average read depth on ambiguous mutations in ARTIC assay. T-test p value for read depth comparison is shown.

More »

Expand

Fig 3.

Short plus long-amplicon hybrid data provide uniform and maximum genomic coverage.

The Integrative genomics viewer (IGV) plot shows sequencing coverage tracks in ATCC positive control and four patient samples based on ARTIC primers (short-amplicon data in panel A), MRL primers (long-amplicon in panel B) and Hybrid data (short + long-amplicons in panel C). Data is shown for ATCC RNA and four patient samples (NP1, NP3, A10 and A4) in each panel. The x-axis shows the genomic position in virus genome and the y-axis shows the individual samples. Top panel grey tracks represent ARTIC data, middle panel green tracks represent MRL data and bottom panel purple tracks represent Hybrid data. Colored lines on the sequencing coverage tracks indicate detected mutations. Black solid arrows point to the poorly captured genomic intervals in ARTIC data set. Blue solid arrows point to examples of three mutations, Q57H, G172V and A2A mutations.

More »

Expand

Fig 4.

Mutation load and pangolin lineage assignment in study samples.

Panel A shows number of mutations detected in ATCC RNA and fifteen patient samples in Hybrid data set. The x-axis shows the study samples. ATCC RNA is shown in black bar. NP1-NP5 samples shown as blue bars indicate samples collected during the month of Jan-Feb 2021. Samples C6-A5 were collected during the month of May-June 2021 are shown with red bars. The y-axis shows total number of mutations detected in each study sample. Pangolin lineage and WHO label for that lineage are shown on the top of individual bar. Panel B show frequency distribution of 214 high confidence mutations in the study samples. The x-axis shows genomic position of detected mutations. The y-axis shows the frequency of each mutation in the study sample. Each bar represents an individual mutation. In this study’s cohort, 8 high bars show mutations with >40% frequency. Amino acid alterations are shown on the top of each bar for the 8 most common mutations (Blue font color) and several variants of concern and variants of interest in spike gene (Red font color).

More »

Expand

Fig 5.

Phylogenetic analysis on study specimens.

A maximum likelihood tree was constructed to explore the phylogenetic relationship between SARS-CoV-2, SARS-CoV-1, MERS and endemic coronaviruses (HCoV-NL63, HCoV-229E, HCoV-OC43, HCoV-HKU). The whole genome sequences for SARS, MERS and endemic coronaviruses were downloaded from NCBI. The numbers along the branches mark the bootstrap values percentage out of 1000 bootstrap resamplings. Samples NP1-NP5 that represent sampling from Jan-Feb 2021 form a clade and are highlighted with pink color. ATCCVR RNA shown in green font color and C6 and B11 patient sample are part of separate clade. Sample B9, A5, A12, and A10 shown in blue label font color cluster closely on tree. Dots labels as Alpha, Beta and Delta represent reference genomes for respective strains downloaded from NCBI.

More »

Expand

Fig 6.

Long-read sequencing provides uniform coverage across deletion-prone region in the virus genome.

Panel A shows the percentage of the virus genome mapped sequencing reads (A) and virus genome fraction covered with Nanopore sequencing data (B), in the positive control (VR1986D ATCC SARS-CoV-2 RNA). Panels C-D show reads mapped to the virus genome and covered genomic fraction in combined, ARTIC + Long-read (Hybrid-II) data, respectively. Panel E illustrates a gene sketch on known deletions in 2kb region of the spike gene. Panel F shows sequencing coverage across a deletion-prone region of the spike gene in ATCC positive control RNA in short-amplicon ARTIC and long-amplicon MRL data. Top coverage plot on ATCC_PC_ARTIC shows sequencing coverage in short-amplicon on Illumina MiSeqDx platform, whereas second track underneath show long-read sequencing data on ATCC_PC sample using MRL long-amplicon primers. Panel G shows a UCSC genome browser snapshot across spike gene region that show previously reported deletions in this genomic interval of SARS-CoV-2 genome.

More »

Expand