Tracking ebolavirus genomic drift with a resequencing microarray

Filoviruses are emerging pathogens that cause acute fever with high fatality rate and present a global public health threat. During the 2013–2016 Ebola virus outbreak, genome sequencing allowed the study of virus evolution, mutations affecting pathogenicity and infectivity, and tracing the viral spread. In 2018, early sequence identification of the Ebolavirus as EBOV in the Democratic Republic of the Congo supported the use of an Ebola virus vaccine. However, field-deployable sequencing methods are needed to enable a rapid public health response. Resequencing microarrays (RMA) are a targeted method to obtain genomic sequence on clinical specimens rapidly, and sensitively, overcoming the need for extensive bioinformatic analysis. This study presents the design and initial evaluation of an ebolavirus resequencing microarray (Ebolavirus-RMA) system for sequencing the major genomic regions of four Ebolaviruses that cause disease in humans. The design of the Ebolavirus-RMA system is described and evaluated by sequencing repository samples of three Ebolaviruses and two EBOV variants. The ability of the system to identify genetic drift in a replicating virus was achieved by sequencing the ebolavirus glycoprotein gene in a recombinant virus cultured under pressure from a neutralizing antibody. Comparison of the Ebolavirus-RMA results to the Genbank database sequence file with the accession number given for the source RNA and Ebolavirus-RMA results compared to Next Generation Sequence results of the same RNA samples showed up to 99% agreement.


Introduction
Filoviruses are emerging pathogens that cause acute fever with high fatality rate and present a public health threat that impacts far outside the immediate area of an outbreak. The average case fatality rate is around 50% and has been reported to be as high as 90% [1]. There  format, enabling screening of 30 or more pathogens in parallel with rapid turnaround time [20,21]. RMAs do not require the complex assembly software or internet connectivity as NGS does for sequence generation, thus enabling data analysis to be completed at a laboratory in the country of the outbreak, rather than relying on data collection at the site and data analysis in another country. The time between completion of wet laboratory protocols and generation of final sequence is as short as 5 hours, with interpretation of data accomplished by moderately trained laboratory personnel. Whole genome sequences of RNA and DNA viruses, including EBOV, have been reported using RMA systems [22,23]. This study presents the design and initial evaluation of an ebolavirus resequencing microarray (Ebolavirus-RMA) system that demonstrates sequencing the major genomic regions of four Ebolaviruses known to cause disease in humans and reveals the ebolavirus glycoprotein coding sequence changes as a recombinant virus is passaged under pressure from a neutralizing antibody.

Resequencing microarray design
A comprehensive survey of the available ebolavirus genomes was performed, and computational methods employed to identify the genomic regions with the highest variability and least variability across the known genomes of EBOV, BDBV, SUDV and TAFV. Along with a literature survey to determine genome regions of known importance, the results of the alignment analysis guided the design of sequence regions (tiles) to be manufactured on the microarray. The resequencing microarray designed has an assembly of tiles covering every base in the selected coding sequences of the four evaluated Ebolaviruses. For computational efficiency, seven short variable region tiles, and six short, conserved region tiles were designed for each Ebolavirus that overlap the coding sequence tiles with a total of 52 short tiles to cover the four Ebolaviruses. Some of the short tiles have multiple slightly different copies to ensure accurate sequence determination where higher variability among known isolates required. Fig 1 illustrates the position of the sequence of those tiles aligned with the EBOV genome sequence. The aligned position of all the microarray tiles is shown in S1 Fig. Six short tiles designed in the NP region ensured detection of the known variants of the Marburg Filoviruses (S1 Fig), though they were not tested experimentally. The array design was sent to Affymetrix, Inc. for manufacturing as previously described [24].
Each tile detector is comprised of a series of overlapping oligonucleotides 25 bases long. Eight sets of probes (one perfect match probe and three mismatch probes, each with one of the three mismatch nucleotides in the 13th position for the plus strand and the complementary minus strand) were generated for each of the 70,208 bases that can be determined by the microarray tiles. This design allows reporting the actual sequence of the virus in the sample even though its sequence is up to 20% divergent from the sequence the design modeled. Sequence composition of each tile is shown in S1 Table. Each of the four Ebolaviruses has tile detectors comprising 11,851 bases of the 18,959-base genomes. Multiple overlapping variable region tiles ensure that most variants will be sequenced. Conserved region tiles are important to provide sequence for computational determination of the genome identity.
The complete description of the microarray is available at ArrayExpress (https://www.ebi. ac.uk/arrayexpress/experiments/) with the accession number A-MTAB-670.

Primer design
Primers were designed to amplify approximately 500-base fragments requiring multiple overlapping amplicons to cover the larger tile detectors. Primer pairs were assembled into 4 pools to achieve multiplex amplification. Ebolaviruses divergence required the design of 2 sets of 4 primer pools: one set for use with BDBV/TAFV, and one set for use with EBOV/SUDV. To achieve sequencing of the cultured recombinant virus, rVSV-EBOVgp-GFP [25], two additional primers were added to the EBOV pool (S2 Table) because the sequence flanking the EBOVgp coding sequence was different from the flanking sequence in the native EBOV genome. Primer pairs were tested in single-plex PCR procedures using genomic RNA templates. Sensitive and specific primers were selected and optimized into four mixtures, using Oligo 7 software (Colorado Springs, CO) to avoid dimer formation, false priming, and internal hairpins. The performance of the primers was visualized using a sensitive capillary electrophoresis system (2100 Bioanalyzer Instrument, Agilent Technologies, Inc., Santa Clara, CA). The primer sequences and the pool to which each belongs are listed in S2 Table.

Source of Ebolavirus and nucleic acid extraction
Samples of the four Ebolaviruses were obtained from BEI Resources (Manassas, VA, BEIRE-SOURCES.ORG). For EBOV isolate Mayinga (NR-31806) and Bundibugyo virus (NR-31812), reference materials were obtained as previously isolated RNA extracts, as described [26]. For Taï Forest virus (NR-44241), Sudan virus (NR-31810), and EBOV variant Makona (NR-49462), irradiated whole cell extracts containing inactivated virions were obtained from BEI Resources. Ebolavirus genomic RNA was isolated from these with the QIAamp1 MinElute1 Virus Spin Kit (Qiagen; cat no.57704) per the manufacture's instruction with the modification that the incubation/extraction step was extended to 1 hour for increased yield.

Multiplex reverse transcription PCR
RNA extracts were measured for quality control in the Nanodrop Spectrophotometer (ND-1000, Thermo Fisher Scientific, Wilmington, DE). Processing followed the previous method generally [18] with the following adaptations. In a clean cabinet, an annealing mix of 4 μL DEPC water, 1 μL of 40 μM random ninemer, 1 μL 10mM dNTPs, and 1 μL each of the internal control templates (1 pg/μL) designed from the sequence of Arabidopsis genes NAC1 and TIM [27] were added to 4 μL (4X10 5 genomic copies) of ebolavirus RNA. The reaction components were mixed gently and placed in a thermocycler 5 min at 65˚C, 5 min at 4˚C. A Reverse Transcription mix of 2 μL SuperScript™ III (200 U/μL) Reverse Transcriptase (Thermo Fisher Scientific, South San Francisco, CA), 8 μL 5X First strand buffer, 4 μL DTT (25mM), and 2 μL of RNAsin (40 U/μL) was prepared and added to the annealing mixture, mixed gently and then heated in a thermocycler 1 min at 25˚C, 60 min at 50˚C, 15 min at 75˚C. A PCR master mix was prepared by adding 100 μL DEPC water, 40 μL 5X GoTaq G2 Buffer, 24 μL MgCl 2 (25 mM), 4 μL dNTP (10 mM), and 8 μL GoTaq G2 5 U/μL (Promega). Forty-four ul of this mix was added to each of 4 pools, along with 1 μL primer mix per pool (See S2 Table for composition of primers in each pool). RT product (5 μL) was added to each pool, mixed gently and heated 1 cycle of: 5 min at 24˚C, 2 min at 94˚C, 11 cycles of: 30 sec at 94˚C, 30 sec at 50˚C, the lower temperature increasing 1˚per cycle to 68˚C then 29 cycles of: 30 sec at 94˚C, 120 sec at 60˚C. After the thermocycling, the samples can be frozen at -20˚or advanced to the microarray processing steps.

Microarray hybridization and sequence analysis
Microarray hybridization, processing and scanning were carried out following the previously published method using a GeneChip assay kit (Affymetrix, Inc.) [28]. Sequence data were generated as base calls using the Gene Chip Sequence Analysis Software (GSEQ, Affymetrix, Inc), which generates a FASTA file containing the sequence of each of the tiles on the microarray. The details of microarray results including the image file (CEL files), base-calling file (CHP) and the final called base sequence (txt files) of each microarray are available at ArrayExpress (https://www.ebi.ac.uk/arrayexpress/) under the accession number E-MTAB-10007 for the Ebola genome resequencing.
Sequence data in the FASTA file is processed by a custom data analysis pipeline, ebola_i2o. A very simple Graphic User Interface allows choice of the pipeline and choice of the FASTA file. With a click on the "run" button, the ebola_i2o executes a series of PERL-based scripts that link together some minor calculations and open source software: BLAST for database searching [29], MUSCLE for multiple sequence alignment [30], Phyml for creating phylogenetic trees [31] and Newick Utilities for processing phylogenetic trees [32], all unseen by the operator. The ebola_i2o pipeline is loaded on a Dell Precision Tower 7910 workstation with Dual Intel Xeon Processor E5-2 650 v3, 64GB memory and two hard drives (1TB and 0.5TB). As a first step the pipeline creates a consensus sequence (determined by plurality) for each of the 52 short tiles. As an example, the nucleoprotein (NP) full length tile overlaps NP Con1. The pipeline compares the two tiles in the area where they overlap and modifies the NP Con1 to have the consensus (determined by plurality) base at each position. Next, a C3 score is calculated for each tile sequence. The C3 score is the total number of nucleotides that occur in runs of three or more consecutive (non-N) base calls, expressed as a percentage of the length of the tile sequence (in bases) [20]. The C3 score is a measure of the quantity of hybridized DNA and the quality of hybridization. Tile sequences with C3 scores greater than or equal to 20.0 are considered to represent true positive hybridization events; tile sequences with lower scores are postulated to be products of cross-hybridization or assay noise.
Tile sequences, including the 6 designed to detect Marburg virus, are searched for homologous sequences in an in-house Filovirus specific database, FilovirDB, that is stored on the Dell Precision Tower 7910 workstation, using the BLAST alignment algorithm. BLAST results for each tile: the top three database hits including their Bit-score, E-value, accession number and scientific name are stored in a BLAST report along with the C3 score for each tile. This BLAST Report is deposited into a date and time stamped folder. A text file with the original sequence report of all the tiles is also added to the folder.
We have identified 13 short segments in each of the genomes of the four human-infecting Ebolaviruses (six conserved regions and seven variable regions, see Fig 1) whose consensus sequences taken together allow us to accurately place a test Ebolavirus in a phylogenetic tree of Ebolaviruses of known strain type with minimal computational time. Unseen by the operator, the pipeline compares these 13 segments and chooses, among the 4 Ebolaviruses, the tile with the highest C3 score, most likely to represent a real hybridization event of a tile with its cognate DNA sequences. Each high-scoring tile sequence is then aligned to the reference alignment file of the corresponding Ebolavirus. This reference alignment file stores the alignment of 160 representative ebolavirus sequences for the given genome segment. Alignments are performed in MUSCLE software [30]. This process is repeated for all 13 short segments (tile + reference sequences), which are subsequently concatenated into a single master alignment that the automatic pipeline uses to create a phylogenetic tree with the method of maximum likelihood and General Time Reversible (GTR) model of substitution, which assumes different rates of substitutions and frequencies of occurrence of nucleotides. The pipeline performs phylogenetic analysis in PhyML software [31], an example of the tree output of the pipeline is shown in S2 Fig, displaying the tree as seen in SeaView [33]. The phylogenetic tree output files are loaded in the folder along with the BLAST Report.
The code for the ebola_i2o pipeline and construction of FilovirDB is freely available at: https://github.com/openbox-bio/ebola_i2o.
To obtain a consensus Ebolavirus-RMA sequence, the FASTA files from the results of 3 independent microarray procedures per nucleic acid sample can be loaded into one of many alignment programs. In our analyses, we used the software package, Sequencher (Gene Codes Corp, Ann Arbor, MI), to align the tile sequences to the reference sequence identified as the closest match in the phylogenetic tree constructed by the pipeline. We generated a final consensus sequence by plurality rule for each sample. A graphic outline of the full analysis pipeline is shown in Fig 2. Though no Marburg virus nucleic acid was obtained to test the full RMA process, mock FASTA files were constructed by inserting the Marburg virus sequence selected for the 6 Marburg tiles into an Ebolavirus-RMA output file with only non-called bases (Ns) in the other tiles. These FASTA files were loaded into the pipeline which assessed the quality of the tile sequences then searched them against the FilovirDB resulting in output files identifying the samples as Marburg virus, demonstrating the performance of the pipeline with the Marburg virus tile sequences.

Next generation sequencing (NGS)
Library preparation and HiSeq system sequencing were performed following established laboratory protocols [34], which are summarized below. RNA samples were processed following the protocol for the Illumina TruSeq Stranded mRNA Sample Preparation Kit. One microgram of total RNA was fragmented and reverse-transcribed into cDNAs (the poly(A)-tailed RNA enrichment was skipped, as viral RNA does not contain poly(A) tails). Double-strand cDNAs were adenylated at the 3' ends and individually indexed, followed by limited-cycle (15) amplification. Paired-end sequencing (100x2 cycles) of multiplexed RNA samples per lane was carried out on an HiSeq 2500 sequencer (Illumina, Inc., San Diego, CA). Fastq files of sequence reads were assembled to the appropriate reference with FDA CBER's High-performance Integrated Virtual Environment (HIVE) Hexagon aligner [35], a cloud-based environment optimized for storage and analysis of extra-large sequence data. The Hexagon alignment was further processed with the HIVE-heptagon sequence profiling tool [36] which formulates a consensus from the base with highest frequency over all reads at each position.
The assembled consensus NGS sequence for an RNA sample was compared to the Ebolavirus-RMA sequence for the same sample with ClustalW operated within the MacVector software (version 17.5.5, MacVector, Inc, Apex, NC). ClustalW parameters were Slow pairwise alignment mode, open gap penalty = 15.0, extended gap penalty = 6.7, delay divergent = 30%, transitions: weighted. The raw Illumina NGS reads and the assembled consensus sequence are available at ArrayExpress (https://www.ebi.ac.uk/arrayexpress/) with the accession number E-MTAB-10102.

rVSV-EBOVgp-GFP antibody neutralization and non-neutralization assay
Replication-competent recombinant vesicular stomatitis virus (VSV) constructs in which the VSV-G envelope protein was replaced with the EBOV glycoprotein (GP), followed by the green fluorescence protein (GFP) gene, termed rVSV-EBOVgp-GFP, were prepared as described previously [25]. Prior to the neutralization experiment, a stock culture of rVSV-EBOVgp-GFP was RNA extracted and six replicate Ebolavirus-RMA chips were processed to produce an RMA consensus of the sequence of the EBOVgp coding sequence. The details of microarray results including the image file (CEL files), base-calling file (CHP) and the final called base sequence (txt files) of each microarray are available at ArrayExpress (https://www.ebi.ac.uk/arrayexpress/ ) under the accession number E-MTAB-10008 for the resequencing of the Ebola glycoprotein gene inserted into Vesicular Stomatitis Virus. The initial stock RNA was also subject to Illumina next generation sequencing for comparison to the Ebolavirus-RMA consensus sequence. The raw reads and the assembled consensus sequence are available at ArrayExpress (https://www. ebi.ac.uk/arrayexpress/) with the accession number E-MTAB-10157. To monitor genome changes, African green monkey kidney (Vero E6) cells were infected with rVSV-EBOVgp-GFP as described in previous neutralization assays [25,37]. Briefly, the EBOV neutralizing human monoclonal antibody (mAb) KZ52 (Integrated BioTherapeutics, Inc.) [38] was incubated at a concentration of 2μg of antibody in 100μl of culture medium or PBS as control with rVSV-EBOVgp-GFP. After 1 hour, the mixtures were added to Vero E6 cells in a 6-well plate. Four days post infection, culture supernatant was recovered for sequencing (Round 1) and 100μl passaged onto fresh Vero E6 cells or treated with 2μg of KZ52, then added to fresh Vero E6 cells. Three days post passage, supernatant was collected as Round 2 and passaged onto fresh Vero E6 cells with the same treatments. Three days post passage, supernatant was collected as Round 3. During all 3 rounds of passage, mAb KZ52 was present in one culture and a parallel culture received PBS control. RNA extraction of each Round of culture supernatant was sequenced by the Ebolavirus-RMA to reveal the sequence of the EBOV glycoprotein. The Round 3 RNA sample was also submitted to Illumina HiSeq NGS and the EBOV glycoprotein assembled for comparison with the RMA result. The raw reads and the assembled consensus sequence are also available at ArrayExpress under the accession number E-MTAB-10157.

ClustalW multiple sequence alignment
Nucleotide sequences were aligned in the software MacVector (MacVector, Inc., Apex, NC) using the ClustalW algorithm according to the instructions in the MacVector User Guide with default parameters.

Design of the Ebolavirus-RMA
The Ebolavirus-RMA was designed to determine the genome sequence of the four Ebolaviruses that pose a threat to human populations. Additional human non-pathogenic Ebolavirues such as Reston virus and Bombali virus were not included to reserve space for coverage of the sequence diversity within the human-infecting viruses. Each virus sequence is represented by 17 to 28 locations on the microarray depending on the Ebolavirus (Figs 1 and S1). The locations, called tile detectors, are prototype sequences that fully cover 4 of the 7 genes in the ebolavirus genome. Prototype or reference sequences were selected based on the consensus of all ebolavirus NCBI database entries prior to the 2013-2016 epidemic. Due to the extent of divergence between the 4 Ebolaviruses, a complete set of tile detectors was designed for each Ebolavirus to ensure that high-accuracy sequence results would be obtained (S1 Fig).
The target genes were selected based on an assessment of which genes have shown variability, demonstrated a role in virulence, potential impact on diagnostics or therapeutics, and are significant for monitoring genetic drift in outbreaks. For example, genetic drift in matrix protein (VP40) and GP [39] as constituents of the envelope of the ebolavirus virion are more likely to affect immune system recognition of the virus than other genes [40]. The virus RNA Dependent RNA polymerase encoded in the L gene and the nucleoprotein (NP) gene, as a consequence of their roles in replicating and coating the single strand-RNA genome respectively, may affect the mutability of the virus and were among the genes that contained unique nucleotide variants in the Makona strains [41]. Most of the assays used for detection and diagnosis of Ebolavirus in the field are currently using either serological or RT-PCR-based nucleic acid tests (NAT) [42,43] that target these 4 genes.
For target regions that are highly variable, multiple tiles were designed for the same sequence region to cover the sequence divergence of the variants (Fig 1, labeled "Var"). To aid in the database search of the output sequence of Ebolavirus, we also included overlapping tiles covering highly conserved regions of sequence that provide anchors (Fig 1, labeled "Con"). This composition of the microarray was designed to achieve the accurate sequencing of the range of nucleotide variability among the species of Ebolavirus rather that complete genome coverage of one reference strain as in other RMAs [23].
Tiles for Marburg virus were included to detect this related Filovirus when a patient with overlapping symptoms is mistakenly diagnosed with Ebola Disease. There is not enough coverage nor an intent to sequence the Marburg virus genome.

Evaluating the Ebolavirus-RMA with repository ebolavirus material
We initially tested the ebolavirus protocol with repository samples of extracted virus RNA or irradiated crude infected Vero cell lysate obtained from BEI Resources (Table 1). With a starting input of 4.0x10 5 copies of viral RNA, base call rates of 98.8-99.6% were obtained for 4 of the 5 Ebolaviruses or variants tested (Table 1). Call rates were lower only for the SUDV strain Boniface, possibly due to specimen matrix interference or divergence from the tile designs.
To assess the accuracy of the sequence generated by the Ebolavirus-RMA and the accuracy of the pipeline in assigning a correct reference sequence, we compared the generated consensus sequence with the sequence file retrieved from GenBank for the accession number provided by the sample supplier. For all three Ebolaviruses and two EBOV variants, a level of agreement of >98% was obtained for 4 of 5 with only SUDV, which had an inferior call rate, having lower (Table 1). Furthermore, our custom-analysis pipeline placed the sequences derived by the microarray on the phylogenetic tree adjacent to the sequences of the repository Ebolaviruses and variants (S2 Fig). To condense the phylogenetic tree display of the results from all the Ebolaviruses, an alignment was performed separately from the pipeline using the consensus sequence from three microarray outputs for each RNA sample and the full-length ebolavirus genomes from the database (Fig 3). Notably, the algorithm placed the EBOV Mayinga isolate (AY142960) and Makona (KJ660347.2) variant, which differ by only 2.9% across their genome sequence, on separate branches of this tree. Therefore, the microarray obtains an accurate sequence, and the pipeline is a suitable way to automate the interpretation of the FASTA outputs and assign the sample the correct identity even among samples that have a similar sequence.
To compare the results of the Ebolavirus-RMA with another leading sequencing platform, we chose EBOV Mayinga isolate as an example, prepared NGS libraries from the RNA sample, processing them on an Illumina sequencer. We obtained full coverage of the 19kb EBOV To estimate the sensitivity of the Ebolavirus-RMA for detection of Ebolavirus and evaluate whether the sequence quality decreases with decreased amounts of input RNA, we serially tenfold diluted samples of EBOV Mayinga, with the lowest dilution including 40 copies per reaction. While the call rate decreases with dilution, the output sequence and search algorithm remained able to correctly assign the sample to the appropriate reference with the lowest number of genome copies tested ( Table 2). Even at 40 copies of input RNA and a weighted mean C3 score of 28 for the EBOV tiles, the Mayinga isolate could be distinguished and assigned to a reference that would serve to identify the Ebolavirus present in a clinical sample, though 400,000 copies would be required to accurately sequence the viral genome.
All the microarray data from this microarray study is available on the ArrayExpress repository (https://www.ebi.ac.uk/arrayexpress/) with the accession number E-MTAB-10007.

Ebolavirus-RMA sequencing of a replicating recombinant virus
To demonstrate that the Ebolavirus-RMA as a system could effectively monitor the genetic drift of the ebolavirus genome as might occur in outbreak scenarios, we followed a previously described approach generating escape mutants with a neutralizing antibody [44,45]. We used the replication competent recombinant virus, rVSV-EBOVgp-GFP. The recombinant virus was engineered by molecular substitution of the coding sequence of the VSV glycoprotein with the EBOV glycoprotein in a plasmid, which, when transfected into cells, accomplishes assembly of infectious virus. The recombinant virus was designed for use in biosafety level 2 laboratory (BSL-2) neutralization assays to evaluate antibody response to Ebola virus disease and vaccination [25,37,46]. Previous studies have demonstrated that this virus performs similarly to wildtype Ebola in neutralization assays performed in a BSL-4 laboratory [46]. We have observed that, after multiple passages in Vero E6 cells with neutralizing antibodies, "escape mutants" evolve that are no longer blocked by the antibody. Identification of base changes in the EBOV glycoprotein coding sequence that could result in these escape phenotypes would demonstrate that the Ebolavirus-RMA would detect EBOV genomic drift. rVSV-EBOVgp-GFP was passaged in the presence or absence of a mAb, KZ52, with a known neutralizing epitope on the ebolavirus GP [47]. Virus stock was collected before passaging (initial stock) and after 1, 2 and 3 passages. In the first passage the mAb KZ52-treated virus was severely restricted in infection rate compared with the untreated virus. By the third passage, the virus in the presence of mAb KZ52 was achieving an infection rate similar to the untreated virus (Fig 4).
RNA was extracted from virus stocks at each passage and sequenced by the Ebolavirus-RMA. Only the EBOV tiles in the GP region of the microarray resulted in base calls. Results are shown in Table 3. The positions where substitutions were detected illustrate several different patterns of genomic drift. At position N(506) the Ebolavirus-RMA results, confirmed by NGS, identified an amino acid substitution known to block binding of the KZ52 antibody [48,49] that only appears in the 3rd round of passage under pressure from the KZ52 antibody. This is a tangible example of the genomic drift that the Ebolavirus-RMA is designed to detect. Substitutions at positions Q(28), I(129) and Q(467) appear not to be responding to selective pressure because they either appear, then revert to the original sequence over the time in culture with or without pressure from the KZ52 antibody, or they appear without selective pressure. At the codon for amino acid V(662), it appears that a base designed in the plasmid used to create the recombinant virus was substituted, detected by Ebolavirus-RMA and confirmed by NGS, when the original virus emerged, changing the sequence to encode isoleucine, which remained fixed in all virus passages. At two positions, V(52) and S(422), the Ebolavirus-RMA base call was not confirmed by NGS suggesting there may be a flaw in the RT-PCR steps or microarray tile at these locations, since the incorrect base call is made with multiple different RNA samples. Future work will determine the source of these variants (see Discussion). However, the overall agreement between NGS sequence of the rVSV-EBOVgp-GFP recombinant virus and the Ebolavirus-RMA results for the same sample taken from the initial stock virus was 99.2% and for the RNA from the KZ52 selected round 3 virus, the agreement was 97.1%, which is comparable to the rate of agreement which was obtained with the Ebolavirus-RMA and NGS in the full-length genome samples.
All the microarray data from the replicating recombinant virus study is available at ArrayExpress (https://www.ebi.ac.uk/arrayexpress/) with the accession number E-MTAB-10008.

Out-group comparison to non-pathogenic Ebola species
To demonstrate that the Ebolavirus-RMA as a system reports a sequence that is specific to the human pathogenic Ebolavirus species, we performed in silico comparisons of the RMA consensus sequences for EBOVs (Mayinga and Makona), BDBV, SUDV and TAFV to the species not pathogenic to humans (Bombali ebolavirus and Reston ebolavirus). Note that the RMA contains probes to sequence 11,851 bases of the 18,959-base genomes, thus a perfect match would have 62.5% identity. These comparisons (Table 4) show that the sequence of each Ebolavirus species or isolate determined by the Ebolavirus-RMA has the highest identity to the sequence deposited in the Genbank NCBI database for the source material RNA. The identity to all other sequence files were low enough to clearly distinguish the divergent species.

Discussion
This paper describes the evaluation of the Ebolavirus-RMA as a rapid technology to obtain the critical sequence of the ebolavirus genome. The goal of a simplified method to monitor genomic drift of Ebolavirus during an outbreak was achieved as demonstrated by the sequencing of a VSV-EBOVgp recombinant virus during passage in culture under selective pressure from a neutralizing antibody. The Ebolavirus-RMA detected a base substitution responsible for an amino acid replacement known to eliminate the binding of the neutralizing antibody. The number of substitutions detected and confirmed by NGS (2 non-synonymous base substitutions) suggest a much higher nucleotide substitution rate than observed for Ebolavirus in nature, 7.06x10 -4 substitutions/site/year [12]. Analysis of the base call output from the scanner software with a custom designed bioinformatic pipeline and alignment of three replicate outputs with the reference sequence selected by the pipeline to form the final consensus achieved a high sequencing accuracy. This workflow results in base calls over the genome sequence covered by the microarray at accuracies of 97.1% to 99.2% agreement with the reference sequence of the Ebolavirus variant. This accuracy was obtained with 4x10 5 genome copies, which is well within the average number of copies in an acutely infected patient of 2.7x10 7 /mL [50]. This capability compares well to the standard NGS technologies that were deployed in the West African and Central Africa outbreaks. With Illumina HiSeq as we used for comparison to the Ebolavirus-RMA results, approximately 0.1-1% of bases are called incorrectly [51]. For a genetically homogenous sample, the effects of Table 4. Percent sequence identity of Ebolavirus-RMA sequence outputs with GenBank NCBI database sequences for Ebolaviruses. Header row contains the name and accession number of each database file. The first column contains the name of each resequencing microarray output which is a consensus of three microarrays. these base miscalls can be mitigated by establishing a consensus sequence from high-coverage sequencing reads [51]. The Oxford Nanopore MinION has been used in EBOV sequencing and has attributes of portability and lower cost than standard NGS. However, the large number of sequence reads require powerful bioinformatic tools and expertise to successfully assemble an ebolavirus genome. In contrast, we have generated a consensus sequence from three microarray results to achieve the level of accuracy described here. The automated pipeline performs downstream bioinformatics processing with little more that the click of a button on the graphic user interface. The time between raw data acquisition and final sequence output, aligned to a reference ebolavirus sequence can be within the same day. The time between initiation of the workflow with an RNA sample and final sequence was 26 hours with the BEI Resource provided specimens. This capability could be enabled in an endemic country and would have established the identification of Ebola virus in the DRC outbreak facilitating the use of the rVSV-ZEBOV-GP vaccine or establishing the genomic identity between the isolates earlier this year in Guinea and the Makona strain thus pointing to the resurgence of the outbreak. Below concentrations of 10 5 RNA copies or in specimens like the Sudan virus, the base call rate was lower so that the platform was no longer able to produce the detailed genome sequence. However, error rates were not affected, and the sequence results obtained were sufficient for identification of the virus infection, which means that the assay has applicability in real-time public health surveillance efforts. Another possible flaw in the system was revealed when Ebolavirus-RMA sequence of the rVSV-EBOVgp-GFP recombinant virus was compared with NGS of the same RNA sample. Though further work is needed to characterize this flaw, the error only occurred when RNA encoding EBOV GP was derived from the recombinant virus. The same codons in RNA from five different Ebolavirus full-length genomes were basecalled correctly by the Ebolavirus-RMA. Future work will re-design this region of the microarray and design primers that are specific for divergent strains like Sudan. We hypothesize that the higher sequence divergence of the Sudan virus from the other 3 Ebolaviruses makes some of the primer pairs unsuitable for amplification and the genome of the sample may be divergent enough from the tile detector sequences to make hybridization inefficient. In future work the primer pools will be further optimized, and additional variant tiles added to the array to improve sequencing of the Sudan virus.

Ebola virus
In summary, comprehensive genome sequencing of whole virus samples and the passagespecific variants detected in the cultured recombinant virus indicate the Ebolavirus-RMA is suited for further evaluation with ebolavirus samples toward acceptance as a tool to evaluate clinical specimens.
Supporting information S1  NR-31806), sequenced with the Ebolavirus-RMA (chip #EBOVMPX004). The sequence output was processed with the ebola_i2o pipeline. The phylogenetic tree output of the pipeline is visualized with SeaView Version 4.7 [33]. Each file in the tree carries the name assigned to that file in the database, FilovirDB. According to this tree, the genomes in FilovirDB most closely related to the Ebolavirus-RMA output sequence are KC242791, KC24801, AF499101 and AY142960. AY142960 is the database sequence entry for the BEI Resources NR-31806 sample. An arrow indicates the location on the tree of the RMA output sequence. This