African swine fever whole-genome sequencing—Quantity wanted but quality needed

The pandemic spread of African swine fever virus (ASFV) genotype II (GTII) has led to a global crisis. Since the circulating strains are almost identical, time and money have been mis-invested in whole-genome sequencing the last years. New methods, harmonised protocols for sample selection, sequencing, and bioinformatics are therefore urgently needed.

in-depth characterisation [1,6]. However, because whole-genome sequencing using the existing techniques was still extremely laborious, and research interest-due to the eradication of ASF from Europe-decreased, only few additional whole genomes were published in the following years (Fig 1).
When ASF re-emerged in 2007, research interest into ASFV increased drastically (Fig 1) [2]. Together with technical advances in sequencing, e.g., the use of second-generation highthroughput sequencing platforms [7], a few additional whole-genome sequences were published and used as a basis for genetic characterisation, virus comparison, and vaccine development [7]. However, only after the introduction of ASF into the EU in 2014 did ASF become a research priority. Since then, numerous ASFV sequences of the respective strains have been published using the latest sequencing methods with the goal to identify genetic markers and trace routes of introduction through molecular epidemiology (Figs 1 and 2) [8]. With the introduction of ASF into Asia, home of the most dense population of domestic pigs in the world, the world is now facing the worst pandemic of an animal disease seen to date, and new ASFV whole-genome sequences from Europe and Asia are being published with increasing  [9]. Nevertheless, we have to ask ourselves whether resources are well invested since the knowledge gained from recently published ASF genome sequences does not meet the expectations.

What do we expect from sequencing ASFV?
Expectations are high when it comes to whole-genome sequences of ASFV. As observed for other pathogens, whole-genome sequence information is expected to help unravelling disease pattern with molecular epidemiology, assist in tracing outbreaks, foster the understanding of virulence, and create the basis for the design of tailored diagnostic tools and vaccine development.

Do we meet those expectations?
As of today (May 13, 2020), 70 ASFV whole-genome sequences are publicly available (Fig 2). While at the first glance this looks like plentiful material to work with, a deeper investigation shows problems concerning quantity and quality.

Quantity
Through partial sequencing of the ASFV B646L gene, coding for the major structural protein P72, 24 genotypes (GTs) have been identified so far [10]. However, the 70 available ASFV whole-genome sequences cover only 13 of these GTs with a bias toward the 2 pandemic genotypes I (GTI) (21 sequences) and II (GTII) (29 sequences) (Fig 2). Thus, strains circulating between the natural hosts in the sylvatic cycle are underrepresented (Fig 2).

Quality
Bioinformatic analyses of ASFV sequences are aggravated by genome complexity. Artefacts are, e.g., caused by using non-suited bioinformatics workflows or sequencing platforms leading to low coverage and misassemblies in the extensive homopolymer and repeat regions [8]. Given these problems in combination with the lack of data regarding quality parameters [11], methodology, and outdated annotations, many published sequences are not suitable for detailed analyses.

Do we need better methods?
Today, high-throughput sequencing platforms of the second generation that produce short reads (50-500 bp) with high accuracy provide good and reliable results. However, the enormous amount of host sequences in the background of most samples leads to a very low virusto-host-sequence ratio. For ASFV, shotgun sequencing of untreated organ samples usually

PLOS PATHOGENS
provides around 0.05%-0.1% viral reads, while datasets from cell-culture supernatant can contain 1%-5% [8]. To assemble a reliable complete genome sequence (mean coverage of around 50), a minimum of around 60,000 single reads (150 bp) is required. Therefore, shotgun sequencing of organ samples but also tissue culture samples require high sequencing capacity of many million reads per sample leading to huge datasets that need to be handled and stored.
To overcome this obstacle, methods for target specific enrichment and host depletion have been implemented and successfully used for ASFV sequencing [8,12]. Employing these techniques has led to a significantly higher virus-to-host ratio with 25%-60% viral reads per dataset. However, even with these techniques, sequencing in homopolymer and repeat regions is still challenging. Therefore, third-generation sequencing platforms such as MinION (Oxford Nanopore Technologies) and PacBio (Pacific Biosciences of California) sequencing producing single-molecule ultra-long reads have been employed [8,13]. Although the accuracy, especially for the Nanopore reads, is usually low, they can provide a backbone that-in combination with the short-read data-allows for the assembly of very high-quality ASFV whole-genome sequences [8]. However, identifying and verifying single-nucleotide differences and variants is still challenging, and-especially in the long homopolymer stretches of up to 16 G and 17 C nucleotides and extensive inverted terminal repeat regions at the genome ends-the existing methods reach their limit.

Are we on the wrong path?
Huge financial and technical resources have been dedicated to ASFV whole-genome sequencing to provide a basis for molecular epidemiology for the current pandemic. However, although 29 whole-genome sequences from the corresponding GTII strains from 10 affected countries have been published in the last 10 years, only 2 regions showing significant differences were identified [14], and none of those have proven useful for larger-scale molecular epidemiology or source tracking so far. Instead, the sequences show more than 99.9% nucleotide sequence identity and differ in only very few single nucleotides distributed over the entire genome without a clear pattern or related change in phenotype which cannot be easily distinguished from sequencing or bioinformatics artefacts [8]. Therefore, most of the ASFV GTII whole-genome sequences published since 2010 do not provide any additional information useful to understand virus evolution or combat the disease, and the financial and technical resources have been more or less wasted.

Is it still worth trying?
Despite the high identity of circulating ASFV GTII strains, rare genetic variants have been observed. These variants include strains with single-nucleotide changes affecting their phenotype [15] as well as viruses showing large genome reorganisations and deletions [16]. Since these variants offer a great opportunity to learn about virus evolution and gene functions, it is imperative to identify and analyse them. Therefore, samples for sequencing should be chosen carefully and prioritised for samples from outbreaks where unusual patterns have been observed, for example, a lower virulence.
Furthermore, to use the available resources most efficiently, the current system of expensive and laborious in-depth sequencing of ASFV strains needs to be changed towards a more targeted approach. Here, novel methods-e.g., the combination of target enrichment by hybridisation capture and multiplex sequencing using Nanopore sequencing or small-scale Illumina platforms (iSeq 100)-could provide a quick and affordable alternative for screening multiple viral genomes for variations followed by in-depth characterisation of selected candidates.
In addition, more virus strains from Africa should be sequenced to elucidate ASFV evolution and mechanisms of genetic adaptation as well as the emergence of novel GTs and prepare for the future spread of other ASFV GTs that, due to limited cross-protection in vaccinated animals (with a future vaccine against ASFV GTII), might require different intervention strategies.
Therefore, cooperation with researchers based in Africa is essential to join in solving this global problem.
But not only field samples should be considered. ASFV strains that were sequenced in the past need to be checked and validated using the most up-to-date sequencing methods to remove sequencing or bioinformatics artefacts making them useful for comparative in-depth analyses. Furthermore, strains that were passaged many times should be analysed to validate their genome integrity prior to the use in experimental studies, and cell-line adapted strains as well as genetically modified variants should also be checked very carefully for off-target effects.
In conclusion, whole-genome sequencing is an essential tool in understanding this extraordinary pathogen and the basis for vaccine development. However, efforts must be made to optimise and harmonise protocols for sample selection, sequencing, and bioinformatics workflows as well as documentation and sharing of data (including raw reads) to use the financial and technical resources most efficiently and generate valuable data-data that are desperately needed to stop one of the most devastating animal pathogens of our time.