Comprehensive Phylogenetic Reconstructions of African Swine Fever Virus: Proposal for a New Classification and Molecular Dating of the Virus

African swine fever (ASF) is a highly lethal disease of domestic pigs caused by the only known DNA arbovirus. It was first described in Kenya in 1921 and since then many isolates have been collected worldwide. However, although several phylogenetic studies have been carried out to understand the relationships between the isolates, no molecular dating analyses have been achieved so far. In this paper, comprehensive phylogenetic reconstructions were made using newly generated, publicly available sequences of hundreds of ASFV isolates from the past 70 years. Analyses focused on B646L, CP204L, and E183L genes from 356, 251, and 123 isolates, respectively. Phylogenetic analyses were achieved using maximum likelihood and Bayesian coalescence methods. A new lineage-based nomenclature is proposed to designate 35 different clusters. In addition, dating of ASFV origin was carried out from the molecular data sets. To avoid bias, diversity due to positive selection or recombination events was neutralized. The molecular clock analyses revealed that ASFV strains currently circulating have evolved over 300 years, with a time to the most recent common ancestor (TMRCA) in the early 18th century.


Introduction
African swine fever (ASF) is an infectious and contagious hemorrhagic disease of domestic pigs [1]. It is highly lethal, causing up to 100% mortality in naive animals, with devastating effects on pig production and animal trade, and major economic losses in affected countries [2]. First described by Montgomery in 1921 in Kenya [3], ASF has then been observed in most sub-Saharan countries, where it has often become endemic [4]. From Africa, it reached Europe, i.e. Portugal in 1957 and again in 1960, from where it colonized Spain, France, and Belgium. From there, the virus reached Latin America during the 70s-80s. In Europe, ASF remained endemic in the Iberian Peninsula up to the middle of the 90s and the disease is still present in Sardinia [2]. Recently, it has been re-introduced on the borders of Europe, in Georgia in 2007 [5] and then it extended to the Caucasus and Russia [6]. No vaccine is available and disease control is based only on quarantine and animal slaughtering. In this context, its great ability to spread makes the ASF virus one of the most important infectious threats for the domestic pig industry worldwide. African swine fever virus (ASFV) is a large icosahedral and enveloped dsDNA virus; it is the only recognized DNA arbovirus and also the only member of the Asfarviridae family and Asfivirus genus [7]. However, ASFV shares characteristics with the other members of the Nucleo-Cytoplasmic Large DNA virus family [8], suggesting that they all may have had a common ancestor [9], [10].
ASFV is believed to be an ancestral virus of soft tick (Ornithodoros genus) [11] infecting wild swine like warthogs (Phacochoerus fricanus), bushpigs (Potamochoerus porcus), and giant forest hogs (Hylochoerus meinertzhageni) with asymptomatic effects. The virus replicates in ticks and is then transmitted to wild swine during blood feeding; wildlife are considered as the natural reservoir of the virus. The virus can persist in ticks for years, even in quiescent ticks waiting for host feeding. The sylvatic cycle of ASFV established between ticks and wild suids can be maintained indefinitely. This cycle allows the maintenance of virus circulation and probably enables the persistence of ancient viruses and the emergence of new variants. At the laboratory level, virus variants were initially characterized by genome size and enzymatic restriction profiles [12]. A high level of variability is observed mainly within the 35 kb at the 39 end and the 15 kb at the 59 end of the genome (170-190 kb) [13], [12], [14]. These two regions contain the multigene families (MGF), which vary in number between isolates and enable virus variability by gene homologous recombination. Moreover, variability is also generated by a change in the number of aminoacid repeats in 14 proteins, including the envelope protein p54 encoded by the E183L gene [15]. More recently, gene sequencing and analysis were introduced to increase differentiation between ASFV isolates collected worldwide. The first group [16] used phylogenetic reconstructions based on the partial sequence of B646L gene coding for the major viral protein (MCP) VP72. Their trees showed a very close relationship between West African, European, and South American isolates, all clustered in genotype 1. Despite more than 50 years of circulation in three continents, the limited accumulation of genetic changes has made it impossible to discriminate isolates within genotype 1. In contrast, eastern and southern African isolates are more diverse and segregate into 21 additional genotypes [16], [17], [18]. This could be explained by the fact that these viruses are propagated within a sylvatic cycle, in contrast to viruses of genotype 1 that mainly replicate in domestic pigs, although they were secondarily detected in European soft ticks O. erraticus and wild boars in Spain and Portugal. This supports the assumption that the virus diversity may be generated during the sylvatic cycle of the virus [19]. Other genes or genome sequences have been used successfully to discriminate ASFV isolates collected at a regional level. For instance, the B602L gene from the central variable region of the genome (CVR, coding J9L protein), the CP204L gene (coding the phospho-protein P32), and the E183L gene (envelope protein p54) have been used to further split the local isolates [20], [21], [5].
The aim of this study was to reassess the phylogenetic reconstructions and nomenclature of ASFV by including recent sequences and to explore the evolution of the virus based on a comprehensive analysis of the available sequence data sets. Accordingly, three genes were targeted, all of them being the most sequenced and uploaded in public databases. The B646L, E183L, and CP204L genes belong to the most conserved central part of the genome and encode the structural virus proteins VP72 (capsid), p54 (membrane protein), and p32 (membrane protein), respectively. They are also known to generate antibodies in pig [22]. The origin and the evolution of the virus were inferred from these three genes.

Phylogenetic Inference
Sequence analysis. Sequences were aligned by ClustalW with default parameters and then scrutinized and edited using Mega version 5 software [23]. From the multiple sequence alignments, an index of substitution saturation to estimate the degree of sequence information was calculated using Dambe software [24]. DNA polymorphism was also analyzed. The site diversity between two sequences (p) and the number of segregating sites (i.e. the number of sites where one or several substitutions occurred) were obtained by DnaSP version 5 software [25]. In the segregating sites, the ratio of transitions and transversions was assessed. The average number of nucleotide differences (k) between two sequences was also determined. All this information was used to check the quality of the sequences.
Phylogenetic reconstruction. Maximum likelihood reconstructions [32] generating trees that best fit the evolution of a set of sequences through a probabilistic model of evolution were done using TREEFINDER version March 2011 software [33]. The evolution model was selected according to the Akaike information criterion (AIC) [34], the corrected AIC (AICc) [35], and Bayesian Information Criterion (BIC) [36] with a number of gamma rate categories fixed at 5. The consensus model given by the three information criteria or alternatively, the simplest model, was selected for the reconstruction. Thus, the B646L tree was constructed under HKY+G 5 model [37], [38]. For E183L and CP204L, HKY+G 5 and HKY on the one hand and HKY+G 5 and TN+ G 5 models on the other hand were selected. The most complex model, GTR [39], was also systematically included and compared with the others. All the reconstructions were done on 1,000 replicates and bootstraps were approximated using the Expected-Likelihood Weights defined by Strimmer and Rambaut (2002) [40] applied on local rearrangements (LR-ELW) as implemented in TREEFINDER.
Bayesian inference phylogeny was performed using Monte Carlo Markov Chain (MCMC) implemented in MrBayes version 3.1 software [41], [42]. According to the best fit models proposed by TREEFINDER, MrBayes was set with HKY+G 5 , HKY and HKY+G 5 , and HKY+ G 5 for B646L, E183L, and CP204L, respectively. The GTR model was also used for each gene. MCMC was run for a maximum of 10 million trees or alternatively when the run reached stationarity as measured by a standard deviation of split frequencies either becoming lower than 0.01 or fluctuating randomly above 0.01 for at least 500,000 generated trees. Consensus trees were generated after having discarded the first 25% of the MCMC burn-in phase.
Tree congruence with data sets was tested by submitting them to the statistical test ELW [40] implemented in TREEFINDER. The tree selected for each gene was the one with the highest ELW score.
Since ASFV is the only member of the Asfarviridae family, 37 outgroup viruses for tree rooting were selected in the closest related DNA virus families, the NCLDVs [43], [44], [45]. Because of the high level of nucleotide divergence, multiple sequence alignments were done on the complete amino-acid sequences of the major capsid protein of both outgroup viruses and ASFV isolates (equivalent to B646L protein) using Mega5 software (see File S1). Tree reconstructions were performed on 1,000 replicates using maximum likelihood method set with WAG+G+I+F and WAG+G+I models using ''all sites'' and ''complete deletion'' options, respectively. The topology of the resulting rooted tree was subsequently applied for placing roots on the B646L, E183L, and CP204L trees.
Analysis of selection pressure. Codons under positive selection pressure in DNA coding sequences may evolve faster than the natural evolutionary rate of the virus genome. To avoid bias in the molecular clocking analysis, the selection pressure acting on the targeted genes was assessed. The ratio of nonsynonymous (dN) -synonymous substitution (dS) per site (dN/dS ratio) was calculated and the codons under positive selection pressure were identified by using Codeml software implemented in PAML 4 package.
ASFV genotyping. Isolate genotyping was assessed by comparing the genetic distance between all B646L sequences. Average intra-and inter-branch distances were globally compared to determine the strength of cluster segregation. Additionally, a haplotype network of the isolates was constructed using TCS1.21 software to identify relationships between isolates potentially poorly represented by conventional phylogenetic tree reconstruction. Lastly, specific nucleotide signatures of the different ASFV clusters were searched using multiple sequence alignments containing only the 67 unique B646L sequences. The three approaches were finally combined to raise conclusions about ASFV genotyping.
Molecular dating. Two methods were used in parallel and compared to determine the evolutionary rate and the time to the most recent common ancestor (TMRCA) of circulating ASFV isolates. The first was based on the maximum likelihood method Baseml implemented in PAML 4 package [46] and the second on the Bayesian MCMC implemented in BEAST package version 1.6.2 [47]. Codons under positive selection (dN/dS .1) and recombined sequences were removed from the multiple sequence alignments to avoid bias in the substitution rate determination and consequently in the Tmrca estimation. The best fit tree generated in the phylogenetic reconstructions was used to perform Baseml implemented in PAML 4 package, using as evolution model HKY+G 5 for B646L, CP204L, and E183L genes. Strict and relaxed molecular clock hypotheses [48] were used to generate dated trees for all genes. These two trees were individually compared with the tree generated without a clock constraint to accept or reject the molecular clock hypothesis. A likelihood ratio test (LRT) and a x 2 comparison were performed to support this analysis. For the relaxed molecular clock, branches delineating the different genotypes were individually relaxed.
All analyses performed with BEAST package were done under an uncorrelated lognormal relaxed clock model. Considering that at least 20% of our sequences were from isolates persisting in wildlife, a constant population size prior was selected. The initial value and the range of substitution rates were estimated from preliminary analyses and entered into the model of evolution. For each gene, analyses of two independent runs of 100 million steps were performed with 1/10,000 trees sampled. MCMC samples were examined using Tracer version 1.4 [49]; the first 25% of samples in the chain were discarded as burn-in phase. Tree consensus was generated using the maximum clade credibility (MCC) tree using Tree Annotator version 1.4.7 [47]. Only posterior probabilities higher than 0.90 are indicated.
Tree visualization. All trees were represented and edited in Fig

Comprehensive Phylogenetic Inference of ASFV Depicts 4 Major Lineages
Before phylogenetic inference, data sets and multiple sequence alignments were thoroughly examined to eliminate misalignments and ensure correct framing of coding sequences. All gaps were considered as missing information to avoid artificial nucleotide divergence. None of the different methods used in RDP3 package identified recombination events in B646L and CP204L sequences. In contrast, several recombination events were detected among E183L sequences. A total of 17 isolates were subsequently removed from the E183L multiple sequence alignments: 16 [50] and one was a South African isolate (RSA/85/1). The recombination events were all identical for Italian isolates (Figure 1). There were no saturated codons in our alignments (DAMBE, p value ,,0.03), thus indicating the genetic information in the data sets was suitable for phylogenetic studies.
To check the nucleotide composition of the alignments, statistical tests were performed using DnaSP software. The tests gave the number of nucleotide substitutions, the average diversity per site between two sequences (p), and the average nucleotide difference between two sequences (k). The diversity of B646L and CP204L was approximately half that of E183L (Table 1). In addition, E183L showed a clear bias in non-synonymous mutations. Based on the observed nucleotide substitutions, the minimum and the maximum evolutionary rates were also calculated from each multiple alignment (Table 1). We determined the dN/dS of each gene (Table 1) and the amino acids under positive selection in the alignments. This led to removing 6 nt (2 aa: His4 and Thr28) from B646L alignments, 9 nt (3 aa: Glu31, Pro123 and Leu176)) from CP204L alignments, and 27 nt (9 aa: Tyr10, Thr23, Asp100, Thr104, Ser122, Pro140, Val142, Glu143 and Ser149) from E183L alignments for subsequent molecular dating analyses.
The outgroup-rooted trees constructed from the multiple sequence alignments of the major capsid protein amino acid sequences of 30 ASFVs and with 37 out-group viruses from the NCLDV family showed that the common ancestor of all these viruses connects the ASFV group within eastern African isolates, more precisely between the genotype VIII, IX and X on the one hand and genotypes I and the other genotypes on the other hand ( Figure 2). Accordingly, the root on all subsequent trees was placed in this position. This reconstruction also shows that the Asfarviridae family is rather divergent from the other NCLDV families.
Phylogenetic trees constructed with B646L sequences showed four major lineages (L) (Figure 3): L1 includes the previously described genotypes I, II, XVII, and XVIII, and L2, genotypes III, IV, V, VI, VII, XIX, XX, XXI, and XXII and an ungenotyped isolate (Cro3.5) [51]. L3 includes genotypes VIII, XI, XII, XIII, XV, and XVI and one isolate TAN/08/MAZIMBU, previously included within genotype XV [52]. L4 gathers genotypes IX and X. Interestingly, the NYA/1/2 isolate ascribed to genotype XIV is the only isolate that does not segregate within one of the four lineages. However, the bootstrap value of its branch is ,70%, thus rendering difficult any conclusion about this isolate. Further clustering of the isolates within these four lineages becomes tricky because of the presence of long branches and multifurcation for some isolate groups or sub-lineages. The TCS network analysis showed that conventional phylogenetic reconstruction based on bifurcations may fail to explain the complex relationships between some isolates (Figure 4). The TCS network confirms the existence of the four lineages that include the same isolates as in bifurcated reconstructions. However, the TCS network seems to better explain the relationships of isolates within a given genotype (e.g. genotype I or X) or between distinct genotypes (e.g. between genotypes III, IV, XIX, XX, and XXI, or between genotypes IX and X). In these cases the pattern of isolate relationships is not strictly bifurcative. Three ways exist between genotype XIX and genotype XX: through genotypes III or IV and/or XXI, which represent internal nodes of the tree, and two ways between genotypes IX and X. Within genotype X, several isolates are internal nodes of the tree, meaning that an isolate can have more than one ancestor, which is inconsistent with bifurcative relationships between isolates. In attempts to refine the clusterization of ASFV isolates, the multiple sequence alignments containing 67 unique B646L sequences were searched for specific molecular signatures ( Figure 5). Lineage 1 is characterized by 2 nt, and L2, L3, and L4 by 4, 6, and 12 nt, respectively. Genotype XIV, which is not included in one of the four lineages, is characterized by 8 nt (G88, G93, G162, T214, C240, T258, T333, and T348). However, this is the only virus generating this branch, which in addition is not supported by a high bootstrap value (,70%). Therefore, it cannot yet be considered as a fifth lineage. Lineages can be subsequently sub-divided into sub-lineages: 4 for lineage 1, 3 for lineage 2, 7 for lineage 3, and 2 for lineage 4 ( Figure 5). Further sub-divisions can be drawn from the molecular signatures ( Figure 5) and all are supported by the evolutionary distance matrix, except for some sub-lineages within L1-1, L1-2, L1-3, L2-2, L2-3, and L4-2-2 ( Table 2). The average evolutionary distances inside and between all sub-lineages were 0.0023 and 0.055, Figure 1. Localization of recombination events detected in E183L sequence alignment. 16 Italian isolates and 1 South African isolate were detected to be recombinant. Italian isolates are linked and because recombination events take place in the same region of the sequence, these isolates have probably emerged from a common ancestor. doi:10.1371/journal.pone.0069662.g001  Table 2. Estimates of evolutionary distances between ASF lineages and sub-lineages. The average intra-sub-lineage diversity was 0.0023, whereas the average inter-sub-lineage was 0.055. In the matrix, diversities lesser than 5x(intra-sub-lineage diversity) = 0.0115 are shown in grey boxes: all sub-lineages (Lx-x) differ from the others by a higher diversity, except for some sub-lineages within L1.

Molecular Dating Leads to a most Recent Common
Ancestor of about 300 Years E183L gene was removed from molecular dating analyses because of the detection of several recombination events and a non-synonymous bias in the gene alignment both due to a strong positive selection of the immune system on this gene. The strict molecular clock hypothesis, meaning an equal substitution rate for every nucleotide site along the DNA sequences, was rejected for the other two genes by the maximum likelihood analysis performed by Baseml in PAML software suite. In PAML, the branches were individually relaxed in the tree submitted to the analysis. Several trees with different numbers of relaxed branches were tested. The resulting TMRCAs for B646L and CP204L genes were highly variable: from 1597 BC to 700 AD or even undetermined date (because of a tree likelihood value of zero at the beginning of the analysis). This high level of heterogeneity in the TMRCA using maximum likelihood method led us to select Bayesian approach in the BEAST package. The Bayesian MCMC inference of the two data sets performed with BEAST package showed a satisfactory convergence in the posterior statistic estimates of the substitution rate. Preliminary analyses were used to set the initial value of m, the parameter of substitution/site/year (data not shown). Accordingly, the prior distribution of this parameter was set from 0.16 m to 16 m. Thus, calibrations of molecular clocks were set at 5.3610 23 substitution/site/year [5.3610 24 -1.4610 21 ] for B646L gene and 5.36610 23 [5.36610 24 -1.99610 21 ] for CP204L gene. With these priors, the mean estimates of substitution rates for each gene were finally calculated by BEAST and ranged from 6.6610 24 (CP204L) to 6.9610 24 (B646L) subst/site/year (Table 3). These results are robust in terms of clock model, rate distribution, and population size parameters. The dated trees generated four lineages as Figure 2. Rooted tree constructed from amino-acid multiple alignments of the major capsid protein of ASFV isolates and four outgrouped viruses. The tree was constructed under a WAG+G+I+F model and maximum likelihood method with 1,000 bootstrap resampling. Numbers indicate the statistical value (Expected-Likelihood Weight) of internal nodes, given in percentages (only numbers above 70% are indicated). The outgroup connects ASFV group by the branch from genotypes VIII (MwLil20/1), IX (UgH03), and X (Kenya1950, kn66 and Uganda) to other genotypes. doi:10.1371/journal.pone.0069662.g002 previously described ( Figure 6) and, again, the same isolates were found within these lineages. The four lineages were organized differently for the two genes: for B646L CP204L gene L1 and L2 were on the one hand and L3 and L4 on the other hand and in contrast, CP204L gene tree rendered different connections: L1, L2, and L4 together and L3 on the other hand. In both cases, the oldest lineage (TMRCA = 111 years) was L4, which gathers isolates from eastern Africa, the presumed birthplace of ASFV. It was followed by L1 (104 years), L2 (74 years), and L3 (47 years).

Discussion
Because the localization of the major capsid protein VP72 in the virus core prevents exposure to circulating neutralizing antibodies, the corresponding B646L gene is not expected to be submitted to immune system pressure. Accordingly, only two amino-acid positions were detected as being under positive selection, suggesting no real impact on the evolutionary force. Therefore, the rate of substitutions of VP72 probably bears the information needed to estimate natural virus evolution. The VP72 homologues of closely related virus families have already been used in evolutionary studies [53] and for a decade in ASFV phylogenetic reconstructions. In contrast, P54 is an envelope protein and the pressure of the immune system on E183L evolution is revealed by nine amino-acid positions placed under positive selection, a strong non-synonymous bias and recombination events within the gene sequence. P32 is also an envelope protein but is involved in translation of viral genes by its interactions with hnRNP cellular protein [54]. In this context, mutations may be detrimental and, thus, the gene may be submitted to purifying selection, as corroborated by the detection of only three amino-acid positions under positive selection.
The evolution of ASFV mapped through partial genomic sequences and phylogenetic reconstructions shows a certain degree of complexity that may not be well represented by bifurcative methods. However, both bifurcative and network analyses in this study clearly provided clear clusterization into four major lineages (L1 to L4) while only three have been described so far [18]. Within these lineages, molecular signatures of the twenty-two already described genotypes were established and two new sub-lineages can be proposed, that is, Cro3.5 isolate, and TAN/08/ MAZIMBU previously ascribed to genotype XV. Molecular signatures do not rely on the same number of substitutions and do not have an equal weight. For instance, L1 is characterized by 2 specific nucleic acid positions and 4 synonymous substitutions while L4 is defined by 12 sites and 13 nucleotide substitutions of which 3 are not synonymous. Within the L1 lineage, genotype I, which is the most represented in terms of sequences (Europe, West Africa, Caribbean and South America), is characterized by only one synonymous substitution (A216). This mutation, however, leads to an increase in ASFV codon preference for alanine (GCG to GCA) (http://www.kazusa.or.jp/codon/), which has surely helped to fix the substitution in the lineage for almost 60 years in three continents. Besides the molecular signature, the distance matrix also supports our proposal for new ASFV classification, which includes the previous genotype subdivision and additional sub-clustering.
ASFV shows a high evolutionary rate relative to that of other DNA viruses [55]. Consequently, this high substitution rate led to very recent TMRCAs: the most common ancestor of ASFV strains currently circulating emerged in around three centuries, in 1700. It is commonly agreed that ASFV is native of East Africa as the disease was first described in Kenya in 1921 after a first outbreak in 1903. Then, during decades ASFV showed a great ability to Figure 4. Haplotype network constructed with TCS software. The network shows the same four main lineages that were observed in the bifurcative phylogenetic tree constructed in maximum likelihood under the HKY+ model, but clearly demonstrates that relationships between some ASFV isolates are too complex to be resolved by only bifurcations. doi:10.1371/journal.pone.0069662.g004 spread worldwide following major trade routes. In the wild, the virus is thought to be originally a virus of tick [11] as it infects argasids ticks of the Ornithodoros genus. Ornithodoros, which infest warthogs' burrows, are endophile ticks, meaning that they need regular temperature and hygrometry. They also are photophobic so they do not spread out over long distances. ASFV is transmitted horizontally and vertically between ticks [56], [57] and between ticks and juvenile wild swine that stay in and close to their burrows. Under such circumstances, the virus is not supposed to spread much and its genetic drift over long periods may have resulted in isolated spots of diversity maintained by the sylvatic cycle with only few entries of new strains. In contrast, the domestic pig cycle is short with a dead-end disease essentially transmitted by contacts with pigs or pig meat and rarely by tick bites. Accordingly, the phylogenetic trees constructed in this study showed higher diversity within lineages of eastern and southern African isolates submitted to a sylvatic cycle than in lineages of domestic pigs from other regions. New variants are not easy to characterize because of the lack of sequence data from their parent lineage. For example, TAN/08/Mazimbu isolate collected in Tanzania in 2008 and originally placed in genotype XV [52] constitutes in this study a sub-lineage of L3. Thus, it should not be considered as a re-emergence of the TAN/01/1 isolate collected during an outbreak in Tanzania in 2001. Sixteen Italian isolates showed recombination events in the E183L gene and were subsequently removed from the corresponding reconstruction. This does not change the affiliation of these isolates to L1, as demonstrated by B646L and CP204L reconstructions (not shown). However, since all these isolates are linked together and show the same recombination events, assuming they all have emerged from a common recombined ancestor, the possibility that they will form a new sub-lineage within L1 has to be considered.
Two different genes and two methods were used to consolidate TMRCA estimation. Maximum likelihood method using PAML package showed that a strict molecular clock could not be validated for our set of genes. However, it did not provide consistent results when using a relaxed clock, with TMRCAs from 212000 to 1500. In contrast, the Bayesian approach generated consistent results, B646L and CP204L analyses dating a TMRCA around 1700 AD with a rate of subst/site/year estimated to be around 6.7610 24 . As illustrated by the E183L gene analysis, the role of the immune system on sequence variability may influence the sequence evolution of some ASFV genes which may consequently render a biased TMRCA (data not shown for the E183L gene in this paper). Therefore, the natural evolution of the virus may be well represented by B646L and CP204L genes in which neither recombination events nor non-synonymous bias or too many codons under positive selection were detected. The TMRCA scale going back to 1700 AD for all ASFV isolates can be considered with confidence since within this scale, the TMRCA of lineage L1-1 and L1-2 were 1943/1955 (for B646L/CP204L genes) and 1990 (for both genes), respectively. L1-1 is supposed to have emerged in the late 1950s [16] and L1-2 includes mainly isolates from Madagascar that were first introduced in 1998 [58]. The substitution rates determined in this study were much higher than expected relative to other large dsDNA viruses like gammaherpes viruses of vertebrate (10 29 subs/site/year) or even small dsDNA viruses like the John Cunningham polyomavirus (10 27 subs/site/year) [55]. With a substitution rate between 10 24 and 10 25 , ASFV approaches RNA viruses that usually have 10 22 to 10 25 subs/site/year [59].
Like many other large dsDNA viruses [60], ASFV may have coevolved with its host. This means a long and ancient history of the virus in the wild. A high substitution rate combined with recent TMRCAs is not consistent with ancient co-evolution of viruses and their hosts, which in contrast should lead to a low rate of substitution [61]. However, for a virus that replicates at a high level in its host, a low rate of subst/site/replication can still lead to an increased accumulation of diversity, which in turns generates high rates of subst/site/year [62]. This has been described for highly contagious viruses that induce acute forms of infection and show a higher observed rate of subst/site/year [63]. In contrast, an asymptomatic infection of the host may not allow an exponential replication rate. ASFV presents these two characteristics, being asymptomatic in natural African wild swine and soft ticks and highly contagious and lethal in domestic pigs. Consequently, a stochastic event may have occurred around 300 years from now that would explain the emergence of an ancestor common to all known ASFV isolated so far in domestic and wild pigs. Our assumption is based on the introduction of domestic pigs  Table 3. Summary results of all tests done in BEAST for molecular clocking, models, evolution rates, and the TMRCA obtained.  in Africa. Domestic pigs have Eurasian and North African ancestral wild boar origins [64]. Even though Plug (2001) [65], claimed pigs were introduced in South Africa between the 3 rd and 7 th centuries, Swart (2010) [66] believes domestic pigs were not present in eastern and southern African livestock because of the nomadic lifestyle of pastoralists at this time. Domestic pigs may have been brought first by the Chinese around 600 years ago [67] then by the Portuguese 300 to 400 years ago [68], both during their exploration and conquest period for trade opportunities. The assumption of pig introduction from Europe and the Far East was confirmed by phylogenetic analysis revealing contributions of both origins in the genetic pattern of local African pigs [69]. Following the circumnavigation of Africa by European nations during 15 th -17 th centuries, pig breed types were introduced during 16 th and 17 th centuries [66], mainly by the Portuguese to the East Africa coast via Goa. Pig breeding diffused then slowly northward from Mozambique [68]. The Portuguese did not colonize Kenya for settlement but as a step to India and definitely left the country in 1720 after being defeated by the Arabs in 1698. Despite Arab colonization and the pig-eating taboo, domestic pigs were eaten by ethnic groups like the Waata in southern Kenya since the 16 th century and called Walyankuru: ''those who eat pig'' [70]. This may have enabled the virus to spread silently among sensitive pig species. Kenya was then colonized by the British. At the end of the 19 th century, the extensive pig industry in the native region of ASFV started after a massive loss of bovine cattle due to rinderpest outbreak. Pigs were massively imported for breeding by colonizers from Seychelles in 1904 and from England in 1905. Pig farming was free ranging at this time and the first outbreak of ASF was reported in 1907. Trade routes and virus resistance in the environment then enabled further spreading of ASFV.

Supporting Information
Table S1 List of ASFV and NCLDVs isolates and corresponding genes used in this study.