Mitochondrial Phylogenomics of Modern and Ancient Equids

The genus Equus is richly represented in the fossil record, yet our understanding of taxonomic relationships within this genus remains limited. To estimate the phylogenetic relationships among modern horses, zebras, asses and donkeys, we generated the first data set including complete mitochondrial sequences from all seven extant lineages within the genus Equus. Bayesian and Maximum Likelihood phylogenetic inference confirms that zebras are monophyletic within the genus, and the Plains and Grevy’s zebras form a well-supported monophyletic group. Using ancient DNA techniques, we further characterize the complete mitochondrial genomes of three extinct equid lineages (the New World stilt-legged horses, NWSLH; the subgenus Sussemionus; and the Quagga, Equus quagga quagga). Comparisons with extant taxa confirm the NWSLH as being part of the caballines, and the Quagga and Plains zebras as being conspecific. However, the evolutionary relationships among the non-caballine lineages, including the now-extinct subgenus Sussemionus, remain unresolved, most likely due to extremely rapid radiation within this group. The closest living outgroups (rhinos and tapirs) were found to be too phylogenetically distant to calibrate reliable molecular clocks. Additional mitochondrial genome sequence data, including radiocarbon dated ancient equids, will be required before revisiting the exact timing of the lineage radiation leading up to modern equids, which for now were found to have possibly shared a common ancestor as far as up to 4 Million years ago (Mya).


Introduction
The family Equidae, along with Rhinocerotidae and Tapiridae, is one of the three extant families of odd-toed ungulates (Perissodactyla, Mammalia). The Equidae are richly represented in the fossil record throughout the past 55 Million years (My), starting with dog-sized taxa from the North American Eocene [1]. Equid lineages then spread globally and experienced successive episodes of radiations and extinctions during the early Miocene, the late Miocene, and at the end of the Pleistocene [2], becoming adapted to a variety of environments with remarkable variations in body size [1].
While several dozen extinct equid genera have been described [2], all extant equid species are classified in the same genus, Equus. It includes the African wild ass, Equus africanus, which is the progenitor of the Domestic donkey, E. asinus [3], and three living species of zebra, all endemic to Africa. The zebras include the Plains zebra, also called Burchell's zebra, E. quagga (previously E. burchellii); Grevy's zebra, E. grevyi; and the Mountain zebra, E. zebra (with two subspecies, E. z. zebra of South Africa, and E. z. hartmannae of Namibia and Angola) [4]. The genus also includes the Asian wild asses (subgenus Hemionus), E. hemionus and E. kiang, with various recognized subspecies of E. hemionus (e.g E. h. kulan and E. h. onager). Together, the zebras, donkeys, and Asiatic asses make up the non-caballine lineages. The remaining extant lineages, the domesticated horses (Equus caballus) and wild populations sometimes referred to as E. ferus (Tarpan and/or the Przewalski horse E. przewalskii), are classified as caballine horses.
NWSLH refer to a group of equids that were endemic to North America. Despite having gracile limbs similar to the Asiatic wild asses, partial mitochondrial DNA sequences revealed no particular genetic affinity with hemiones, instead the NWSLH were nested within caballine horses [20]. Sussemiones (subgenus Sussemionus) consist of a lineage of Eurasian equids that survived until the Late Pleistocene, at least up to 46 thousand years ago (Kya) [21,22]. No particular genetic relationship with any of the living equid lineages was supported by partial mitochondrial sequences [22], and their evolutionary origin is still debated [21,22]. The Quagga, E. quagga quagga, a morphological variant of the Plains zebra found in South Africa, became extinct in the wild in the late 1800 s [23], and is now commonly accepted as a southern variant of the Plains zebra [4,24].
Inferring the evolutionary relationships among extinct and living equids is an active area of research. So far, analyses have used mainly short fragments of mitochondrial DNA, with highly controversial results suggesting that previously defined lineages should be collapsed into single taxa [20,22,[25][26][27]. In this study we aim to (1) resolve the phylogenetic relationships among all of the extant as well as three extinct lineages of equids, and (2) to investigate the evolutionary timing of lineage radiation within the genus Equus.

Overall Experimental Design
We isolated one partial and 17 complete mitogenomes representing all extant species within the genus Equus (14 modern samples), including three extinct lineages (four ancient samples) ( Table 1). For five out of the 14 modern samples, mitogenomes were generated by PCR amplification followed by a combination of Sanger and GS FLX sequencing. The DNA extracts from the remaining nine modern samples and three of the ancient samples were converted into sequencing libraries and shotgun-sequenced using the Illumina HiSeq2000 platform. The mitogenome of the remaining ancient sample (NWSLH sample MS272) was recovered from a genomic library using the MySelect in solution target enrichment kit (MYcroarray, USA) and sequenced on the Illumina HiSeq2000 platform.

Ethics Statement
Tissue samples were collected by Hans Siegismund [28], or provided by the Smithsonian Institution (Washington D.C.), the Musée des Confluences (Lyon), the Russian Academy of Sciences, the Government of Yukon, or zoological gardens following official agreements with the Natural History Museum of Denmark in the framework of the COSE certificate DK003 of the host institution, which covers CITES-protected species. Permission was obtained from all museums and institutions to access the collections and all samples were on loan for scientific purposes.

Modern Samples
For PCR amplified mitogenomes, we collected 30-50 mg of hair or dried alcohol-preserved tissue and extracted DNA using the DNeasy Blood and Tissue kit (Qiagen, USA) following the manufacturer's instructions except that 360 ml of Buffer ATL and 40 ml Proteinase K were used, and samples were left to lyse overnight at 55uC. We performed PCR amplification using three to six overlapping primer sets (Table S5; [16,17]) as in Vilstrup et al. [29] (Table S4). We pooled purified PCR products from each sample, except for a 1.5 kb fragment, in equimolar amounts and nebulized for 1 min at 2.1 bar to get approximately 600 bp fragments. We then built the fragmented DNA samples into tagged dA-tailed libraries using NEBNext Quick DNA Library Prep Master Mix set for 454 (New England BioLabs, ref: E6090) using a final adaptor concentration of 8 nM. We then pooled the tagged libraries in equimolar amounts and distributed them on one eighth of a plate on the GS FLX. We followed a similar procedure for PCR amplicons from primer sets Pr1 and Pr2, which span the hypervariable region. All tagged reads were sorted, trimmed and mapped to both a donkey and a horse reference genome (Accession Nb. NC001788 and NC001640, respectively) using GS Reference Mapper with default parameters (454 Life Sciences). We allowed for a maximum of one mismatch during index sorting and a minimum size of 80 bp was required during sequence assembly. Consensus sequences were aligned to a reference using Sequencher v4.8 (Genes Code Corporation, Ann Arbor, MI, USA), and the resulting mitogenomes aligned by eye using SE-AL v2.0a11 Carbon (A. Rambaut, Univ. of Oxford). Sanger sequencing of the short 1.5 kb fragment was performed at the Macrogen facility (Seoul, South Korea) with several primers (Table S4).
For shotgun-sequenced mitogenomes, we collected 2 ml blood samples and extracted DNA using the QIAamp DNA Blood Midi kit (Qiagen, USA) following the manufacturer's protocol except that blood digestion was performed with Proteinase K for 30 min at 70uC, and the final product was eluted into 150 ml EB buffer. We then sonicated the DNA extracts for seven cycles of 15 secs/ 90 secs (ON/OFF cycles) on mode High (H) using a Bioruptor XL (Diagenode, Belgium). We selected fragments of 200-300 bp using the EZNA Gel Extraction Kit (Omega Bio-Tek, ref: D2500) and converted each sample into an Illumina library using the NEBNext Quick DNA library prep Master mix set for 454 (New England BioLabs, ref: E6090). We amplified the libraries in a 50 ml volume reaction, with two parallel PCR set-ups per library using half (16 ml) of each DNA library. The final PCR reaction consisted of 5U Taq Gold (Invitrogen, Life Technologies), 4 mM MgCl 2 , 1 mg/ml BSA, 1 mM dNTPs, 1 mM of Primer inPE1.0, 20 nM of Primer inPE2.0, and 1 mM of an Multiplexing Index Primer containing a unique 6 nucleotide index tag (Illumina Inc.). PCR cycling conditions consisted of initial denaturation for 10 min at 92uC, followed by 10-15 cycles of 30 secs denaturation at 94uC, 30 secs annealing at 60uC, and 40 secs elongation at 72uC, and a final elongation step at 72uC for 7 min. We QIAquick (Qiagen, USA) purified each amplified library and then quantified them on an Agilent 2100 Bioanalyser. The libraries were then pooled in equimolar ratios, and sequenced in paired-end mode (26100 bp) on the Illumina HiSeq2000 platform at the Danish National Highthroughput Sequencing Centre.
We trimmed the Illumina reads using default settings in AdapterRemoval [30], except using a minimal read length of 25 bp, and aligned reads against both published mitogenomes and those generated here via PCR amplification using BWA [31], disabling the seed and relaxing the edit distance (option -n 0.03) as suggested by Schubert et al. [32]. We removed reads that mapped to multiple positions and with mapping quality scores ,25 using SAMtools [31]. We removed sequence duplicates using both read start and read end coordinates for collapsed paired-end reads, and we used only the read start coordinates for filtering potential sequence duplicates for uncollapsed read pairs as implemented in MarkDuplicates from the Picard package (http://picard. sourceforge.net). A final alignment was generated for each mitogenome and visually corrected for potential local misalignments. Finally, we called the sequence of each mitogenome using a consensus approach requiring a minimum base coverage of 2 and over 50% of sequence match among reads.

Ancient Samples and Museum Specimens
We extracted three ancient samples (JW328, ACAD2304, QH1) and two museum-preserved dried tissue specimens (CGG10086 and CGG10096) at the Centre for GeoGenetics in laboratory facilities dedicated to the analysis of fossil material and geographically separated from laboratories with amplified PCR products. We used three different extraction protocols depending on the tissue type, and all extractions were accompanied by appropriate controls.
We extracted the two bone samples JW328 (20 mg) and sample ACAD2304 (464 mg) following the silica-based protocol described in Orlando et al. [22]. We extracted the two tissue specimens (CGG10086 and CGG10096) as described in Gilbert et al [33], adding 2 mg/mL Proteinase K (New England Biolabs) to the extraction buffer, which we digested overnight at 37uC, followed by QIAquick purification and elution into 50 ml. We extracted ,50 mg of the hair sample (QH1) as in Gilbert et al. [33] except digestion occurred at room temperature (RT) over three days, after which undigested hair was removed and incubated at RT for an additional three days with more digestion buffer. We concentrated both digests to ,50 ml using 30 KDa centricons and purified the extracts using QIAquick columns, and eluted the concentrated DNA into 50 ml after a five-minute incubation in elution buffer at RT.
We built indexed Illumina libraries for samples JW328, CGG10086, CGG10096, and QH1 using the NEBNext Quick DNA Library Prep Master Mix set for 454 (New England BioLabs, ref: E6090) following the manufacturer's instructions, except that SPRI bead purification was replaced by Qiagen MinElute DNA purifications. The libraries were amplified in two steps, as described in Kampmann et al. [34]. We purified the resulting libraries through QIAquick columns, quantified them using an Agilent 2100 Bioanalyser, and pooled them in equimolar ratios for Illumina sequencing performed in single-read mode (75 bp) on an Illumina HiSeq2000 platform at the Danish National Highthroughput DNA Sequencing Centre.
We built two independent DNA libraries using 21 ml each of DNA extract for sample ACAD2304 using the NEBNext DNA library prep Master mix set for 454 (New England BioLabs, ref: E6070) and following Meyer and Kircher [35], except working with 25 ml reaction volumes and using a final concentration of 0.5 mM Illumina multiplex blunt-end adaptors. At the end of the final fill-in reaction, Bst Polymerase was inactivated following 20 min incubation at 80uC. We then amplified the two libraries for 12 cycles using PCR conditions as described above for the seven modern samples. We purified the PCR products through QIAquick columns and re-amplified them in four parallel reactions (5 ml each) for ten additional cycles. The four amplification products generated per library were QIAquick purified, quantified using an Agilent 2100 Bioanalyser and pooled in equimolar amounts before sequencing in single-read mode (100 bp) on a HiSeq2000 Illumina platform at the Danish National High-throughput DNA Sequencing Centre.
A second Quagga library was also enriched for mitochondrial fragments following the procedure described by Maricic et al. [36] and using 1.5 mg of library as input. Mitochondrial baits consisted of a mix of fragmented, amplified mitogenomes of a Grevy's zebra, a Somali wild ass, and a Kiang (SF, LF and Primer1 primer sets; Table S4). The captured library was purified through MinElute columns and 16 ml was amplified for 12 cycles in a 50 ml volume under the same PCR conditions as the seven modern samples using 0.2 mM postCap primer inPE1.0 and 0.2 mM post index primer. The captured library was sequenced in single-read mode (75 bp) on an Illumina HiSeq2000 platform at the Danish National High-throughput DNA Sequencing Centre.
We assembled complete mitogenomes following the procedure described for assembling Illumina reads from modern equid specimens, except that for sample JW328 we also attempted to map reads against previously reported partial mitochondrial sequences of hippidiforms (accession numbers: GQ324601, GQ324596 and AY152859) and the complete nucleotide sequence of another NWSLH generated in this study following library target-enrichment (see below). The authenticity of our ancient DNA data was then assessed using the mapDamage package [37] ( Figure S3).
The NWSLH sample MS272 was processed separately in a dedicated ancient DNA laboratory at the Pennsylvania State University, USA. We powdered ,500 mg of bone using a Mikrodismembrator (Braun) and extracted DNA using the column based silica extraction protocol as described in Rohland et al. [38]. DNA was eluted into 60 ml of 16TET (16TE including 0.05% Tween20). We prepared an indexed Illumina library using 15 ml of the ancient DNA extract according to the protocol described in Meyer and Kircher [35] with 40 ml reaction volumes. We amplified the library for 25 cycles using a unique indexing primer, and subsequently purified the library using the Agencourt AMPure XP PCR purification kit. We eluted the library in 30 ml of 16TET as described in Meyer and Kircher [35]. We then enriched the indexed library for the mitogenome using the MYselect (now called MYbaits) target enrichment kit from MYcroarray (USA) following the manufacturer's instructions. Custom bait molecules were designed based on the modern Equus caballus mitogenome sequence (horse genome version equCab2.0 http://genome.ucsc.edu/cgi-bin/ hgGateway?db = equCab2 provided by The Broad Institute, Cambridge, USA) at 2X tiling (where each base is covered by two unique bait molecules). To provide enough input DNA for the capture reaction, the library was amplified for another 30 cycles using Phusion Hot Start polymerase (New England BioLabs) and primers IS5_reamp.P5 and IS6_reamp.P7 [35]. The target capture itself was performed to the manufacturer's specifications. We then amplified the enriched libraries once more using the same Phusion Hot Start polymerase setup including primers IS5_reamp.P5 and IS6_reamp.P7. Following a purification using the Agencourt AMPure XP PCR purification kit, we eluted the final library in 30 ml of 16TET and visualized the results on a 2% agarose gel.
The post-capture library enriched for the NWSLH mitochondrial DNA was then sequenced as part of a larger pool containing 47 additional libraries at equimolar ratios on one lane of an Illumina HiSeq2000 platform. Raw reads were used for readmerging using Kircher's MergeReadsFastQ_cc.py script [39], resulting in 8.2 millions of reads that were mapped to the horse genome version equCab2.0 http://genome.ucsc.edu/cgi-bin/ hgGateway?db = equCab2 using BWA [31] and SAMtools [40]. The final consensus sequence was called using SAMtools [40] requiring a minimum coverage of 26.

Data Partitioning and Model Selection
We included 13 previously published mitogenomes in our analyses (Table S1), and used the online Sequence Manipulation Suite [41] to partition protein-coding genes into first, second, and third codon positions. Two additional partitions were generated: one comprising the two rRNA genes (16S and 12S) and another limited to the control region. A highly variable and repetitive section of the control region consisting of tandem repeats was disregarded (from position 16,125-16,368 on horse ref genome JN398398). Phylogenetic analyses were run independently for each gene as well as on the five partitions data set (15,262 bp), the four partitions data set (14,006 bp; excluding the control region), and on the whole unpartitioned data set. Analyses were run both with and without outgroups consisting of six species of rhinoceros and the lowland tapir (Table S1).
We used jModelTest v0.1.1 [42,43] and the corrected AIC to select the most appropriate evolutionary model for each gene and each partition. We used seven substitution schemes, base frequencies +F, rate variation +I and +G with eight categories, and ML optimized base tree for likelihood calculations (Table   S3a-b). Variable sites and nucleotide composition frequencies were retrieved from MEGA 5 [44].

Phylogenetic Analyses
Phylogenetic analyses were run using PhyML online [45,46]. The topology and branch lengths were optimized, and support nodes were estimated using both 500 bootstrap pseudo-replicates and an approximate Likelihood Ratio Test (SH-like) [47]. Topological tests (Approximate Unbiased, AU; and Kishino Hasegawa, KH) were performed using CONSEL v0.3 [48] and site likelihood estimates recovered from PhyML with the -print_site_lnl option. A total of 12 topologies were tested and sorted according to the p-values recovered from AU tests.
RaxML GUI 1.0 [49] was used to compute maximum likelihood analyses on the four and five partitioned data sets using 500 bootstraps. In cases where the best model chosen by jModelTest was not available in RaxML or PhyML the next best model implemented was chosen instead (Table S3a-b). Bayesian phylogenetic analyses were performed using MrBayes v3.1.2 [50]. Two runs of 50 million generations were used for each dataset, and the first 25% of samples from each run was excluded as burnin.  Analyses run with and without outgroups (see Table S9a-b). All dates are in years with 95% confidence interval given in parentheses. N/a = not applicable. Sum = Sumatran rhino; Wool = woolly rhino; Zeb = zebras; Don = Donkey; Donkey = E. africanus and E. asinus; Asses = E. hemionus and E. kiang; Ass = Asses. Node letters in parentheses as in Figure 1. doi:10.1371/journal.pone.0055950.t002

Molecular Dating
To place the estimated phylogeny on a calendar time-scale, we ran several additional phylogenetic analyses using the software package BEAST v.1.5.4 [51]. For each analysis, we assumed the relaxed uncorrelated lognormal molecular clock, first with Birth and Deaths, and then with Yule speciation. We ran separate analyses for the four-and five-partition data sets both with and without a non-equid outgroup (see above). MCMC chains were run for 250-500 million generations with 25% burn-in, and with convergence to stationarity determined by visual inspection using Tracer v1.5 [52].
In the analyses including the non-equid outgroup, the molecular clock was calibrated using a normal prior of 5565 Mya [12,20,53] on the root of the Perissodactyl tree. In addition, we ran separate analyses using a second normal prior calibration of 0.760.1 Mya for the emergence of the Plains zebra [22] and based on the fossil record of E. mauritanieus [54,55]. In analyses excluding outgroups, only the Plains zebra calibration was used. In the absence of precise carbon dating information, ages for both the NWSLH and the Sussemione were sampled from a uniform prior spanning 12 to 100 Kya.
Monophyly was constrained for species represented in the analysis by more than one individual. The subspecies E. h. kulan and E. h. onager were not constrained to a single clade. However, following the results of the ML and Bayesian analyses above, we constrained monophyly on the clade comprising all the zebras, the clade comprising all horses (E. caballus and E. przewalskii), the clade

Branch Tests for Natural Selection
To identify regions of the mitogenome that may be under selection in any given lineage, we performed branch tests for each individual protein-coding gene as implemented in PAML v4.5 [56]. Based on the branch test results, branch-site tests were also run for the NADH dehydrogenase 4 protein gene (ND4) to test the caballine (E. caballus and NWSLH) branch for sites under positive selection.

Results and Discussion
We successfully recovered the complete mitogenome of three extinct equid taxa (NWSLH, Sussemione, and Quagga). We also characterized partial mitochondrial contigs (11-288 bp in length, for a total of 7,108 bp) for a second NWSLH (JW328). Of the modern equids we sequenced 14 full mitogenomes ( Table 1). The complete mitogenomes of the NSWLH, Sussemione, E. hemionus kulan, and the three species of zebra plus the Quagga had not been sequenced prior to this study. The average depth-of-coverage is 387.3 for modern samples and 57.2 for ancient samples (Table 1).
Minimal pairwise distances were observed between conspecific individuals, suggesting that the sequences reported here are in agreement with the known sequence diversity among equids (Table S8). We evaluated nucleotide misincorporation patterns through comparison of Illumina reads with the horse reference mitogenome following the procedure described in Ginolhac et al. [37]. In contrast to modern samples, typical accumulation of C to T and G to A mismatches were observed at sequence ends in the ancient samples, suggesting that the latter were affected by significant levels of post-mortem DNA damage, in particular cytosine deamination ( Figure S3). This, together with the base composition of our sequences (Table S6), higher substitution rates at third codon positions over first and second codon positions respectively (data not shown), and the blank experimental controls used during the whole laboratory procedure highlights the quality our sequence data.

Topological Relationships
Mitogenome alignments were used to perform phylogenetic analyses under a Bayesian and a Maximum Likelihood framework. All methods and sequence partitions considered supported the same topology (Figure 1a-b; Table S2; Figure S2), where all nodes except C, D and G received maximal support (Figure 1a-b). Interestingly, the mitochondrial topology appeared very similar to a study based on 22 partial mitochondrial and nuclear genes [57] and one reconstructed from ca. 55,000 nuclear SNPs [8]. This appears in strong contrast to other recent cases reported among mammals where the phylogenetic signal retrieved from nuclear and mitochondrial genes were in conflict (e.g for brown and polar bears, [58]; for Denisovans and Neanderthals, [59]).
Rooted phylogenetic analyses supported a major division between caballine horses and non-caballine equids, where horses and NWSLH appeared as sister species within caballine horses, in agreement with previous results based on partial mitochondrial sequences [20,22]. This suggests that Asiatic wild asses and NWSLH did not originate from a single ancestral population with gracile limbs, but instead that similarities in their post-cranial skeleton [25,60] derive from convergent evolution. However, another possibility is that gracile limbs is the ancestral state which has been preserved in the Asian asses, while it has been lost in all other lineages, resulting in limb states derived from a more gracile one.
Zebras were found to be monophyletic (node A, Figure 1a-b), with the Mountain zebra (E. zebra) diverging before the Plains (E. quagga) and Grevy's (E. grevyi) zebras, which are sister species. This topology is supported by other studies using a variety of genetic markers [8,12,24,57,60,61]. The now extinct Quagga nests within the Plains zebra clade, confirming their conspecific relationship [24]. Asiatic wild asses (node F, Figure 1a-b) can also be confirmed as monophyletic. The subspecies E. h. onager and E. h. kulan clustered together (albeit with relatively low bootstrap support) in a group divergent from E. kiang.
Sussemiones (as represented by E. ovodovi) were confirmed as a distinct branch within non-caballine equids (Figure 1a-b; Figure  S2), as suggested by previous analyses based on partial mitochondrial sequence information [22]. The Sussemione lineage showed greater genetic distance to caballine horses (Mean = 0.055460.0036) than to non-caballine equids (Mean = 0.042760.0026) ( Table S2; Table S8). Approximately Unbiased and Shimodaira-Hasegawa topological tests showed significantly more support for a topology nesting Sussemiones within non-caballine equids ( Figure S2; item 8 in Table S2). However, even though the full length of the mitogenome was characterized, the exact placement of Sussemiones within noncaballines could not be resolved with high confidence (Figure 1ab, Figure S2; and topological tests presented in Table S2), suggesting that additional phylogenetic information from nuclear genes and/or the identification of e.g rare genomic rearrangements or insertion sites of transposable elements will be required for unraveling the phylogenetic origin of this enigmatic equid lineage [21,62].
In Bayesian analyses, African wild asses and domestic donkeys were supported as a sister lineage to Asiatic wild asses, a pattern compatible with various past suggestions that the subgenus Asinus should incorporate both African and Asiatic asses; however this relationship was not mirrored in ML analyses (node D, Figure 1ab). Therefore, the mitochondrial information should be regarded as inconclusive with regards to the relationships between the main non-caballine groups (Sussemiones, Asiatic wild asses, African wild asses and donkeys, and zebras). Interestingly, SNP chip data comparing .40K autosomal markers [8] and a study combining nuclear and mitochondrial genes [57] could not resolve these relationships with high support either. This suggests that noncaballine equids have likely experienced an extremely rapid radiation. Alternatively, incomplete lineage sorting within the noncaballine ancestral population might explain the persistence of donkey-like mitochondrial sequences in the Asiatic wild ass lineage, resulting in spurious phylogenetic reconstructions.
The emergence of new approaches using NGS, including whole exome capture, or even complete genome characterization, will soon be available to determine whether the non-caballine phylogeny corresponds to a hard or soft polytomy [63,64], or incomplete lineage sorting [65]. For now, we note that many other studies have also found a sister relationship between African wild asses and Asiatic wild asses [8,57,[66][67][68][69] (but see [12]).

Branch Tests for Natural Selection
We used the PAML package [56] to estimate non-synonymous to synonymous rate ratios (dN/dS) along the main branches of the topology ( Figure S1) and compared this model to a null model that assumes one conserved ratio across the whole phylogeny (Table 2). For all genes except COX2, ND2, ND4 and ND5, the null model was better supported and dN/dS ratios were found to be significantly lower than 1 (Table S7), in agreement with the presence of purifying selection acting on mitochondrial genes. For ND4 (NADH dehydrogenase 4), the dN/dS ratio of the ancestral branch leading to caballine horses (horses and NWSLH) was found to be significantly superior to 1 (Table S7), suggesting positive selection for non-synonymous variation during the early evolutionary history of that lineage. In order to identify which amino acid(s) could have been functionally advantageous, we performed branch-site tests in the branch ancestral to caballine horses compared to the rest of the topology for gene ND4. However, this analysis did not identify any amino acids that were supported as having evolved under positive selection (p-value: .0.25).

Divergence Times
We used relaxed molecular clock (log-uncorrelated) methods as implemented in BEAST [51] to date recent lineage divergence events along the evolutionary history of equids. A total of eleven analyses were run including different data set partitions (4-way or 5-way partitioned), different calibration points (the origin of Perissodactyla at 55 Mya, and/or a recent emergence of the Plains zebra lineage at 0.7 Mya), and different models of speciation (Yule versus Birth/Death). The resulting average estimated ages are displayed in Figure 2, Table 2 and Table S9a-b.
The estimated mitochondrial TMRCA for the Equidae varied considerably between our different analyses. When the two outgroup taxa are excluded from the analysis and only the more recent calibration point is used, we estimate the mitochondrial TMRCA of all equids around 4.3 Mya (95% CI: 4.0-4.7 Mya). When the outgroup taxa are included, the average mitochondrial TMRCA increases to 8.6 Mya (95% CI: 6.7-11.9 Mya) (see Table  S9a-b). The younger TMRCA is similar to previously reported divergence estimates for the Equidae (3.8 Mya: [12]; 3.9 Mya: [60]; 4.0 Mya: [22]), however ages bracketing that range have also been proposed (2.3 Mya: [10]; 5.8 Mya: [20]). We believe the older age estimates likely result from the contrast between the deep phylogenetic distance separating equids and other living perissodactyls, and the relatively young origin of equids as a whole. Evolutionary rate estimates have been shown to vary in a timedependent manner [70,71], with deep calibrations leading to slower evolutionary rate estimates. The posterior distribution of mutation rates recovered from our analyses show this trend (Table 3). This may be due to a variety of effects, including substitution saturation, purifying selection, and model misspecification, especially with regard to how rate heterogeneity across sites is compensated [72,73]. The older calibration may therefore be less appropriate for estimating recent divergence times, such as those we aim to address here.
Assuming only the more recent calibration, we find that caballine and non-caballine equids most likely diverged ,4 Mya. This estimate is older than the age of the oldest fossil unambiguously identified as Equus in the paleontological record, which has been dated to 2.1 Mya [74]. According to our estimates, the mitochondrial TMRCA of the lineage leading to domestic horses (E. caballus) is , 364 Kya (95% CI: 336-389 Kya). This is older than that reported in Steiner and Ryder [12] (250 Kya), Achilli et al. [5] (150 Kya), and Lippold et al. [7] (100 Kya), all of whom assumed that the genus Equus emerged ,2 Mya. We should caution that our estimates assume 700 Ky as a calibration date for the TMRCA of Plains zebras. In the current dataset, the latter is estimated based on only three lineages, which likely leads to an under-estimation of the true phylogenetic distance among Plains zebras, and consequently of the mutation rate. Therefore, our molecular clock analyses are expected to provide upper bound estimates for divergence times and TMRCA dates. Additional complete mitochondrial genome data from modern Plain zebras coupled with tip calibration based on a series of radiocarbon dated ancient equid samples [75] will be needed to revisit the exact timing of lineage radiation in equids. For now, we note that most of the extant equid lineages radiate at the transition from the Tertiary to the Quaternary at 2.6 Mya, with a TMRCA for caballine horses estimated at 2.6 Mya (95% CI: 2.4-2.9 Mya) ( Table 2; Table S9b

Concluding Remarks
Here, we provide new data to assess the timing of the appearance of extant and several Late Pleistocene equid lineages, and to investigate their evolutionary relationships to one another based on complete mitogenome sequences. Our results are relevant to taxonomic boundaries: (1) While we confirm the monophyly of zebras, we find a surprisingly deep evolutionary divergence between the Mountain zebra and Plains zebra; (2) our results support recognition of the Kiang as an evolutionarily distinct species, rather than part of a single radiation that includes the Onager and Kulan; (3) we confirm that New World stilt-legged horses are closely related to caballine horses; (4) we show that the enigmatic Late Pleistocene Sussemione is a non-caballine equid that is only distantly related to extant equid lineages; and (5) we confirm that the recently extinct Quagga of South Africa was a subspecies of the widespread Plains zebra. Our study has also revealed a rapid radiation within non-caballine equids, within which details regarding order and pattern of diversification could not be resolved from complete mitogenomes. The massive throughput of current NGS platforms now enables the characterization of the complete nuclear genome of non-model organisms [78]. Such genome-wide approaches will likely provide enough information to resolve the nodes that are presently still problematic. Figure S1 Constrained topology used for positive selection tests in PAML. Branch numbers are shown and are equivalent to branch numbers used in Table S7. (TIFF) Figure S2 The 12 topologies tested in the topological test. Topology number 1 to 12 is equivalent to item number in Table S2. Suss = Sussemione (E. ovodovi); Burch = Plains zebra (E. quagga); Grev = Grevy's zebra (E. grevyi); Moun = Mountain zebra (E. zebra); Ass = African wild ass and domestic donkey (E. africanus and E. asinus); Ona/Kul = E. hemionus (Onager and Kulan). (PDF) Figure S3 DNA fragmentation and Nucleotide misincorporation patterns. Panel A: Ancient sample ACAD2304 E. ovodovi. Panel B: Ancient sample QH1 E. quagga quagga (only shotgun reads are considered in order to avoid the base composition bias resulting from the target-enrichment approach). Panel C: MS272 E. sp. NWSLH Panel D: Modern sample H11 E. zebra hartmannae. The base composition of the reads is reported for the first 10 nucleotides sequenced (left: 1-10) as well as for the 10 nucleotides located upstream of the genomic region aligned to the reads (left: 21 to 210). In addition, the base composition of the last 10 nucleotides sequenced (right: 210 to 21) and of the 10 nucleotides located downstream from the reads (right: 1-10) is also provided, all in relation to the mitochondrial horse reference genome. Nucleotide positions located within reads are reported with a gray frame. Each dot reports the average base composition per position. The figure shows two parallel base compositions; these correspond to the base composition of the two mitochondrial DNA strands, one being relatively enriched in purines (H, heavy), compared to the other one (L, light). The frequencies of all possible mismatches and indels observed between the horse genome and the reads are reported in gray as a function of distance from 59-to 39-ends (first 25 nucleotides sequenced) and 39-to 59-(last 25 nucleotides), except for C.T and G.A that are reported in red and blue, respectively. These frequencies are calculated by dividing the total number of occurrences of the modified base at a given position in a read by the total number of the unmodified base at the same position in the horse genome. (PDF)   Figure S2. Item = Topology number ( Figure S2). (PDF) Table S3 Selected models from jModelTest. The best model selected by jModelTest per gene or codon is displayed according to analysis. A: On full dataset. B: On dataset excluding non-equid outgroup. (PDF) Table S4 Primer pairs used to amplify some of the modern mitogenomes. Ta = annealing temperature, Ext. = extension time, Size = size of the amplicon generated by the primer pair, Location = a rough estimate of the part of the mitogenome the primers amplify based on the horse reference mitogenome (JN398398). * SF and LF are universal mammalian mitogenome amplifying primers (designed in-house by Sandra Abel Nielsen), Pr1 and Pr2 are from study Xu et al. [17], and Pr3 was designed in-house by Ludovic Orlando. (PDF) Table S5 Primer sets used to amplify 5 modern mitogenomes for FLX sequencing. Primer set names as from Table S4, primer sets from Pr3 to 14.3_Pr3 make up shorter regions of the LF fragment. Pr1 and Pr2 are shorter primers to cover the hypervariable region of the control region, while the last four are to fill in gaps throughout the mitogenome. (PDF)   Figure S1. Values superior to 1 are highlighted in bold. P-values from LRT are given in the bottom row, with pvalues #0.02 highlighted in bold. (PDF)