Large geographic distance versus small DNA barcode divergence: Insights from a comparison of European to South Siberian Lepidoptera

Spanning nearly 13,000 km, the Palearctic region provides an opportunity to examine the level of geographic coverage required for a DNA barcode reference library to be effective in identifying species with broad ranges. This study examines barcode divergences between populations of 102 species of Lepidoptera from Europe and South Siberia, sites roughly 6,000 km apart. While three-quarters of these species showed divergence between their Asian and European populations, these divergence values ranged between 0–1%, distinctly less than the distance to the Nearest-Neighbor species in all but a few cases. Our results suggest that further taxonomic studies may be required for 16 species that showed either extremely low interspecific or high intraspecific variation. For example, seven species pairs showed low or no barcode divergence, but four of these cases are likely to reflect taxonomic over-splitting while the others involve species pairs that are either young or show evidence for introgression. Conversely, some of the nine species with deep intraspecific divergence at varied spatial levels may include overlooked species. Although these 16 cases require further investigation, our overall results indicate that barcode reference libraries based on records from one locality can be very effective in identifying specimens across an extensive geographic area.


Introduction
In many cases, DNA barcoding can be an effective tool for both specimen identification and species discovery. In animals, a 648 base pair segment of the mitochondrial cytochrome c oxidase subunit 1 (COI) gene has been adopted as the barcode region [1], [2]. Numerous researchers PLOS  Republic, Russia), supplemented by a few reliably identified specimens from other areas ( Fig  1). Whereas DNA barcode coverage for lepidopteran taxa is generally high for species from central and northern Europe, only few records are available from South Siberia. We therefore sought to obtain specimens of >100 species shared by these regions. We focused on species with a disjunct arctic-alpine and South Siberian-alpine distribution based on the expectation that they would be likely to show higher intraspecific barcode variation.
Species identification was exclusively based on morphological traits. In general, we analyzed three specimens from South Siberia for each of these species to estimate intraspecific divergence, but only two specimens were available for 18 species whereas for 15 species the number of voucher specimens ranged between 4 and 8. The average number of successfully sequenced specimens per species from Asia was 3.24. By comparison, the number of sequenced specimens was much higher for most European representatives of these species with 16.34 sequenced specimens per species on average. Existing specimens from museum collections were analyzed where possible and were supplemented with material from an expedition to the Russian Altai Mountains from late July to mid-August 2016 [11]. A permit was not required for the Altai specimens as no protected species were collected. Collections in other countries were made in compliance with current legislation. In Finland, permits were issued by the Finnish Centre for Economic Development, Transport, and the Environment to MM under permissions VARELY/441/07.01/2012 and LAPELY/275/07.01/2012, while collecting permits were not necessary for scientific research in Austria/Tyrol. The Nagoya protocol was not applicable because our European material was collected before October 12, 2014 and because the protocol has not been ratified by Russia.
Most sequences considered in this study derive from specimens held in the Tiroler Landesmuseum Ferdinandeum, Innsbruck, Austria; the University of Oulu, Oulu, Finland; the Bavarian State Collection of Zoology, Munich, Germany; and another 25 specimen depositories. Wherever possible, data were supplemented by publicly available sequences in BOLD ( [12], see http://www.boldsystems.org).

DNA sequencing
For freshly collected specimens, a single leg was removed and placed in a 96-well lysis plate that was submitted for analysis to the CCDB (Canadian Center for DNA Barcoding, University of Guelph, Canada) where DNA extraction, PCR amplification, and sequencing were performed following standard high-throughput protocols [13].
Altogether, 315 specimens of 102 South Siberian species that also occur in Europe were sequenced. Moreover, we examined previously published 1,682 sequences (>500bp) [10] from specimens of the same species from sites in Europe including Finland (423), Austria (410), Germany (329), Russia (315), and 19 other countries (520) (Fig 1). Information regarding the institutions hosting each publicly available specimen, sample and process IDs and GenBank accession numbers are available in S1 Table. Further details on each specimen, including complete voucher data, and images are available on BOLD [12] in the public dataset "Lepidoptera of Altai Mountains (DS-LEPEUALT)" under the DOI: 10.5883/DS-LEPEUALT.

Data analysis
The extent of intraspecific sequence variation in the COI sequences for each species was estimated using the Kimura-2-parameter (K2P) model of nucleotide substitution using analytical tools on BOLD v4.0 (http://www.boldsystems.org) and MEGA v.6 [14]. There has been an interesting debate over the choice and justification of K2P and other distance measures used in barcoding analyses (e.g., [15]), however, the 'best method' depends on the dataset under consideration and the effects of different distance measures and models on the distances and identification success are generally small (e.g., [16]). Therefore a consequence of model choice on the main results of this specific work is unlikely and we applied the K2P method as implemented in BOLD. For each species we obtained four estimates of intraspecific divergence by calculating the arithmetic mean for all pairwise distances (K2P) among conspecific individuals within the following spatial contexts: (a) 'total intraspecific' (mean distance for all data for each species); (b) 'within Europe' (mean distance for all European samples); (c) 'within Asia' (mean distance for all South Siberian-Central Asian samples); and (d) 'inter Europe-Asia' (mean distance within each species for all pairs of specimens from Europe vs. Asia).
Furthermore, we examined the potential impact of distribution type on intraspecific divergences. For this analysis, each species was assigned to one of two categories: (a) those with largely continuous distributions across Eurasia, i.e. with known gaps <500 km; and (b) those with highly disjunct distributions, i.e. with gaps between known populations >2,000 km. These two categories basically reflect what has been termed Euro-Siberian versus arctic-alpine and South Siberian-alpine distribution patterns in biogeographic studies [17].
We compared mean intraspecific sequence divergences across the three spatial levels (intra-Europe, intra-Asia, inter-Europe-Asia) using a non-parametric Friedman ANOVA of ranks because of uneven variance and sequence numbers for the 102 species. Total mean intraspecific barcode divergence between the two types of species distributions was compared using a Mann-Whitney U-test. In addition, we examined the strength of isolation by distance within every species. For this purpose, we calculated a Mantel correlation coefficient for the matrix of geographic distances between sampling localities and the K2P distance matrix for every species using the Geographic Distance Correlation tool in BOLD. These correlation coefficients were then tested for contingency upon distribution type or overall intraspecific sequence divergence using a Mann-Whitney test and a Spearman rank correlation, respectively. Statistical analyses were performed using Statistica 8.0 (StatSoft Inc.). Finally, we compared the mean and maximum intraspecific divergence for each of the 102 species with its Nearest-Neighbor (NN) distance, because a gap between intraspecific and interspecific variation is essential for DNA barcoding to be effective in specimen identification. For this purpose we used the DS-MARKALL dataset (dx.doi.org/10.5883/DS-MARKALL). It includes >500 bp sequence records for 41,583 specimens representing 5,016 species of Lepidoptera [10]. We limited comparisons to this dataset because it is both comprehensive and identifications are very reliable. Sequences from the present study and from DS-MARKALL were pooled, and a barcode gap analysis was then carried out on BOLD using the K2P model. This analysis estimated the minimum genetic divergence to the NN and both the mean and maximum intraspecific divergences for each species.

Sequenced species
We collected 1,997 sequences >500 bp from the 102 species. Among them, 54 sequences were not barcode compliant according to the standards in BOLD, i.e. a minimum sequence length of 500 bp, less than 1% ambiguous bases, the presence of two trace files, a minimum of low trace quality status, and the presence of a country specification in the record as set out by the Consortium for DNA Barcoding (CBOL), most likely due to partially degraded DNA. Nevertheless, these 54 sequences were still considered in the analysis as they were correctly placed with their conspecifics in an initial NJ tree. The seven families with the largest numbers of sequences were Noctuidae (551), Geometridae (389), Erebidae (175), Tortricidae (157), Nymphalidae (146), Gelechiidae (144), and Lycaenidae (133).

Intraspecific barcode divergences
Intraspecific barcode divergence was generally <1% with a mean (± SD) of 0.68 ± 0.67% (median: 0.43%; range: 0.00 to 3.46%) for the 102 species. As expected, there were highly significant differences among the three regional comparisons (Friedman ANOVA: χ 2 2df = 77.82; p<0.0001). Divergences were lowest within the Asiatic samples as expected because they originated from few collecting sites with low numbers of specimens, while divergences within Europe averaged higher, and those between the European and Asiatic samples were highest (Fig 2, Table 1). In post-hoc comparisons, all three pairwise comparisons were highly significant (Wilcoxon-tests, p<0.007).

Factors affecting isolation by distance within species
As expected, the extent of sequence divergence between members of a species was often related to the distance between their sites of collection. However, the extent of this isolation-by-distance effect was highly variable among species. Sequence divergences in 56 of the 102 species showed no association with distance, while 13 species showed a weakly significant Mantel correlation (p<0.05) and 33 species showed a strong relationship (p<0.01). Evidence for isolation-by-distance was stronger in species with disjunct (mean Mantel r = 0.59±0.32) than continuous distributions (mean Mantel r = 0.28±0.27; Mann-Whitney test: z = 4.19, p<0.0001; Fig  4). In species with disjunct distributions, the extent of isolation-by-distance was only weakly and non-significantly related to overall sequence divergence (Spearman rank correlation: r S = 0.40, p = 0.087), and this relationship was even weaker and also non-significant for species with continuous ranges (r S = 0.20; p = 0.073). The strength of isolation-by-distance patterns within species did not co-vary with the maximum distance between sampling sites (r S = -0.005, p = 0.96), but it was negatively related to the number of sequences available for a taxon (r S = -0.27, p = 0.007).

Discussion
Our analysis of DNA barcode sequences from a phylogenetically diverse group of Lepidoptera from Asia and Europe revealed that intraspecific divergences increased with sampling intensity and distance. However, intraspecific divergences in most species remained low with mean K2P divergences averaging 0.68% and exceeding 2.5% in 23 species of the complete sample. However, divergence was >2.5% in just 9 of the 102 species in one or more of the three spatial levels of our analysis. By comparison, the species with a higher divergence than 2.5% showed a mean sequence divergence of 4.62% to European populations of 5,016 species of Lepidoptera. This result corroborates patterns from earlier studies on North American [6] and European Lepidoptera [4], confirming that the barcode region of COI is an efficient tool for species identification, given that the databases are of high quality, even when the reference sequences used for species identification derive from sites far distant from the locality under study. Irrespective of their origin, most sequences could be unambiguously allocated to a taxonomically defined species although several cases of high intraspecific divergence may reflect overlooked species (as discussed later). Conversely, 4 of the 7 species pairs (Crambus perlella/monochromella, Crocallis elinguaria/albarracina, Epinotia trigonella/indecorana, Coenonympha tullia/ rhodopensis) that either lacked or possessed very limited (<0.5%) divergence from their NN may indicate taxonomic over-splitting rather than the failure of DNA barcoding to discriminate valid species (see [10]). For three other species pairs (Setina irrorella/aurita, Boloria titania/chariclea and Perizoma hydrata/affinitata), the low NN values suggest a recent divergence of valid, morphologically well-defined species or recent mitochondrial introgression. For example, an earlier study suggested that the low NN divergence between P. hydrata and P. affinitata resulted from mitochondrial introgression from P. hydrata to P. affinitata [18]. Our comparisons of European and South Siberian populations revealed regional sequence divergence in the respective region in about half the species, but most values were well below 2%. In addition, regional barcode variation was similar in species with disjunct distributions and in those with continuous ranges, indicating substantial gene flow in both cases. In part, this may reflect the fact that current distributions of Euro-Siberian Lepidoptera largely result from range expansions in the brief interval since the last glacial maximum, i.e. within less than 15,000 years [19], [20]. However, when intraspecific sequence divergences were examined using an isolation-by-distance approach, they were slightly stronger in species with disjunct ranges.
Despite our limited sampling, some species (e.g. Elachista bedellella, Boloria napaea, B. titania and Plebejus orbitulus) showed clear divergence between South Siberian and European populations (see Table 1). In addition, populations of some species from northern Europe clustered with those from Asia rather than from central Europe (e.g. Xestia speciosa). This pattern likely indicates that formerly glaciated areas in northern Europe were sometimes recolonized by lineages from Asia. All these intraspecific patterns need to be examined in more detail by Large geographic distance and small DNA barcode divergence in Lepidoptera increased sampling effort in intermediate areas, and should be cross-checked using morphology and nuclear markers to clarify phylogeographic histories. Yet, for the purpose of species identification, we did not encounter any significant barriers, even in these taxa.

High intraspecific divergences-potential cryptic diversity
High intraspecific barcode divergences (> 2-3%) may be indicative for the existence of overlooked species of Lepidoptera, but may also be due to mitochondrial introgression from a sister species [21]. Therefore, all such cases should be analyzed in more detail by examining divergence patterns at nuclear loci and morphological characters. We detected high intraspecific divergences (> 2.5% max divergence) between European and Asian populations for 9 of the 102 species (Table 3). Six of these species have a disjunct distribution, suggesting the possible existence of cryptic species in South Siberia versus Europe. In three other species (e.g. Coscinia cribraria), barcode variation was high even within Europe without an obvious geographical pattern. The remainder of this section discusses these nine species in more detail. All of them group into two or more different BINs [3] (S1 Table), operational taxonomic units which in Lepidoptera are frequently but not always congruent with species boundaries (e.g. [7], [22]). In fact deep barcode splits may be caused by pseudogenes, Wolbachia infection, hybridization etc. [23] and these cases need to be analysed using an integrative approach (e.g., [24]).
1. Caryocolum pullatella (Tengström, 1848) (Gelechiidae). C. pullatella is a Holarctic species that is widespread in northern Europe, but restricted to isolated localities in the Alps and Balkans [25]. As its Palearctic populations include two DNA barcode clusters with allopatric distributions (central/south-east Europe versus north Europe-South Siberia), this may indicate cryptic diversity. The situation potentially gains further complexity when North American specimens are considered as they include additional BINs and requires further assessment.

Coscinia cribraria (Linnaeus, 1758) (Erebidae).
This morphologically variable species is widely distributed across the Palearctic. Numerous forms and subspecies have been described, including ssp. sibirica (Staudinger, 1892) from the Altai Mountains which was Large geographic distance and small DNA barcode divergence in Lepidoptera  Populations from the Altai region have been attributed to the nominotypical subspecies, but the clear differences in their external morphology and genitalia [28], coupled with their barcode divergence, suggest they represent a cryptic species.
4. Eana osseana (Scopoli, 1763) (Tortricidae). E. osseana is a widespread Holarctic species, restricted to mountainous areas at the southern limits of its distribution. DNA barcodes indicate two divergent BINs, one from Europe, and a second from the Altai Mountains. As three additional BINs are known from North America, the species requires integrative revisionary work.
5. Eulithis prunata (Linnaeus, 1758) (Geometridae). This species is almost continuously distributed in temperate Eurasia, but is restricted to mountainous areas in the southern parts of its range. Hausmann & Viidalepp (2012) [29] found high COI sequence divergence in E. prunata, with distances reaching 5.9% and at least six divergent haplotypes in Europe and Turkey. South Siberian populations have been assigned to the ssp. leucoptera (Djakonov, 1929), but it may represent a distinct species given its deep barcode divergence from other populations. 6. Gazoryctra ganna (Hübner, 1804) (Hepialidae). G. ganna is an arctic-alpine species with a disjunct distribution. It occurs in the Alps and High Tatra Mountains, northern Finland, and European Russia, as well as at isolated localities to the Far East [30]. Moderate sequence divergence exists between northern and central European populations [8] while those from the Altai Mountains show high sequence divergence from both European clusters. Because of their differing flight times (late afternoon in Asia versus early morning in Europe) and slightly different phenotypes, the Asian specimens likely represent an overlooked species.
7. Ochsenheimeria urella (Fischer von Röslerstamm, 1842) (Ypsolophidae). O. urella is widely although locally distributed in central and northern Europe, including European Russia. A previously doubtful record from the Far East [30], together with our record from Altai [11], indicates a much wider distribution in Asia. Members of this species are placed in two BINs, one shared by the Alps and Finland, and the other by Finland and the Altai Mountains.
8. Pontia callidice (Hübner, 1800) (Pieridae). P. callidice shows a disjunct distribution in the high mountains of Eurasia from the Pyrenees to the Himalayas, and in the subarctic Tundra from the Ural Mountains to the Far East. Linked to their geographic isolation, populations show considerable variation in wing patterns and have been assigned to several subspecies. The nominotypical subspecies occurs in the high mountains of Europe (Pyrenees and Alps Altai). Despite this nomenclatural uncertainty, the DNA barcode results indicate that specimens from the Alps belong to a very distinct barcode cluster from those in Russia (Altai), Kyrgyzstan and Tajikistan.
9. Scrobipalpula diffluella (Frey, 1870) (Gelechiidae). In the Palearctic, the genus Scrobipalpula includes a complex of closely related species with disputed taxonomy [25]. S. diffluella shows a typical boreo-montane distribution with most records from northern and central Europe, extending to the southern Urals. Specimens of the newly detected population from the Altai show close morphological similarity with European material, but clear barcode divergence, suggesting cryptic diversity.

Conclusions
This study on a phylogenetically diverse sample of Lepidoptera across a wide geographic range within the Palearctic region corroborates the utility of DNA barcode data for enabling both species identification and species discovery. For most species, unequivocal identifications could be established for samples from a widely distant region (the Russian Altai mountains), even though available reference data largely derived from regions in north and central Europe. On the other hand, in a few 'species' taxonomically known since Linnean times, patterns of sequence divergence suggest the possibility of unrecognized cryptic species diversity and demand further assessment using an integrative taxonomic approach. Hence, this study exemplifies the usefulness of well curated DNA barcode libraries whose power and versatility will expand as more sequence data are collated under strict quality standards.
Supporting information S1 Table. Accession numbers and BINs. List of species names, sample-IDs, process-IDs (from BOLD database), GenBank Accession numbers, BINs, and Institution/collection storing vouchers (PDF)