A genetic atlas for the butterflies of continental Canada and United States

Multi-locus genetic data for phylogeographic studies is generally limited in geographic and taxonomic scope as most studies only examine a few related species. The strong adoption of DNA barcoding has generated large datasets of mtDNA COI sequences. This work examines the butterfly fauna of Canada and United States based on 13,236 COI barcode records derived from 619 species. It compiles i) geographic maps depicting the spatial distribution of haplotypes, ii) haplotype networks (minimum spanning trees), and iii) standard indices of genetic diversity such as nucleotide diversity (π), haplotype richness (H), and a measure of spatial genetic structure (GST). High intraspecific genetic diversity and marked spatial structure were observed in the northwestern and southern North America, as well as in proximity to mountain chains. While species generally displayed concordance between genetic diversity and spatial structure, some revealed incongruence between these two metrics. Interestingly, most species falling in this category shared their barcode sequences with one at least other species. Aside from revealing large-scale phylogeographic patterns and shedding light on the processes underlying these patterns, this work also exposed cases of potential synonymy and hybridization.


Introduction
Because of its high mutation rate and clean genealogical signal, mitochondrial DNA has long been used to detail the spatial configuration of genetic lineages [1].However, the growing capacity to generate genetic data now makes it possible to also investigate multiple nuclear markers or entire genomes.While the latter approaches overcome limitations of a single marker [2][3][4] in accurately describing patterns of genetic diversity [5], their application is typically limited to only a few species.The broad adoption of the mitochondrial COI barcode region for animals [6] stimulated the development of BOLD (https://v4.boldsystems.org)[7], a data repository that now holds more than 16 million geo-referenced DNA barcode records.As particular effort has been directed towards Lepidoptera [e.g., [8][9][10][11], phylogeographic assessments are now possible for all species on a continental scale [12].Founded on an updated checklist for the butterfly species of North America, the present study assembles a curated dataset of 13,236 DNA barcode records for this fauna, nearly 15% of which are new.Based on these data, it compiles haplotype maps and networks for each species.This information was integrated with standard indices of genetic diversity to characterize their genetic diversity and genetic structure.

Sampling and COI characterization
The barcode data considered here mainly derive from an earlier compilation [8] reinforced with 1,750 additional records.A total of 13,236 georeferenced records (6838-Canada, 6398-USA) were assembled in the dataset "DS-ATLASNAB" (dx.doi.org/10.5883/DS-ATLASNAB).All 13,236 records included at least 500 unambiguous base pairs.This study provided barcode coverage for 619 species, 90% (619 of 691) of the butterfly taxa known to be resident in Canada and the continental United States (Fig 1).While sampling bias is inevitable, targeted selection of samples from two major natural history collections in North America, namely the Canadian National Collection and the Smithsonian's National Museum of Natural History, helped to alleviate this issue.Sampling considered both the geographic and phylogenetic diversity of the species.The fauna is delineated by oceanic dispersal barriers to the east, west, and north and by the arid regions of northern Mexico to the south.The present checklist is largely based on D'Ercole et al., [8] augmented by recent taxonomic studies to refine the faunal list (S1 Appendix).In a few cases, where the evidence for revised taxonomic status was considered incomplete, proposed changes were not adopted, but these cases are highlighted in the checklist.Zhang et al. [13] provided low coverage (1x) genomic data for all North American butterfly species to advance understanding of their diversification and adaptation, but the results also revealed incongruencies with the current taxonomic system that were considered in this study.
DNA extraction, polymerase chain reaction (PCR), and sequencing followed the established procedures at the Centre for Biodiversity Genomics.DNA extraction utilized a silica-based from Food from Thought program (https:// foodfromthought.ca/)(CFREF Project 000054) supported by the Canada First Research Excellence Fund (https://www.cfref-apogee.gc.ca/) provided salary for JD.The funders had no direct involvement in data collection, analysis or manuscript preparation.There was no additional external funding received for this study.

Competing interests:
The authors have declared that no competing interests exist.method implemented in 96-well plate format [14].PCR volumes and thermal cycling conditions followed methods in DeWaard et al. [15].Although Sanger sequencing can recover small DNA fragments [16], the need to characterize numerous short amplicons is expensive and requires a substantial input DNA.High-throughput sequencing (HTS) not only provides the capability to analyze multiple sets of amplicons simultaneously but also, due to their ability to analyze single molecules, can produce reliable results with low concentrations of template DNA [17].This study employed the Sequel platform to characterize short amplicons generated by multiplexing different primer sets and nested PCR [18].CodonCode Aligner (CodonCode Corporation, http://www.codoncode.com) was employed to assemble trace files into contigs, thereby generating a sequence record for each specimen.Data validation revealed a few sequences that reflected either contamination by non-target species or nuclear mitochondrial elements (NUMTs); these records were excluded.A Neighbor-Joining tree [19] for the 619 species was built to search for unexpected placements, a situation that often reflects operational errors [20], but none were detected.

Genetic analysis
The data were analyzed using methods described by Dapporto et al. [12].Haplotype maps were constructed to depict the geographic distribution of genetic lineages.A matrix of p-distances (with pairwise deletion of missing sites) computed using the records for each species was subjected to Principal Coordinates Analysis (PCoA) (S1A Fig provides an example for Limenitis lorquini).The resultant 2-dimensional space was then overlapped with a colored square whose vertices were yellow, blue, white, and black.Shades of these colors defined the internal space of the square (e.g.S1B Fig) .The overlap between the dimensional space of genetic distances and the colored square allowed assignment of colors to haplotypes-an approach that allowed color similarity to serve as a proxy for genetic similarity (e.g.S1C Fig).
Color choice considered the most common types of color blindness.Haplotypes were then grouped by areas and displayed in circles whose size was proportional to the number of sequences (e.g.S2 Fig) .In addition to haplotype maps [12], the present work established a framework for examining the spatial distribution of closely related species and revealing potential hybrid zones.This analysis treated two or more species as a single entity, so that their haplotypes were included in the same haplotype network and followed the same PCoA color scheme.However, specimens from different species were then displayed on separate maps to clearly locate areas of introgression (e.g., S3 Fig) .Haplotype networks were used to examine relationships among the haplotypes detected in each species.The R package pegas (rmst function) was used to construct minimum spanning trees [21] with each haplotype represented by a circle whose size was proportional to the number of records.Numbers over the connections among haplotypes indicate the number of mutational steps separating them.These haplotype networks followed the color scheme established in the PCoA, which is the same than that employed for the haplotype maps (e.g., S4 Fig) .Comprehensive understanding of the genetic structure of a species requires both the quantification of diversity and knowledge of its distribution across space.A bivariate bubble plot was employed to display the relationship between nucleotide diversity (π) [22] and G ST [23], a measure of regional variation, for each species (e.g., S5 Fig) .The function "nuc.div" of the pegas R package [24] was used to estimate nucleotide diversity.G ST is defined as D ST /H T , where DST = (H T -HS)/H T , and H T is the mean p-distance among all sequences while H S is the mean p-distance among sequences of the single populations.G ST ranges from -1 to 1.Because negative values typically represent artefacts due to low sample size [25], we set them to 0. Species represented by a single haplotype possessed a D ST value of 0 and an undefined G ST , but G ST was set to 0 in these cases as a species with a single haplotype lacks structure.Records were binned to cells of differently sized grids.It needs emphasis that G ST values depend on the size of the grid cells and on the arbitrary placement of the grid on the geographic map.To overcome these limitations, G ST was computed for five cell sizes (i.e., 100x100, 200x200, 300x300 and 400x400 km 2 ) and replicate values were computed by moving cell centers to cell vertices.This approach produced 8 values for G ST and the average of these values was retained for each species.G ST was computed only for species having more than 10 specimens included in at least two areas, with no less than three specimens in each.Similar lower bounds were employed to estimate spatial structure in Scalercio et al. [26].Following Dincă et al. [10], estimates of sampling completeness were obtained with the iNEXT R package [27].This approach generates accumulation curves [28] that can be used to compare the asymptotic value with observed haplotype diversity.Because the accuracy of these estimates depends on sampling intensity, only species with more than ten specimens were analyzed.Following this approach, the number of specimens (N), the number of observed haplotypes (H), the estimated fraction of haplotype diversity retrieved (R), and the number of additional haplotypes which likely remain to be sampled (L) were computed for each species in the Atlas.In addition to estimates of sampling completeness, π, and G ST , Nei haplotype diversity (h) [29], the maximum intraspecific distance, and the nearest neighbor distance were also computed and included in S1 Appendix.

Results and discussion
Taxonomic systems are dynamic, experiencing updates as knowledge of evolutionary relationships increases.This study was founded on a checklist assembled by D'Ercole et al. [8] which was updated based on genomic evidence [13] to produce a new reference list (S1 Appendix).The resulting new checklist includes 72 changes in taxonomic status-16 new names (9 species, 7 genera), 55 hierarchy changes (12 synonymies, 14 taxa moved to a lower rank, 29 taxa elevated to a higher rank).These changes impacted the taxonomic status of 158 North American butterfly species.In addition to these changes, 15 species were transferred between existing genera.
By examining the distribution of genetic diversity in 619 species, this study revealed three main phylogeographic patterns (S1 Appendix).Contrary to findings for Europe, where formerly glaciated areas tended to show low genetic variation [12], high genetic diversity and structure were particularly evident in the northwest (i.e., Alaska, Yukon, British Columbia) (e.g., Erebia pawloskii, Boloria frigga, Oeneis uhleri).Because Beringia in the north was a glacial refugium throughout much of the Pleistocene [30][31][32], populations of numerous species likely experienced recurrent isolation in this region, contributing to the observed diversity [32].This pattern could also be influenced by other factors.Previous work on North American butterflies has shown that sharing of mitochondrial haplotypes is prevalent among closely related species in the north [8].Consequently, it is probable that a portion of the diversity in this region derives from hybridization and introgression of mitochondrial lineages, rather than through intraspecific diversification.Notably, analysis of sequence variation for European butterflies [10,12] indicated the impact of barcode sharing is less than noted in this study [10,12].It is possible that the varied landscape of Europe [33] limited genetic exchange among populations during the recurrent glacial cycles [34].The southernmost regions of United States displayed high genetic diversity and structure (e.g., Thorybes pylades, Calycopis cecrops, Hermeuptychia sosybius).This pattern conforms not only with the traditional model of genetic diversity described for the European biota [35,36], but also with recent genetic findings pertaining to west Palearctic butterflies [8,12].Second, species occurring in proximity to the western mountains typically showed both high genetic diversity and relatively high spatial structure (e.g., Argynnis cybele, Neophasia menapia, Polites draco).This agrees with patterns of genetic diversity observed for the cold-adapted species in Europe, where several differentiated endemic lineages are found in correspondence with the major mountain chains (i.e.Pyrennees, Alps, and Carpathians).These lineages often result from multiple centers of glacial survival in the perialpine areas [37,38].This pattern could also reflect dispersal from glacial refugia and the aggregation of lineages near barriers [32,39] or relaxed natural selection on hybrid populations.Hybrid zones are frequently localized in regions with restricted gene flow and low population density where the negative effects of outbreeding depression are minimized [26,40,41].The present results support this prediction as many species distributed along the western mountain chains share their barcodes with at least one other species (e.g., Icaricia lupini, Euphydryas anicia, Argynnis zerene).Third, although most species displayed either high genetic diversity/ structure, or low diversity/structure, a considerable number showed a modest increase in genetic structure when compared to genetic diversity (e.g., Euphilotes enoptes, Limenitis arthemis, Phyciodes cocyta).As most species in this category displayed barcode sharing, it is probable that hybridization generates hotspots of diversity where different lineages congregate, a situation that results in a lack of genetic structure.
Aside from providing a glimpse of large-scale patterns of intraspecific genetic diversity, this study enables exploration of unique patterns displayed by single species linked to differing biological attributes and environmental scenarios [36,42] (S1 Appendix).For instance, while the primary response to glacial advances for species in western North America was either a retreat to southerly refugia or to the northern Beringian refuge [39], the present analysis highlights a few species (e.g., Euchloe ausonides, Pieris marginalis, Boloria chariclea) that likely retreated into micro-refugia on the margins of the west coast (i.e., Vancouver Island, Haida Gwaii, Alexander Archipelago).This agrees with past evidence supporting the existence of smaller refugia located off the coasts of British Columbia and Alaska [39].Similar evidence is also found in Europe, where extra-Mediterranean source populations north of major mountain chains (i.e., Pyrenees, Alps, and Carpathians) helped to repopulate glaciated regions [43,44].
Studies of sequence variation have significant implications for conservation programs as they can provoke taxonomic revisions, impacting the fundamental units of conservation.The genus Celastrina provides an example as its taxonomy has been volatile over the last 30 years.Scott (1992) [45] viewed all North American specimens as belonging to the Eurasian species C. argiolus.However, subsequent workers argued that the North American Celastrina was composed of multiple species [46,47] and eventually, nine North American endemics were recognized [48] The present results indicate that all North American Celastrina species show close barcode congruence and share haplotypes with at least one other species.This result reinforces conclusions from detailed morphological and phenological analysis [49] which suggest that the supposed Celastrina species are actually conspecific (S1 Appendix).Another example of taxonomic and evolutionary interest involves the genus Erynnis.Two species, E. propertius and E. horatius, are allopatric and display clear morphological (including genitalia) and ecological differences [50].While E. propertius exists in the far west (from southwest Canada to north Baja California), E. horatius is distributed in central and eastern North America (from New Mexico and Colorado up to the east coast).Zakharov et al., [51] detected E. horatius haplotypes in many populations of E. propertius and concluded that introgression from the western E. horatius to eastern E. propertius represented the most likely explanation.However, the increased sampling coverage in this study reveals a different scenario.A third, closely related species, E. meridianus, ranges between these two species (from south Nevada through much of Arizona and New Mexico to Texas) and displays the same haplotypes introgressed from E. horatius to E. propertius.It is probable that the present configuration of mitochondrial lineages arose with E. meridianus as an intermediary.More specifically, the ancestral species of E. meridianus might either have transferred its haplotypes to both E. horatius and E. propertius, or it might have acted as a stepping-stone from E. horatius to E. propertius.Additional barcode coverage coupled with studies on nuclear markers should resolve this case.
This phylogeographic assessment for all species in a large phylogenetic group demonstrates how the use of a well-curated reference library can aid understanding of biodiversity and its connection to underlying processes.Moreover, it can be used to improve the current classification system and to reveal cases of evolutionary interest such as hybridization and genetic introgression.It is worth noting that a similar atlas can be used as a routine practice to improve the quality of DNA barcode reference libraries of poorly known taxonomic groups where the impact of operational errors (e.g., inaccurate taxonomy, misidentified specimens) is large.