BIN overlap confirms transcontinental distribution of pest aphids (Hemiptera: Aphididae)

DNA barcoding is highly effective for identifying specimens once a reference sequence library is available for the species assemblage targeted for analysis. Despite the great need for an improved capacity to identify the insect pests of crops, the use of DNA barcoding is constrained by the lack of a well-parameterized reference library. The current study begins to address this limitation by developing a DNA barcode reference library for the pest aphids of Pakistan. It also examines the affinities of these species with conspecific populations from other geographic regions based on both conventional taxonomy and Barcode Index Numbers (BINs). A total of 809 aphids were collected from a range of plant species at sites across Pakistan. Morphological study and DNA barcoding allowed 774 specimens to be identified to one of 42 species while the others were placed to a genus or subfamily. Sequences obtained from these specimens were assigned to 52 BINs whose monophyly were supported by neighbor-joining (NJ) clustering and Bayesian inference. The 42 species were assigned to 41 BINs with 38 showing BIN concordance. These species were represented on BOLD by 7,870 records from 69 countries. Combining these records with those from Pakistan produced 60 BINs with 12 species showing a BIN split and three a BIN merger. Geo-distance correlations showed that intraspecific divergence values for 49% of the species were not affected by the distance between populations. Forty four of the 52 BINs from Pakistan had counterparts in 73 countries across six continents, documenting the broad distributions of pest aphids.


Introduction
Although aphids (Hemiptera: Aphididae) are important plant pests, their life stage diversity and phenotypic plasticity have constrained the development of effective taxonomic keys in the past [1,2]. With over 5,000 described species, the Aphididae represents the largest family within the Aphidoidea [3]. Most  coordinates, the collection sites were rendered using SimpleMappr.net (Fig 1). Aphids were collected by either beating foliage above a white paper sheet or by removing them from their host plant with a camel hair brush [55]. Collections were transferred into Eppendorf tubes prefilled with 95% ethanol and stored at -20˚C until analysis.

Identification
Aphids were identified using standard taxonomic keys [55,56]. Morphological characters were examined with a Labomed CZM6 stereomicroscope (Labo America) equipped with an ocular micrometer. Each specimen was identified to species-level based on morphology. This identification was later validated by DNA barcode sequence matches on BOLD.
Genetic Engineering (NIBGE), Faisalabad. Individual specimens were placed into 96-well format in a microplate pre-filled with 30 μl of 95% ethanol in each well. Each specimen was photographed dorsally using a 12 megapixel Olympus μ-9000 camera (Olympus America Inc., USA) mounted on a stereomicroscope. Specimen metadata (collection information and taxonomy) and images were submitted to BOLD under the project MAAPH, "Barcoding Aphid Species of Pakistan". DNA extraction, PCR amplification, and sequencing were carried out at Centre for Biodiversity Genomics at Guelph. DNA extraction followed Ivanova et al. [57] with voucher recovery protocol [58]. PCR amplification of the COI-5 0 (barcode region) [24] was performed using primer pair C_LepFolF (forward) and C_LepFolR (reverse) (http://ccdb.ca/ site/wp-content/uploads/2016/09/CCDB_PrimerSets.pdf) in 12.5 μL reactions that included standard PCR ingredients [59]

Data analysis
All barcode sequences were compared with those on BOLD and GenBank to ascertain sequence similarities. Sequence matches on BOLD were obtained using the "Identification Engine" tool whereas nBLAST (http://www.ncbi.nlm.nih.gov/blast/) was used on GenBank.
All sequences meeting quality standards (>500 bp, <1% ambiguous bases, no stop codon or contamination flag) were assigned to a BIN [38] (as of 18 January 2019). BIN discordance and BIN overlap reports were generated using analytical tools on the BOLD workbench. As a test of the reliability of species discrimination, the presence or absence of a 'barcode gap' [62] among species was determined on BOLD. A species was considered distinct when its maximum intraspecific distance was less than the distance to its nearest neighbor (NN). ClustalW nucleotide alignments [63] and neighbor-joining (NJ) analysis [63] were conducted in MEGA6. The NJ analysis employed the Kimura-2-Parameter (K2P) [64] distance model, with pairwise deletion of missing sites, and 1000 non-parametric bootstrap [65] replicates for the nodal support. Bayesian inference was performed in MrBayes v3.2.0 [66] employing the Markov Chain Monte Carlo (MCMC) technique. This analysis was based on representative sequences from 67 aphid haplotypes in the dataset extracted using DNaSP v5.10 [67] with Diaphorina citri (Hemiptera: Psyllidae) as outgroup. The data were partitioned in two ways; i) a single partition with parameters estimated across all codon positions, ii) a codon-partition in which each codon position was allowed different parameter estimates. The evolution of sequences was modelled by the GTR+Γ model independently for the two partitions using the ''unlink" command in MrBayes, and the best model was selected using Find-Model (www.hiv.lanl.gov/cgi-bin/findmodel/findmodel.cgi). The analyses were run for 10 million generations using four chains with sampling every 1000 generations. Bayesian posterior probabilities were calculated from the sample points once the MCMC algorithm converged. Convergence was defined as the point where the standard deviation of split frequencies was less than 0.002 and the PSRF (potential scale reduction factor) approached 1, and both runs converged to a stationary distribution after the burn-in (by default, the first 25% of samples were discarded). Each run produced 100001 samples of which 75001 samples were included. The trees generated through this process were visualized using FigTree v1.4.0.
BOLD was searched for barcode records for the 42 species encountered in this study. The resultant records were examined for BIN assignment [38] and used in a geo-distance correlation analysis to examine the relationship between geographic and genetic distance in each species. Two methods were employed in the latter analysis; the Mantel Test [68] was used to examine the relationship between geographic (km) and genetic (K2P) distance matrices, while the second approach compared the spread of the minimum spanning tree of collection sites and maximum intra-specific divergence [69]. The relationship between geographic and intraspecific distances was analyzed for each species with at least one individual from three or more sites. BINs recovered from Pakistan were also used in BIN-overlap analysis on BOLD to ascertain the incidence of unique BINs in a region, and to estimate overlap in BIN composition.

Results
Morphological analysis facilitated by the barcode data enabled identification of most specimens (774/809) and revealed 42 species, each representing an important crop pest (S1 Table). Another 32 specimens could be placed to a genus while the remaining three could only be assigned to a subfamily (Aphidinae). Overall, the 809 specimens included representatives of 30 genera and five subfamilies (Aphidinae, Calaphidinae, Chaitophorinae, Eriosomatinae, Lachninae) of the Aphididae (S2 Table). Members of the Aphidinae were dominant (n = 780) as the other four subfamilies were represented by just 29 specimens with Chaitophorinae and Lachninae each contributing one specimen (S2 Table). Among the determined genera, Aphis was most common one (n = 306), and was represented by eight identified and three undetermined species. Furthermore, Myzus was the second most frequent genus (n = 170), but it was only represented by one species, Myzus persicae. Rhopalosiphum, the third most abundant (83) genus, was represented by three major pest species (R. maidis, R. padi, R. rufiabdominale). Two species (Aphis astragalina, Periphyllus lyropictus) represented first records for Pakistan whereas two others (Lipaphis pseudobrassicae, Sarucallis kahawaluokalani) were known, but were recorded as Lipaphis erysimi and Tinocallis kahawaluokalani.
The 809 barcode sequences provided two or more records for 36 of the 42 species and single records for the rest (Tables 1 and S1). Maximum K2P divergence values within species ranged from 0-3.6% (mean = 0.1%), whereas distance values within genera were between 0.8-10.3% (mean = 7.4%), and within family (Aphididae) 3.7-17.3% (mean = 9.6%) ( Table 1). Barcode gap analysis examined the ability of barcodes to discriminate the 42 named species. With the exception of one species (Aphis gossypii), where the maximum intraspecific distance (3.6%) overlapped with A. affinis, the maximum intraspecific distance for each species was less than Table 1. Sequence divergence (K2P) in the COI barcode region for aphid species from Pakistan with more than three specimens, genera with three or more species, and families with three or more genera. This analysis only considers specimens that were assigned to a Linnaean species. its NN distance (Fig 2A, Fig 2B). This pattern did not change with increased sample size ( Fig 2C). Nearly all sequences (801/809) fulfilled the criteria for a BIN assignment, and they were placed in 52 BINs. The 774 specimens of the 42 species were assigned to 41 BINs; 38 showed BIN concordance (species members = BIN members), one species (Rhopalosiphum padi) was split (AAA9899, ACF2924), and two species (Aphis affinis, A. gossypii) were merged (AAA3070), while another, Aphis astragalina lacked a BIN assignment due to its low quality sequence (410 bp, 9 Ns). The 32 specimens lacking a species assignment were placed in 9 BINs-three for Aphis and one for each of the other six genera (Acyrthosiphon, Capitophorus, Forda, Hyalopterus, Macrosiphoniella, Schizaphis). The three specimens only identified to a subfamily were assigned to two BINs. NJ analysis (Fig 3) and Bayesian inference (BI) (Fig 4) supported the monophyly of each of the 52 BINs. The NJ and BI also discriminated the species or genera that either lacked (Aphis astragalina) or shared BINs (Aphis gossypii, A. affinis), as they formed distinct branches on the NJ and BI trees (Fig 3, Fig 4).

Distance
Geo-distance correlation analysis for 37 species was conducted by including an additional 5,067 sequences from conspecific individuals deposited on BOLD. This analysis showed that intraspecific divergences in 49% of the species were not affected by expanding analysis to consider their entire ranges (Mantel test, P>0.01) ( Table 2). The other 51%, that were affected by geographic range, included six species with BIN splits and eight with intraspecific divergence higher than >2%. The distributional patterns of aphids detected in Pakistan were further analyzed by examining BIN overlap between Pakistan and other countries, a comparison that involved 9,905 barcode records assigned to the 52 BINs. This analysis showed that 27 of the 52 BINs were recorded from four or more continents while eight were unique to Pakistan (Table 3). Except for Acyrthosiphon malvae and Uroleucon sonchi, all named species (40) analyzed in this study already had barcode records from multiple countries and continents (Table 3).

Discussion
Prior morphological surveys on the aphids of Pakistan have reported the presence of nearly 300 species [70][71][72]. Most of this work focused on specific geographic regions [73] or species attacking crops [74,75]. The current study surveyed aphids across major agricultural areas of Pakistan from a wider range of host plants, but primarily aimed to develop a barcode reference library for the fauna for the first time. Prior studies have begun to construct barcode reference libraries for some pest insect groups, such as aphids in Canada [29], leafminers in USA [76], fruit flies in Africa [45], food pests in Korea [49], thrips in Pakistan [28], looper moths in British Columbia [77], and mealybugs in China [52]. These libraries have stimulated the use of DNA barcoding in biosecurity and plant protection programs [78], but their use revealed the need for expanded parameterization of the libraries in order to improve their utility in diagnosing newly encountered species. Barcode libraries for two major pest insect groups in Pakistan, thrips and whiteflies, have progressed well [28,79], but other groups have seen little attention in this country so far. The current study not only expands on the prior efforts by barcoding another group of insect pests but also maps the global presence of pest aphids by using BINs.
Most aphids analyzed in this study were assigned to a species, but 35 specimens could only be determined to genus or subfamily level. In part, this difficulty reflected the fact that many important pest aphids are cryptic species complexes whose members are almost impossible to discriminate using morphological traits only [42] or their identification was beyond our expertise. For example, Aphis gossypii is a particularly challenging species complex [5,13]; it includes at least 18 morphologically indistinguishable species [80] likely explaining its wide range of primary and secondary host plants [81]. In the present study, DNA barcoding separated all eight species of the genus Aphis that were encountered. Although K2P distances between two species pairs; i) A. affinis and A. gossypii (1.4%), ii) A. astragalina and A. craccivora (0.8%) were low, both NJ analysis and Bayesian inference supported the monophyly of each species. The COI divergences in this study are similar to those reported in prior investigations   Prior studies revealed a strong correspondence between BINs and known species [39], in particular when reference specimens are identified by experts [84]. The same pattern was apparent in this study as 38 of 41 species were assigned to a single BIN. There were only two exceptions; R. padi was assigned to two BINs and A. gossypii-A. affinis were assigned to the same BIN. By comparison, when barcode sequences from conspecific specimens from other countries were considered, 12 of the 42 species showed a BIN split, an outcome which likely indicates incorrectly identified specimens [39]. Interestingly, the BIN (AAA3070) shared by specimens of A. gossypii and A. affinis from Pakistan included 31 additional species names when all records for it on BOLD were considered. Misidentifications and overlooked cryptic species may often cause conflicts between BIN and species morphology [85], but this can only be resolved by detailed taxonomic studies [86]. As well, heteroplasmy, hybridization, and incomplete lineage sorting can also cause BIN-morphology conflicts [87,88]. Furthermore, host affinities of sympatric populations, which have been observed in aphids, also expand intraspecific divergence [89], possibly resulting in BIN splits as we observed in R. padi.
Geo-distance correlations showed that the genetic divergence increased with geographic distance in almost half of the aphid species. Interestingly, the inclusion of conspecific sequences from other regions also increased the incidence of BIN splits. Since these analyses included all the conspecific sequences on BOLD, this outcome may reflect taxonomic errors [90]. Although spatial variation in conspecific sequences sometimes leads to increased intraspecific divergence values [91], it is usually too low to reduce the capacity of DNA barcodes to deliver reliable species identifications [47,92].
BINs are valuable in evaluating the geographic range of aphid species because they circumvent taxonomic uncertainties. In addition, BINs are gaining increased use to estimate species numbers [41] and to understand their distributions [52]. This analysis revealed that 27 of the 44 BINs with prior records on BOLD occurred on four or more continents, highlighting the broad ranges of many pest aphids. For example, BINs for Aphis fabae (black bean aphid), A. nerii (oleander aphid), A. craccivora (groundnut aphid), Acyrthosiphon pisum (pea aphid), Brachycaudus helichrysi (plum aphid), Brevicoryne brassicae (cabbage aphid), L. pseudobrassicae (turnip aphid), R. padi (oat aphid), R. maidis (corn aphid), Macrosiphum euphorbiae (potato aphid), M. persicae (peach aphid), and Therioaphis trifolii (alfalfa aphid) were all recorded from six continents. Interestingly, BINs associated with some of these species were also linked with other species on BOLD. For instance, AAA3070 was linked to 33 other species of Aphis while AAA6213 was associated with 13 species of Macrosiphum, and AAA5565 with nine species of Aphis. Although some of these cases may involve BIN sharing by different species [29], most cases likely reflect misidentifications.
The level of BIN overlap between the aphid fauna of Pakistan is much higher (85%) than levels for moths (44%) [93] and spiders (24%) [94]. This difference, may, be due, in part, to the fact that the winged alates of aphids can disperse long distances and their dispersal capacity with the broad availability of the crop plants that they attack [95]. Consequently, the number of aphid species known from Europe has increased by 20% in the last 30 years [96] reflecting their transport on produced fruits [52], coupled with shifting environmental regimes. Reports suggest that with every 1˚C increase, some 15 additional aphid species were recorded in Europe [97]. In North America, about 18% of all aphid species are introduced, and nearly half are plant pests [98]. Rapid developments in DNA sequencing are enabling the documentation of pest species and their distribution across the globe, but conflicts between taxonomic assignments and sequences have limited the full utility of these data. Given this difficulty, the BIN system provides an alternative path to document and track the pest species on a planetary scale.
Supporting information S1