Two extended haplotype blocks are associated with adaptation to high altitude habitats in East African honey bees

Understanding the genetic basis of adaption is a central task in biology. Populations of the honey bee Apis mellifera that inhabit the mountain forests of East Africa differ in behavior and morphology from those inhabiting the surrounding lowland savannahs, which likely reflects adaptation to these habitats. We performed whole genome sequencing on 39 samples of highland and lowland bees from two pairs of populations to determine their evolutionary affinities and identify the genetic basis of these putative adaptations. We find that in general, levels of genetic differentiation between highland and lowland populations are very low, consistent with them being a single panmictic population. However, we identify two loci on chromosomes 7 and 9, each several hundred kilobases in length, which exhibit near fixation for different haplotypes between highland and lowland populations. The highland haplotypes at these loci are extremely rare in samples from the rest of the world. Patterns of segregation of genetic variants suggest that recombination between haplotypes at each locus is suppressed, indicating that they comprise independent structural variants. The haplotype on chromosome 7 harbors nearly all octopamine receptor genes in the honey bee genome. These have a role in learning and foraging behavior in honey bees and are strong candidates for adaptation to highland habitats. Molecular analysis of a putative breakpoint indicates that it may disrupt the coding sequence of one of these genes. Divergence between the highland and lowland haplotypes at both loci is extremely high suggesting that they are ancient balanced polymorphisms that greatly predate divergence between the extant honey bee subspecies.


Introduction
Genetic adaptation to different environmental conditions is a key process in evolution and speciation. However, identifying the genetic variants involved in adaptation and the underlying regulatory networks and biological mechanisms by which they impact fitness is challenging. There are relatively few instances where the genetic basis of environmental adaptation is well understood [1]. Some examples where genetic variation has been linked to locally adaptive phenotypic differences are the pigmentation differences in the rock pocket mouse driven by variation in at least one melanocortin receptor [2], the industrial melanism of the peppered moth driven by a transposal element in the cortex gene [3,4] and the adaptive evolution of populations of sticklebacks [5], including the pelvic reduction driven by recurrent deletion of a tissue specific enhancer [6].
Studies of highland populations have proven informative for understanding the genetic basis of adaptation [7][8][9][10][11]. First, they inhabit different environments in close proximity to lowland populations. Genetic exchange or recent ancestry between the highland and neighboring lowland populations is therefore likely to result in low differentiation in neutrally evolving markers between highland and lowland populations, making it easier to distinguish loci involved in local adaptation. Secondly, analysis of interconnected populations spanning different habitats affords the opportunity to determine how processes such as convergent evolution [12] or adaptation from standing variation [13] have contributed to their adaptations. For example, genomic analysis of genetic adaptation of human populations living at high altitudes on three continents have revealed that convergent evolution involving selection on variants in different genes related to adaptation to hypoxia are responsible for their adaptations [8,[14][15][16]. Conversely, analysis of freshwater adaptation in sticklebacks has implicated that a suite of genetic variants are present in multiple geographically distant localities, implicating selection on standing variation [5,17].
Most genes demonstrated to be involved in adaptation have effects on morphology or physiology [1,18]. However, some studies have also identified putatively adaptive variation involved in differences in fitness related to behavior [19], such as genes that control variation in burrow architectures of Peromyscus mice [20]. Genome comparisons allow us to identify genes involved in local adaptations to different habitats where the phenotypic nature of these adaptations is not necessarily well characterized [18]. Many adaptations in social insects are likely to be behavioral [21]. In particular, honey bees have sophisticated cognitive abilities, which are needed to efficiently perform the diverse set of tasks necessary for optimal functioning of a colony. Furthermore, efficient foraging requires recognition of floral scents, location of flowers, association with a food reward and advertising food sources with a characteristic dance [22]. Optimal foraging strategies are likely to be variable between habitats and subject to selection [23].
The honey bee Apis mellifera has a large native range incorporating a wide variety of habitats. There is substantial variation in morphology, physiology and behavior across this range, which are likely to represent local adaptations [24,25]. The mountain regions of East Africa are highly complex in their topography with scattered high mountains, most of them of volcanic origin and comprising of three distinct vegetation belts: montane forest, subalpine heathlands and an alpine zone [26]. The average annual temperature of mountain rain forest habitats at 2600 m altitude is only 11.2˚C, remarkably different from lowland savannah regions below 1500 m (20.8˚C) [24].
The honey bees found in the mountain forests differ in phenotype compared to the bees of the savannah. They have been designated as a separate subspecies A. m. monticola Smith 1961 [27], whereas savannah bees have been assigned to A. m. scutellata Lepeletier de Saint Fargeau 1836 [24]. Mountain and savannah bees can be distinguished on the basis of morphometrics, although the status of monticola as a distinct subspecies from scutellata has been a matter of debate [28][29][30][31][32]. Bees from colonies identified as monticola tend to be darker in color, larger and less aggressive than scutellata savannah bees [24,27,33]. Measurement of mating frequencies indicates that levels of polyandry in monticola honey bees are significantly lower than in scutellata [34]. Descriptions of the behavior of monticola honey bees suggest that they can fly at lower temperatures than scutellata colonies, conserve honey stores during times of reduced nectar flow by reducing brood rearing and are less prone to abandon their nests by swarming or absconding [24,27,32,35,36]. It is therefore likely that monticola honey bees possess adaptations for life in cool mountain forests.
The population history of the mountain bees is debated. The mountain refugia hypothesis proposes that mountain bees have survived as small and reproductively isolated populations for thousands of generations [29,31]. Such isolation can be expected to result in distinctly different patterns of genetic diversity compared to the widespread lowland bees. The results of a study based on mtDNA data supported this scenario [31]. However, a separate study of mtDNA and microsatellites did not identify genetic differences between the monticola honey bees and surrounding lowland populations [28], which could suggest that the phenotypes observed in monticola honey bees represent phenotypic plasticity [28] or that they are determined genetically but have not led to reproductive isolation. A perennial hybridization of monticola with scutellata in a transitory zone of altitude has previously been reported [24,27].
In this study, we compare whole-genome sequences from 39 worker bees representing two Kenyan mountain areas that are approximately 100km apart: Mt Kenya and Mau (Fig 1). Each locality includes unmanaged bees from neighboring highland forest and lowland savannah environments that are separated by approximately 1000m in altitude. We aim to clarify the evolutionary origin of these populations and the genetic basis of their adaptation to high-altitude habitats.

Population-scale sequencing of honey bees from highland and lowland habitats in Kenya
We mapped genome variation in the East African mountain honey bees (A. m. monticola) in order to infer the evolutionary history of the population and identify loci involved in adaptation to altitude. We sequenced 39 samples previously collected from four native feral populations [28]: Mt Kenya Forest (MKF, 2,300 m above sea level; n = 10), Mt Kenya Savannah (MKS, 1,100 m above sea level; n = 9), Mau Forest (MF, 2,900 m above sea level; n = 10) and Mau Savannah (MS, 1,900 m above sea level; n = 10) (Fig 1; S1A Table). The bees collected from the highland forest localities are referred to as A. m. monticola (hereafter monticola), whereas lowland savannah samples are referred to as A. m. scutellata (hereafter scutellata). Samples from the highland and lowland regions could be separated by morphometrics in a previous study [28]. We produced 490 million short-reads to generate a 463× dataset spanning the sixteen assembled nuclear chromosomes, unplaced contigs and mitochondrial genome (see Materials and methods). Some unplaced contigs and the mitochondrion in particular were sequenced at extremely high coverage, inflating the average coverage. The assembled nuclear genome was sequenced to 10.4× per sample (80% of the genome was covered by >5× per sample) and unless indicated otherwise, the results below refer to analyses of this data (S1A Table).
We next called single nucleotide polymorphisms (SNPs) across the 39 samples. 8.6 million biallelic SNPs were retained after filtering and imputation (Table 1). For some comparative analyses, we expanded the dataset to include previously published Kenyan honey bee genomes (n = 11; data from [37]) and a worldwide sample of honey bees (n = 98; data from [25]; S1B and S1C Table). This makes it possible to position the Mt Kenya and Mau samples among other honey bee populations and detect uniquely divergent regions in highland genomes. The expanded dataset was produced using the same methods and spanned 13.6 million SNPs. The genome sequence of the Eastern honey bee A. cerana [38] was aligned against the A. mellifera reference genome in order to further facilitate assessments of divergence and distinguish between ancestral and derived variants. The corresponding A. cerana sequence was present at The monticola bee is associated with the isolated highland forests and the scutellata bee occurs in the surrounding lowland savannahs. High mountain peaks where monticola has been found are indicated with black triangles. Mount Kenya and Mau boxes indicate sample locations (grey = monticola; yellow = scutellata). Mt Kenya Forest (average sample elevation 2,300 m above sea level; n = 10), Mt Kenya Savannah (1,100 m asl; n = 9), Mau Forest (2,900 m asl; n = 10) and Mau Savannah (1,900 m asl; n = 10). https://doi.org/10.1371/journal.pgen.1006792.g001 Adaptation in honey bees 78% of the A. mellifera genome. Genome-wide divergence between the two species was estimated as 6.9%.
The mountain refugia hypothesis suggests that monticola populations are small relicts that have been reproductively isolated from lowland scutellata bees [31]. This hypothesis makes predictions about genetic variation in highland bees compared to lowland bees. Assuming that small populations have comparatively low effective population sizes (N E ), we can expect lower levels of neutral variation in monticola than in scutellata under equilibrium [39,40]. The number of SNPs detected within each of the four populations ranged between 5.5-5.9 million, corresponding to nearly identical estimates of nucleotide diversity (π = 0.65-0.67%/bp), the population mutation rate (θ w = 0.75-0.78%/bp) and effective sizes of each population (N E = 470×10 3 -488×10 3 ) (Table 1). We do not observe reduced variation in highland bees.
The hypothesis also predicts that highland populations share a common ancestral population and evolutionary history separate from other bees [31]. Accordingly, we should expect highland genomes to diverge from lowland genomes. We therefore calculated genome-wide F ST between populations using the Reynolds et al. estimator [41]. F ST between the Mt Kenya and Mau monticola and scutellata populations range between 0.05 and 0.068 and they all group with other Kenyan and African bees (neighbor-joining tree; Fig 2A; Table 1). Among the Kenyan bees, the monticola populations cluster on a short separate branch in the population tree (Fig 2A). Likewise, average pairwise genetic distances (d XY ) split monticola and scutellata samples into different groups (Fig 2B). While this could indicate limited degrees of independent evolutionary history, we note that the excess distance between monticola and scutellata samples is very small: d XY is only 1.02x higher between any random pair of highland and lowland samples compared with samples drawn within either habitat. We thus find that monticola populations do not diverge strongly from other Kenyan bees and that highland and lowland bees appear to be nearly undifferentiated.
It should be noted that all Kenyan populations cluster to the exclusion of the Nigerian adansonii and South African scutellata and capensis populations (the "sub Saharan" bees from [25]). Some of this increased divergence may be artificial and result from technical differences in short-read sequencing and mapping technologies used to assemble the extended dataset (Illumina paired-end reads+BWA herein and in [37] vs SOLiD fragments+Lifescope in [25]). For instance, the mean F ST between the four Kenyan populations (Illumina) and the three European M-group (SOLiD) populations is 0.392, whereas it is 0.358 between the three African populations previously sequenced on SOLiD and the same European populations. We therefore estimate the magnitude of this "technology-bias" to be an increase in divergence of 9%, assuming that the distances should be the same. However, this bias does not affect our Adaptation in honey bees comparisons of the highland and lowland populations, which were all sequenced on the same technology and processed identically.
To determine the relationship between the closely related highland forest and lowland savannah bees, accommodating the possibility of contradictory genealogies across the genome, we inferred F ST and the corresponding population interrelationships in 10 kbp segments. In contrast to the whole-genome signal, we found that the most common pattern of relatedness in the genome groups populations by locality (40% of windows; Fig 2C). The pattern that Adaptation in honey bees groups populations by habitat is the least common (29%). By partitioning the inference by chromosome, we further found that the latter pattern is recovered at approximately the same frequency on all chromosomes (25%-31%) except on chromosome 9, where it is significantly enriched (49%; p<10 −5 ; Fisher's exact test; Fig 2D). The major pattern of relatedness across the genome is therefore consistent with exchange of genetic material between local highland and lowland populations, whereas signals that cluster the populations by environment are restricted to a smaller proportion of the genome. Taken together, these analyses suggest that the extant populations of highland and lowland honey bees have the same evolutionary origin and are not isolated from each other. Our results therefore disagree with the mountain refugia hypothesis.

Two distinct regions segregate between highland and lowland populations
Genetic diversity is nearly undifferentiated between highland and lowland populations. Nevertheless, highland genomes appear to contain a small set of loci that are different from lowland genomes. These are putative targets of natural selection. Variants that are shared between the geographically separated highland populations but absent in local and closely related lowland populations could be associated with adaptation to high altitudes. We calculated F ST (Weir-Cockerham estimator [42]) for every SNP segregating between highland samples and lowland samples in order to produce a high resolution map of differentiation across the full dataset (~29 bp/SNP) and detect such loci. The result corroborates the whole-genome estimates above. Divergence is low across the genome: genome-wide F ST is only 0.036 and 7.7 million (97%) of SNPs have F ST <0.1 (Fig 3A and 3B). The striking exceptions are two regions on chromosomes 7 and 9, hereafter called "r7" and "r9". Out of the 24,445 SNPs that segregate with F ST >0.5, only 4 occur on other chromosomes or outside of these regions ( Fig 3A). The same divergent regions are identified when the two highland/lowland population pairs are analyzed independently (S1 Fig). A strong association between highland and lowland habitats and these two chromosomal regions were detected in a genome-wide association study (GWAS) using PLINK, where SNPs within the r7 and r9 regions are clear outliers from the expected distribution of allele frequency differences between two groups as indicated by a Q-Q plot (S2 Fig).
We delineated the r7 and r9 regions by their first and last SNPs with F ST >0.5 (Table 2), respectively. r7 bridges across four scaffolds and is composed of two blocks that together span 0.573 Mbp close to the start of chromosome 7 (i and ii; Fig 3C). Although these blocks are separated by almost 0.9 Mbp and appear to be discontinuous, we suspect that the second smaller block is located on a misoriented scaffold in an ambiguous region of the current assembly. This scaffold (7.5) is minus oriented in the reference genome, whereas upstream scaffolds have unknown orientation. The high similarity in sample genotypes between the two blocks, as well as their shared gene family components (see below), suggest that it should be reoriented. r9 spans 3 scaffolds and 1.639 Mbp on chromosome 9 ( Fig 3C). F ST between highland and lowland populations is~0.7 and~0.3 across the r7 and r9, respectively ( Fig 3C; Table 2). They contain 111,161 SNPs in total, representing only 1.39% of the data, yet exclusion of these SNPs alone removes the split between highland and lowland samples and shifts the interrelationships of the four populations from clustering by environment to cluster by locality ( Fig 3D). It is therefore clear that these narrow but divergent regions stand out against the genomic background and influence the analyses of the genome-wide interrelationships (see Fig 2B above).
The distinct blocks of highly differentiated SNPs with clear boundaries are suggestive of the presence of non-recombining haplotypes with high levels of divergence between them ( Fig  3A). These patterns are unlikely to result from selective sweeps, which would be expected to result in a gradual decay of LD over shorter genetic distance. r7 contains 12,042 SNP S with F ST of 0.8-0.85 but only 2,866 SNPs with F ST of 0.5-0.8 ( Fig 3B). Likewise, r9 contains 3,587 SNPs with F ST of 0.65-0.7 and similarly to r7, no individual bin with F ST >0.5 contains more SNPs than the 0.65-0.7 bin in the region (Fig 3B). The regions therefore appear to be enriched for SNPs at a particular high F ST bin, indicating strong association between the segregating variants. One explanation for these patterns is that the haplotypes represent structural polymorphisms, such as inversions that prevent recombination occurring between them.
To further characterize r7 and r9, we counted the genotypes of every sample at all divergent SNPs (F ST >0.5; Table 2). Worker honey bees are diploid and at every such SNP, a sample can therefore be homozygous for the reference sequence allele (0/0), homozygous for the non-reference allele (1/1) or heterozygous (0/1). The reference sequence is derived from a US managed population and matches the lowland haplotype [43]. Across both regions, we found that 1/1 genotypes are significantly more frequent in highland bees than in lowland bees: 83.7% (n = 267,947) vs. 2.7% (n = 8,484) in r7 and 74.7% (n = 126,136) vs. 13.4% (n = 21,526) in r9, respectively, showing that highland bees have haplotypes that are strongly enriched for non-reference variants (p<10 −5 for both regions; Fisher's exact test; Fig 4). Notably, we detected several samples that appear to be nearly completely heterozygous across either region: these are heterozygous for >88% of genotypes (Fig 4). We detected a few outlier samples that are homozygous for the opposite haplotype compared to the majority of samples from either environment. The same samples are heterozygous or atypical at the two physically linked r7i and r7ii sub-blocks on chromosome 7 (S3A Fig), supporting the idea that they are closely located on the chromosome. This pattern pertains to other samples at the independently transmitted chromosome 9. We performed principal component analyses (PCAs) using the multidimensional scaling algorithm implemented in PLINK [44] to further evaluate these divergence patterns. The PCA was carried out for all SNPs within and outside of the r7 and r9 regions, respectively. In accordance with the F ST -based analyses, we found that divergence between highland and lowland samples was much higher at r7 and r9 than across the rest of the genome and that outlier and heterozygous samples clustered as predicted (S4 Fig). Honey bees have among the highest recorded recombination rates of any animal [45]. Continuous megabase-scale heterozygosity suggests that they have both a lowland and highland haplotype and that meiotic recombination between them is greatly suppressed.
The r7 highland haplotype (r7h) is completely fixed in the Mt Kenya highland samples and absent from the Mt Kenya lowland population, whereas we detect heterozygous samples or outliers in the Mau populations (Fig 4). The r9 highland haplotype (r9h) follows similar, albeit less extreme segregation. We estimate the average population frequencies of both r7h and r9h to be 93% across highland bees and 8% or 21% across lowland bees, respectively. Haplotype frequencies are strongly associated with environment (p<10 −5 for both regions; χ 2 test), corresponding to F ST values of 0.832 at r7 and 0.682 at r9. We performed coalescent simulations in order to determine the probability that the extreme differences in haplotype frequencies we observe between populations could occur in the absence of natural selection on these regions. We used ms [46] to model the evolution of a highland and lowland population and test this alternative scenario. We adopted a basic split model without subsequent gene flow between descendant populations and without recombination. Inclusion of these processes would homogenize genetic variation between the descendant populations. We applied population demographic parameters inferred from the data to model the split (see Methods) to simulate the evolution of 1 million independent loci across the genome using the same sample size as in our dataset (20 vs. 19 diploids). We then estimated F ST between populations using the same methods with the empirical dataset. The split was inferred to have occurred 28,410 generations ago and the average divergence between simulated populations was very close the empirical data (0.037 vs 0.036). However, F ST values as high as those observed for the r7 and r9 haplotypes (r7: This indicates that such levels of divergence between two populations are highly unlikely to occur by drift alone under this scenario. It is also important to note that we observe similarly extreme levels of divergence at the r7 and r9 loci in two independent highland/lowland comparisons (Mau forest vs. savannah and Mt. Kenya forest vs. savannah). There can be no direct contact between the two highland populations due to their geographic isolation, but gene flow can occur between them via the lowland populations, where frequencies of the highland haplotypes are very low. The pattern we observe, where the same haplotype variants at two loci are associated with the highland habitat in two independent comparisons is therefore indicative of selection favoring these haplotypes Adaptation in honey bees in highland environments. The possibility that these patterns could occur in the absence of selection can be ruled out.

Segregation patterns on unmapped fragments
About 13% (29 Mbp) of the honey bee reference genome is not placed on any chromosome. We assessed these sequences separately and detected 982 additional SNPs with F ST >0.5, distributed across 31 scaffolds/contigs (S2A Table). We scanned these SNPs for genotype and haplotype patterns consistent with those in r7 and r9 (Fig 4). We find that 30 of them can be assigned to either r7 or r9 based on the pattern of segregation at the SNPs: 16 fragments spanning 28.6 kbp and 682 SNPs match r7 and 14 fragments covering 62.5 kbp and 299 SNPs match r9 (S2A Table;  To verify the assignments, we scanned the paired-end data for evidence of split read-pairs that could anchor the unmapped fragments to the regions using Delly2 [47] in translocation mode. All 30 fragments assigned to either r7 or r9, but not GroupUn869, contain evidence that place them within haplotype scaffolds or close to their borders (S2B Table). Taken together, the results suggest that they may belong to the two regions, possibly extending them by up to 4-5%.

Highland haplotypes in other parts of Kenya and across the species' range
We detect homozygosity for the highland haplotypes in the single monticola bee collected at Mt Elgon and sequenced by Fuller and co-workers [37]  Highland and lowland haplotypes are highly diverged We assessed genetic differentiation between populations at the r7 and r9 regions. For this analysis, we first subsampled the Kenyan data to contain only the individuals that were homozygous for the major haplotype associated with either environment (Fig 4). We included sequence variation from other honey bee populations for either region in order to analyze the haplotypes in the context of global haplotype diversity within the species. For both regions, we find that the Kenyan lowland bees have haplotypes that are typical for African honey bees (F ST <0.10 against other African bees), while the highland haplotypes diverge strongly from African and other subspecies (F ST >0.5; Fig 5A and 5B; Table 3). Other population interrelationships are consistent with the whole-genome analyses above and previous results (Fig 2A  above; [25]) We next compared the genetic distance between the two haplotypes in order to estimate timing of their divergence. Across the full genome, d XY is 0.67% between any random pair of two haploid genomes (Fig 3D; Table 3). At r7 on the other hand, divergence at non-coding sites is 3.34% (95% CIs: 2.97%-3.49%, 2,000 bootstrap replicates) between highland and lowland haplotypes, dating the split between them to about 3.2(2.8-3.3) million years ago (Fig 5C;  Table 3) assuming a mutation rate of μ = 5.27×10 −9 mutations per base per generation and a one-year generation time. For r9, divergence is 1.34% (95% CIs: 1.28%-1.36%, 2,000 bootstrap replicates; Fig 5D), corresponding to the haplotypes having diverged about 1.28 (1.22-1.30) million years ago. These molecular clock estimates suggest that the r7 and r9 highland haplotypes have originated independently but are both very old, possibly predating the diversification of modern honey bee populations and the colonization of their current ranges by hundreds of thousands of years [25].   Table). By comparing fixed variants between haplotypes against the corresponding sites in A. cerana, it is possible to estimate the number of derived changes that have occurred on each haplotype after the split from their common ancestor. We inferred that 66% of the 9,941 fixed mutations have taken place on the highland haplotype, indicating higher rates of fixation in this haplotype.
To assess functional evolution on each haplotype, we quantified the ratio between fixed non-synonymous and synonymous changes that has occurred on either haplotype since their common ancestor (S3 Table). In the r7 region, there are 560 fixed coding differences between highland and lowland haplotypes, 57% (n = 323) of which we infer to have occurred on the highland haplotype. Of the derived variants fixed on the highland haplotype, 44% of (142/323) are non-synonymous. On the lowland haplotype, only 28% (66/237) of the fixed derived variants are non-synonymous. The proportion of non-synonymous variants that are fixed on the highland haplotype is therefore significantly higher than the proportion fixed on the lowland haplotype (Fisher's exact test; p<10 −5 ). In the r9 region we only detect 28 fixed mutations. Out of these, 21 have occurred on the highland haplotype. Of the derived variants fixed on the highland haplotypes, 62% (n = 13) are non-synonymous. In the r9 lowland haplotype, only 28% (2 out of 7) variants are non-synonymous. These proportions show the same trend as the r7 haplotype, although they are not significantly different. It therefore appears that highland haplotypes have accumulated non-synonymous changes at a substantially higher rate.
The two divergent regions span many divergent genes and mutations that may alter protein function, making it difficult to identify the specific targets of selection in highland bees with full certainty. Both regions contain genes that influence honey bee worker behavior that we consider to be interesting candidate genes for mediating adaptation to the montane forest habitat. The r7 region includes genes encoding four octopamine receptors: AmOctβ1R (oa2), AmOctβ3R/4R (isoforms X2, X1 & X3) and AmOctβ2R (on r7ii), which together contain 12 derived non-synonymous mutations in the highland haplotypes (Fig 6A; S3 Table). Octopamines are biogenic amines and essential neurotransmitters, modulators and circulatory hormones in invertebrates. They interact with specific G protein coupled receptors to increase Ca 2+ or cAMP levels and modulate physiology and behavior in response to environmental stimuli [48,49]. In honey bee workers, octopamine increases responsiveness to sucrose and sensitivity to sensory inputs and regulates olfactory learning and memory formation [50][51][52][53]. The r9 region contains genes for encoding several isoforms of calcium/calmodulin-dependent Serine protein Kinase enzyme (CASK; LOC411347; isoforms identified with BLAST; Fig 6B). CASK interacts with a second Ca 2+ /calmodulin kinase, CaMKII, in a fundamental pathway for memory formation that is shared between humans, fly and honey bees [54][55][56]. We therefore hypothesize that one or both of the divergent haplotypes contain changes to genes that underlie adaptive foraging behaviors at high altitudes. Future work could focus on identifying behavioral differences in honey bees bearing contrasting haplotypes at the r7 and r9 loci.

Genomic signal of divergent haplotype blocks is unaffected by stringency of SNP calling
We observed reduced mapping coverage in many locations across the r7 highland haplotype compared to the lowland haplotype (Fig 6A). For r7h samples, the average genotype depth was 9.2× (0.90x the genome average), with 7% of genotypes missing in the original FreeBayes SNP haplotypes; diamond symbols are d XY for genes (centered at gene-body mid-points); yellow diamonds are octopamine receptors; ii) four-way population interrelationships as in Fig 2C and 2D; iii) genetic diversity on r7h relative to r7l; iv) average call. Among F ST >0.5 outlier SNPs, 2% of genotypes were missing. For r7l samples, the average genotype depth was 11.9× (1.13x the genome average), with 0.5% missing genotypes. Reduced short-read coverage can be expected due to high genetic distance between r7h and the reference genome and the difference in mapping depth is smaller for the less divergent r9 region (Fig 6B). Both regions contain a few regions with very high mapping depth that are shared between both haplotypes (Fig 6A and 6B).
Structural variants such as duplications/deletions that segregate between populations and the reference genome can alter mapping depths and result in incorrect genotyping. To assess the influence of mapping depths for the results, we analyzed a dataset of SNPs where a set of strict filters had been applied. We masked all SNPs and base pairs in the genome which had <30% or >30% read depth, compared to the average mapping depth, or where <50% of samples had been genotyped by FreeBayes. Filters were applied per environment (n = 20 highland vs. n = 19 lowland bees), for each subsample of individuals that were homozygous for either haplotype (Fig 4) or across the Kenyan bee dataset as a whole.
The filters retained~80% of bases across the whole genome and full dataset (156 Mbp/200 Mbp) and for the r9 region (S7A Fig). At r7, 62% of the region passed the filter in the highland subsample (vs. 80% in the lowland subsample; S7A Fig). Of the original outlier SNPs with F ST >0.5, 7,205 (45%) and 5,134 (61%) were retained for r7 and r9, respectively. For both r7h and r7l subsamples, the resulting average genotype depth was~0.95x the genome average and <0.1% of genotypes were missing. We then re-estimated and compared levels of diversity across the whole genome and the r7 and r9 haplotypes, with or without these extra filters. For the same regions, we also compared F ST between all highland and lowland bees and re-inferred haplotypes for r7 and r9 (as in Fig 4). The results are highly congruent between datasets (S7B- S7D Fig). We therefore conclude that poorly mapped regions do not drive the patterns of diversity and divergence that we have inferred across r7, r9 or the genome.

Haplotype breakpoint assessment
Detection of haplotype breakpoints can help to disentangle the nature of putative structural variants. In fly and mosquito, inversion breakpoints have previously been linked to crossover between repeated sequence and mobile elements [57,58]. We scanned the genomic regions around the outermost outlier SNPs of each region ( Table 2) for patterns of divergent read mapping and repeated motifs. These SNPs occur close to scaffold borders and it is possible that the genome assembly is incomplete for these regions.
For r7, we find that the first outlier SNP (pos. 11,056 bp) occurs in a 500 bp region (pos. 11,000-11,500 bp) where scutellata reads map normally but very few monticola reads map (S8A Fig). The monticola reads that contain the first outlier SNP are truncated by BWA to~36 bp and have mate pairs that are either unmapped or mapped to scattered regions in the genome, indicating potentially aberrant alignment. Regular monticola read mapping resumes at pos.~11,420 bp. The region overlaps with repetitive sequence containing three iterations of a 176 bp AluI-like monomer with starting positions 8,814, 10,709 and 10,885, respectively (S8A Fig). This AluI-like element has been experimentally estimated to be a common repeat at honey bee telomeres [59,60]. The two latter motifs correlate with the extremely high mapping per-haplotype sample mapping depth for r7h (grey lines) and r7l (yellow bars) with three >10× mapping peaks shared between highland and lowland haplotypes indicated with stars (per-haplotype average coverage 131×, 13× and 13×, respectively); v) the difference in mapping depth after normalizing across the whole chromosome. Scale bar indicate scaffold coordinates and orientation (? = unknown orientation) in the reference genome sequence. (B) The corresponding data for the r9 region. Shared mapping peak at 13× per-haplotype coverage indicated with star. https://doi.org/10.1371/journal.pgen.1006792.g006 Adaptation in honey bees depth observed in our data (Fig 6A). The average per-sample coverage across three 100 bp windows between positions 10,700-11,000 is 8,300×, approximately 830 times the per-sample genome average (10×). This suggests that many AluI-like short reads from across the genome have been mapped to these motifs. Within the same region, we identify a 26 bp motif with high probability ("aattgataaaggaagggaggaagagg"; p<6.20 x 10 −29 ) using MEME suite [61]. It is repeated 65 times between positions 8,942-10,667 and has high similarities towards a Winged Helix-turn-Helix (HTH) DNA binding protein motif likely containing the optix transcription factor binding site ("tgata", relative score = 1), as predicted by JASPAR [62]. Beside their roles in transcription, proteins with HTH-binding domains are involved in recombination and may cause rearrangements [63,64].
The corresponding downstream region of r7 that may contain a haplotype breakpoint occurs around position 1,511,853, where we detect the first outlier SNP in r7ii ( Table 2). This SNP and the subsequent divergent SNPs are located inside the third intron of the octopamine receptor AmOctβ2R gene (Fig 6; S8B Fig). Due to the strong linkage between r7i and r7ii (S3 Fig), we hypothesize that the first outlier SNP represents one end of the r7 haplotype region and that the 7.5 scaffold should be reoriented to join the octopamine receptor gene family into a continuous block. This region does not appear to contain AluI-like fragments or HTH motifs. As the7ii border is not as repetitive as the r7i upstream border, we designed a longrange PCR experiment to amplify the third intron of AmOctβ2R. If the intron contains the second breakpoint, we expect to be able to amplify the sequence in most reference-like scutellata samples but not in the rearranged monticola samples. We successfully amplified the The SNP-delineated borders of the r9 region occur very close to scaffold ends (<7 kbp). In both cases, the outmost SNPs are located within 500 bp from short read alignment gaps (approx. pos. 1,729,290 and 3,468,837) that are shared between highland and lowland bees, suggesting that the data may be incomplete for these regions and that the actual r9 breakpoints are not mapped.

No clear association with haplotype identity and body color
There is a tendency for monticola bees to be darker in color, whereas scutellata bees are more yellow, although color on its own does not distinguish highland and lowland bees and even varies within colonies [28,30]. We compared the color of our specimens with the identity of their haplotypes at r7 and r9 in order to determine if either of these loci controlled these differences in color (S9 Fig). There was no clear association with color at either of the loci. Five of the 19 specimens collected in the lowlands have uniformly dark abdomens. Of these, three are homozygous for the lowland haplotype at both loci, which would not be predicted if the highland haplotype is associated with dark abdomens. Three of the 20 highland specimens have uniformly dark abdomens. One of these is heterozygous at the r9 locus, whereas the others are fixed for highland haplotypes at both loci. It is therefore clear that there is a much stronger distinction between highland and lowland bees in the genome at r7 and r9 than there is in their color. This suggests that, while these loci may have some subtle effect on body color, they are likely associated with adaptations to highland habitats that have greater fitness consequences than differences in color.

Discussion
This comprehensive study of the genomes of mountain honey bees in Kenya reveals novel insights into their evolutionary history and population structure. Throughout most of the genome, mountain bees are not differentiated from neighboring lowland populations, indicating that they have a recent common origin or are experiencing gene flow (Fig 2). Our results therefore contradict the mountain refugia hypothesis [29,31] but are consistent with gene flow between the highland and lowland areas [24,27]. This evidence indicates that monticola should not be considered a distinct subspecies but rather a regional ecotype tied to mountain forests [32,65].
We identify two haplotype blocks, r7 and r9 (on chromosomes 7 and 9, respectively), which segregate strongly between highland bees at two distinct localities and neighboring lowland bees (Fig 3). These loci are extreme outliers in terms of levels of genetic differentiation between populations, and compared to the rest of the genome and to coalescent simulations (Fig 3, S2  and S5 Figs). These blocks likely represent chromosomal rearrangements such as inversions that provide the basis for adaptation to the mountain forest environment. We find divergence between the haplotypes at r7 and r9 to be extremely high, suggesting that they began diverging much earlier than the split between extant lineages of honey bees. We hypothesize that candidate genes within these regions that have effects on foraging behavior could provide the molecular basis for local adaptation to montane forest habitats.

Inversion polymorphisms likely govern local adaptation in mountain honey bees
The presence of long genomic blocks with high divergence between highland and lowland populations with distinct boundaries indicates that the regions harbor two distinct haplotypes. This is further supported by analysis of patterns of segregation of highly differentiated SNPs in this region among individuals, which made it possible to identify individuals that were homozygous or heterozygous for the two diverged haplotypes (Fig 4). These patterns are indicative of a form of balancing selection, where haplotypes are locally adaptive coupled with repression of interallelic exchange of genetic material through recombination between haplotypes. The most likely mechanism to prevent recombination that would lead to the observed pattern of diverged haplotypes, is a structural rearrangement such as an inversion.
Only one putative inversion breakpoint (on the r7 haplotype) mapped to known sequence in the honey bee genome assembly. We are able to amplify sequence across this breakpoint in lowland bees, but not in most highland bees, consistent with an inversion. We infer that the other breakpoint for the r7 haplotype is located close to the end of chromosome 7. There are many Alu1-like monomers and a repeated helix-turn-helix motif in the vicinity of this region. The Alu family of mobile element has previously been associated with chromosomal rearrangements in mammals [66]. It is possible that Alu-like repeats cause chromosome instabilities and rearrangements also in other taxa and have been involved in the origin of the highland haplotype in mountain bees. We are unable to identify such sequence patterns for the r9 boundaries, which correlate more strongly with scaffold borders and incomplete mapping.
There are two non-mutually-exclusive ways in which an inversion could have an effect on phenotype and fitness. The first is that the inversion mutation itself has an effect on genome function. This could be because the breakpoint disrupts a transcribed region or has an effect on gene regulation. The second is that the suppression of recombination is selectively favored because it maintains associations between co-adapted alleles. Theory suggests that inversions that capture locally adapted alleles in two populations that are connected by gene flow can quickly spread in a population [67]. Recent studies suggest that adaptation by inversions or supergenes is more common than previously thought [68,69].
There are many examples of structural inversions that are involved in local adaptation. For example in sticklebacks [5] several inversions govern local adaptation to fresh water habitats, and are present in many geographically separated regions. Adaptation to environmental clines in Drosophila also correlates with the frequency of cosmopolitan inversions, providing a striking example of rapid evolution [70]. It has been shown that these clines have shifted as a response to global climate change and that increased cold-tolerance may have arisen several times despite the presence of gene flow. Among butterfly species, the mimetic wing patterning of Heliconius numata provides a compelling example of chromosomal rearrangements that lead to co-adapted gene complexes involved in adaption and speciation [71]. A similar mechanism controls Batesian mimicry in the Papilio genus [72,73]. A polymorphic inversion governs worker behavior and reproductive strategies in the fire ant Solenopsis invicta [74,75].
We observe the same haplotypes associated with highland habitats in the two localities studied here (Mau and Mt. Kenya) and in another published dataset from Mt. Elgon [37] whereas these haplotypes appear to be rare outside of montane forest environments. This shared pattern indicates that their high frequencies in highlands is the result of selection on standing variation, rather than selection on new mutations. It is similar to the pattern of genetic adaptation observed in sticklebacks, where the same set of genetic variants, including inversions, are associated with freshwater streams and lakes across the world, but are rare in the oceans that connect them [5].
Putative candidate genes for adaptation to montane forest A number of genes located within the r7 and r9 regions are potential candidates for environmental adaptation in highland bees. r7 contains four out of five of the honey bee octopamine receptor genes in the genome. These are all four cAMP-inducing octopamine β-receptors (AmOctβ1R to AmOctβ4R; Fig 6A; S3 Table), whereas the single Ca 2+ regulating AmOctαR1/ oa1 is located outside of the region, on chromosome 15 [76]. Experiments with microinjections have shown that octopamine can modify neuronal responses in different neuropils of the bee brain [52]. Octopamine signaling affects complex behaviors in bees and has a key role in social division of labor and foraging [77]. Expression of octopamine receptors correlates with worker tasks and age and experimental application of the amine induces foraging in nurse bees [78][79][80]. Increased octopamine levels in honey bees positively affects scouting for new food sources or new nest sites [81]. In particular, octopamine plays a major role in olfactory learning and memory formation in the honey bee [50], important tasks for adapting to different environmental conditions. Interestingly, octopamine has also been shown to be important for stabilizing signaling integrity during hypoxic and thermal stress in other insects [82,83]. The four octopamine receptor genes located within r7 have many fixed differences between highland and lowland bees. Moreover, we infer that a rearrangement breakpoint is present in one of these genes that could potentially disrupt gene function (S8 Fig). We therefore hypothesize that genetic variants in the octopamine receptors, that exert their effects via mediating foraging behavior, are responsible for selective advantage of the highland version of the r7 locus in montane forests.
Another putative candidate for controlling honey bee behavior is the Ent2 gene (LOC55249) on r7, which has been associated with synaptic transmission and associative learning in Drosophila [84] and the Ca 2+ /calmodulin-dependent Serine protein Kinase (CASK) isoforms encoded by genes at one edge of the r9 region (Fig 6B; S3 Table). CASK acts together with CaM-KII to affect long-term memory formation in honey bees [56,85]. Differences in these genes could contribute to foraging performance in highland bees. However, we cannot rule out the possibility that the divergent haplotypes have broader functional implications as they include changes to genes with Drosophila orthologues involved in regulating chromatin (transcriptional activator protein Pur-β; LOC72639; r7), lipid (calcium-independent phospholipase A2-gammalike; LOC726656; r7) and polypeptide-folding functions (prefold in subunit 5-like; LOC411936; r9) as well as muscle development (myosin-2 heavy chain; LOC100576864; r9) [86]. Any of these could be important for adaptation, either independently, or as part of a co-adapted supergene complex, evolving in concert with other genes.
In addition to adaptive sequence evolution, highland haplotypes appear to have been affected by more genetic drift than lowland haplotypes. First, the levels of genetic variation are much reduced on the highland haplotypes compared to the lowland haplotypes (Table 3; Fig  6): by 76% for r7h vs r7l and by 28% for r9h vs r9l. Reduced diversity among highland haplotypes likely equates to lower N E in these haplotypes compared to lowland haplotypes, probably due to them being more geographically restricted. Second, highland haplotypes have accumulated considerably more derived non-synonymous variants than lowland haplotypes (S3 Table), which may indicate accumulation of slightly deleterious variants. This could reflect that N E of the mountain haplotypes may historically have been lower due to restricted distribution to particular habitats compared to the widespread lowland haplotypes and despite prevalent gene flow across the rest of the genome.

Ancient origin of inversions and the evolution of mountain bees
High divergence between the haplotypes at r7 and r9 suggests their origin is ancient. Divergence between haplotypes at these loci is substantially higher than divergence between the major honey bee lineages found on different continents. We estimate r9 to have evolved 1.3 million years ago (MYA) and the r7 3.2 MYA using a molecular clock. Both of these dates are considerably older than estimations of the emergence of extant populations of A. mellifera [25]. There are two main explanations for such an ancient origin. First, it is possible that the haplotypes were present in the ancestral population of A. mellifera, before the split of extant lineages. They could have been involved in local adaptation before modern lineages came to inhabit their current ranges in Europe, Africa and the Middle East. Despite phenotypic similarities to European bees [27], we have not detected the monticola haplotypes outside Africa (Table 3; S6 Fig). A second possibility is that the highland haplotype arose from introgression with another related species in the past. This scenario is inferred to be the case with the haplotype encompassing the EPAS1 gene in humans, which is responsible for high altitude adaptation in Tibetans and inferred to have arisen by adaptive introgression from archaic humans to modern humans [11]. In the case of monticola mountain bees, a potential donor population is not known. All other Apis species are found in Asia and their native distributions have not overlapped with that of A. mellifera until the beginning of the 20 th century when A. mellifera was introduced in East Asian countries.

Conclusion
Our studies of honey bee genomes from highland and lowland populations reveal patterns that are consistent with pervasive gene flow between them, with the exception of two large and divergent blocks on chromosomes 7 and 9. Haplotypes at these loci appear to represent long inversions that are strongly differentiated between populations from different habitats. These loci are reminiscent of supergenes that have been demonstrated to govern adaptation in several other species. Many genes within these blocks are linked to honey bee behavior. In particular, we identify a haplotype breakpoint that disrupts the transcript of an octopamine receptor, part of a family of genes involved in foraging and learning. We therefore hypothesize that these loci govern the behavioral traits that are characteristic for mountain bees and likely constitute local adaptations to the highland environment. High levels of divergence between haplotypes at both loci indicate an ancient origin, suggesting that they were involved in environmental adaptation before the dispersal of honey bees to their present geographic range.

Sample collection and DNA extraction
Female worker honey bees from each highland locality in Kenya (eastern slope of Mount Kenya and Eastern Mau Forest) and lowland samples from neighboring locations were collected as part of a previous study [28]. The highland bees are referred to as A. m. monticola (hereafter monticola), whereas lowland bees are referred to as A. m. scutellata (hereafter scutellata). The monticola sampling sites were closed canopy forest areas above 2000m, whereas the corresponding lowland scutellata samples were collected in savannah vegetation or agricultural land surrounded by savannah vegetation (Fig 1; S1A Table). We used the Maxwell Tissue DNA Purification Kit (Promega) to extract total genomic DNA from the thorax of single honey bees, each from a different colony. Images of individual abdomens were taken using a ZEISS Stereo Microscope unit Stemi 305 with an Axiocam 105 color (Fa. Zeiss, Germany).

Sequencing and read mapping
The 39 DNA samples were barcoded and 2x125 bp paired-end reads were sequenced on an Illumina HiSeq 2500 sequencer. Reads were mapped against v4.5 of the honey bee reference genome (Amel_4.5) [43] using the default settings in the BWA v0.7.12 aligner with the "mem" algorithm [87]. Read groups and duplicates were tagged and marked with Picard v1.118. Indel-realignment and quality score recalibration was performed with GATK v3.3.0 [88], using SNPs from Wallberg et al. [25]. These programs were used with default settings and parameters.
Two datasets were produced. The first dataset contained genome data only from the 39 new libraries and was used for most analyses comparing Kenyan highland and lowland bees. For the second dataset, we used 11 short read archives from [37] (Kenyan samples; NCBI project ID: PRJNA237819; Illumina reads) and 98 archives from [25] (worldwide samples; NCBI project ID: PRJNA236426; SOLiD reads) in order to expand the population sample and facilitate additional comparative analyses. The former dataset was mapped with BWA as above, whereas the latter dataset was mapped with Lifescope™ as in [25]. See S1B and S1C Table for detailed sample information.

SNP calling and imputation
We called single-nucleotide polymorphisms (SNPs) across all samples using the haplotypebased variant detector Freebayes v0.9.20-16 [89] for both datasets. We used the flags "-E 0", "-X" and "-u" to suppress construction of short multi-nucleotide haplotypes from closely positioned polymorphisms and to avoid making composite polymorphisms. We used the flag "theta 0.008" to better match the expected population mutation rate estimated in [25], as compared to the human default value (0.001). Putative SNPs were filtered for quality by accepting only biallelic SNPs with QUAL scores >50. In addition, we removed known problematic positions where a drone closely related to the DH4 individual used to produce the reference genome had been inferred to be heterozygous in [25]. As drones are haploid, they should never be heterozygous. Such SNPs are errors that indicate problematic regions in the assembly where genotyping is unreliable. This filter removed 61,157 SNPs. The procedure is detailed in [25]. The restricted dataset (n = 39 samples) contained 8,593,016 SNPs and the expanded dataset (n = 148) included 13,672,645 SNPs. BEAGLE v3.3.2 [90] was used to impute missing variants and phase haplotypes. We used the flags "iterations = 30", "nsamples = 10" and "lowmem = true" to increase accuracy and reduce memory usage as per the recommendations in the program manual.

Variant annotation
Gene models provided by the latest official gene set (OGSv3.2) [43] in GFF format were used to associate SNPs with genes and annotate synonymous and non-synonymous variants in coding sequences. In order to determine gene names, orthologues and putative functions, the models were cross-referenced with genes in Amel_4.5,the NCBI Annotation Release 103 (AR103) and a comprehensive set of Drosophila Flybase orthologues detected in [25,86]. Custom Perl scripts were used to parse the coordinates and label SNPs. We aligned the genome sequence of the Eastern honey bee A. cerana (v1.0; from Park et al. [38]) against the A. mellifera genome using the whole genome synteny aligner Satsuma v1 [91]. This allowed us to: i) root phylogenetic trees; and ii) use parsimony to distinguish between ancestral and derived variants at SNPs across the divergent haplotypes. In this method, the shared allele between A. mellifera and A. cerana is taken as ancestral and the non-shared allele is taken as derived.

Divergence estimates
We used the F ST statistic to estimate divergence between populations from pairwise differences in allele frequencies. For individual SNPs, we used the F ST estimator of Weir and Cockerham [42]. Outlier SNPs with high F ST compared to the genomic background served as a basis for detecting genomic regions segregating between highland and lowland bees. We applied the Reynolds F ST estimator [41] to produce whole-genome distance matrices between all populations, allowing us to determine whether divergent regions were associated with mutations in highland or lowland bees. Reynolds F ST was also used to compute divergence in 10 kbp windows along the genome in order to detect local support for conflicting genealogies between Kenyan populations. These statistics were estimated using custom Perl scripts. Population interrelationships were inferred from genetic distances using the neighbour-joining algorithm [92] as implemented in PHYLIP v3.696 [93] or SplitsTree v4.14.2 [94] using default settings.
Average pairwise differences were computed to estimate per-base genetic distance between all samples or between the highland and lowland haplotypes detected on chromosomes 7 and 9. These per-base genetic distances were formulated by Nei and Li [95] and are expressed as π when considering nucleotide diversity within populations and d XY when measured between populations or haplotypes. Divergence was calculated using custom Perl scripts. We applied a constant molecular clock to date the origin of the haplotypes from d XY estimated from presumably functionally neutral variants (those located outside of gene bodies). For these calculations, we used a mutation rate μ = 5.27×10 −9 mutations per base per generation, which was previously estimated from neutral divergence between A. mellifera and its sister species A. cerana [25]. We assumed one generation per year as in ref [25]. We estimated 95% confidence intervals by partitioning the haplotypes into 10 kbp windows and bootstrapping the data using per-window d XY estimates (2,000 replicates per region).
We performed a genome-wide association study (GWAS) to study associations between SNPs and the environment, using lowland and highland location for each sample as the case and control phenotype (S1 Table). We used the association (-assoc) function in PLINK v1.9 [44] to implement the test and plotted the observed associations over the expected assuming a random normal distribution in a Q-Q plot using the-adjust and-qq-plot flags. We also used PLINK to make a principal component analysis inside and outside the putative haplotype inversion regions using the multidimensional scaling algorithm (-mds). The regions were defined by the intervals provided in Table 2.

Diversity estimates
To measure genetic diversity within populations, we estimated π (see divergence section above) and θ w (Watterson's theta; the population mutation rate) per base for each population [96]. Locally reduced genetic diversity can be a signal of selection or indicate reduced effective population size at that locus. Relative levels of genetic diversity between highland haplotypes (h) and lowland (l) haplotypes on chromosomes 7 and 9 was computed using the equation: Strongly negative values indicate reduced variation in highland haplotypes. We used the population mutation rate (θ w ) and the honey bee mutation rate (μ) above to estimate effective population size (N E ) using the following equation: Because honey bees are haplodiploid, we used the inheritance scalar of 3 rather than 4, used for diploids.

Demographic simulations
We used the program ms [46] to simulate the neutral coalescent under a basic population split scenario under a Wright-Fisher model, assuming no gene flow, no recombination and constant population size. In this scenario, a hypothetical ancestral honey bee population split into a highland and a lowland population without secondary contact. We used realistic empirical estimates of the model parameters. We first estimated the generation time since the split. Under a model of neutral divergence of two populations from a common ancestor, it is possible to convert F ST into an estimate of time since divergence, measured in units of scaled time, which we define as T = t/3N E , where t is the number of generations since the split. The factor of 3 is applicable for haplodiploids. T can be estimated using the following formula [41]: The average divergence between highland and lowland honey bees (F ST = 0.036) was used to estimate T as 0.01833. We next determined the number of independent loci to simulate. Excluding gaps and undetermined sites, the 16 honey bee chromosomes span 199.7 Mbp in total. Given the extremely high recombination rate across the genome [45], linkage between sites is expected to decay very rapidly and the genome should therefore contain many unlinked loci. r 2 was previously estimated to decay below 0.05 at distances greater than 200 bp in African honey bees [25]. We therefore assumed that the 200 Mbp genome would contain 1 million unlinked loci. Assuming the population mutation rate θ w to be 0.0082 per base pair, as estimated from all data (Table 1), we simulated a coalescent process across a theoretical 1 kbp locus (an arbitrarily chosen size) where θ w is scaled to 8.2. We repeated the simulation 1 million times, instructing ms to export 1 segregating site (SNP) per replicate. We sampled 40 chromosomes for one descendant population (n = 20 diploid workers) and 38 chromosomes for the other (n = 19 diploid workers). Since the coalescent tracks events back in time, the model is implemented using a population join (-ej flag) rather than a population split. The simulation was performed with a single command: ms 78 1000000 -t 8.2 -I 2 40 38 -ej 0.01833 2 1 -s 1 > ms.simulation.data Here, we specify to sample 78 chromosomes, repeat the analysis 1 million times, use the 1 kbp scaled mutation rate of 8.2 (-t 8.2), specify that the chromosome count for two populations is 40 and 38, respectively (-I 2 40 38), model a population join between population 2 and 1 at generation time 0.01833 (-ej 0.01833) and export one segregating locus per replicate (-s 1). This produced 1 million loci for which we calculated Weir and Cockerham F ST individually, as for the empirical data. We produced a distribution of these simulated estimates by partitioning them into F ST bins of 0.01.

Haplotype breakpoint assessment
Distal outlier SNPs were used delineate haplotypes in each divergent region. Read mapping was manually inspected around these coordinates using Tablet 1.16.09.06 [97]. Repetitive element motifs in the haplotype breakpoint region on chromosome 7 were detected with the BLAST [98] and MEME [61] web services.
Some unmapped fragments (aggregated into virtual chromosome GroupUN) had outlier SNPs and genotype patterns consistent with those detected at the divergent haplotypes. In order to collect additional lines of evidence that those candidate fragments may belong to these regions, we performed split-read and paired-end analyses delly2 [47] in translocation mode (-t TRA) for the GroupUN and chromosome 7 and 9 BAM files. The program was used with default settings and the output VCF file was parsed for links between unmapped fragments and either region.
In order to test a breakpoint experimentally, oligonucleotides were developed flanking the putative breakpoint located within Octβ2R (GB49696) in chromosome 7 between positions 1,511,194 to 1,513,199. Oligonucleotide Ex3in_Ex4_1fw TTTTCTTCTCCCCCTTCTTTTC and Ex3inEx4in_1rev TTCCACTATAACCGCTTTTCC were used in a standard PCR reaction setup using high-fidelity Q5 Taq polymerase (NEB Biolabs, UK) and the following cycle conditions: 98˚C 120 sec, 33x 98˚C 30 sec, 58˚C 25 sec, 72˚90 sec and 72˚C 4 min. PCR fragments were size separated on a 1.3% agarose gel (0.5x TBE) on 140 voltage for 2.5 h in 0.5x TBE buffer. Sequence information of a subset of these PCR fragments were obtained following subsequent standard cloning procedure with pGEM-T vector system (Promega, Germany) and double strand sequencing of clones (GATC Biotech, Konstanz, Germany).  Table. (B) A global sample of honey bees from [25]. Symbols as in Fig 4. Sample order as in S1C Table. (C) The r9 region for the African samples and subdivided for the two main scaffolds (scaffold 9.5; n = 2,231 SNPs; scaffold 9.7; n = 6,208 SNPs). (D) Divergence (d XY ) between the two South African (SA) scutellata samples that appear to be heterozygous for the r9h highland haplotype (scu_3 and scu_5) and the Kenyan (KE) bees that are either homozygous for r9h (upper plot) or r9l (lower plot). d XY between South African (SA) scutellata samples homozygous for the lowland haplotype and the same Kenyan bees indicated in yellow. d XY within either group of Kenyan bees indicated in black. (TIF) S7 Fig. Extended filtering and quality control. (A) Proportion of retained sites across the genome or divergent regions r7 and r9 after stringent filtering for mapping depth and sample coverage (see Results section for filters; r7h = r7 highland haplogroup; r7l = r7 lowland haplogroup; r9h = r9 highland haplogroup; r9l = r9 lowland haplogroup). (B) Average genetic diversity across the regions in A based on all data or after filtering. (C) F ST between highland and lowland bees based on all data or after filtering. (D) Haplotype patterns for r7 and r9 after filtering (as compared to Fig 4).