The tempo of linked selection: rapid emergence of a heterogeneous genomic landscape during a radiation of monkeyflowers

—What are the processes that shape patterns of genome-wide variation between emerging species? This question is central to our understanding of the origins of biodiversity and the fundamental principles governing molecular evolution. It is becoming clear that indirect selection on linked neutral variation (hereafter ‘linked selection’) plays a pervasive role in shaping heterogeneous patterns of genome-wide diversity and differentiation within and between species, but we do not know how these signatures of linked selection evolve over time. To fill this critical knowledge gap, we construct the first chromosome-level genome assembly for the bush monkeyflower, and use it to show that linked selection has been a primary architect of heterogeneous patterns of lineage sorting, differentiation, and nucleotide diversity across a recent radiation. By taking advantage of the range of divergence times between the different pairs of monkeyflower taxa, we also show how the signatures of linked selection evolve as populations diverge: linked selection occurring within lineages acts to conserve an ancestral pattern of diversity after a population split, while its joint action in separate lineages causes a common differentiation landscape to rapidly emerge between them. Together, our study demonstrates how pervasive linked selection shapes patterns of genome-wide variation within and between taxa, and provides critical insight into how its singiature evolves during the first 1.5 million years of divergence. Significance What are the processes that shape patterns of genome-wide variation between emerging species? Because nucleotides are linked together on chromosomes, even neutral variants are impacted by selection on mutations that arise at neighboring sites. We show that this phenomenon, referred to as linked selection, was important in causing common patterns of differentiation to evolve between taxa during a radiation of monkeyflowers. This signature begins to emerge shortly after divergence begins, but it takes 1.5 million years to become pronounced. This result fills a critical gap in our knowledge about how genomes evolve, and it shows how linked selection shapes patterns of differentiation soon after a population split, which is critical to our understanding of divergence and speciation.

Abstract-What are the processes that shape patterns of genome-wide variation 26 between emerging species? This question is central to our understanding of the origins of 27 biodiversity and the fundamental principles governing molecular evolution. It is 28 becoming clear that indirect selection on linked neutral variation (hereafter 'linked 29 selection') plays a pervasive role in shaping heterogeneous patterns of genome-wide 30 diversity and differentiation within and between species, but we do not know how these 31 signatures of linked selection evolve over time. To fill this critical knowledge gap, we 32 construct the first chromosome-level genome assembly for the bush monkeyflower, and 33 use it to show that linked selection has been a primary architect of heterogeneous patterns 34 of lineage sorting, differentiation, and nucleotide diversity across a recent radiation. By 35 taking advantage of the range of divergence times between the different pairs of 36 monkeyflower taxa, we also show how the signatures of linked selection evolve as 37 populations diverge: linked selection occurring within lineages acts to conserve an 38 ancestral pattern of diversity after a population split, while its joint action in separate 39 lineages causes a common differentiation landscape to rapidly emerge between them. 40 Together, our study demonstrates how pervasive linked selection shapes patterns of 41 genome-wide variation within and between taxa, and provides critical insight into how its 42 singiature evolves during the first 1.5 million years of divergence.

44
Introduction 57 Since the first discoveries of abundant genetic variation in nature, evolutionary 58 geneticists have sought to understand the processes that shape patterns of polymorphism 59 and divergence within and between species (1-6). The neutral theory explained how 60 mutation and drift could shape genetic variation (3, 7). Despite work suggesting the 61 importance of non-neutral forces (5,(8)(9)(10), it has remained the default assumption of 62 most molecular genetic analyses, partly because of a lack of concrete, alternative models. 63 However, genome-wide studies have revealed heterogeneous patterns of genetic variation 64 that are inconsistent with purely neutral forces (11)(12)(13)(14). These genomic 'landscapes' can 65 be important confounders for work in other fields, such as speciation research (15-17), 66 and they provide intriguing clues in their own right into the ongoing evolutionary forces 67 shaping our own genomes. 68 Heterogeneous genomic landscapes are increasingly understood to be formed due 69 to the indirect effects of selection on linked neutral variation (hereafter, linked selection) 70 (13). For example, variable patterns of genetic diversity (π) have now been observed 71 across the genomes of a diverse range of plants and animals, and appear to have been 72 shaped by variation in the intensity of linked selection across the genome (14,18). This 73 occurs because natural selection reduces the amount of genetic variation available to 74 future generations at linked sites, similar to a reduction in local effective population size 75 (N e ) (19-23). It is important to note that all forms of selection, whether acting on 76 deleterious or beneficial mutations or on epistatic interactions, have linked effects. In 77 theoretical models of recurrent linked selection, its local intensity is determined by the 78 density of targets of selection relative to the recombination rate, such that larger 79 reductions in diversity occur in genomic regions with more frequent selection and less 80 recombination (19)(20)(21)(22). 81 Compared to patterns of within-species diversity, we know relatively little about 82 how linked selection shapes patterns of genome-wide differentiation between emerging 83 species (15) Although rates of differentiation and lineage sorting should be accelerated in 84 genomic regions that have experienced long-term reductions in diversity (24-26), we do 85 not know how long it takes for linked selection to generate heterogeneous patterns of 86 between-species variation (17). Unlike patterns of diversity (π), which are inherited from 87 an ancestral population and maintained in diverging taxa by ongoing linked selection, a 88 heterogeneous pattern of differentiation (F ST ) should emerge gradually owing to the 89 accumulating effects of lineage-specific linked selection following a population split. 90

116
A chromosome-level genome assembly, map, and annotation for the bush monkeyflower 117 To facilitate the analysis of genome-wide variation in this group, we constructed 118 the first chromosome-level reference genome for the bush monkeyflower using a 119 combination of long-read Single Molecule Real Time (SMRT) sequencing reads 120 (PacBio), overlapping and mate-pair short-reads (Illumina), and a high-density genetic 121 map (7,589 segregating markers across 10 linkage groups; Fig. S1; Table S1). Contig 122 building and scaffolding yielded 1,547 scaffolds, with an N50 size of 1,578 kbp, and a 123 total length of 207 Mbp. The high-density map allowed us to anchor and orient 94% of 124 the assembled genome onto 10 linkage groups, which is the number of chromosomes 125 inferred from karyotypic analyses in all subspecies of M. aurantiacus and M. clevelandii 126 (37). Analysis of assembly completeness based on conserved gene space (38) revealed 127 that 93% of 1440 universal single copy orthologous genes were completely assembled 128 (Table S2). Subsequent annotation yielded 23,018 predicted genes.

130
Variation in the extent of lineage sorting across the genome 131 As a first step toward understanding the processes that have shaped patterns of 132 genome-wide variation during this radiation, we inferred phylogenetic relationships 133 among its taxa. Rapid diversification is a hallmark of radiations and can result in 134 extensive phylogenetic discordance between genomic regions due to incomplete lineage 135 sorting (ILS) (39)(40)(41)(42). To do this, we sequenced 37 whole genomes from the seven 136 subspecies and two ecotypes of Mimulus aurantiacus (n = 4-5 per taxon) and its sister 137 species M. clevelandii (n = 3) ( Fig. S2; Table S3). Close sequence similarity allowed us 138 to align reads from all samples to the reference assembly with high confidence (average 139 91.7% reads aligned; Table S3). After mapping, we identified 13.2 million variable sites 140 that were used in subsequent analyses (average sequencing depth of 21x per individual, 141 Table S3). Relationships were inferred among the nine taxa using maximum-likelihood 142 (ML) phylogenetic analysis (43) based on three different datasets: whole-genome 143 concatenation and 500 kb and 100 kb non-overlapping genomic windows. 144 145 146 Figure 1. Evolutionary relationships and patterns of genome-wide variation across the 147 radiation. A) The black tree was constructed from a concatenated alignment of genome-wide 148 SNPs and is rooted using M. clevelandii. The 387 gray trees were constructed from 500 kb 149 genomic windows. The first number associated with each node or taxon is the bootstrap support 150 for that clade in the whole genome tree, and the second number is the percentage of window-151 based trees in which that clade is present. B) Levels of differentiation (F ST ), C) divergence (d xy ), 152 and D) diversity (π) within and among taxa based on the same 500 kb windows. For simplicity, 153 F ST and d xy are shown only for comparisons with the red ecotype of subspecies puniceus.

155
The tree topology obtained from the whole genome (concatenated) dataset ( Fig. 1)  156 confirmed the same phylogenetic relationships as previous analyses based on reduced-157 representation sequencing and five methods of phylogenetic reconstruction (27,33), and 158 were supported by patterns of clustering from principal components analysis (Fig. S3). 159 All seven subspecies formed monophyletic groups with 100% bootstrap support. 160 Relationships within subspecies puniceus were more complex, as the red ecotype formed 161 a monophyletic sub-clade within the paraphyletic yellow ecotype, reflecting the recent 162 origin of red flowers from a yellow-flowered ancestor (

Page 5
Although the whole genome phylogeny provides a well-supported summary of the 164 relationships among the taxa, window-based analyses revealed extensive phylogenetic 165 discordance at a finer genomic scale ( Fig 1A). (6%) trees showed the same taxon branching order as the whole-genome tree. While 169 some of this discordance could be generated by gene flow after divergence (33,42,44,170 45), our data indicate that the majority is due to incomplete lineage sorting (ILS). 171 Specifically, higher levels of discordance were observed at nodes that were separated by 172 shorter internode lengths (r 2 = 0.94, p < 0.001; Fig. S4 Window-based estimates of nucleotide diversity also vary markedly (π; Fig. 1D), ranging 207 from 0.09% to 1.26%, even though mean estimates were very similar among the ingroup 208 taxa (0.37% to 0.53%) and were only slightly lower in M. clevelandii (0.26%). 209 As with tree concordance, these summary statistics showed non-random patterns 210 of variation across broad regions of the genome (p < 0.005; Fig. 2; Fig. S5 Fig. S6; Fig.  211 S7). To account for the large magnitude of variation in these statistics across all nine taxa 212 (for π) or among the 36 pairs of taxa (for d xy and F ST ), we normalized the window-based 213 estimates using z-transformation and plotted them across the genome (Fig. S5). After 214 noting that the genome-wide patterns for each statistic were qualitatively similar among 215 all comparisons, we used principal components analysis to quantify their similarity and 216 extract a single variable (PC1) that summarized this common pattern (Fig. S5). These 217 analyses confirmed that patterns of genome-wide variation were highly correlated across 218 this group of taxa. and is effectively the same as taking the mean window-based scores for each statistic (r 2 228 between PC1 and mean scores > 0.995 for all three statistics). 229 Observing similar differentiation, diversity, and divergence 'landscapes' among 261 these taxa suggests that a common mechanism has been responsible for shaping patterns 262 of genome-wide variation across the radiation. Recent studies have observed correlated 263 genomic landscapes among related taxa, concluding that they were generated by a shared 264 pattern of heterogeneous linked selection (48-51). Indeed, if a region experiences a high 265 level of linked selection across the phylogenetic tree, then it will have both lower 266 diversity (π) within species and lower divergence (d xy ) between species, because 267 divergence is determined in part by levels of diversity in the common ancestor (15, 16, 268 22, 24). In agreement with this prediction, we observed a strong positive correlation 269 between PC1 d xy and PC1 π (r = 0.84), indicating that regions of the genome with lower 270 diversity tended to be less diverged in all taxa (Fig. 3, Fig. S9 for scatterplots and Fig.  271 S10 for results at 100 kb scale). Regions with reduced diversity also tended to show 272 higher differentiation (F ST ) (r = -0.84) and higher levels of tree concordance (r = -0.69), 273 both of which are predicted by models of linked selection (52). 274 275 276

289 290
We next identified factors that cause variation in the intensity of linked selection 291 across the bush monkeyflower genome. Its intensity is determined by the local density of 292 targets of selection relative to the recombination rate (19-22). Specifically, a higher 293 intensity of linked selection is predicted in regions of the genome that are enriched for 294 functional elements, because mutations are more likely to have fitness consequences if 295 they arise in areas that are gene rich. The local recombination rate modulates this effect, 296 because regions unlinked to functional sites evolve independently of them.

297
To test these predictions, we used our annotated genome and genetic map to 298 calculate the number of protein coding genes and the average recombination rate 299 (cM/Mbp) in each 500 kb window ( Fig. 2E-F; Fig. S5; Fig. S6). There was a strong 300 negative correlation between gene count and recombination rate (r = -0.40), leading to 301 large variation in the predicted strength of linked selection across the genome. In 302 addition, we observed strong correlations between PC1 π and both gene count (r = -0.84) 303 and recombination rate (r = 0.44; Fig. 3; Figs. S9 & S10), both of which indicate that 304 variation in the intensity of linked selection has shaped common patterns of diversity 305 across the genome.

306
Despite only having a direct estimate of gene density and recombination rate 307 variation from one subspecies (puniceus), the presence of a common diversity landscape 308 implies that the genomic distribution of these features has been conserved in all taxa after 309 being inherited from their common ancestor. This scenario is consistent with the recent 310 shared history of the group, and explains how a common pattern of heterogeneous linked 311 selection has become shared among them.

313
The signature of linked selection becomes stronger with increasing divergence time 314 Our analyses indicate that linked selection has shaped common patterns of 315 diversity and differentiation across this radiation. By using the different levels of 316 divergence between pairs of taxa, we next test predictions about how these signatures 317 should evolve over time (16). When a population first splits, levels of diversity (π) and 318 divergence (d xy ) are equal, so F ST will be zero across the genome ( Fig. S11 for cartoon 319 explanation assuming a simple model of allopatric divergence, no spatial structure, and 320 large N e ). As divergence proceeds, differentiation increases, but due to variation in the 321 intensity of linked selection across the genome, certain regions become differentiated 322 before others. During the early stages of divergence, when lineage-specific linked 323 selection has had a minor impact on the genome, patterns of differentiation should only 324 weakly mirror the footprint of historical linked selection. However, as the divergence 325 time increases, the cumulative effects of linked selection should strengthen the 326 relationships between F ST and π, gene density, and recombination rate. In contrast, the 327 strength of the correlations between π and gene density and recombination rate should 328 remain similar over time, because the shape of the diversity landscape is preserved by 329 recurrent linked selection despite new mutations arising in each lineage. 330 To test these predictions, we determined if the strength of the relationships 331 between these statistics varied with the level of divergence between taxa. As expected for 332 a pair of taxa that recently split, the correlation between π and d xy is almost perfect 333 between the least divergent pairs of taxa (r ~ 1), but the correlation decays over time as 334 ancestral variants fix and new mutations increase d xy (Fig. 4A). Remarkably, however, 335 the strong correlations between π and gene density (r ~ 0.8) and π and recombination rate 336 In addition to showing that the footprint of linked selection is dynamic, the 343 sequence of divergence times provides novel insight into when linked selection begins to 344 shape patterns of differentiation, and how long it takes for this signature to develop ( Fig.  345 5). In population pairs with the most recent divergence times (d xy = 0.5% -0.7%), linked 346 selection's effects on the differentiation landscape are already apparent, as genome-wide 347 patterns of F ST are moderately correlated with variation in π, gene count, and 348 recombination rate (r ~ 0.4, Fig. 4D-F). This is true even for the parapatric red and 349 yellow ecotypes of subspecies puniceus (d xy = 0.5%), which diverged recently with 350 ongoing gene flow (31, 32). As divergence continues, the effects of linked selection 351 become even more pronounced. In the most divergent comparisons (d xy = 1.5%), the 352 diversity and differentiation landscapes almost perfectly mirror one another (r = -0.94;

364
The color gradient shows the strength of the correlation. Details for each regression can be found 365 in Table S5.

367
Conclusions and implications 368 Facilitated by the first chromosome-level genome assembly for the bush 369 monkeyflower, we show that linked selection has been a primary architect of the common 370 patterns of diversity, differentiation, and lineage sorting across this recent radiation. 371 Genome-wide variation in the intensity of linked selection is conserved among these taxa 372 and is determined by the distribution of functional elements and variation in the local 373 recombination rate. By taking advantage of the range of divergence times between the 374 different pairs of monkeyflower taxa, we provide the first empirical picture of how the 375 signatures of linked selection emerge over time: linked selection occurring within 376 lineages acts to preserve an ancestral pattern of diversity after a population split, while its 377 joint action in separate lineages causes a common differentiation landscape to emerge 378 between them. Average d xy (*10 -2 ) 0.0

389
In addition to providing a dynamic picture of how the genomic landscape has 390 evolved over the first 1.5 millions years of divergence, our study has important 391 implications for the fields of molecular evolution and speciation. For example, even 392 though the impact of linked selection might be expected to vary across the tree of life due 393 to factors like differences in genome size, ploidy, mutation rate, recombination rate, and 394 effective population size (53), our findings support previous studies indicating that little 395 of the genome evolves free of its effects (6,14,18). This suggests that genome-wide 396 patterns of diversity, differentiation, and lineage sorting cannot be understood without 397 taking the effects of linked selection into account. 398 Our work also has implications for interpreting the genomic landscape in light of 399 the speciation process. Although it was initially thought that peaks of differentiation 400 would correspond to genomic regions underlying barriers to gene flow between emerging 401 species, it is now clear that differentiation landscapes are also shaped by widespread 402 selection that is unrelated to speciation (17, 49 Page 11 ecotypes of subspecies puniceus, as any effect of background selection should be 411 nullified when divergence occurs with gene flow (55). Therefore, unless these 412 simulations failed to capture some important aspect of our study system, other forms of 413 selection that are relevant to speciation may contribute to the common signature of linked 414 selection that we, and others, have seen. Although further work is clearly needed to 415 understand the causes of linked selection, our study shows that characterizing its 416 signatures is a critical step in understanding the processes that shape genetic variation 417 within and between populations and species. 418 419

420
Genome assembly, high-density linkage map, and annotation 421 We used a combination of short-read Illumina and long-read Single Molecule, 422 Real Time (SMRT) sequencing to assemble the genome of a single individual from the 423 red ecotype of M. aurantiacus subspecies puniceus (Table S2 for sample collection 424 location). To assemble resulting scaffolds into pseudomolecules, we generated a high-425 density linkage map from an outbred F 2 mapping population ( We used RAxML v8 to construct genome-wide phylogenies from a concatenated 441 alignment of all variable sites and from genomic windows. We tested for a relationship 442 between node concordance (the number of 500 kb window-based trees that recovered a 443 given node from the genome-wide tree) and internode length using the internode 444 distances from the genome-wide analysis. Tree concordance scores were generated from 445 the correlation between the distance matrix from each window-based tree and the whole-446 genome tree. Autocorrelation coefficients for tree concordance scores were calculated in 447 R using custom scripts, and their significance was tested from 1000 random permutations 448 of the genome-wide data. See supplementary methods for more details.

450
Population genomic analyses 451 Estimates of nucleotide diversity (π), differentiation (F ST ), and divergence (d XY ) 452 for 100 kb and 500 kb windows were calculated using Python scripts downloaded from 453 https://github.com/simonhmartin/genomics_general. Principal components analysis was 454 used to summarize patterns of variation in these statistics across all taxa (for π) and taxon 455 comparisons (  (colored points) obtained using a phylogenetic correction. The color gradient shows the 657 strength of the correlation. Details for each regression can be found in Table S5.

Genome Assembly
We used a combination of short-read Illumina and long-read Single Molecule, Real Time (SMRT) sequencing to assemble the genome of a single individual from the red ecotype of M. aurantiacus subspecies puniceus (population UCSD; Table S1). Genomic DNA was isolated from leaf tissue using either ZR plant/seed DNA miniprep kits (Zymo Research) or GeneJet Plant Genomic DNA purification kits (Thermo Fisher). Illumina libraries were generated following the Allpaths-LG assembly pipeline (Gnerre et al. 2011), which included a single fragment library with average 180 bp insert size and three mate pair libraries (average insert sizes: 3.5-5 kb, 5-7 kb, and 7-13 kb). Libraries were sequenced on the Illumina HiSeq 2500 using paired-end 100 bp reads. An initial scaffoldlevel assembly was performed with Allpaths-LG using default parameters and the haploidify function enabled. This assembly yielded 11,123 contigs (N50 = 40.5 kb) and 2,299 scaffolds (N50 = 1,310 kb), for a total assembly size of 193.3 Mbp. Long-read sequencing was performed from the same individual using 12 SMRT cells sequenced on the Pacific Biosystems RS II machine at Duke University. We obtained a total of 6.4 Gbp of sequence, which corresponds to ~21 × coverage of the genome. The PacBio reads were used to re-scaffold the Allpaths-LG scaffolds using Opera-LG (Gao et al. 2016). This reduced the number of scaffolds to 1,547 (N50 = 1,578 kb).
We then manually improved the scaffold containing the flower color gene MaMyb2 (Streisfeld et al. 2013). We first aligned this scaffold to a previously published draft sequence assembly from this same individual (Stankowski et al. 2017), which was generated using Illumina short-reads and the Velvet assembler (Zerbino and Birney 2008). We used long range PCR and cloning to generate Sanger sequences across three regions within 20 kb of MaMyb2 that did not assemble well. Genomic DNA was amplified using Phusion high fidelity polymerase (NEB). PCR products were cloned into the pCR2.1 TOPO-TA vector (Life Technologies), and purified plasmids were sequenced with Sanger technology. Resulting sequences were aligned to the scaffold containing MaMyb2, and new PCR primers were designed to sequence internal fragments until the entire insert was sequenced. Using this approach, we sequenced a total of 9,824 bp across the three regions. The reference sequence in the assembly was corrected manually to match the Sanger data.
Finally, we gap filled the assembly using the PacBio data and the program PBJelly (English et al. 2012). Resulting scaffolds were assembled into pseudomolecules using Chromonomer (http://catchenlab.life.illinois.edu/chromonomer/), according to the online manual. This software anchored and oriented scaffolds based on the order of markers in a high-density linkage map (see below) and made corrections to scaffolds when differences occurred between the genetic and physical positions of markers in the map. A final round of gap filling with PBJelly was performed to fill any gaps that were created by broken scaffolds in Chromonomer. To assess the completeness of the gene space in the assembly, we used both the BUSCO and CEGMA pipelines to estimate the proportion of 956 single copy plant genes (BUSCO) or 248 core eukaryotic genes (CEGMA) that were completely or partially assembled (Parra et al. 2007;Simao et al. 2015). The proportion of these genes present in an assembly has been shown to be correlated with the total proportion of assembled gene space, and thus serves as a good predictor of assembly completeness.

Construction of high-density linkage map
We generated an outbred F 2 mapping population by crossing two F 1 individuals, each the product of crosses between different greenhouse-raised red and yellow ecotype plants collected from one red ecotype and one yellow ecotype population (populations UCSD and LO, respectively; Table S1). We then used restriction-site associated DNA sequencing (RADseq) to genotype F 1 and F 2 individuals. DNA was extracted from leaf material using Zymo ZR plant/seed DNA miniprep kits, and RAD library preparation followed the protocol outlined in Sobel and Streisfeld (Sobel and Streisfeld 2015). Libraries were sequenced on the Illumina HiSeq 2000 platform using single-end 100 bp reads at the Genomics Core Facility, University of Oregon.
Reads were filtered based on quality, and errors in the barcode sequence or RAD site were corrected using the process_radtags script in Stacks v. 1.35 (Catchen et al. 2011;Catchen et al. 2013). Loci were created using the denovo_map.pl function of Stacks, with three identical raw reads required to create a stack, two mismatches allowed between loci for an individual, and two mismatches allowed when processing the catalog. Single nucleotide polymorphisms (SNPs) were determined and genotypes called using a maximum-likelihood (ML) statistical model implemented in Stacks and a stringent χ 2 significance level of 0.01 to distinguish between heterozygotes and homozygotes. We then used the genotypes program implemented in Stacks to identify a set of 9,029 mappable markers. We specified a 'CP' cross design (F 1 individuals coded as the parents), requiring that a marker was present in at least 85% of progeny at a minimum depth of 12 reads per individual, and we allowed automated corrections to be made to the data.
Linkage map construction was performed using Lep-MAP2 (Rastas et al. 2016). The data were filtered using the Filtering module to include only individuals with less than 15% missing data and excluded markers that showed evidence for extreme segregation distortion (χ 2 test, P < 0.01). To assign markers to linkage groups, we used the SeparateChromosomes module with a logarithm of odds (LOD) score limit of 20 and no minimum size for linkage groups (LG). This assigned 7,217 markers to 10 linkage groups, which matches the number of chromosomes in M. aurantiacus. The JoinSingles module was executed again with a LOD limit of 10 to join an additional 877 ungrouped markers to the 10 previously formed LGs. Fifty-seven singles that were not joined at this stage were discarded from the dataset. Initial marker orders were determined using sexaveraged and sex-specific recombination rates using the OrderMarkers module. For each LG, we conducted 10 independent runs using the Kosambi mapping function (useKosambi=1), with the dataset split into seven pseudofamilies to take advantage of parallel processing. When multiple markers had identical genotypes, only the duplicate marker with the least missing data was used in marker ordering. We retained the marker order from the run with the best likelihood. After removing markers with an error rate > 0.05, the ML order was re-evaluated using the evaluateOrder flag. The map contained 8,094 informative loci from 269 F2 individuals, with an average of 3.5% ± SD 3.86 missing data per individual.
After the integration of our assembly and genetic map using the Chromonomer software (Amores et al. 2014), we made corrections to the map order based on the physical position of markers within assembled scaffolds. Using the output of Chromonomer, we identified markers that were out of order in the map compared to their local assembly order and aligned these markers to the assembly from Chromonomer using Bowtie2 v. 2.2.5 (Langmead and Salzberg 2012) with the very_sensitive settings to obtain their physical order. We then re-estimated the map using the evaluateOrder flag in Lep-MAP2 as described above, but with the marker order constrained to the physical order (improveOrder=0) and with all duplicate markers included in the analysis (removeDuplicates=0). After initial map construction, we removed 17 markers with an estimated error rate greater than 5% and estimated the map one last time using the same settings. The final map contained 7,589 markers across the 10 linkage groups.

Genome annotation
Prior to genome annotation, the assembly was soft-masked for repetitive elements and areas of low complexity with RepeatMasker (RepeatMasker Open-4.0) using a custom Mimulus aurantiacus library created by RepeatModeler (RepeatModeler Open-1.0), Repbase repeat libraries (Jurka et al. 2005), and a list of known transposable elements provided by MAKER (Holt and Yandell 2011). In total, 30.99% of the genome assembly was masked by RepeatMasker. Repetitive elements were annotated with RepeatModeler. Hidden Markov Models for gene prediction were generated by SNAP (Korf 2004) and Augustus (Stanke and Waack 2003)and were trained iteratively to the assembly using MAKER, as described by Cantarel et al. (Cantarel et al. 2008). Training was performed on the 14.5 Mbp sequence from LG9. Evidence used by MAKER for annotation included protein sequences from Arabidopsis thaliana, Oryza sativa, Solanum lycopersicum, Solanum tuberosum, Daucus carota, Vitis vinifera (all downloaded from EnsemblPlants on 9 August 2016), Salvia miltiorrhiza (downloaded from Herbal Medicine Omics Database on 9 August 2016), Mimulus guttatus v. 2 (downloaded from JGI Genome Portal on 9 August 2016), and all Uniprot/swissprot proteins (downloaded on 18 August 2016) (Goodstein et al. 2012;Nordberg et al. 2013;Kersey et al. 2016)(Herbal Medicine Omics Database; Uniprot). We filtered the annotations with MAKER to include: 1) only evidence-based information that also contained assembled protein support, and 2) those ab initio gene predictions that did not overlap with the evidence-based annotations and that contained protein family domains, as detected with InterProScan (Quevillon et al. 2005).

Genome re-sequencing and variant calling
We collected leaf tissue from four to five individuals from seven subspecies of M. aurantiacus, including both ecotypes of subspecies puniceus (Table S3; Fig. S2). In addition, we collected leaf tissue from three individuals of M. clevelandii. We extracted DNA from dried tissue using the Zymo Plant/Seed MiniPrep DNA kit following the manufacturer's instructions. We prepared sequencing libraries using the Kapa Biosystems HyperPrep kit, and libraries with an insert size between 400-600 bp were sequenced on the Illumina HiSeq 4000 using paired-end 150 bp reads at the Genomics Core Facility, University of Oregon.
We filtered raw reads using the process_shortreads script in Stacks v1.46 to remove reads with uncalled bases or poor quality scores. We then aligned the retained reads to the reference assembly using the BWA-MEM algorithm in BWA v0.7.15 (Li 2013). An average of 91.7% of reads aligned (range: 82.6-96.0%), and the average sequencing depth was 21x (range: 15.16 -30.86x). We then marked PCR duplicates using Picard (http://broadinstitute.github.io/picard). We performed an initial run of variant calling using the UnifiedGenotyper tool in GATK v3.8 (McKenna et al. 2010) and filtered the data to remove variants with a mapping quality < 50, a quality depth < 4, and a Fisher Strand score > 50. We then used these variants to perform base quality score recalibration for each individual, before performing another run of the UnifiedGenotyper to call final variants. After the second run of variant calling, we removed variants with a mapping quality < 40, a quality depth < 2, and a Fisher Strand score > 60. The final dataset contained 13,233,829 SNPs across the nine taxa. Finally, we ran UnifiedGenotyper with the EMIT_ALL_SITES option to output all variant and invariant genotyped sites.

Phylogenetic analyses
Initially, we used RAxML v8 to reconstruct the evolutionary relationships among the nine taxa by concatenating variant sites from across the genome. To investigate patterns of phylogenetic discordance across the genome, we also built trees from windows across the genome. We phased SNPs using BEAGLE v4.1 (Browning and Browning 2007), using a window size of 100 kb and an overlap of 10 kb. We incorporated information on recombination rate from the genetic map and did not impute missing genotypes. After phasing, we used MVFtools (https://www.github.com/jbpease/mvftools) to run RAxML from 100 kb and 500 kb nonoverlapping windows, with the M. clevelandii samples set as the outgroups. We then visualized the window trees in DensiTree v2.01 (Bouckaert 2010).
To assess concordance between the window-based trees and the whole-genome tree, we converted trees to distance matrices using the Ape package in R (Paradis et al. 2004). We then calculated the Pearson's correlation coefficient between the distance matrix from each window and the whole-genome tree, with a stronger correlation indicating higher concordance with the whole-genome tree. We used one-dimensional autocorrelation analysis to determine if variation in tree concordance was randomly distributed across the genome. This involved estimating the autocorrelation between genomic position and tree concordance for each LG with a lag size of 2 Mbp. The significance of the observed value for each LG was determined from a null distribution of autocorrelation coefficients estimated from 1000 random permutations of the genomewide data.
We also conducted a Principal Components Analysis (PCA) based on all variant sites from across the entire genome using Plink v. 1.90 (Chang et al. 2015). Initially, we ran the PCA with all 37 samples, but we consecutively re-ran the analysis by removing different taxa in order to assess clustering patterns among more closely related samples.

Population genomic analyses
To examine how genome-wide patterns of diversity, differentiation, and divergence varied among taxa, we calculated within-taxon nucleotide diversity (π), between-taxon relative differentiation (F ST ), and between-taxon absolute divergence (d xy ) across nonoverlapping 100 kb and 500 kb windows using custom Python scripts downloaded from https://github.com/simonhmartin/genomics_general. We calculated measures of differentiation and divergence across all 36 pairwise comparisons among the nine taxa, and diversity was estimated separately for each taxon. These scripts estimated π and d xy by dividing the number of sequence differences within a window (either within or between taxa) by the total number of sites in that window. To account for missing data, the script counted the number of differences between each sample, divided by the total number of variant sites that were genotyped within those samples, and then averaged across all pairs of samples. To provide an unbiased estimate of diversity and divergence, we incorporated invariant sites into the calculation by dividing the number of pairwise differences (within and between taxa, respectively) by the total number of genotyped sites (variant and invariant) within a window. F ST was calculated following the measure of K ST (Hudson et al. 1992), equation 9), but was modified to incorporate missing data using the same approach as π and d xy . We filtered the data separately for each taxonomic comparison, so that each site was required to be genotyped in at least three individuals for comparisons within the M. aurantiacus complex or at least two individuals for each comparison involving M. clevelandii.
We summarized the variation in each statistic across comparisons using a Principal Components Analysis (PCA), with taxon or taxon pair as the variables. Thus, across each window, the first principal component of π, F ST , and d xy provided multivariate measures that explained the greatest covariance in the data. We used a one-dimensional autocorrelation analysis and permutation test to determine if the genome-wide patterns of PC1 π, F ST , and d xy departed from a random expectation, as described above for tree concordance (see section 'phylogenetic analyses').
To examine the relationships among PC1 diversity, differentiation, and divergence, we estimated Pearson's correlation coefficient among all three statistics across genomic windows. Further, we estimated correlations among these three statistics and tree concordance, gene density, and recombination rate. Recombination rate was estimated by comparing the genetic and physical distance (in cM/Mbp) between all pairs of adjacent markers on each LG from the genetic linkage map described above. We removed the top 5% of recombination rates, as these represented unrealistically high values of recombination. A minimum of three estimates was required to obtain an average recombination rate estimate within each window. Gene density was calculated from the number of predicted genes in each window, as determined from the annotation described above.
To determine how the correlations among the statistics (diversity, differentiation, divergence, recombination rate, gene count) changed with increasing divergence time, we examined the correlation coefficient among all pairs of statistics individually for each of the 36 pairwise comparisons. Because diversity was measured within taxa rather than between them, we calculated the mean value of π between each pair of taxa. Also, because many of the pairwise comparisons are non-independent, we applied the phylogenetic correction outlined by (Felsenstein 1985;Coyne and A. 1989) to produce a statistically independent set of data points for this analysis.
As a measure of the divergence time between M. clevelandii and M. aurantiacus, we estimated the percent sequence divergence (d xy ) between individuals of M. clevelandii and all subspecies of M. aurantiacus combined. We then converted this value into a divergence time T (in generations) using the equation: T = d xy /(2µ), where µ is the mutation rate, 1.5 x 10 -8 (Koch et al. 2001;Brandvain et al. 2014). This value was then converted into years by multiplying by a generation time of two years. Table S1. Summary of the genetic linkage map constructed using an F2 intercross between the red and yellow ecotypes of subspecies puniceus. The table includes map length in cM for each linkage group (LG), the number of markers associated with each LG, the number of unique map positions, and the average genetic distance in cM between each unique map position. Standard deviations are given in parentheses.
LG    Figure S4. Incomplete lineage sorting due to rapid diversification. Clades separated by shorter internodes (i.e., separated by less time) are recovered less frequently in window-based trees (500 kb windows). This indicates a strong role for incomplete lineage sorting in areas of the tree where diversification is rapid. The % node concordance is the percentage of window-based of trees that contain a given node in the genome-wide tree, and is plotted against the length of the internode separating each clade. Only clades at and above the level of taxon were included. The red points are the predicted values from a 4-parameter logistic function fitted to the data using an iterative least-squares method.  Figure S5. Patterns of variation plotted across each bush monkeyflower linkage group. Z-transformed F ST , d xy , and π in overlapping 500 kb windows (step size = 50 kbp). The gray lines are z-transformed scores for each of the 36 pairwise comparisons (F ST and d xy ) or nine taxa (π), and the blue line is the z-transformed score for the first principal component (PC1). Estimates of tree concordance, gene count and recombination rate (cM/Mbp) are also shown. LG10 Figure S5. continued Figure S6. Common patterns of genome-wide variation mirror variation in the local properties of the genome. Plots are the same as in Fig. 2 of the main text, but for 100 kb windows (step size 10 kb). A) tree concordance; B-D) Z-transformed PC1 for F ST , dxy and π, respectively; E) gene count; and F) recombination rate (cM/Mbp) are ploted across the 10 monkeyflower LGs. LG1 LG2 LG3 LG4 LG5 LG6 LG7 LG7 LG8 LG9 LG10 Tree conc.  Figure S9. Bivariate plots among measures of variation and genomic features across 500 kb genomic windows. Note that this is the same as Figure 3 but with axes units. Also note that the axes are different across rows and columns of the matrix.   Figure S11. A cartoon depicting the gradual build-up of a heterogeneous differentiation landscape by lineage-specific LS. A) When a population first splits (allopatric divergence with large population sizes), patterns of genome-wide diversity (π pops A and B) are identical due to the complete sharing of ancestral variation among them. Thus, there is no pattern of heterogeneous differentiation between them. B) As time passes, LS begins to act separately within each lineage (LS events are indicated by arrows across the chromosome). This functions to maintain the diversity landscape in the face of new neutral mutations (affected areas in each lineage are shown in color) and also causes ancestral variants to be fixed among them. The result is an increase in F ST in affected areas. C) Long after the split, many LS events have occurred in each lineage. Because the heterogeneous patterns of LS have been conserved since prior to the common ancestor of these taxa (redder areas of the chromosome experience a higher rate of LS), the affected areas are similar between the lineages. As a result, the diversity and differentiation landscapes come to perfectly mirror one another. Static, yet ever-changing.
Their reflection grows.