The rate and effect of de novo mutations in a colonizing lineage of Arabidopsis thaliana

Because colonizations and invasions are often associated with genetic bottlenecks, they offer an opportunity to directly observe de novo mutations and their subsequent fate. North America has recently been colonized by Arabidopsis thaliana, and many of the individuals found today belong to a single lineage, HPG1. To determine substitution rates under natural conditions in this lineage, we have sequenced 100 HPG1 genomes from plants collected between 1863 and 2006. We infer that the last common HPG1 ancestor lived in the early 17th century, most likely the time when HPG1 began to colonize N. America. Demographic reconstructions infer substantial population size fluctuations during the past four centuries. Even though changing demographics can undermine the effect of natural selection, we observed that mutations at coding sites were at lower frequency than mutations at other sites, consistent with the effect of purifying selection. Exceptionally, some mutations rose to high frequency and some had measurable effects in root development, consistent with positive selection acting over mutations with an adaptive value. Our work showcases how by applying genomics methods to a combination of modern and historic samples we can learn about plant colonisations and invasions and observe “evolution in action”.

A neighbor joining tree ( Fig. 2A), mul -dimensional scaling (MDS) (Fig. 2B), and a parsimony network (Fig. 2C) confirmed very close relatedness of the HPG1 genomes, with only three apparent intra-HPG1 recombinants (Fig. 2C). Removing these resolved the re cula ons in the parsimony network (Fig. 2D). The remaining 100 samples (Table S1) cons tute a quasi-clonal lineage mostly devoid of effec ve recombina on and popula on structure, and without SNPs in organellar genomes. The very low genome-wide nuclear diversity (π = 0.000002, θ W = 0.00001, 5,013 segrega ng sites) is two orders of magnitude lower than in the na ve range of the species (θ W = 0.007) (ref. 24 ) ( Table S1). The enrichment of low frequency variants in the site frequency spectrum (Tajima's D = -2.84; global= -2.04) (ref. 24 ) and low levels of polymorphism are consistent with a recent bo leneck followed by popula on expansion (Fig. 3). The obvious explana on is that the bo leneck corresponds to a coloniza on founder event, likely by very few closely related individuals, or perhaps only a single plant. To describe intra-HPG1 rela onships in a more sophis cated manner, we used Bayesian phylogene c inference, exploi ng collec on dates for p calibra on of phylogene c trees. The 76 modern individuals formed a largely monophyle c clade, with only four interspersed herbarium samples from the second half of the 20 th century (Fig. 3A, B). Long branches reflected an abundance of singletons, typical of expanding popula ons a er bo lenecks.
To es mate the subs tu on rate in the HPG1 lineage, we used distance-and phylogeny-based methods that take advantage of the known collec on dates. One has to dis nguish between the muta on rate , which is the rate at which genomes change due to DNA damage, faulty repair, gene conversion and replica on errors, and subs tu on rate , which is the rate at which muta ons survive and accumulate a er demographic and natural selec ve processes 25 . Under neutral evolu on, muta on and subs tu on rates should be equal 26 . The simple evolu onary history of the HPG1 natural popula on enables direct es mates of subs tu on rates, and by comparing these with muta on rates calculated in controlled condi ons, we can learn about demographic and selec ve forces. In the distance method, the subs tu on rate is calculated from correla on between differences in collec on me in historic-modern sample pairs, and the number of changes between those pairs rela ve to a reference (Fig. 3C), scaled to the size of the genome accessible to Illumina sequencing. This method resulted in an es mated rate of 2.11*10 -9 subs tu ons site -1 year -1 (95% bootstrap Confidence Interval [CI]: 1.88-2.33*10 -9 ) using rigorous SNP calling quality thresholds. Relaxing the quality thresholds for base calling and minimum genotyped rate affects both the number of called SNPs and the length of the interrogated reference sequence 27 . These largely cancelled each other out, and our es mates were rela vely stable, between 2.1-3.2*10 -9 subs tu ons site -1 year -1 (Table S3). The Bayesian phylogene c approach, which uses the collec on years for p calibra on and assumes a relaxed molecular clock, yielded a similar es mate, 4.0*10 -9 , with confidence ranges overlapping the above es mates (95% Highest Posterior Probability Density [HPPD]: 3.2-4.7*10 -9 ). Based on the results obtained with different methods, we can confidently say that the subs tu on rate in the wild should be between 2 to 5 *10 -9 site -1 year -1 . We recommend Differences in "per genera on" rates could be caused by several factors, such as an imperfect knowledge of the genera on me in the wild (for rates using different genera on mes, see Fig. 3E). In addi on, mutagenic environmental factors, genome background, muta on spectrum, or methodological idiosyncrasies can affect the es mates. For example, transposons, which comprise~8% of the genome and~19% of the SNPs in greenhouse MA lines, had fewer SNPs called than expected in HPG1 (~13%). This is likely due to difficul es when mapping reads to genomic areas with extensive structural varia on 30 , 30 a 30 d 30 could have contributed to the lower subs tu on rate es mates for HPG1 (Fig. 3E, Fig S3C, Table S3). In addi on, the subs tu on spectrum in HPG1 is shi ed to a lower transi on/transversion ra o compared to the MA lines ( Fig. 3C and Fig. S3), which could be caused by methylated cytosines (see Fig. S4 and SOM). Finally, an alterna ve evolu onary explana on for the rate differences is that purifying selec on slows the accumula on of muta ons in the wild by removing deleterious muta ons (Fig. 3E). To find evidence of nega ve selec on independently of dataset comparisons, we looked at the site frequency spectrum of different annota ons within the HPG1 dataset. Medium-frequency variants, which are more exposed to purifying selec on 31 , were more sharply depleted in genomic regions expected to be under greater selec on constraint (genic and nonsynonymous sites) than in puta vely more neutral ones (intergenic or synonymous sites) (Fisher's Exact test, p-value <0.05 for both comparisons, see Fig. S5). Therefore, even if we cannot say with certainty that purifying selec on drives the differences between HPG1 and MA rates, it must be responsible for the differences between different types of sites within HPG1.
The subs tu on rate allows da ng of HPG1's origin. The mean es mate from Bayesian methods was the year 1597 (HPPD 95%: 1519-1660) (Fig. 3A, B). We also used a non-phylogene c method that u lizes the rela onship between the average gene c distance between any two individuals, with the subs tu on rate mul plied by twice the divergence me and the genome size; solving by the divergence me in the equa on we obtained 353 years. When subtracted from the average collec on date of our samples, the corresponding point es mate is 1625, within the confidence interval of the Bayesian es mate. This corresponds to the date of the last common ancestor of HPG1, and should thus be close to the me of introduc on of HPG1 to N. America. (The date is older than our previous es mate, for which we had naively applied the higher greenhouse muta on rate 23 ). Inference of N e through me suggested exponen al popula on growth un l the early 19 th century (Fig 3B, Fig S6C). During the 20 th century the N e trajectory showed oscilla ng pa erns between growth and bo lenecks, which are typical of selfing organisms 32 , and which likely led to a replacement of most HPG1 sublineages, as the modern samples are all very closely related (Fig. 3 A, B). Since we knew both the collec on years and loca ons of origin of the HPG1 samples, we could also analyze the migra on dynamics of HPG1. Although unknown sources of sampling bias could affect our analyses 16 , the phylogeographic models suggested that HPG1 came to cover much of its modern range soon a er its introduc on to N. America (Fig. S6 A,B). We found a significant correla on between collec on date and both la tude and longitude (Fig. 1C) , which we interpret as a net, highly dispersed, movement in a northwestern direc on over me. Addi onal support for this hypothesis comes from an isola on-by-distance signal, which is most consistent with a historic westward migra on and a more recent reverse eastward migra on (Fig. S6 E,F). The apparent source of those new migrants now persis ng along the East coast was the Lake Michigan area .
Finally, while we did not expect to easily find muta ons that have helped the HPG1 lineage to adapt to its new N. American environment, we wanted to determine whether any of the muta ons have measurable phenotypic effects. Focusing on flowering-related, reproduc ve and root traits of likely ecological relevance, we detected significant quan ta ve heritable varia on (Table S4). We used an approach borrowed from GWAS to find SNPs that had increased in frequency (>5%) and were associated with these phenotypes. Because conven onal GWAS relies on recombina on, which is almost absent from our popula on, our approach could not iden fy individual SNPs, but only SNP cohorts distributed across the genome 33 . We found 79 SNPs associated with root traits, of which nine resulted in nonsynonymous changes. We did not find any SNPs associated with flowering me, even though it is thought to be a key player of rapid adapta on in many annual species 34,35 . Nineteen other SNPs, of which four were nonsynonymous, were associated with climate variables ( www.worldclim.org ) even a er correc on for la tude and longitude, and some of the hits overlapped between root traits and climate variables (Table 1, Table S5, Fig. S7). Although a good number of SNPs was associated with phenotypes and/or climate variables, it is not possible to confidently pinpoint individual candidate SNPs, since the extent of whole-genome linkage disequilibrium (LD) in HPG1 is high (Fig. S9 B,C). However, there is a gradient in the extent of LD between SNPs associated with root architecture, which could help to determine par cularly promising candidates for molecular characteriza on and quan fica on of fitness in natural condi ons (Table 1, Fig. S9 D-F). For example, the gene AT5G19330, overexpression of which confers salt tolerance 36 , contains a SNP that was unlinked to other hits and that has risen in the last four centuries to a frequency of 40%. This SNP is likely to change protein func on, as it leads to a subs tu on of cysteine for tryptophan. Another nonsynonymous SNP is located in AT2G38910, encoding a calcium dependent kinase that belongs to a family of factors involved in root hydraulic conduc vity and phytohormone response 37,38 . Remarkably, most derived root-associated alleles, when compared with equally frequent neutral alleles, were first seen in older herbarium samples, most of which were collected near Lake Michigan, the apparent source of modern popula ons (Fig. S8). Altogether, the rise in frequency, older age of some de novo muta ons, and their quan fiable phenotypic effects and clima c correla ons strengthen the hypothesis that they might have an adap ve value and were under posi ve selec on. These results favor Baker's law, which alleges that selfing species can o en adapt to new environments, over the evolu onary dead-end hypothesis of Muller's ratchet. Furthermore, these sugges ve signals of rapid adapta on via de novo muta ons could change the current paradigm that invasive species adapt most o en from sources of standing varia on, either because an incomplete  215  216  217  218  219  220  221   222  223  224  225  226  227  228  229  230   231  232  233  234   bo leneck le residual varia on or because there has been subsequent admixture with na ve species or  secondary colonizers 39,40 . In summary, we have exploited whole-genome informa on from historic and contemporary collec ons of a herbaceous plant to empirically characterize the effect of evolu onary forces during a recent coloniza on. With this natural me series experiment, we could directly es mate the nuclear subs tu on rate in wild A. thaliana popula ons. This parameter, which provides immediate ability to date key events of popula ons, is only known for one other species: Homo sapiens 41 . We have presented evidence that purifying selec on is percep ble already over me scales spanning only a few centuries. Although the colonizing popula on we have inves gated has limited diversity and suffered rapid fluctua ons in popula on size, there appear to be de novo muta ons with phenotypic effects that contributed to rapid adapta on. While A. thaliana HPG1 is not an invasive species, it can teach us about fundamental evolu onary processes behind successful coloniza ons and adapta on to new environments. Our work should encourage others to search for similar natural experiments and to unlock the poten al of herbarium specimens to study "evolu on in ac on".
Online Content Methods, along with any addi onal Extended Data display items, are available in the online version of the paper; references unique to these sec ons appear only in the online paper.

METHODS
Sample collec on and DNA sequencing. Modern A. thaliana accessions were from the collec on described by Pla and colleagues 17 which iden fied HPG1 candidates based on 149 genome-wide SNPs (Table S1). Herbarium specimens were directly sampled by Max Planck colleagues Jane Devos and Gautam Shirsekar, or sent to us by collec on curators from various herbaria (Table S1). DNA from herbarium specimens was extracted as described 42 in a clean room facility at the University of Tübingen. Two sequencing libraries with sample-specific barcodes were prepared following established protocols, with and without repair of deaminated sites using uracil-DNA glycosylase and endonuclease VIII (refs.  Table S1. We also re-sequenced the genomes of twelve Col-0 MA lines 50,51 (Table S2).
Phylogene c methods and genome-wide sta s cs . We used four methods to es mate the rela onships among modern accessions, and between modern and herbarium HPG1 samples: (i) mul dimensional scaling (MDS); (ii) construc on of a neighbor joining tree with the adegenet package in R (ref. 52  We es mated gene c diversity as Wa erson´s θ (ref. 55 ) and nucleo de diversity π, and the difference between these two sta s cs as Tajimas's D (ref. 56 ) using DnaSP v5 (ref. 57 ). We calculated the folded site frequency spectrum (SFS) as well as the unfolded SFS, for which we assigned the ancestral state using the A. lyrata genome 58 . We es mated pairwise linkage disequilibrium (LD) between all possible combina ons of informa ve sites, ignoring singletons, by compu ng r 2 , D and D ' sta s cs. For the modern individuals, we calculated the recombina on parameter rho ( 4N e r ) and performed the four-gamete-test 59 to iden fy the minimum number of recombina on events. All LD and recombina on related sta s cs were determined using DnaSP v5 (ref. 57 ) (see SOM).
Subs tu on and muta on rate analyses. We used genome-wide nuclear SNPs to calculate pairwise "net" gene c distances using the equa on D' ij = D ic -D jc , where D ' ij is the net distance between a modern sample i and a herbarium sample j ; D ic the distance between the modern sample i and the reference genome c ; and D jc is the distance between a modern sample (j) and the reference genome (c). We calculated a pairwise me distance in years between the collec on mes, T 'ij, and calculated the linear regression: D ' = a + bT '. The slope coefficient b describes the number of subs tu on changes per year. We used either all SNPs or subsets of SNPs at different annota ons (genic, intergenic etc.) appropriately scaled by accessible genome length and with confidence intervals determined by bootstrap (see SOM and Fig. S3).
The second approach used Bayesian phylogene cs with the p-calibra on method implemented in BEAST v1. 8 (ref. 54 ). Our analysis op mized simultaneously and in an itera ve fashion using a Monte Carlo Markov Chain (MCMC) a tree topology, branch length, subs tu on rate, and a demographic Skygrid model ( Fig. 3 A,B; see SOM). The demographic model is a Bayesian nonparametric one that is op mized for mul ple loci and that allows for complex demographic trajectories by es ma ng popula on sizes in me bins across the tree based on the number of coalescent -branching -events per bin (ref. 60 ). We also performed a second analysis run using a fixed prior for subs tu on rate of 3*10 -9 subs tu ons site -1 year -1 based on our previous net distance es mate to confirm that the MCMC had the same parameter convergence, e.g. tree topology, as in the first "es mate-all-parameters" run.
Inference of genome-wide selec on. We separately analyzed sequences at different annota ons, since certain regions regions should be under a different selec on regime (less evolu onary constraint) than others. We compared the means and confidence intervals of subs tu on rates in the en re genome and in intergenic regions for both datasets, HPG1 popula on and laboratory Col-0 MA lines. Only within the HPG1 popula on, we also tested for an interac on between low and common allele frequency polymorphisms and puta vely selected and puta vely neutral annota ons (comparisons: en re genome -intergenic, genic -intergenic, nonsynonymous -synonymous). The formal test was Fisher exact test and low and common frequency SNPs were defined in all possible cutoffs from 1 to 45% allele frequency (Fig.  S5). The signal captured is based on the assump on that purifying selec on is more efficient at intermediate frequencies, pushing deleterious variants towards lower frequency in the spectrum.
Associa on analyses and da ng of new muta ons. We collected flowering, seed and root morphology phenotypes for 63 accessions. For associa ons with climate parameters, we followed a similar ra onale as described 61 . We extracted informa on from the bioclim database ( h p://www.worldclim.org/bioclim ) at a 2.5 degrees resolu on raster and intersected it with geographic loca ons of HPG1 samples (n = 100). We performed associa on analyses under several models and p -value correc ons using the R package GeneABEL (ref. 62 ), with phenotypes and clima c variables as response variables and SNPs as explanatory variables; appropriately correc ng for covariates. Resul ng p-values were adjusted with an empirical p-value distribu on generated from 1,000 permuted datasets, or with a double Bonferroni correc on: 5% / (number of SNPs + number of phenotypes tested).     Comparison of genome-wide, intergenic, intronic, and genic subs tu on rates in HPG1 and muta on rates in greenhouse-grown MA lines. Subs tu on rates for HPG1 were re-scaled to a per genera on basis assuming different genera on mes. Confidence intervals in HPG1 subs tu on rates were obtained from 95% confidence intervals of the slope from 1,000 bootstrap (see Table S4 for actual values).

Table 1. Genic SNPs associated with different traits.
Most SNPs first appeared in sample JK2530 collected 1922 in Indiana. For non-synonymous SNPs, the amino acid change and the Grantham score (ranging from 0 to 215), which measures the physico-chemical proper es of the amino acids, are reported. All SNPs in the table were significant (p < 0.05) a er raw p-values were corrected by an empirical p-value distribu on from a permuta on procedure. * highlights those that also passed a double Bonferroni threshold, correc ng by number of SNPs and number of phenotypes (p < 0.0001). LD corresponds to how many other SNP hits are in high  Table S5 for informa on on all significant SNPs and Table S4 for details on phenotypes and clima c variables.   Figure S1. Ancient-DNA-like characteris cs of unrepaired herbarium libraries.
17 Figure S2. Separa on between HPG1 and other North American lineages.
18 Figure S3. Subs tu on spectrum and rates.
19 Figure S4. Rela onship between methyla on and subs tu ons. 21 Figure S5. Enrichment of low variants at puta vely selected annota ons. 23 Figure S6. Phylogeographic inference in HPG1. 24 Figure S7. Density of SNPs along all chromosomes and loca on of SNP hits. 26 Figure S8. Spa al and temporal emergence of root-associated muta ons. 27 Figure S9. Linkage disequilibrium between SNPs with significant trait associa ons. 28 Figure S10. Correla ons of SNP effects and p-values with frequency and age.

Sample collec on and prepara on
Seeds from modern accessions (Table S1) were bulked at the University of Chicago. Progeny for DNA extrac on was grown at the Max Planck Ins tute for Developmental Biology. We used 2 to 8 mm 2 of dried ssue for destruc ve sampling from the herbarium specimens (Table S1).

SNP calling thresholds
To asses the effect of SNP calling thresholds on the muta on rate, we employed three different SHORE v0.9.0 quality thresholds following previous work (see Table S4 . S3), and we chose the regular 32_15 quality threshold and complete informa on for the final es mate (Fig 3 C, E).

Resequencing of Col-0 Muta on Accumula on lines
We also sequenced the genomes of twelve greenhouse-grown muta on accumula on (MA) lines, including ten that had been sequenced at lower coverage before 5,6 (Table S2). We called SNPs, indels and structural variants (SVs), following the workflow and parameters described 7  reference sequence (12% of variants replaced N's in the TAIR9 genome) or gene c differences in the founder plant of the MA popula on compared to the Col-0 reference genome. In addi on, we iden fied 388 segrega ng variants across the twelve lines (Table S2) Fig. 3E).

Iden fica on of bona fide HPG1 accessions and muta ons
Before we could work with the colonizer group HPG1, we needed to carefully iden fy individuals that belong to other haplogroups or that have introgressions from them. We established the rela onships among all samples at three levels of resolu on: (i) the 149 nuclear SNPs used originally to define the HPG1 haplogroup in a global screening 9 (Fig. S2A), (ii) SNPs in the chloroplast genome (where we did not find any variants within HPG1), (iii) and all nuclear genome SNPs ( Fig. S2B-C). At these three levels we performed a mul dimensional scaling (MDS) analysis and built a neighbor-joining tree using the adegenet package in R (ref. 10 ).
Having iden fied these bona fide HPG1 individuals, we wanted to confirm that the diversity has a legi mate origin from de novo muta ons. For that we used the 1001 Genomes resource (1001genomes.org) to verify that the majority of HPG1-specific variants did not originate in the na ve Eurasian range. Subse ng the genomes from this resource to only European accessions, and limi ng the SNP set to those with ≥1% frequency of alterna ve alleles, there were 338 variants out of all 5,181 HPG1 variants that were also found in Europe or Asia (6%). Only one of the reported SNPs associated with phenotypes (see sec on 10) was among these shared variants.
There are several scenarios that can explain these shared SNPs. One is that some HPG1 individuals were moved back to Europe by humans. Another one is that parallel muta ons occurred in North America and Eurasia or that a reversion-muta on happened in some HPG1 individuals.
Given that 10% of all sites in the genome are variable in the 1001 Genomes collec on, this is not an implausible scenario for at least a frac on of shared SNPs. Two addi onal scenarios involve an origin from standing European varia on: (1) the shared variants come from small introgression events that passed our filters above, or (2) the bo leneck was not complete, and while it le no diversity in the chloroplast genome, a few hundred SNPs were passed on in the nuclear genome (given the low number of variants, the colonizers could have been as many as two dozen seeds)

Greenhouse grown MA lines
Muta on rates were es mated for each 31 st genera on greenhouse-grown MA line 5 as the number of muta ons divided by the total bp length of the genome (or a given annota on) and by 31 genera ons (the two MA lines with only three genera ons were excluded from this analysis). Mean and confidence intervals across lines are reported (Table S3). The genome length was determined as all base pairs with coverage higher or equal to 3, and a SHORE mapping quality score of at least 32 in one sample (Table S2).

Net distances
For the "net gene c distances" method, we computed confidence intervals of the b regression slope coefficient ( D ' = a + bT ') using a bootstrap with replacement of 1,000 samples to avoid over-confident confidence intervals due to lack of independence of points 15 . We used either all SNPs or SNPs at specific annota ons to calculate different subs tu on rates and scaled the slope into a per-base rate using all posi ons (of the given annota on) that passed alterna ve or reference call quality thresholds rather than using a single value of genome length (Table S3). For all annota ons we calculated subs tu on rates with three quality thresholds and either full informa on per SNP or allowing a maximum of 50% missing accessions per SNP (see sec on 3 and Fig. S3C). For some annota ons subs tu on rates were not reliable. For instance, in 3' and 5' UTR regions, we did not have enough muta ons (on average~1 SNP difference between any pair), and thus do not report these regions' rates. Transposons showed unstable subs tu on rate es mates that we a ribute to structural varia on rela ve to the reference genome. This likely decreased our ability to map transposon reads correctly and subsequently call SNPs 16 . In contrast, in the MA dataset, transposon structural varia on was probably fairly low since only 31 genera on separate the Col-0 reference genome with each of the ten derived MA lines. This can be the reason that the number of transposon SNPs iden fied in the MA dataset is propor onally larger than in HPG1 (Table S2 and S3).
Therefore, transposon subs tu on rates in HPG1 cannot be trusted.

Bayesian p-calibra on
For the second approach to es mate a subs tu on rate, the Bayesian phylogene cs p-calibra on approach, we performed systema c runs and chain convergence assessments of different demographic and molecular clock models. We found the Skygrid demographic model 17 (Fig. 3B). Addi onally, we used DensiTree 20 to simultaneously draw the 10,000 BEAST trees with the highest posterior probability (Fig. 3A). Since all trees were drawn transparently, agreements in both topology and branch lengths appear as densely colored regions, while areas with li le agreement appear lighter.

Methyla on status of mutated sites
As in many other species, the spectrum of de novo muta ons in the greenhouse-grown A. thaliana Ancestral cytosines with higher methyla on in both A. thaliana Col-0 reference and HPG1 pseudo-reference methylome datasets were more likely to mutate to thymines in HPG1 (Fig. S4 A-D).
Addi onally, the methyla on degree at subs tu ons inside genes was higher in the HPG1 methylome ( Fig. S4 B,D). While some C → T changes could be explained by higher spontaneous deamina ons known to happen more o en at methylated cytosines, also C → A/G subs tu ons were more likely to have been methylated. If this process is common enough, the Ts/Tv ra o should decrease. We are far from understanding differences in Ts/Tv in natural and controlled condi ons, but definitely methyla on status seems to have a strong sta s cal connec on with mutability.

Inference of genome-wide selec on
Since we observed differences between the two muta on accumula on (MA) datasets (the laboratory Col-0 MA lines and the wild HPG1 lines), we tried to infer selec on based on differences in polymorphisms and subs tu on rates. We compared the different subs tu on rates and the 95% bootstrap confidence intervals to assess how iden cal they were (Table S3). Genome-wide subs tu on rate in the HPG1 dataset was significantly lower than that in controlled greenhouse condi ons, even a er correc ng by the mean genera on me of 1.3 years 25 (Table S3). However, these differences could be due to differences in individual genomic annota ons. For instance, coding regions and introns were virtually iden cal between the two datasets, but transposons and intergenic were much lower in HPG1. Furthermore, if the true genera on me is about 2-3 years (Fig.   3), differences in the men oned annota ons would also disappear. Therefore, we tried to inves gate the existence of purifying selec on based on frequency equilibriums only within the HPG1 popula on. We did comparisons between pairs of annota on categories and between pairs of frequency classes (i.e., low and common). In this way, genome-wide, genic and non-synonymous polymorphisms were compared to matched puta vely neutral annota ons: intergenic and synonymous sites (Fig. S5) Because any frequency cut-off is arbitrary, we computed the test with all cutoffs from 0 to 50% frequencies in 1% steps (Fig. S5). This test, which resembles in concept the MK or HKA tests, evidences that there is a deple on of common frequency polymorphisms, which are more exposed to selec on, at the three puta vely selected genomic levels compared to control (quasi-neutral) regions: genome-wide, genes and nonsynonymous sites.

Skygrid coalescent
From the Bayesian phylogene c analyses described previously (sec on 7.2.2), we studied the demographic model es mated via Skygrid. We reconstructed a skyline plot that depicts changes in effec ve popula on size, a measure of rela ve diversity, through me 26 (Fig. 3B). Sampling biases could produce artefactual effects in the Skygrid plot. Nevertheless we expect these to be minor since our dataset has a con nuous sampling over a century (>2 samples per decade) instead of a two-mepoints sampling, as it is common in ancient DNA studies where the rarity of the samples are a limi ng factor 27 . An addi onal BEAST run was performed only with modern samples to verify that the corresponding part of the tree and popula on sizes matched (data not shown). Implementa on of non-phylogene c methodologies for demographic inference exist, e.g. Mul ple Sequen ally Markovian Coalescent (MSMC) 28 , but a er exploring them we concluded their resolu on was insufficient for analyses of the last several centuries. In order to compare with another method, we got rough es mates of the diversity per decade as the average gene c differences between any two samples per decade (Fig S6C).

Phylogeography
We performed another Bayesian phylogene c analysis incorpora ng a geographic loca on trait 29,30 .
For this, Brownian diffusion parameters are es mated by fi ng a con nuous gradient of geographic loca ons along tree branches, star ng from the leaves of the tree for which geographic loca ons are known, i.e., the collec on sites of our samples. We excluded three samples from the West coast of the United States, separated over three thousand kilometers from the rest, since these do not fit a gradual propaga on by Brownian diffusion. We ran this analysis with the parameters described previously (sec on 7.2.2) and sliced the resul ng 3D (temporal and geographical) phylogeny at the early 16 th and late 18 th century using SPREAD so ware 31 (Fig. S6B). Similar to before, we roughly es mated "local diversity" for each sampling loca on by compu ng the average gene c differences to the 10 closest neighbours (Fig. S6D).

Isola on by distance
We employed a heuris c search 32 using an isola on-by-distance pa ern to find the origin of diffusion of HPG1 in North America, and compared it to the phylogeography analyses. We performed a regression between gene c distances on geographic distances for all pairs of samples ( gene c distance~Euclidean geographic distances ). This pa ern, known as isola on by distance pa ern, reflects that as individuals are more geographically apart, they differ more gene cally. We evaluated whether this rela onship was s ll significant for each of our samples separately (i.e., from a focal sample, does gene c distance increase as geographic distance increases?). Only the significant samples were retained and plo ed since those points are the expected origins of migra ons (Fig. S6 E,F). Arrows can be plo ed in the direc on of the maximum slope to illustrate migra on trajectories.
This was done separately for historic and modern samples.

Root
Fi een root phenotypes were scored for ≥ 10 replicates per genotype over a me-series experiment at the Gregor Mendel Ins tute in Vienna, using image analysis as described in detail elsewhere 33 . We used the means per genotypes and per me series for associa on analyses.

Flowering in the growth chamber
We

Fecundity in the field
To inves gate varia on in fecundity in natural condi ons, we grew three replicates of each accession in a field experiment following a completely randomized block design. Seeds were sown from

Quan ta ve gene c analyses
For 63 modern accessions, we measured me to bol ng and flowering, seeds per plant, seed size, and 15 root phenotypes in common chamber or common garden se ngs. For all 100 accessions, clima c informa on from the bioclim database ( h p://www.worldclim.org/bioclim ) was extracted using their geographic coordinates. For historic samples, some loca ons were only known by county name. In this case we assigned the geographic coordinate loca on of the centroid of the county.

Heritability
We performed associa on analyses using the R package GenABEL 34 , with measured phenotypes (p = We did this so the clima c es mates of h 2 , for which we only have one value per genotype, would be comparable with the phenotypic h2 ones (Table S4).

Linear Models
For associa on analyses we first employed a linear mixed model that fi ed the kinship matrix using the mmscore func on, and only three significant SNP hits were discovered using a 5% significance threshold a er False Discovery Rate correc on (FDR). This was expected since we have few variants and these would have originated in an approximated phylogeny structure. We concluded that fi ng the kinship matrix in our model was not appropriate since there would be no residual varia on for associa on with specific SNPs. With this ra onale we employed a fixed effects linear model using the qtscore func on 36 . To reduce the false-posi ve rate we took a conserva ve permuta on strategy by carrying out associa on with over 1,000 randomized datasets (permu ng phenotypes across individuals) and used the resul ng p-value distribu on to correct p-values es mated with the original dataset. SNPs with p-values below 5% in the empirical p-value distribu on were considered significant (Table S5). In clima c models, we included longitude and la tude as covariates to correct for any spurious associa on between SNPs and climate gradients created by the migratory pa ern of isola on by distance.

Evalua on of significance
Significant SNPs were interspersed throughout the genome (Fig. S7) and their p-values and phenotypic effects did not correlate with the minimum age of the SNPs nor with their allele frequency (Fig. S10), something that could have indicated that the significance was merely driven by the higher sta s cal power of intermediate frequency variants. Using QQ plots to assess infla on or defla on of p-values, we observed generally that permuta on corrected p-values were deflated.
Straight series of points in QQ plots indicate iden cal p-values for mul ple SNPs, a pa ern that we a ributed to long range LD, i.e. lack of independence (see Graphic Table S7 for trait distribu ons and QQ plots from each associa on analysis). We also used a False Discovery Rate correc on for the raw p-values using p.adjust in R and, as a sanity check, we used a Bonferroni-corrected threshold, a procedure considered over-stringent in associa on analyses (Table S5). This was calculated as: 5% / (number of SNPs + number of traits) ~ 0.01%.

Context of de novo muta ons associated with phenotypes
For each SNP in our dataset, we determined the ancestral and derived states, by iden fying which allele was found in the oldest herbarium samples. We compared the me of emergence and the centroid of geographic distribu on of the alterna ve alleles of SNP hits to random draws of SNPs with the same MAF filtering (5%) (Fig. S8).

Func onal informa on
On top of phenotypic and clima c associa ons of SNP hits, we also provide a likely func onal effect employing a commonly used amino acid matrix of biochemical effects 37 . Func onal informa on of gene name and ontology categoriza on of SNP hits was obtained from www.arabidopsis.org/portals/genAnnota on/gene_structural_annota on/annota on_data.jsp and www.arabidopsis.org/tools/bulk/go/ (Table 1 and Table S5).

Proof of concept examples
We argue that the power of an associa on approach relies on the fact that HPG1 lines resemble Near Isogenic Lines (NILs) produced by experimental crosses 38 (Fig. S9A). Similar to genome-wide associa on studies (GWA), power depends on many factors, namely the noise of phenotype under study, architecture of phenotypic trait, quality of genotyping, popula on structure, sample diversity, sample size, allele frequency, and recombina on. On one hand, associa on analyses in NILs suffer from large linkage blocks, but confident results can be achieved due to accurate measurement of phenotypes, limited gene c differences between any two lines, and high quality genotypes. In common GWA studies such as in humans, there are mul ple confounding effects. Among the confounders are (1) that any two samples differ in hundreds of thousands of SNPs, and (2) that historical and geographic stra fica on produce non-random correla ons among those SNP differences. This considerably complicates the iden fica on of phenotypic effects at specific genes, and power relies greatly on large sample sizes to achieve the sufficient number of recombina on between markers.
To provide support for the non-synonymous SNP on chromosome 5, at posi on 6,508,329 in AT5G19330, we looked for pairs of lines that carry the ancestral and the derived allele, but that differ in few (or no other) SNPs in the genome. When considering all genic subs tu ons with a minimum allele frequency of 5% (Fig. S9A), we iden fied 20 pairs of lines differing only in the AT5G19330 SNP and another linked SNP (located on a different chromosome, associa on p-value > 0.4). The phenotypic differences in mean gravitropic score of these almost-iden cal pairs were significantly higher than phenotypic differences among all pairs of HPG1 lines, and gene cally iden cal pairs a ending to subs tu ons inside genes (Fig. S9A). Furthermore, this SNP was not in complete linkage with any other SNP hit ( r 2 < 0.5) (Fig. S7D). The same approach was used to examine the SNPs in AT1G54440 (Fig. S7E) and AT2G16580 (Fig. S7F), which represent an intermediate and a high LD example.  Percentage of the variance explained by each dimension given in parentheses. isola on-by-distance pa erns. Three loca ons of modern samples and four loca ons of herbarium samples showed significant slopes (p < 0.05) in the isola on-by-distance pa ern. That is, gene c distance increased when moving away from these geographic loca ons. For one sample of each subset, herbarium (F) and modern (G), a likely migra on trajectory is depicted by an arrow and its isola on-by-distance pa ern is shown.  Table S5).   Correla on between SNP frequency and p-value (A) , frequency and effect (B) , age and p-value (C) , age and effect (D) . All cases were non-significant.

SUPPLEMENTAL TABLES
See appended . pdf file for Tables S1-5 .