Demographic history, linked selection, and recombination shape the genomic landscape of a broadly distributed Pacific salmon

Understanding the impacts of current human activities on within-species genetic variation requires a thorough description of the historical factors that have shaped the genomic and geographical distribution of nucleotide diversity. Past and current conditions influencing effective population size have important evolutionary implications for the efficacy of selection, increased accumulation of deleterious mutations, and loss of adaptive potential under the nearly neutral theory. Here, we gather extensive genome-wide data that represent the extant diversity of the Coho salmon (Oncoryhnchus kisutch) to address three issues. First, we demonstrate that a single glacial refugium is the source of the majority of present-day genetic diversity, with minor but detectable inputs from secondary micro-refugia. We propose a scenario whereby several ancestral populations located south of the ice sheets expanded in postglacial time, swamping out most of the diversity from other putative micro-refugia. Following this expansion, we identify particular populations having undergone continuous declines in population size (Ne). Second, we combine multiple evidence from demographic modelling, analysis of recombination landscape, and genome-wide landscape of diversity to demonstrate that selection at linked sites and Hill-Robertson interference played a major role in shaping genetic diversity across the Coho salmon genome. Third, we demonstrate that this demographic history generated subtle differences in the load of deleterious mutations among populations, a finding that mirrors recent results from human populations. Taken together, we found considerable support for the joint contributions of demographic history and linked selection in the load of deleterious mutations. We suggest that these inferences should be better integrated in conservation genetics of managed fish species which currently focuses largely on within-population adaptation. Author Summary Reconstruction of a species’ past demographic history from genome-wide data allows understanding how historical factors interact with intrinsic genomic properties to shape the distribution of genetic diversity along its genome and its geographic range. Here, we combine genotyping-by-sequencing and whole genome sequence data with demographic modelling to address these issues in the Coho salmon, a Pacific salmon species with rapidly declining census size in some parts of its range, notably in the south. Our demographic reconstructions indicate a linear decrease in genetic diversity towards the north of the species range, supporting the hypothesis of a major southern refugia for the Coho salmon and a northern route of postglacial recolonization. Accordingly, the number of candidate deleterious homozygous derived mutations was higher in northern populations. Demographic modelling also suggested the existence of cryptic refugia that may have been missed with the use of simpler summary statistics. We further showed that the species’ genome was shaped by linked selection and biased gene conversion. In particular, local variation in recombination rates have modulated the efficacy of natural selection. These processes, together with a complex demographic history, can contribute to the load of deleterious mutations – an effect we argue should be taken into account more routinely in conservation genetics studies.


Introduction
Both plant and animal biodiversity are currently declining or disappearing at unprecedented rates due to human activity [1]. This leads to population size reduction, reduced genetic diversity, and to the sixth mass extinction [2]. Before humans became major drivers of changes in species distributions and abundance, long-term climate change had a major influence [3]. The Pleistocene glaciations resulted in major contractions in the geographical distributions of many species into refugia that persisted in unglaciated areas [4]. Postglacial range expansions often led to contacts between ancestral populations previously segregated in different refugia [4,5]. The effects of long-term climate change combined with recent human-induced population declines can foster genetic changes including a loss of genetic diversity, increased inbreeding, increased load of deleterious mutations, and a loss of local adaptation [6].
In this context, it becomes important to understand how demographic history interacts with past and ongoing selection and recombination to shape genetic variation. By disentangling past and current drivers of range-wide genomic diversity, this information can inform management and conservation decisions [7].
Beyond conservation implications, such context provides a unique opportunity to address outstanding questions in evolutionary biology. In particular, what is the role of gene flow in shaping heterogeneous differentiation landscape during population divergence [8,9]? What are the demographic conditions required to generate substantial differences in deleterious load among populations [10]? deleterious mutations [25]. From a conservation standpoint, populations harboring an elevated number of deleterious variants might need to be monitored more closely.
Combining population genomics data with demographic modelling represents a powerful strategy to test alternative hypotheses about historical drivers of existing genomic diversity. Previous studies employing a similar approach have focussed mostly on species with a narrow geographic range, such as small islands [26,27], which are on the verge of extinction [28][29][30][31] are strongly bottlenecked [32], sheddig light on the evolutionary consequences of small population size. Few studies, however, have investigated how historical processes have shaped the geographical patterns in the distribution of genomic diversity in more broadly distributed species, e.g., at the scale of a whole continent. An exception to this observation is the vast literature on demographic reconstructions of human populations. Long-lasting debates in this literature regarding the role of demography in generating mutation load differences among populations [22,33,34] could benefit from studies of species displaying similarly complex demographic histories and broad geographic distributions.
Salmonid fishes are economically important species that have suffered recent demographic declines [35,36]. This is particularly the case for Coho salmon (Oncorynchus kisutch), one of the five anadromous species of Pacific salmon that supports important recreational and indigenous subsistence fisheries, which has suffered dramatic population declines (> 90%) over the last three decades in parts of its range [36,37]. A previous study investigated the range-wide population structure and demographic history of the species and found a cline of decreasing diversity from south to north, as well as some endemic diversity in small putative refugia [38] (see also [39]). This study indicated that Coho salmon may have survived the last glacial maximum (LGM, i.e. the Fraser Glaciation in British Columbia, and the McConnell/McCauley Glaciation in Yukon and Alaska; 23 to 18 Ky ago) in unglaciated areas of Haida Gwaii and Beringia in addition to areas south of the ice sheets. This study, however, predates the genomic era and could not eliminate alternative hypotheses regarding the origin and number of glacial refugia during the LGM. Most importantly, the impacts of confounding factors such as background selection, recombination rate variation, and how these factors may facilitate the accumulation of deleterious mutations could not be studied with the limited number of genetic markers available at the time. In North America, the species is currently distributed from California to Alaska [40]. Unglaciated areas that could potentially serve as glacial refugia persisted both north (e.g. the Beringian refugium in Alaska, the Yukon Territory of Canada and areas of Asia and the Bering Land Bridge) and south (e.g. all of the deglaciated area south of British Columbia, Canada) of the ice sheet [40][41][42]. Other unglaciated areas (e.g. Haida Gwaii in British Columbia) could also have been micro-refugia [43,44]. In this context, distinct demographic scenarios can be tested. Under a first scenario whereby populations expanded north from a single southern refugium, we predict: i) a latitudinal decrease in genetic diversity from south to north along with a pattern of IBD, and ii) ancestral populations located in areas south of the ice sheets. Under a second scenario, populations expanded south from a single northern refugium, and we predict the opposite geographic pattern. The third scenario corresponds to the survival of populations in different refugia where we predict: i) the existence of clearly distinct genetic clusters, and ii) postglacial gene flow with signatures of secondary contacts, with contact zones displaying higher genetic diversity through postglacial admixture between different genomic backgrounds.
In order to test these alternative scenarios, we generated genome-wide data from nearly 2,000 Coho salmon from California to Alaska, one of the most extensive genomic datasets for a non-model vertebrate species to date. First, to resolve the species demographic history, we used a modelling approach that accounts both for barriers to gene flow affecting migration locally, and for linked selection affecting the rate of drift.
Next, we tested the above predictions related to linked selection. Finally, we hypothesized that demographic history and background selection shaped the pattern of deleterious mutation load, both within and among populations. In particular, we hypothesized that postglacial re-colonisation influenced levels of standing genetic variation and favoured the accumulation of deleterious mutations at the expansion front. In these conditions, we predicted that genetic diversity should decrease as a function of the distance from the ancestral populations, while the accumulation of putatively deleterious mutations should increase as a function of the distance to the ancestral populations.

Overall genetic diversity and population structure
A total of 1,957 individuals was sampled from California to Alaska representing 58 sampling locations (mean n = 34 fish per location, Fig 1a, S1 Table) and genotyped using a genotype by sequencing (GBS) method that generated 82,772 high quality filtered single nucleotide polymorphisms (SNPs). Another set of 55 individuals representing 11 sampling locations from the same range (Fig 1a), were whole genome sequenced (WGS ) to ~30X coverage, and used in specific analyses (S2 Table).
Levels of genetic diversity (observed and exepected heterozygosity, πSNP) were highest in formerly deglaciated areas in the south (California, Cascadia, Fig 2a, Fig 1b) and decreased as a function of distance from the southernmost site up to Alaska (r = 0.64, p < 0.0001, Fig 2a, S1 Fig). The Thompson River watershed (Thompson R. hereafter) in southern British Columbia was an exception to this latitudinal pattern and displayed the lowest average level of regional genetic diversity of all sampling locations which we hypothesized to results from bottlenecks in this area. The remaining samples from British Columbia were intermediate in genetic diversity. The distribution of singletons provided further information regarding the most ancestral populations, with older populations expected to have accumulated a higher density of singletons [45]. Counting the number of singletons by sampling site and averaging by regional groups revealed the following differences: Cascadian samples contained the highest number of singletons, with a mean of 1,263 singletons per site. Californian samples had the fewest number of singletons (nMEAN = 55) while Alaska harbored intermediate density (nMEAN = 966). Consistently, WGS data revealed 2.7 times more singletons in Cascadia (Tsoo-Yess River) as compared to Alaska (Kwethluk River), whereas Thompson R. samples contained 6.7 times less singletons, supporting the hypothesis of a pronounced bottleneck in the these populations (S3 Table). Similarly, the occurrence of private alleles was more prevalent among southern than northern populations, being nearly twice as high in Cascadian (n = 10,097) than in Alaskan populations (n = 5,270). Again, populations from the Thompson R. were an exception to this pattern with the lowest level of private polymorphism (n = 1,479, S2  Next we used the βST coefficient to identify ancestral population [46]. Unlike FST estimates [47], this index can account for the non-independence among populations and negative values are indicative of ancestral populations [46]. Here, βST indicated that ancestral populations were located in previously unglaciated areas corresponding to Cascadia (n= 5 localities), California (n = 3 localities) as well as one site from southern British Columbia (Fig 2b, S4 Table). A linear decrease in βST as a function of distance from the southernmost site was observed (r = 0.60, slope = 1.03e-04, p <0.0001) as expected under IBD. Support for this IBD pattern was also observed using FST (S4 Fig, r    We then used Treemix [49] to infer population splits and gene flow (Fig 3b). and Thompson R. also displayed higher genetic drift, in line with evidence based on analyses of genetic diversity and genetic structure. The two most supported migration events occurred from Cascadia south to California and north to the Thompson R. We note that populations followed the south to north arrangement, with the samples from Cascadia displaying less drift than those further north.

Demographic history
In order to assess more formally the occurrence of one or more refugial origins for contemporary populations, we performed the following explicit model-based inferences of population divergence scenarios using ∂a∂i [50]. Our models account for the confounding effects of selection at linked sites and that of the accumulation of local barriers to gene flow in the genome [19,51]. Four major demographic models were statistically compared using groups identified in the PCA and with a focus on previously hypothesized refugia (i.e., Cascadia, California, Haida Gwaii, Alaska). The following models were tested: strict isolation ∂a∂i. In each pairwise comparison one single representative population from a putative refugium was compared against one population from another refugium. Due to the high local structure within groups, we avoided pooling samples into higher order groups (e.g., regional groups) as this would unavoidably bias our results. Models incorporating linked selection and restricted introgression along the genome always received highest support (S5 Table). YAIC (minimum value > 5) confidently discriminated models in 87% of the ∂a∂i comparisons (S5 Table). The SC model received the highest support in 46% of the comparisons, the AM model in 30%, and the IM model in 14%, with 13% remaining unclassified. Scenarios including periods of gene flow clearly outperform scenarios assuming no gene flow. The fact that secondary contact was the bestsupported model suggests that more than one glacial refugia contributed to the recolonization of the contemporary range occupied by Coho salmon.
Assuming a generation time of 3.5 years [52] and mutation rate of 8e -9 bp/generation revealed  Table). Similarly, the Hill-Robertson factor suggested that Ne was reduced to 37%-50% of its initial estimate and that approximately half of the genome was affected (S6 Table). We then investigated historical change in effective population size using the Sequentially Markovian Coalescent in SMC++ [54] and the 55 whole genome sequences, which are representative of 11 populations from California to Alaska (Fig 1, S2 Table).  [54]. Similarly, exact Ne estimates should be interpreted cautiously and only the trend should be considered rather than the exact values. We first described patterns of recombination and then tested correlations among genetic diversity (π), divergence (DXY), and differentiation (FST) parameters measured across the whole genome using 500 kb sliding windows (Fig 5 A-C) and the population-scaled landscape of recombination (P) ( Fig 5D) and gene density ( Fig 5E) to test our predictions using whole genome sequences.

Linked selection shape the Coho salmon genomic landscape
This analysis first revealed a heterogeneous recombination landscape that varied both within and among chromosomes. In particular, recombination rate was higher towards the ends of chromosomes ( Fig   5D), as observed across many species [55] Second, we observed a negative correlation between recombination rate and chromosome length (R 2 = 0.481, p <0.0001, S14 Fig Table). PC1 captured 99% of variance in π and in DXY, indicating that the PC1 was effective at summarizing information about diversity and divergence. All linear correlations among the tested variables were significant (p < 0.0001, S8A Table).
We found a positive correlation between PC1-π and recombination (r = 0.54, S8A Table) and a negative correlation between PC1-π and gene density (r = -0.20) (Fig 6A-D). We also found a positive correlation between PC1-DXY and recombination (r = 0.55) and a negative correlation between PC1-DXY and gene density (r = -0.21). PC1-FST was negatively, albeit weakly correlated with both recombination (r = -0.05) and gene density (r = -0.04, Fig 6B-E and C-F). We observed a correlation of 0.999 between π and DXY, as expected here because of a very recently shared common ancestor ( Fig 6G). We also found a negative relationship between PC1-π and PC1-FST (r = -0.44, Fig 6H) as well as between PC1-DXY and PC1-FST (r = -0.43, Fig 6I). The smaller correlations between statistics involving FST can be explained by the modest variance in FST explained by PC1 (37%). In order to investigate further the effects of linked selection, we used mixed linear models that integrate interactions among tested variables (i.e. recombination, gene density). These revealed a significant effect of both recombination (t = -41.17, p <0.0001, S8B Table) and gene density (t = 5.72, p<0.0001) on PC1-DXY (R 2 = 0.348). The same was true when considering the correlation of PC1-π with recombination ( R 2 = 0.55, Fig   6A- Table). Therefore, our results suggest a role of linked selection in shaping the patterns of diversity along the Coho salmon genome. We then tested whether the efficacy of selection also played a role in shaping patterns of diversity along the Coho salmon genome using the whole genome sequences. We measured ratios of non-synonymous to synonymous polymorphisms (πN/πS) as a measure of the efficacy of natural selection and GC content at the third codon position (GC3) as a proxy of local recombination rates. This metric was preferred as an indicator of localized recombination rate variation at the gene scale, as opposed to the large population scale estimates of recombination (P) over 250kb windows used above. Indeed, biased gene conversion (gBGC) occurs because base mismatch during homologous recombination is known to be biased toward G and C (conversion preferentially favours G+C over A+T bases) [56,57]. Here, we observed that the correlation between GC3 and P ranged from r= 0.3 to 0.72, depending on population. These modest correlations likely reflects the loss of information about fine scale recombination rate variation when using 250kb sliding windows, which also provides added justification to use the GC3 as a proxy. We found a strong correlation between πN/πS and GC3 for all populations (Fig 7, linear models all R 2 > 0.9, p < 0.0001, S10 Table). Next, we observed a negative correlation (linear model, R 2 = 0.28, p = 0.051) between historical Ne measured before the onset of population expansion ~13,000 years ago and πN/πS ratio (S16 To do so, we tested for an increase in the derived allele frequency (DAF), and homozygosity of predicted deleterious mutations at non-synonymous sites. We estimated derived alleles using whole genome information from three outgroups: 1) Chinook salmon (Oncorhynchus tshawytscha) [58], 2) Rainbow trout (Oncorhynchus mykiss) [59] and 3) Atlantic salmon (Salmo salar) [60]. Out of 4,427 non-synonymous mutations identified with the GBS dataset from all 58 populations, PROVEAN (63) predicted a total of 1,297 deleterious mutations in at least one of these populations and for which we were able to identify the derived allele. Deleterious mutations were maintained at lower DAF than synonymous variants (Wilcoxon-tests, all p <0.001). The DAF spectrum ( Fig 8A) were computed for each major region (using a sample of size n = 100) and for each population separately (S17 Fig). DAF were significantly different, among region (Kruskal-Wallis chi-squared = 100.57, df = 5, p-value < 2.2e-16) as well as among population (Kruskal-Wallis chisquared = 638.59, df = 57, p-value < 2.2e-16, S11 Table).They showed that Alaska, Thompson R. and Next, we examined the count of homozygous (proportional to the load under a recessive model [33,34]), heterozygous and of total derived deleterious mutations (proportional to the load under an additive model [33,34]). In particular, we expected that mutations in a heterozygous state should be more frequent than in a homozygous state, especially in populations with higher effective sizes, where selection should be more effective at purging these mutations [23]. We found that 77% of deleterious mutations were maintained in heterozygous states across all samples. Also, fish from Alaska, Haida Gwaii and California populations harbored a significantly higher number of deleterious mutations in a homozygous state when compared to Cascadia or British-Columbia (Wilcoxon-test, p < 0.01) (Fig 8B, S12 Table). When considering the total load of derived deleterious mutations, we found that, on average, there were significantly more putatively deleterious variants per individual in California, Cascadia, and Haida Gwaii populations than in fish from Alaska, British-Columbia or the Thompson R. watershed (S12 and S13 Table, Fig 8C, Wilcoxon-test, p < 0.01), although these differences were modest. Finally, we tested the prediction that distance from the likely ancestral source predicts the deleterious load, as observed in human populations [33]. We found a nearly linear relationship between the distance from the putative origin of ancestral populations (the site with the lowest βST in Cascadia) and the number of derived, homozygous, putatively deleterious mutations (HDD; linear models, p<0.0001; R 2 = 0.13, Fig 8D). Under the hypothesis that higher recombination rate leads to more efficient purging of deleterious mutations we also expected deleterious mutations to be preferentially located in areas of low recombination (S19 Fig). As expected, the GLM revealed that the occurrence of deleterious mutations decreased as recombination rate increased (χ 2 = 4.90, DF = 1, p = 0.027). This effect was stronger when considering non-synonymous variants instead of putatively deleterious variants (χ 2 = 10.07, DF=1, p = 0.0015). Given the negative correlation of recombination with chromsome length, we also found a positive correlation (r = 0.76, p < 0.0001) between the chromosome length and the number of deleterious variants by chromosomes, as expected when purging is more efficienct in areas of higher recombination (S20 Fig). To further examine this, we classified regions by recombination rate into high (rho > 4.516, see methods), low (rho < 1.411) and intermediate classes and found that there was a significant excess of candidate deleterious mutations in areas of low recombination and depletion of candidate deleterious mutations in areas of high recombination (χ 2 = 13.33, DF= 2, p = 0.0012).

Discussion
Coho salmon is an emblematic fish species that has undergone population declines in recent decades throughout its North American range. We generated one of the largest collections of GBS and WGS data to date for a non-model vertebrate species, which revealed: i) a complex demographic history involving population splits, gene flow, and secondary contacts; ii) linked selection and Hill-Robertson interference shaped genetic diversity along the genome, and iii) this demographic history has resulted in modest yet detectable differences in the frequency of deleterious variants across regions and populations, which also varied as a function of recombination rate along the genome. Together, these observations help illuminating the drivers of variation in genetic diversity throughout the genome of a broadly distributed species of economic and cultural importance.

Expansion from a major southern refugium and secondary contact with micro-refugia
Our results revealed the existence of a major ancestral refugium located south of the ice sheets in Cascadia were not in full agreement with SMC++ estimates using WGS (~18 KyA). This, however, is expected given the different assumptions made by the various models compared and that gene flow is expected to delay divergence [62]. The last glacial period (Wisconsin Glaciation) which lasted from ~ 120 Ky to ~11 KyA was interrupted by an interglacial period from 55 to 25 KyA ( [40]). Therefore, our results indicate that neither Beringia nor Haida Gwaii were refugia before the onset of the Wisconsin Glaciation. However, it is possible that individuals from Cascadia already colonized and subsequently diverged in Alaska (Beringia) 55 to 25 KyA. If this was the case, however, the contribution of Beringia is likely minor as no strong footprint of ancestral diversity and no signal of secondary contact was inferred in this area.
While a single major southern refugium hypothesis is well supported by our data, our analyses also revealed patterns consistent with minor contributions of micro-refugia on extent genomic diversity. Indeed, our demographic modelling supported a model of postglacial (10-20 KyA, S6 Table)  between Cascadia and Thompson R. (n = 1). These results are counter-intuitive, because contacts between northern sites and California, but bypassing Cascadia, seem unlikely. A possible explanation is that our statistical power to detect SC involving putative micro-refugia was reduced. Here, indeed, the SC period represented a large proportion (22% on average) of the total divergence time across all models. However, models of SC can easily be confounded with models of isolation with migration when SC represents a large period of time (> 10%) relative to the total period of divergence, as observed in our investigations [19,63].
Still, our inference of SC supports the hypothesis that smaller micro-refugia have persisted along the Pacific Coast [43,64]. In particular, the Haida Gwaii archipelago is known for high endemism, and its role as a refugium for mammals, invertebrates, and angiosperms is well established [64][65][66].  [70]. A potential limitations of our current approach is that we did not incorporate samples from Asia, where the Coho salmon also occurs [41]. Whether this would change our inferences remains an open question. In summary, our modelling approach revealed the presence of multiple cryptic refugia that would otherwise have been missed because their endemic genetic diversity appears to have been largely wiped out by a major demographic expansion out-of-Cascadia, which has clearly been the primary contributor of extent genomic diversity in North American Coho salmon. Knowledge of this non-equilibrium demography will important to help interpret adaptive and deleterious genomics landscape in future studies.

Selection at linked sites shape genome wide variation
We also explored the role of gene flow and linked selection in generating a heterogeneous differentiation landscape along the genome, a hotly debated topic in evolutionary genomics [8,9,71,72]. Disentangling the two processes is of fundamental importance for correctly interpreting the origin of genomic islands of divergence and genome scan results [73,74]. Although a positive relationship between absolute and relative divergence may reveal the presence of barriers to gene flow [9], this is not relevant for early stages of divergence [75], as in the present study. Instead, our modelling approach best supported a role of both gene flow and linked selection. In particular, the role of linked selection is supported by the positive correlations between π or DXY and recombination rate, while FST was negatively correlated with recombination rate [73,75]. The negative correlations between π or DXY and gene density provided further evidence for the role of linked selection [17]. Here, heterogeneous genomic divergence did not arose after speciation, as suggested by Cruickshank & Hahn [9]. Instead, our within-lineage study indicated that linked selection arose within structured populations, in line with recent findings in birds [73]. Similar effects of linked selection were suggested between the more diverged ancestral lineages within Atlantic salmon (>1 MyA) [76]. Given that correlated genomic landscapes have been reported among species diverged for over tens of millions of years [77,78], it would be interesting to investigate how linked selection has also shaped genomic landscapes within the entire radiation of salmonid fishes. Indeed, salmonid species have undergone a whole genome duplication approximately 90 MyA [79] affecting recombination throughout their genome [57,80,81]. Precise mapping of the genomic locations of duplicated regions will allow an improved understanding of how recombination rate varies in these regions althougth we do not expect that will affect our main conclusion related to linked selection. Salmonid genomes are also characterised by pronounced male heterochiasmy [82,83] and here, we observed a significant correlation between recombination and chromosome length, indicative of crossover interference [84]. Both heterochiasmy and crossover interference contribute to heterogeneity in recombination rates and likely favour the effect of linked selection. Given that most nonneutral mutations are deleterious [85], and given the impact of recombination across the genomes of fishes, models of background selection could be the null model against which to test adaptive hypotheses [21]. Our data revealed that the GC3 was a good proxy of recombination and that it was strikingly well correlated with our proxy for the efficacy of natural selection (πN/πS). As expected, this metric revealed differences in the most bottlenecked populations from the Thompson River drainage, where the severity of the bottleneck mimics the effect of domestication bottleneck (e.g. [86]). Indeed, the Thompson R. populations displayed a higher burden of non-synonymous mutations, and we suggest that the strong bottlenecks in these populations had a dramatic impact on the efficacy of natural selection. When considering only genomic regions of low recombination, however, we also found that some California populations also displayed an increased πN/πS ratio. Given that these populations have recently undergone large declines in abundance [68], should be high [82]. Furthermore, we suggest that the severe and recent demographic declines documented in California populations compared to more long term and putatively less severe declines in the Thompson R.
may explain why differences were detected only in genomic regions of low recombination in the former but genome-wide in the latter. Regardless, the non-equilibrium demography, and role of linked selection, in particular background selection, are predicted to favour the accumulation of deleterious mutations, an important finding to manage declining species.

Accumulation of deleterious mutations under complex demography
In the literature, contradictory results have been reported regarding the role of demographic history on the load of deleterious mutations [10,33]. On the one hand, processes such as strong and repeated bottlenecks [24,32], including domestication [86,88], large expansions [89,90] [86] and recently in Isle Royale wolves [26] where decreased population sizes and inbreeding increased the frequency of deleterious recessive mutations. Overall, these results are consistent with recent empirical findings in which small populations or populations at the expansion front carry more homozygous derived deleterious mutations [26,27,86]. These deleterious mutations in homozygous state are expected to be purged by purifying selection [23,92]. Populations from Cascadia, with higher effective population sizes, contain a higher number of putatively deleterious variants in heterozygous state, as expected from population genetics theory. We also found a nearly linear relationship between the number of derived deleterious mutations in homozygous states and geographic distance from the putative refugial source in Cascadia. This relationship mirrors the findings pertaining to the 'out-of-Africa' expansion of human populations [33], although the maximum geographic distance in our study is an order of magnitude smaller than in the human studies, and that the presence of multiple refugia (as opposed to the sole African ancestral origin for humans) may have contributed to obscure somewhat the relationship.
Moreover, pronounced genetic drift or bottlenecks observed in some populations (e.g. Thompson R. and California) may have contributed to reduce the efficacy of selection [24]. Finally, the lower prevalence of deleterious mutation in fish from British-Columbia might also be explained by post-glacial admixture. In some small populations it is possible that these mutation initially reached fixation but subseqently became masked in heterozygous state due to gene-flow and introgression [93]. Similar effect could be due to 582 583 more deleterious mutations being found in genomic regions of low recombination. This is in line with theoretical predictions (e.g. Hill-Robertson Interference discussed above), and also empirical studies in other species [94,95]. Overall, our results suggest that Coho salmon populations with highly reduced population sizes are exposed to higher inbreeding depression -a prediction with major conservation implications and which should be investigated further in future studies. 585 586 587 588

Conclusion
In this study, we presented a rare combined assessment of the relative role of complex demographic history (investigated both by empirical genomic data and modelling) and intrinsic genomic factors (recombination, linked selection) in shaping drivers of the genomic landscape of a broadly distributed species. Complex demographic processes including population expansion, isolation, and secondary contact were revealed through the use of an extensive modelling framework. Moreover, our results highlighted the necessity of accounting for local variation in recombination rate, a key driver of linked selection. Altogether, these processes influence the efficacy of selection and can favour the accumulation of mutations affecting fitness.
Our findings suggest that such approaches offer enormous potential in the field of conservation genomics to disentangle the impacts of historical vs. recent drivers of demographic declines and for assessing the distribution of not only putatively beneficial, but also deleterious variants. We propose that future studies should also integrate in-depth analysis of selective sweeps which will be necessary to investigate into more details how linked selection, through background selection and hitchhiking, acts to maintain deleterious mutations in the genome [96]. Finally, while it has become routine with new genome-wide datasets to focus on the effect of positive selection and documenting patterns of local adaptation [7], this focus ignores the fundamental prediction that most new mutations are likely to be deleterious [85]. An increased focus on deleterious mutations would provide a more nuanced view of genomic evolution in wild species which would benefit both the fields of evolutionary and conservation genetics

Genotyping By Sequencing
A total of 2,088 individuals was collected from 58 sample sites located along the Pacific coast from California to Alaska (S1 Table and Figure 1). DNA was extracted from all individuals and sequenced using a GBS method (protocol detailed in [97]). Reads were aligned to the Coho salmon reference genome v1 (GCF_002021745.1) using bwa-mem 0.7.13 [102]. Samtools v1.7 was used to keep reads with a mapping quality above 20, remove supplmentary alignment and unmapped read. Variants were then called with Stacks v1.46 [98]. To do so, the module "pstacks" was used with a minimum depth of 5, and up to three mismatches were allowed in catalog assembly. The module "populations" was run to produce a vcf file that was filtered with a custom python script.We performed stringent filtering to remove SNPs that were 1) genotyped in less than 60% of the individuals; 2) at a mean depth of sequencing below 7, and 3) with observed heterozygosity above 0.60, thus resulting in 93,000 SNPs. The pipeline for SNP calling is available on github at https://github.com/enormandeau/stacks_workflow/releases/tag/coho_demography_paper. Next, we removed any individuals with more than 5% missing data and finally only kept SNPs present in at least

Genetic diversity and ancestral populations
For each sampling location we estimated the observed heterozygosity and π using vcftools 0.1.16 [99] and hierfstat [100] The most likely ancestral populations were identified using βST [46]. A total of 1,000 bootstraps was performed to obtain the 95% confidence intervals around the βST. Weir and Cockerham's FST estimator θ [47] was computed in vcftools. We measured the relationship between observed heterozygosity, βST, FST and the distance to the southernmost site using linear models. We also verified the relationship between FST and the distance to the southernmost site using Mantel tests with 10,000 permutations. Vcftools was also used to identify singletons (i.e. variants present in one single individual across the whole dataset).
Their distributions were then summed in each locality. We then computed the averaged (min, max and median) number of singletons at the regional level. The scripts are available on github at https://github.com/ QuentinRougemont/utility_scripts

Population structure, admixture and gene flow
Levels of ancestry and admixture proportions were inferred with the snmf function implemented in the R package LEA [101]. We allowed less than 5% of missing data. We then kept a set of SNPs in approximate linkage equilibrium by removing pairs of variants with r 2 greater than 0.2 (option --indep-pairwise 50 10 0.2) resulting in 40,632 SNPs. K-values ranging between 1 and 60 were tested and cross-validations were obtained using the cross-entropy criterion with 5% of masked genotypes. The default value for the regularization parameter was used to avoid forcing individuals into groups and hence underestimating admixture. Similar results were obtained from Admixture [102] and are not presented here. Genetic relationship among all salmon was assessed using a PCA with the R package ade4 [103] based on the LDpruned dataset (40,632 SNPs). We used a 1% minor allele frequency (MAF) threshold and allowed less than 4% missing data. Formal tests of admixture were performed using Treemix [49] using the LD-pruned dataset of 40,632 SNPs and without any MAF threshold. A MDS was also constructed using plink and plotted with the ggplot2 [105] R package. We ran Treemix allowing up to 20 migration events and performed 500 bootstraps replicates of the best model to obtain robustness of the nodes. The "best" model was inferred after inspecting the relevant migration edges by measuring the percentage of variance explained as migration edge were added to the tree as well as by assessing the p-value associated to each migration edge. A total 500 bootstraps replicate was performed under the "best" model and under a model without migration to infer the robustness of the nodes. The scripts are available on github at https://github.com/QuentinRougemont/treemix_workflow

Explicit demographic inferences.
We tested alternative hypotheses of secondary contact between major regional groups (Haida Gwaii, California, Cascadia (Washington-Oregon), Alaska, British Columbia, Thompson R.) of populations by comparing alternative divergence scenarios represent in Fig S12 and initially described in [19,76].
Alternative hypotheses of secondary contacts were tested between major groups by testing the significance of alternative divergence scenarios. The four major models tested included a model of Secondary Contact Gene flow can be asymmetric, so that two independent migration rates m12 (from population 2 to 1) and m21 Models were fitted using the diffusion theory implemented in ∂a∂i [50] and also includes the effect of linked selection and barriers to gene flow as detailed in [19,104]. ∂a∂i uses the SFS as a summary of the data. For a given demographic model, the SFS is computed using diffusion approximation and compared to the empirical SFS using AIC. Here, we started from the whole file containing 200,000 SNPs and used one single SNP per GBS locus, filtered the data to minimize missing data. No MAF was used and singletons were kept to obtain ascertainment-free estimates of demographic parameters. Ideally, no Hardy-Weinberg Equilibrium (HWE) filter should be used for demographic inferences, as this also biases the distribution of allele frequencies. However, to remove paralogs present in the Coho salmon genome, a permissive HWE filter based on a p-value of 0.0001 was used. Here, a total of 69 pairwise comparisons between populations from the major regional groups was performed in order to test for a prevailing pattern. Here, the regional groups considered were all previously unglaciated areas during the LGM, namely California, Cascadia, Alaska

Analyses based on whole genome resequencing data.
We used 55 individuals representing 11 populations from California to Alaska (S2 Table). Each individual was sequenced on an Illumina platform using paired-end 150 bp reads. Reads were processed using fastp for trimming [105], bwa mem v0.7.13 [106]for mapping, samtools v1.7 requiring a minimum quality of 10, and picard to remove duplicates (http://broadinstitute.github.io/picard/). Then SNP calling was performed using GATK [107]. Genotypes were filtered for depth between 10 and 100 reads to remove low confidence genotypes including potential paralogs, displaying high coverage. Then following GATK Best Practices we excluded all sites that did not match the following criterion: MQ < 30, QD < 2, FS > 60, MQRankSum < -20, ReadPosRankSum < 10, ReadPosRankSum > 10. We also generated a vcf file using the samtools mpileup pipeline, merging individuals with bcftools and performing the same stringeant filtering as with the vcf constructed with GATK. Finally, we also generated a separate vcf file using the emit-all-site option to call variable and invariable sites across the whole genome. This vcf file was used in the sliding windows analysis below to test the effect of linked selection. The whole pipeline is available on github We computed pairwise linkage disequilibrium using vcftools with the r 2 statistics calculated in all 11 populations separately. A window of 1,000,000 base pairs was used and all SNPs were included. To reduce the number of SNPs, we allowed no missing data and used a MAF of 10% and a p-value of Hardy-Weinberg disequilibrium of 0.05 in each population, keeping between 2 and 4 million SNPs depending on the population. We then estimated LD decay by plotting LD against physical distance measured in base pairs and using smoothing functions implemented in ggplot2 [108] package in R. vcftools was also used to identify singletons, for which the distribution were counted by localities, as for the GBS dataset.
We used LDHat software [109] to estimate effective recombination rates (P=4.Ne.r where r represents the recombination rate per generation and Ne is the effective population size) along the genome.
Unphased genotypes were converted into LDHat format using vcftools with a minimum MAF of 10% since only common variants are useful for such inferences [109]. Following the authors' guidelines, the genome was split in chunks of 2,000 SNPs with overlapping windows of 500 SNPs to compute recombination rate and data were then merged together. We measured recombination rates for each river as well as globally including all populations except the population from the Thompson R. watershed that was too divergent from the remaining samples. The pipeline to reproduce the analysis is available on github (https://github.com/QuentinRougemont/LDhat_workflow).

Clarifying the role of linked selection
The genetic landscape of divergence was measured by estimating levels of nucleotide diversity (π), gene density, levels of genetic differentiation (FST), levels of divergence (Dxy), and scaled recombination rate along the genome. Estimates of π, FST, and Dxy were computed into 500kb windows using Python scripts available from [110]. Gene density was computed directly from the gff file of the and measured in 500 kb windows. Recombination rates were averaged into 250 and 500-kb windows using a Python script after being estimated with LDhat as described above. We used a PCA to obtain a synthetic view across all 11 π estimates, and all pairwise 55 Dxy and 55 FST separately. This allowed capturing the common variation affecting these three estimates. First, simple Spearman correlations based on linear models testing each (z-transformed) variable projected on the PC1 axis separately were carried out to produce fig 5.
Then Mixed Linear Models were used to test for correlations between either π, Dxy, or FST and either recombination landscape (rho) or gene density windows as explanatory variables, with and without interaction. Correlations were also calculated considering each π, Dxy, and FST by population separately without PCA transformation (S9 Table) which returned patterns that were congruent with those observed with PCA results.

Measuring GC content, Non-Synonynous and Synonymous diversity
The pipeline developed by [111] was used to compute Tajima's π estimator of nucleotide diversity and the GC content over non-overlapping 10-kb windows. Then we concatenated the gene into different classes according to their length and compute πN and πS over 4 mb windows, which allows circumventing the problem associated to low πs values. We then verified the correlation between populations scaled recombination rate (P), measured over 250-kb windows and GC at the third codon position (approximately neutral). To do so, we assigned a P value for each gene according to its position into each 250-kb windows.
Then we averaged P values over the same length class as GC3 classes and compared the median population recombination rate estimates and GC3s over 4-mb windows. Finally, we measured the correlation between GC3s and πN/πS using linear models.

Estimating ancestral and derived alleles
We used the derived allele count as an estimator of deleterious allele count. We used the genomes of three outgroup species, the chinook salmon, the rainbow trout, and the Atlantic salmon, to classify SNPs as ancestral or derived. Whole genome data for the chinook salmon (n = 3 individuals) were provided by one of us (B. Koop, unpublished), whereas rainbow trout (n = 5) and Atlantic salmon (n = 5) data were downloaded from NCBI Sequence Read Archive (rainbow trout, SRA, Bioproject : SRP117091; Salmo salar SRA Bioproject: SRP059652). Each individual was aligned against the Coho salmon V1 reference genome (GCF_002021745.1) using GATK UnifiedGenotyper and calling every SNP using the EMIT_ALL_SITES modes. We then constructed a Python script and determined the ancestral state of the GBS SNPs if 1) the SNP was homozygous in at least two of the three outgroups, and 2) match one of the two alleles identified in Coho salmon. Otherwise, the site was inferred as missing and not used in subsequent analyses.

Measuring damaging impact of non-synonymous alleles
We tested differences in mutation load among populations as follows. The software Provean [112] was used to predict whether a given non-synonymous mutation was deleterious with a threshold score of -2.5 or less using the pipeline available at https://github.com/QuentinRougemont/gbs_synonymy_with_genome. We analysed data in two ways: first we counted the total number of putative homozygous deleterious alleles per individual as well as the total number of deleterious alleles (both in homozygous and heterozygous states) using: Ntotal = 2 Χ Nhomo + Nhetero [34]. These individual values were then averaged per population and major regional groups (i.e., California, Cascadia, British Columbia, Haida Gwaii, Thompson, and Alaska). We then computed derived allele frequencies (DAF) in all sampling locations and across mutation categories (synonymous, non-synonymous, and non-synonymous deleterious) and tested for significant allele frequency differences among populations in non-synonymous and non-synonymous deleterious mutations using Wilcoxon rank sum tests. DAF spectra were constructed for all population separately. For the ease of visualisation, we also constructed DAF spectra by region for a sample of size n = 100 individuals (Fig 7A).
This size was chosen according to the smallest sample size of the three combined Haida Gwaii populations.
Finally, we tested for a preferential enrichment of "hotspots or coldspots" of recombination in deleterious mutations. To define a coldspot, we first computed a lower bound that we defined as the mean RHO -5*standard errors for each chromosome separately. Similarly, 'hotspots" of recombination were identified using a upper bound defined as meanRHO + 5 standard errors. We then tested if the average recombination rate of each of the 250 kb windows was falling below the threshold (for coldspot) or above it (for hotspots). We then tested if recombination hostpots and coldspots contained more or less putatively deleterious mutations than "normally" recombining regions using χ 2 tests. Linear mixed effects models were further used to test if there was a relationship between recombination rate and the distribution of putatively deleterious mutations.
The response variable was the deleterious state considered binomial (0 = non deleterious, 1 = deleterious) and the explanatory variable was the recombination rate. The chromosome identity was included as a random effect. This model was compared against a null model excluding recombination rate. The analysis was replicated but using the synonymous and non-synonymous state as the response variable instead of the putatively deleterious state of the considered mutation. Models were carried out in R using the LME4 package [113]. Finally, we used SnpEff v3.4 [114] to obtain the functional annotations of the putatively deleterious variants. The annotations were comprised of mis-sense variants, non-coding transcripts, 3' and 5' untranslated regions, 5kb up-and down-stream variants, intergenic and intronic, splice acceptor and splice region, stop gained and start loss (S14 Table). We found 35% of deleterious mutations to be missense variants, 38% to be non-coding transcript and 22% to be either upstream or downstream gene variants, with the remaining being spread over the categories.

Ethic Statement
A permit number SIRUL 111722 was obtained to work on DNA sequences.
Raw data will be deposit on NCBI together with Short Read Archive (SRA) accession number. population in two population; Tsc: duration of the secondary contact P: proportion of the genome freely exchanged (1-P provides the proportion of the genome non-neutrally exchanged); Q: proportion of the genome with a reduced effective population size due to selection at linked sites; hrf = Hill-Robertson factor representing the reduction of Ne in the region Q with reduced Ne.
S8 Table: A) Spearman correlation association to the comparison in Figure 6. B) Linear models testing the combined effect of recombination (Rho) and gene density (Gene count). Interaction terms were not significant for Pi and Dxy and were removed. S9 Table: spearman correlation obtained between recombination and Dxy, Fst and Pi when considering each possible pairs of river separately (n = 55) or each river independently (for Pi only). S10 Table: Results of liear models testing the correlation between piN/piS and GC3. S11 Table: Summary of deleterious variation by region. 1)Derived Allele Frequency (DAF) of deleterious mutation, after averaging by rivers and then by major regional group. 2) Count of deleterious mutations in each rivers and then averaged by major regional group.
3) Number of homozyguous derived deleterious mutations by individual, after averaging by rivers and then by major regional group. 4) Number of heterozygous mutations by individuals, after averaging by rivers and then by major regional group 5) Total load of derived deleterious mutations by individuals, after averaging by rivers and then by major regional group. S12 Table: Results of Wilcoxon test (Man-Whitney tests) for differences in derived allele frequencies among major groups for a sample of size 100. S13 Table: Results of Wilcoxon test (Man-Whitney tests) for differences in count of derived homozygous variants and total load among individuals in each region. S14