Genomics of Mesolithic Scandinavia reveal colonization routes and high-latitude adaptation

Scandinavia was one of the last geographic areas in Europe to become habitable for humans after the last glaciation. However, the origin(s) of the first colonizers and their migration routes remain unclear. We sequenced the genomes, up to 57x coverage, of seven hunter-gatherers excavated across Scandinavia and dated to 9,500-6,000 years before present. Surprisingly, among the Scandinavian Mesolithic individuals, the genetic data display an east-west genetic gradient that opposes the pattern seen in other parts of Mesolithic Europe. This result suggests that Scandinavia was initially colonized following two different routes: one from the south, the other from the northeast. The latter followed the ice-free Norwegian north Atlantic coast, along which novel and advanced pressure-blade stone-tool techniques may have spread. These two groups met and mixed in Scandinavia, creating a genetically diverse population, which shows patterns of genetic adaptation to high latitude environments. These adaptations include high frequencies of low pigmentation variants and a gene-region associated with physical performance, which shows strong continuity into modern-day northern Europeans.


Introduction
As the ice-sheet retracted from northern Europe after the Last Glacial Maximum (LGM), around 23,000 years ago, new habitable areas emerged [1] allowing plants [2,3] and animals [4,5] to recolonize the Scandinavian peninsula (hereafter Scandinavia). There is consistent evidence of human presence in the archaeological record from c. 11,700 years before present (BP), both in southern and northern Scandinavia [6][7][8][9]. At this time, the ice-sheet was still dominating the interior of Scandinavia [9] (Fig. 1A, S1 Text), but recent climate modeling shows that the Arctic coast of (modern-day) northern Norway was ice-free [10]. Similarities in late-glacial lithic technology (direct blade percussion technique) of western Europe and the oldest counterparts of northernmost Scandinavia [11] (S1 Text) have been used to argue for a postglacial colonization of Scandinavia from southwestern Europe. However, studies of a new lithic technology, 'pressure blade' technique, which first occurred in the northern parts of Scandinavia, indicates contacts with groups in the east and possibly an eastern origin of the colonizers [7,12,13] (S1 Text). The first genetic studies of Mesolithic human remains from central and eastern Scandinavia (SHGs) revealed similarities to two different Mesolithic European populations, the 'western huntergatherers' (WHGs) from western, central and southern Europe and the 'eastern hunter-gatherers' what routes they followed, how they were related to other Mesolithic Europeans [15][16][17][18][19]24]and to investigate human adaptation to high-latitude environments.

Results and Discussion
We sequenced the genomes of seven hunter-gatherers from Scandinavia (Table 1; S1 Text, S2 Text, S3 Text) ranging from 57.8× to 0.1× genome coverage, of which four individuals had a genome coverage above 1×. The remains were directly dated to between 9,500 BP and 6,000 BP, and were excavated in southwestern Norway (Hum1, Hum2), northern Norway (Steigen), and the Baltic islands of Stora Karlsö and Gotland (SF9, SF11, SF12 and SBj) and represent 18% (6 of 33) of all known human remains in Scandinavia older than 8,000 [25]. All samples displayed fragmentation and cytosine deamination at fragment termini characteristic for ancient DNA (S3 Text). Mitochondrial (mt) DNA-based contamination estimates were <6% for all individuals and autosomal contamination was <1% for all individuals except for SF11, which showed c. 10% contamination ( Table 1, S4 Text). Four of the seven individuals were inferred to be males, three were females. All the western and northern Scandinavian individuals and one eastern Scandinavian carried U5a1 mitochondrial haplotypes while the remaining eastern Scandinavians carried U4a haplotypes (Table 1, S5 Text). These individuals represent the oldest U5a1 and U4 lineages detected so far. The Y chromosomal haplotype was determined for three of the four males, all carried I2 haplotypes, which were common in pre-Neolithic Europe (Table 1, S5 Text). Table 1: Information on the seven Scandinavian hunter-gatherers investigated in this study, including calibrated date before present (cal BP) corrected for the marine reservoir effect, given as a range of two standard deviations, average genome coverage, average mitochondrial (mt) coverage, mt and Y chromosome haplogroups and contamination estimates based on the mt, the X-chromosome for males and the autosomes.  [14][15][16][17].
Thus, many genetic variants found in Mesolithic individuals have not been carried over to modern-day groups. Among the novel variants in SF12, four (all heterozygous) are predicted to affect the function of protein coding genes [28] (S3 Text). The 'heat shock protein' HSPA2 in SF12 carries an unknown mutation that changes the amino acid histidine to tyrosine at a proteinprotein interaction site, which likely disrupts the function of the protein (S3 Text). Defects in HSPA2 are known to drastically reduce fertility in males [29]. Although SF12 herself would not be affected by this variant, her male offspring could carry the reduced fertility variant, and it will be interesting to see how common this variant was among Mesolithic groups as more genome sequence data become available. The high-quality diploid genotype calls further allowed us to genetically predict physical appearance, including pigmentation, and to use a model-based approach trained on modern-day faces and genotypes [30,31] to create a 3D model of SF12's face (S9 Text). This represents a new way of reconstructing an ancient individual's facial appearance from genetic information, which is especially informative in cases such as for SF12, where only post-cranial fragments were available, and future archaeogenetic studies will have the potential to many individuals appearance from past times.

Demographic history of Mesolithic Scandinavians
In order to compare the genomic data of the seven SHGs to genetic information from other ancient individuals and modern-day groups, data was merged with six published Mesolithic  1A, B, S6 Text). This is consistent with a scenario in which SHGs represent a mixed group tracing parts of their ancestry to both the WHGs and the EHGs [15][16][17]20,38].
To investigate the postglacial colonization of Scandinavia, we explored four hypothetical migration routes (primarily based on natural geography) linked to WHGs and EHGs, respectively (S11 Text); a) a migration of WHGs from the south, b) a migration of EHGs from the east across the Baltic Sea, c) a migration of EHGs from the east and along the north-Atlantic coast, d) a migration of EHGs from the east and south of the Baltic Sea, and combinations of these four migration routes. These scenarios allow us to formulate expected genetic affinities for northern, western, eastern, and central SHGs (S11 Text). The SHGs from northern and western Scandinavia show a distinct and significantly stronger affinity to the EHGs compared to the central and eastern SHGs (Fig. 1). Conversely, the SHGs from eastern and central Scandinavia were genetically more similar to WHGs compared to the northern and western SHGs (Fig. 1). Using [16,17] Fig. 1A, S6 Text). These patterns of genetic affinity within SHGs are in direct contrast to the expectation based on geographic proximity with EHGs and WHGs and do not correlate with age of the sample (S11 Text).
The archaeological record in Scandinavia shows early evidence of human presence in northern coastal Atlantic areas [13]. Stable isotope analysis of northern and western SHGs revealed an extreme marine diet, suggesting a maritime subsistence, in contrast to the more mixed terrestrial/aquatic diet of eastern and central SHGs (S1 Text). Combining these isotopic results with the patterns of genetic variation, we suggest an initial colonization from the south, likely by WHGs. A second migration of people who were related to the EHGs -that brought the new pressure blade technique to Scandinavia and that utilized the rich Atlantic coastal marine resources -entered from the northeast moving southwards along the ice-free Atlantic coast where they encountered WHG groups. The admixture between the two colonizing groups created the observed pattern of a substantial EHG component in the northern and the western SHGs, contrary to the higher levels of WHG genetic component in eastern and central SHGs (Fig. 1, S11 Text).
By sequencing complete ancient genomes, we can compute unbiased estimates of genetic diversity, which are informative of past population sizes and population history. Here, we restrict the analysis to WHGs and SHGs, since only SNP capture data is available for EHGs (S7 Text). In current-day Europe, there is greater genetic diversity in the south compared to the north. During the Mesolithic, by contrast, we find higher levels of genetic diversity (S7 Text) as well as lower levels of runs of homozygosity ( Fig. 2A) and linkage disequilibrium ( Fig. 2B) in SHGs compared to WHGs (represented by Loschbour and Bichon, [16,32]) and Caucasus hunter-gatherers (CHG, represented by Kotias and Satsurblia, [32]). Using a sequential-Markovian-coalescent approach [39] for the high-coverage, high quality genome of SF12, we find that right before the SF12 individual lived, the effective population size of SHGs was similar to that of WHGs ( Fig. 2C). At the time of the LGM and back to c. 50,000 years ago, both the WHGs and SHGs go through a bottleneck, but the ancestors of SHGs retained a greater population size in contrast to the ancestors of WHGs who went through a more severe bottleneck (Fig. 2C). Around 50,000-70,000 years ago, the effective population sizes of the ancestors of SHGs, WHGs, Neolithic groups (represented by Stuttgart [16]) and Paleolithic Eurasians (represented by Ust-Ishim [36]) align, suggesting that these diverse groups all trace their ancestry back to a common ancestral group which likely represents the early migrants out-of-Africa, who likely share a common ancestry outside of Africa.

Adaptation to high-latitude environments
With the aim of detecting signs of adaptation to high-latitude environments and selection during and after the Mesolithic, we employed three different approaches that utilize the Mesolithic genomic data. In the first approach, we assumed that SHGs adapted to high-latitude environments of low temperatures and seasonally low levels of light, and searched for gene variants that carried over to modern-day people in northern Europe. As we have already noted, modern-day northern Europeans trace limited amount of genetic material back to the SHGs (due to the many additional migrations during later periods), and any genomic region that displays extraordinary genetic continuity would be a strong candidate for adaptation in people living in northern Europe across time. We designed a statistic, D sel (S10 Text), that captures this specific signal and scanned the whole genome for gene-variants that show strong continuity (little differentiation) between SHGs and modern-day northern Europeans while exhibiting large differentiation to modern-day southern European populations [40] (Fig. 3A; S10 Text). Six of the top ten SNPs with greatest D sel values were located in the TMEM131 gene that has been found to be associated with physical performance [41], which could make it part of the physiological adaptation to cold [42]. This genomic region was more than 200kbp long and showed the strongest haplotypic differentiation between modern-day Tuscans and Finns (S10 Text). The particular haplotype was relatively common in SHGs, it is even more common among today's Finnish population (S10 Text), and showed a strong signal of local adaptation (S10 Text). Other top hits included genes associated with a wide range of metabolic, cardiovascular, developmental and psychological traits (S10 Text) potentially linked to physiological [42].
In addition to performing this genome-wide scan, we studied the allele frequencies in three pigmentation genes (SLC24A5, SLC45A2, having a strong effect on skin pigmentation, and caused by adaptation to high-latitude environments [43,45].

Conclusion
By combining information from climate modeling, archaeology and Mesolithic human genomes, we were able to reveal the complexity of the early colonization process of Scandinavia and human adaptation to high-latitude environments. We disentangled migration routes and linked them to particular archaeological patterns, demonstrate greater genetic diversity in northern Europe compared to southern Europe -in contrast to modern-day patterns -and show that many genetic variants that were common in the Mesolithic have been lost today. These finds reiterate the importance of human migration for dispersal of novel technology in human prehistory [14][15][16][17]23,24,38,[47][48][49][50] and the many partial population turnovers in our past.

Sample preparation
Genomic sequence data was generated from teeth and bone samples belonging to seven (eight, including SF13) Mesolithic Scandinavian hunter-gatherers (S1 Text were UV irradiated (6 J/cm 2 at 254 nm). After removing one millimeter of the surface, approximately 30-100 mg of bone was powderized and DNA was extracted following silicabased methods as in [51] with modifications as in [49,52] or as in [53] and eluted in 25-110 µl of EB buffer. Between one and 16 extractions were made from each sample and one extraction blank with water instead of bone powder was included per six to ten extracts. Blanks were carried along the whole process until qPCR and/or PCR and subsequent quantification.
DNA libraries were prepared using 20µl of extract, with blunt-end ligation coupled with P5 and P7 adapters and indexes as described in [49,54]. From each extract one to five double stranded libraries were built. Since aDNA is already fragmented the shearing step was omitted from the protocol. Library blank controls including water as well as extraction blanks were carried along during every step of library preparation. In order to determine the optimal number of PCR cycles for library amplification qPCR was performed. Each reaction was prepared in a total volume of  treatment was included in order to remove deaminated cytosines [55]. Quantitative PCR (qPCR) was performed in order to quantify the number of molecules and the optimal number of PCR cycles prior to amplification for each DNA library. Furthermore, this step included extraction blanks, library blanks and amplification blanks to monitor potential contamination. All of these negative controls showed an optimal cycle of amplification significantly higher to those of our ancient DNA libraries (>10 cycles) and they were thus deemed as negative. Our experimental results show minimal levels of contamination, which is in concordance with mitochondrial DNA and X chromosome estimates of contamination (see S4 Text and Table 1). Each reaction was done in a total volume of 25 μl, containing 1 ul of DNA library, 1X MaximaSYBRGreen mastermix and 200nM each of IS7 and IS8 [54] reactions were set up in duplicate.The PCRs were set up using a similar system as for the non-damage repair samples (in quadruplicates that were pooled when the PCR products were cleaned), with the difference of using AccuPrime DNA polymerase instead of AmpliTaqGold (Thermofisher) and the following PCR conditions; an activation step at 95°C for 2 min followed by 10-16 cycles of 95°C for 15s, 60°C for 30s and 68°C for 1min, and a final elongation step of 68°C for 5min. Blank controls including water as well as extraction blanks were carried out during every step of library preparation. Amplified libraries were pooled, cleaned, quantified and sequenced in the same manner as non-damage repaired libraries. In order to sequence libraries to depletion, two to eight libraries were pooled together and sequenced until reaching a clonality of >50%, if the clonality was lower, the library was either classified as unproductive or when the sequencing goal (>55X coverage) was reached and further sequencing was deemed unnecessary. Sequencing was performed as above.

Bioinformatic data processing and authentication
Paired-end reads were merged using MergeReadsFastQ_cc.py [56], if an overlap of at least 11 base pairs was found the base qualities were added together and any remaining adapters were trimmed. Merged reads were then mapped single ended with bwa aln 0.7.13 [57] to the human reference genome (build 36 and 37) using the following non-default parameters: seeds disabled -l 16500 -n 0.01 -o 2 [15,16]. To remove PCR duplicates, reads with identical start and end positions were collapsed using a modified version, to ensure random choice of bases, of FilterUniqSAMCons_cc.py [56]. Reads with less than 10 % mismatches to the human reference genome, reads longer than 35 base pairs and reads with mapping quality higher than 30 were used to estimate contamination.
The genetic data obtained from the two bone elements SF9 and SF13 showed extremely high similarities, which suggested that the two individuals were related. Using READ [58], a tool to estimate kin-relationship from ancient DNA, SF9 and SF13 were classified as either identical twins or the same individual. Therefore, we merged the genetic data for both individuals and refer to the merged individual as SF9 throughout the genetic analysis.
All data shows damage patterns indicative of authentic ancient DNA (S3 Text). Contamination was estimated using three different sources of data: (I) the mitochondrial genome [59], (II) the X chromosome if the individual was male [60,61] and (III) the autosomes [62]. Low contamination estimations over the three different approximations were interpreted as data mapping to the human genome being largely endogenous (S4 Text).

Analysis of demographic history
Most population genomic analyses require a set of reference data for comparison. We compiled three different data sets from the literature and merged them with the data from ancient individuals (S6 Text). The three reference SNP panels were: • The Human Origins genotype data set of 594,924 SNPs genotyped in 2,404 modern individuals from 203 populations [16,37].
• A panel of 1,055,209 autosomal SNPs which were captured in a set of ancient individuals by Mathieson et al [18]. Yorubans) in Yorubans of the 1000 genomes project [27]. Those SNPs were extracted using vcftools [63].
These data sets were merged with ancient individuals of less than 15x genome coverage using the following approach: for each SNP site, a random read covering that site with minimum mapping quality 30 was drawn (using samtools 0.1.19 mpileup [64]) and its allele was assumed to be homozygous in the ancient individual. Transition sites were coded as missing data for individuals that were not UDG treated and SNPs showing additional alleles or indels in the ancient individuals were excluded from the data.
Six high coverage ancient individuals (SF12, NE1 [24], Kotias [32], Loschbour [16], Stuttgart [16], Ust-Ishim [36]) used in this study were treated differently as we generated diploid genotype calls for them. First, the base qualities of all Ts in the first five base pairs of each read as well as all As in the last five base pairs were set to 2. We then used Picard [65] to add read groups to the files. Indel realignment was conducted with GATK 3.5.0 [66] using indels identified in phase 1 of the 1000 genomes project as reference [27]. Finally, GATK's UnifiedGenotyper was used to call diploid genotypes with the parameters -stand_call_conf 50.0, -stand_emit_conf 50.0, -mbq 30, -contamination 0.02 and --output_mode EMIT_ALL_SITES using dbSNP version 142 as known SNPs. SNP sites from the reference data sets were extracted from the VCF files using vcftools [63] if they were not marked as low quality calls. Plink 1.9 [67,68] was used to merge the different data sets. We performed principal component analysis (PCA) to characterize the genetic affinities of the ancient Scandinavian genomes to previously published ancient and modern genetic variation.
PCA was conducted on 42 present-day west Eurasian populations from the Human Origins dataset [16,37], using smartpca [69] with numoutlieriter: 0 and lsqproject: YES options. A total of 59 ancient genomes (52 previously published and 7 reported here) (Table S6. Here we used high-coverage Mota [35], Yoruba [27] and Chimp genome as (A) outgroups. The software popstats [70] was used to calculate f4 statistics, in order to estimate shared drift between groups. Standard errors and Z scores for f4 statistics were estimated using a weighted block jackknife (Fig. 1C).

Model-based clustering:
A model-based clustering algorithm, implemented in the ADMIXTURE software [71], was used to estimate ancestry components and to cluster individuals.
ADMIXTURE was conducted on the Human Origins data set [16,37], which was merged with the ancient individuals as described above. Data was pseudo-haploidized by randomly selecting one allele at each heterozygous site of present-day individuals. Finally, the dataset was filtered for linkage disequilibrium using PLINK [67,68]  In addition to ADMIXTURE, we assessed the admixture patterns in Mesolithic Scandinavians using a set of methods implemented in ADMIXTOOLS [37], qpWave [73] and qpAdm [16,17].
Both methods are based on f4 statistics, which relate a set of test populations to a set of outgroups in different distances from the potential source populations. We used the following set of target and source are used as L (with T being the base population) and f4 statistics with outgroups from R (same as above) are calculated. The rank of the resulting matrix is then set to the number of sources minus one, which allows to estimate the admixture contributions from each populations in S to T. The results are shown in Fig. 1.

Runs of homozygosity:
Heterozygosity is a measurement for general population diversity and its effective population size. Analyzing the extent of homozygous segments across the genome can also give us a temporal perspective on the effective population sizes. Many short segments of homozygous SNPs can be connected to historically small population sizes while an excess of long runs of homozygosity suggests recent inbreeding. We restricted this analysis to the six high coverage individuals (SF12, NE1, Kotias, Loschbour, Stuttgart, Ust-Ishim) for which we obtained diploid genotype calls and we compared them to modern individuals from the 1000 genomes project. The length and number of runs of homozygosity were estimated using Plink 1.9 [67,68] and the parameters --homozyg-density 50, --homozyg-gap 100, --homozyg-kb 500, --homozygsnp 100, --homozyg-window-het 1, --homozyg-window-snp 100, --homozyg-window-threshold 0.05 and --homozyg-window-missing 20. The results are shown in Fig. 2A.
Linkage disequilibrium: Similar to runs of homozygosity, the decay of linkage disequilibrium (LD) harbors information on the demographic history of a population. Long distance LD can be caused by a low effective population size and past bottlenecks. Calculating LD for ancient DNA data is challenging as the low amounts of authentic DNA usually just yields haploid allele calls with unknown phase. In order to estimate LD decay for ancient populations we first combine two haploid ancient individuals to a pseudo-diploid individual (similar to the approach chosen for conditional nucleotide diversity, S7 Text). Next, we bin SNP pairs by distance (bin size 5kb) and then calculate the covariance of derived allele frequencies (0, 0.5 or 1.0) for each bin. This way, we do not need phase information to calculate LD decay as we do not consider multilocus haplotypes, which is similar to the approach taken by ROLLOFF [37,74] and ALDER [75] to date admixture events based on admixture LD decay. For Fig. 2B, we used two modern 1000 genomes populations to scale the LD per bin. The LD between two randomly chosen PEL (Peruvian) individuals was set to 1 and the LD between two randomly chosen TSI (Tuscan) individuals was set to 0. This approach is used to obtain a relative scale for the ancient populations and we caution against a direct interpretation of the differences to modern populations as technical differences in the modern data (e.g. SNP calling or imputation) may have substantial effects.
Effective population size: We are using MSMC's implementation of PSMC' [39] to infer effective population sizes over time from single high coverage genomes. We restrict this analysis to UDGtreated individuals (SF12, Loschbour, Stuttgart, Ust-Ishim) as post-mortem damage would cause an excess of false heterozygous transition sites. Input files were prepared using scripts provided with the release of MSMC (https://github.com/stschiff/msmc-tools) and MSMC was run with the non-default parameters --fixedRecombination and -r 0.88 in order to set the ratio of recombination to mutation rate to a realistic level for humans. We also estimate effective population size for six high-coverage modern genomes [76] (Fig. 2C). We plot the effective population size assuming a mutation rate of 1.25x10e-8 and a generation time of 30 years. The curves for ancient individuals were shifted based on their average C14 date.

Detecting adaptation to high-latitude environments
We scanned the genomes for SNPs with similar allele frequencies in Mesolithic and modern-day northern Europeans, and contrast it to a modern-day population from southern latitudes. Pooling all Mesolithic Scandinavians together, we obtain an allele frequency estimate for Scandinavian hunter-gatherers (SHG) which is compared to modern-day Finnish individuals (FIN) and Tuscan individuals (TSI) from the 1000 genomes project [27]. We use the Finnish population as representatives of modern-day northern Europeans (this sample contains the largest number of sequenced genomes from a northern European population). Tuscans are used as an alternative population, who also traces some ancestry to Mesolithic populations, but who do not trace their ancestry to groups that lived at northern latitudes in the last 7-9,000 years. Our approach is similar to PBS [77] and inspired by DAnc [40], for each SNP, we calculate the statistic D sel comparing the allele frequencies between an ancestral and two modern populations: This scan was performed on all transversion SNPs extracted from the 1000 genomes data. Only sites with a high confidence ancestral allele in the human ancestor (as used by the 1000 genomes project [27]) and with coverage for at least six ancient Scandinavians were included in the computation. More information can be found in S10 Text.
Stora Förvar material, and the Norwegian Maritime Museum for excavating Hummervikholmen.
MS and PC thank the participants who volunteered for face reconstruction methods development.    [16,17] estimates of genetic ancestry for each SHG individual. The map also displays the ice sheet covering Scandinavia 10,000 BP (most credible (solid line) and maximum extend (dashed line) following [10]). Newly sequenced sites are shown in bold and italics, SF11 is excluded from this map due to its low coverage (0.1x). Additional European EHG and WHG individuals used in this study derive from sites outside this map. (B) Magnified section of genetic similarity among ancient and modern-day individuals using PCA featuring only the Mesolithic European samples (see S6 Text for the full plot). (C) Allele sharing between the SHGs, Latvian Mesolithic hunter-gatherers [33] and EHGs vs WHGs measured by f 4 (Chimpanzee, SHG; EHG, WHG) calculated for the captured SNPs for the EHGs [18]. Error bars show two block-jackknife standard errors.   525  526  527  528  529  530  531  532  533  534  535  536  537