Increased ultra-rare variant load in an isolated Scottish population impacts exonic and regulatory regions

Human population isolates provide a snapshot of the impact of historical demographic processes on population genetics. Such data facilitate studies of the functional impact of rare sequence variants on biomedical phenotypes, as strong genetic drift can result in higher frequencies of variants that are otherwise rare. We present the first whole genome sequencing (WGS) study of the VIKING cohort, a representative collection of samples from the isolated Shetland population in northern Scotland, and explore how its genetic characteristics compare to a mainland Scottish population. Our analyses reveal the strong contributions played by the founder effect and genetic drift in shaping genomic variation in the VIKING cohort. About one tenth of all high-quality variants discovered are unique to the VIKING cohort or are seen at frequencies at least ten fold higher than in more cosmopolitan control populations. Multiple lines of evidence also suggest relaxation of purifying selection during the evolutionary history of the Shetland isolate. We demonstrate enrichment of ultra-rare VIKING variants in exonic regions and for the first time we also show that ultra-rare variants are enriched within regulatory regions, particularly promoters, suggesting that gene expression patterns may diverge relatively rapidly in human isolates.

166 Our results indicate that the VIKING samples are significantly enriched for ultra-rare SNPs 168 (1.16 fold) and INDELs (1.22 fold) not observed in gnomAD (Table 1). Importantly, the observed 169 enrichment is not driven by a greater individual-specific variation in Shetlanders; in fact, a VIKING 170 individual carries less than two-thirds of the number of ultra-rare singleton variants compared to an Again, the overall enrichment is driven by the shared ultra-rare variants (i.e. ≥doubleton) -3.03 fold 180 ultra-rare SNP enrichment (p = 2.4x10 -12 ) and 2.65 fold ultra-rare INDEL enrichment (p = 1.7x10 -12 ) -181 whereas the two cohorts exhibit very similar levels of individual-specific ultra-rare variation and their 182 difference is not significant. 183 These data suggest that genetic drift has increased the frequency of many ultra-rare variants 184 in Shetland compared to those in Lothian. On average, a Shetland individual carries about 2.6 times 185 more ultra-rare variants shared with at least one other Shetlander, compared to the ultra-rare 186 variants shared within the Lothian individuals (Table 1). There is also a small but significant depletion 187 of very rare known variants (MAF NFE ≤ 1%) in VIKING, again due to the action of genetic drift whereby 188 many rare variants are expected to be lost in the population. 5'UTR (a total length of 9.3M bases), exon (30Mb), intron (906Mb), 3'UTR (27.6Mb) and ncRNA 194 regions (7.3Mb); the remaining 1.1Gb of the mappable regions in the reference human genome are 195 labelled as "non-coding" regions (Materials and Methods). To make data from different regions 196 comparable, we examined the number of variant alleles per megabase and used the same 197 framework as for the genome-wide analysis to quantify the observed differences for each of the 198 considered regions. The full results are available in Table 2 and Table 3 in S2 File and illustrated in Fig  199  9 terms of variant loads for ultra-rare and very rare variants; the results for these two regions are 201 presented in Fig 1.  202 Our results show that VIKING samples are significantly enriched for ultra-rare SNPs in all 203 coding related regions -including exonic regions -while potentially more damaging ultra-rare 204 INDELs are restricted to non-coding and intronic regions. The observed exonic enrichment of ultra-205 rare SNPs is similar to the levels of enrichment seen genome-wide and in non-coding regions, 206 demonstrating that exonic regions in the VIKING cohort have not been protected from the general 207 accumulation of ultra-rare variation in spite of their functional importance. Indeed, the median 208 enrichments seen in exons, 3'UTR and 5'UTR regions are somewhat higher than the genome-wide 209 median enrichment.

215
Significance: at least 95% of the 10,000 subsets have p-value ≤ 8x10 -4 (Bonferroni corrected) and no overlap between the 216 95% CI for the LBC and the VIKING median values (for full results see Fig 4 in S1 File). The higher variance in the 5'UTR and 217 lower variance in ncRNA regions could be explained by their relatively small sizes -9.3Mb and 7.3Mb, respectively.

218
We also annotated variants within predicted functional non-coding regions using the 219 coordinates of 15 chromatin states generated for nine cell types by the NIH Roadmap Epigenomics 220 (1.8Gb) regions (Materials and Methods). Using the same approach as for the genome-wide (Table 1)  223 and coding analyses (Fig 1) to quantify variant loads for each of the chromatin states, we again found 224 that the major difference between the two cohorts is for ultra-rare variant loads (Table 4 and Table 5  225 in S2 File). The observed significant enrichment of ultra-rare SNPs in all predicted regulatory regions 226 was generally indistinguishable from the genome-wide level (Fig 2), suggesting that regulatory 227 regions -similarly to the exonic regions -do not appear to be protected from ultra-rare SNP 228 variants. 229 As for exonic regions, the median enrichment for promoters is generally somewhat higher 230 than the genome-wide enrichment, particularly for predicted promoters active in H1 embryonic 231 stem cells, HMEC primary mammary epithelial cells and NHEK epidermal keratinocyte cells (Fig 2). 232 The results for ultra-rare INDELs (Fig 5 in S1 File) are similar, but due to the small number of 233 INDELs present in these regions, the conclusions are less robust. There is no significant difference in 234 the regulatory regions for known SNPs in any of the 9 cell types ( ≤0.01% n/a n/a n/a ≤0.01% n/a n/a n/a rare 0.80% ≤0.01% n/a n/a 0.72% ≤0.01% n/a n/a very rare 31

264
There is also evidence of genetic drift for VIKING variants shared only with LBC, as well as for 265 variants shared with geographically more distant populations (Table 2). Among the VIKING ultra-rare 266 variants (i.e. not seen in gnomAD), but present in LBC, there are 18,451 SNPs (2.14%) and 1,678 267 INDELs (2.17%) with allele frequency in VIKING at least ten times higher than in LBC. Considering the 268 VIKING variants which are very rare in gnomAD Non-Finnish European population (MAF NFE ≤ 1%), 269 there are 359,275 SNPs (10.49%) and 31,713 (9.35%) INDELs with allele frequency in VIKING at least 270 ten times higher than the maximum allele frequency observed in LBC and all gnomAD populations. genetic drift in shaping the genomic variation in the isolated VIKING cohort. About one tenth of all 275 high-quality variants discovered -10.06% of the SNPs and 8.95% of the INDELs -are either unique to 276 the VIKING cohort or seen at least ten times more frequently in it compared to cosmopolitan WGS 277 populations (LBC and gnomAD). 278 Another line of evidence supporting the founder effect / genetic drift in the VIKING cohort is 279 based on the analysis of the distribution of allele frequencies across polymorphic sites, also known 280 as the site frequency spectrum (SFS) analysis (Materials and Methods). Our analysis is based on the 281 high-quality variants discovered in the callable regions of the 22 autosomal chromosomes in the two 282 cohorts of unrelated individuals, split to known variants (present in gnomAD at any frequency) and 283 ultra-rare variants (not found in any gnomAD population). 284 The proportion of known variants (Fig 6 in   analyses of the VIKING and LBC data based on the ultra-rare SNPs discovered in the two cohorts and 343 included data for protein coding and related regions (Fig 3). A comparison of the fraction of ultra-344 rare variants (FUV) and their densities in VIKING and LBC reveals that 5'UTR, exon and promoter 345 regions show the most extreme shifts, driven by accumulation of ultra-rare variants at a higher rate 346 compared to known variants in VIKING. the results are not readily comparable since their analysis was based on all (rather than only ultra-382 rare) singleton missense variants (regardless of their CADD score and not including nonsense and 383 essential splice variants) as LOF variants and reporting mean instead of median values. Since the 384 major difference in the variant load between VIKING and LBC is due to ultra-rare non-singleton 385 variants (Table 1), we relaxed the singleton requirement above and performed the same analysis 386 considering all ultra-rare variants in the two cohorts (5,365 genes with at least one LOF or SYN 387 variant observed in both cohorts). The result shows a 9.4% enrichment of ultra-rare LOF SNP alleles 388 in the VIKING cohort compared to LBC (p = 0.00064, one-sided Wilcoxon rank sum test). 389 390 391 two distinct categories based on their predicted impact. We developed a more general test, the 395 allelic shift bias (ASB) test, which is designed to assess relaxation of selection in non-coding regions, 396 based on the change in the allele frequency of variants within specific genomic regions across 397 populations, as follows. We selected all SNPs in the VIKING and LBC cohorts found in the gnomAD 398 genome dataset with MAF NFE ≤ 1% in Non-Finnish Europeans. Given their low frequencies, these 399 variants from the ancestral European population are likely to be enriched for SNPs that have been 400 subject to purifying selection. We repeatedly (1000x) randomly selected 269 LBC individuals 401 (matching the VIKING unrelated cohort size, with replacement) and selected MAF NFE ≤ 1% variants 402 shared between this LBC subset and the VIKING cohort. We then computed the mean MAF of such 403 variants for each LBC subset and the VIKING cohort in exonic, promoter, intronic and non-functional 404 intergenic (NFIG) regions (Fig 10 in S1 File). We also calculated the mean MAF of such variants for 405 non-synonymous exonic variants and the predicted deleterious promoter variants (CADD score ≥ 10; 406 predicted top 10% of the most deleterious variants genome-wide). 407 We estimated the strength of the purifying selection in each cohort as the difference 408 between the mean MAF of the selected variants observed in the NFIG regions, where the effect of 409 purifying selection is assumed to be negligible, and the mean MAF in regions assumed to be subject 410 of active purifying selection. If purifying selection acts with the same strength in two populations 411 there will be equivalent MAF differences in the two cohorts between the NFIG regions and the 412 regions being tested. However, in the scenario where purifying selection is weakened in one of the 413 populations, we expect to observe a bias towards smaller MAF differences in this population. The 414 significance of these shifts can then be measured by a nonparametric statistic comparing the 415 distributions of MAF differences between cohorts. addition, ASB testing shows a similarly widespread loss of constraint in VIKING promoter regions, 419 suggesting effects on gene expression. We observe higher MAF of very rare variants at LBC intronic 420 regions compared to VIKING, which is most likely due to the more cosmopolitan nature of the LBC 421 cohort and weaker purifying constraint in intronic compared to exonic and promoter regions.  (Table 9 in S2 File). We performed genotype-to-phenotype analysis 451 in the 500 VIKING individuals for those 13 variants and the 26 related quantitative traits for which 452 data is available, but found no significant associations (nominal p < 0.0019, Bonferroni corrected for 453 the number of traits). This was not surprising, given that we have 80% power with n = 500 and MAF 454 ≈ 0.01 to detect a variant explaining 3% (or more) of the trait variance at that significance level. 455 Variants with such effects sizes are relatively rare in generally healthy cohorts, highlighting the 456 importance of sample size. We plan to investigate the identified variants and their potential 457 phenotype correlations in ~1600 additional VIKING samples whose WES is currently underway. 458 VIKING variants in promoter regions show higher levels of enrichment for ultra-rare variants 459 than other regulatory regions (Fig 2), and analysis of the WGS data of the 269 unrelated VIKING inverse correlation between the observed frequency of a variant and the probability of it being ultra-488 rare [15,19,20,23], we are aware of no study to date that has explicitly investigated ultra-rare variant 489 loads in isolates. By using the gnomAD genomes database as a reference dataset to separate the 490 variants into ultra-rare and very rare but known (i.e. seen in gnomAD and with MAF in Non-Finnish 491 Europeans ≤ 1%), we were able to show that while the VIKING cohort is depleted for very rare known variants, it is enriched for ultra-rare variants compared to a control cosmopolitan population, 493 in particular for those shared by more than one unrelated individual in the isolated population. The 494 discovered ultra-rare and rare VIKING variants which are predicted to be functional and are 495 significantly enriched in the Shetland isolate compared to gnomAD add to the emerging catalogue of 496 ultra-rare variants from isolated cohorts correlated with various traits of medical importance 497 [20,23]. Such variants are illustrative of the potential for the so called "jackpot effect" [25]. 498 The VIKING individuals in this study were recruited as phenotypically 'normal' healthy 499 individuals and represent only our first view of the Shetland isolate, with further recruitment 500 underway. The detailed demographics and history of the Norse diaspora is still an area of active 501 research (e.g. [57]). We look forward to deep WGS data from relevant Scandinavian populations 502 (with compatible sequencing technologies and sample ascertainment) becoming available in the 503 future. Such data, combined with power increasing strategies (e.g. imputation) and continual GWAS 504 Catalog improvements, will provide much greater opportunities for discovering VIKING variants 505 correlated with various phenotypic traits. 506 The availability of high-coverage WGS data allows the interrogation of both SNP and INDEL 507 variant loads in regulatory as well as coding regions. Our results suggest that due to the reduced 508 efficiency of purifying selection, the exonic and regulatory regions in the Shetland isolate exhibit 509 ultra-rare SNP loads equal to the genome-wide level. We observe the same trend for higher levels of 510 ultra-rare INDELs in many VIKING regulatory regions, particularly promoters, but VIKING exonic 511 regions appear to be protected from short ultra-rare INDELs (of length up to 75bp), consistent with 512 the higher expected intolerance to variation in exonic compared to regulatory regions, as well as 513 with the previously reported finding that exonic regions are depleted of long (median size of several 514 kbp) copy number variant deletions [58]. Excesses of functional exonic SNPs in isolated populations 515 have been widely reported before but, to the best of our knowledge, this work is the first to provide 516 empirical evidence that while exonic regions in an isolated population may be enriched for ultra-rare It has previously been shown that primate promoters exhibit an increased rate of evolution 519 compared to other genomic regions [59] and this acceleration of nucleotide substitution rate is most 520 pronounced in broadly expressed promoters [60]. It is also widely accepted that variation in 521 The WGS sequencing and initial processing of the samples used in this study was performed 562 at Edinburgh Genomics, University of Edinburgh. The starting point of our analyses were the gVCF 563 files (GRCh38) we received for the 500 VIKING and 1369 LBC individuals, generated as follows. 564 Demultiplexing is performed using bcl2fastq (Illumina, 2.17.1.14), allowing 1 mismatch when 565 assigning reads to barcodes; adapters are trimmed during the demultiplexing process. BCBio-566 Nextgen (0.9.7) is used to perform alignment, bam file preparation and variant detection. BCBio uses 567 bwa mem (v0.7.13 [62]) to align the raw reads to the reference genome (GRCh38; with alt, decoy 568 and HLA sequences), then samblaster (v0. 1.22 [63]) to mark the duplicated fragments, and GATK 3.4 569 for the indel realignment and base recalibration. The genotype likelihoods are calculated using GATK 570

HaplotypeCaller creating a final gVCF file. 571
We called the variants in each sample individually from its gVCF using GenotypeGVCFs (GATK 572 3.6); the identified INDELs are limited to 75bp, i.e. about half of the read length. The discovered 573 variants for each sample were decomposed and normalized using VT (v0.5772-60f436c3 [64]). The 574 Variants not in the 22 autosomal or the two sex chromosomes, as well as variants with AC = 0 (after 575 decomposition) were excluded from further analyses and the filter value for all the remaining 576 variants was reset to PASS. The variants in each individual VCF were then split to SNPs and INDELs 577 (GATK 3.6). 578 An attempt to filter the variants using GATK's VQSR approach did not produce convincing 579 results -there was no clear separation between the filtered and retained variants in the generated 580 plots. Instead, we adopted a hard-filtering strategy based on the variant call parameters suggested 581 as suitable for hard-filtering by GATK [65]. The cut-off values for these parameters were determined 582 separately for VIKING and LBC cohorts in order to account for potential batch effects since the two 583 cohorts were sequenced at different time points and using different preparation kits -VIKING used 584 the TruSeq PCR-Free High Throughput library, while the earlier sequenced LBC used the TruSeqNano 585 High Throughput library. Using VariantFiltration (GATK 3.6), we marked (FILTER flag in the VCF set to 586 FAIL) SNPs with QD < 7.4/6.9, MQ < 44.0/44.5, FS > 10.0/9.8, SOR > 2.1/2.1, MQRankSum < -2.4/-2.3 587 or ReadPosRankSum < -1.4/-1.4; and marked INDELs with QD < 5.3/4.9, FS > 9.1/8.8, SOR > 2.9/2.6 588 or ReadPosRankSum < -1.8/-1.8 in VIKING/LBC cohorts, respectively. These cut-off values were 589 determined as the boundary to the worst-quality 5% of the variants for each of the parameters, 590 using all variants in the SNP and INDEL VCFs for 23/62 randomly chosen VIKING/LBC samples with 591 mean sequencing coverage >= 30x. The chosen cut-off values are more stringent than those 592 suggested by GATK; however, one of our objectives was to minimize the number of false positive 593 calls. In addition, we also marked as FAIL variants with DP < 10. On average, our approach lead to were 19% and 18%, respectively. It should be noted that in the later step of merging the variants 596 from all samples in each cohort, we used the GATK's KEEP_IF_ANY_UNFILTERED option. This allowed 597 for reconsidering variants which failed to pass the hard filtering in some samples, but were called 598 with sufficient quality in other samples to be considered trustworthy and were therefore kept for 599 further analyses. Our analyses suggest that using this option does not introduce a bias towards rarer 600 variants in more related populations (Fig 11 in S1 File). 601 The individual SNP and INDEL VCFs were lifted over to the human_g1k_v37 reference 602 genome (using picard-2.6.0, http://broadinstitute.github.io/picard) and merged into cohort-wide 603 SNP and INDEL VCFs (CombineVariants, GATK 3.6, using the KEEP_IF_ANY_UNFILTERED option). VIKING and LBC datasets passing the hard-filtering described above, but failing the quality filters in 657 gnomAD, were excluded from further analyses. We refer to the variants which passed both our and 658 gnomAD filtering as "known" variants. Furthermore, from variants found in our datasets, but not 659 found in gnomAD (i.e. ultra-rare variants), we kept for further analysis only biallelic SNPs with allele 660 frequency (AF) in the corresponding dataset ≤ 0.1, with depth of coverage (DP) at least 8 and no 661 more than 60 reads and genotype quality (GQ) ≥ 30; and only biallelic INDELs with AF ≤ 0.1, DP ≥ 12 662 and ≤ 60 and GQ ≥ 40. We refer to those variants as "ultra-rare" (Table 1), noting that some are 663 shared between the VIKING and LBC cohorts. Our tests showed that these ultra-rare variants are 664 generally randomly distributed genome-wide.  Next, we examined the density of SNP markers in the detected long and intermediate ROHs 712 (Fig 8 in S1 File). For long ROHs, we observed a bi-modal distribution for the number of SNP markers 713 discovered per 1Kb ROH length indicating potentially poor coverage/reliability for some ROHs, 714 consistent with the findings in [24]. To address this issue, we excluded from further analysis all long 715 ROHs with less than 2 or 3.5 markers per 1Kb ROH length in the VIKING and LBC cohorts, 716 respectively. The difference between the LBC and VIKING cut-off values (ratio = 1.75) correlates well 717 with the ratio of the total number of SNP markers given as input to bcftools for ROH calling (ratio = 718 1.68, LBC = 16,623,172 SNPs, VIKING = 9,890,893 SNPs). These density cut-offs also appear suitable 719 for intermediate ROHs (Fig 8 in S1 File). 720 721 Annotation of coding regions 722 Using the Ensembl (Genes 92, GRCh37.p13) data, we split the mappable regions in the 723 reference human genome into six categories -5'UTR (a total length of 9.3M bases), exon (30Mb), intron (906Mb), 3'UTR (27.6Mb), ncRNA (7.3Mb) and non-coding (1.1Gb) regions. Note that some 725 regions may be overlapping, e.g. a 3'UTR region of one gene might be 5'UTR region for another, etc. 726 The non-coding regions are defined as genome regions which do not fall in any of the above five 727 categories. applicable to nonsense variants -the closer pLI is to 1, the more haploinsufficient the gene appears (Table 12 in