Insular Celtic population structure and genomic footprints of migration

Previous studies of the genetic landscape of Ireland have suggested homogeneity, with population substructure undetectable using single-marker methods. Here we have harnessed the haplotype-based method fineSTRUCTURE in an Irish genome-wide SNP dataset, identifying 23 discrete genetic clusters which segregate with geographical provenance. Cluster diversity is pronounced in the west of Ireland but reduced in the east where older structure has been eroded by historical migrations. Accordingly, when populations from the neighbouring island of Britain are included, a west-east cline of Celtic-British ancestry is revealed along with a particularly striking correlation between haplotypes and geography across both islands. A strong relationship is revealed between subsets of Northern Irish and Scottish populations, where discordant genetic and geographic affinities reflect major migrations in recent centuries. Additionally, Irish genetic proximity of all Scottish samples likely reflects older strata of communication across the narrowest inter-island crossing. Using GLOBETROTTER we detected Irish admixture signals from Britain and Europe and estimated dates for events consistent with the historical migrations of the Norse-Vikings, the Anglo-Normans and the British Plantations. The influence of the former is greater than previously estimated from Y chromosome haplotypes. In all, we paint a new picture of the genetic landscape of Ireland, revealing structure which should be considered in the design of studies examining rare genetic variation and its association with traits.


Introduction
Situated at the northwestern edge of Europe, Ireland is the continent's third largest island, with a modern-day population of approximately 6.4 million. The island is politically partitioned into the Republic of Ireland and Northern Ireland, with the latter forming part of the United Kingdom (UK) alongside the neighbouring island of Britain. Alternative divisions separate Ireland into four provinces reflecting early historical divisions: Ulster to the north, including Northern Ireland; Leinster (east); Munster (south) and Connacht (west). Humans have continuously inhabited Ireland for around 10,000 years [1], though it is not until after the demographic upheavals of the Early Bronze Age (circa 2200 BCE), that strong genetic continuity between ancient and modern Irish populations is observed [2]. Linguistically, the island's earliest attested language forms part of the Insular Celtic family, specifically the Gaelic branch, whose historic range also extended to include many regions of Scotland, via maritime connections with Ulster [3,4]. A second branch of Insular Celtic, the Brittonic languages, had been spoken across much of Britain up until the introduction of Anglo-Saxon in the 5th and 6th centuries, by which time they were diversifying into Cornish, Welsh and Cumbric dialects [5].
Since the establishment of written history, numerous settlements and invasions of Ireland from the neighbouring island of Britain and continental Europe have been recorded. This includes Norse-Vikings (9th-12th century), especially in east Leinster, and Anglo-Normans (12th-14th century), who invaded through Wexford in the southeast and established English rule mainly from an area later called the Pale in northeast Leinster [6]. There has also been continuous movement of people from Britain, in particular during the 16-17th century Plantation periods during which Gaelic and Norman lands were systematically colonized by English and Scottish settlers. These events had a particularly enduring impact in Ulster in comparison with other planted regions such as Munster. As with the previous Norman invasion, the less fertile west of the country (Connacht) remained largely untouched during this period.
The genetic contributions of these migratory events cannot be considered mutually independent, given that they derive from either related Germanic populations (such as the Vikings and their purported Norman descendants) or from other Celtic populations inhabiting Britain, which had themselves been subjected to mass Germanic influx from Anglo-Saxon migrations and later Viking and Norman invasions [7]. Moreover, each movement of people originated from northern Europe, a region which had witnessed a mass homogenizing of genetic variation during the migrations of the Early Bronze Age, possibly linked to Indo-European language spread. [8,9]. However, each event had a geographic and temporal focal point on the island, which may be detectable in local population structure.
Previous genome-wide surveys have detected little to no structure in Ireland using methods such as principal component analysis (PCA) on independent markers, concluding that the Irish population is genetically homogenous [10]. However, runs of homozygosity are relatively long and frequent in Ireland [10] and correlate negatively with population density and diversity of grandparental origins [11], suggesting that low ancestral mobility may have preserved regional genetic legacies within Ireland, which may be detectable in modern genomes as local population structure embedded within haplotypes. This is further supported by the restricted regional distributions of certain Y chromosome haplotypes [12,13].
The haplotype-based methods ChromoPainter and fineSTRUCTURE [14] were recently used to uncover hidden genetic structure among the people of modern Britain [7]. These approaches exploit the rich information available within haplotypes (usually statistically phased) to identify clusters of genetically distinct individuals with a resolution that could not be attained using single-marker methods. In doing so, the People of the British Isles (PoBI) study was able to identify discrete genetic clusters of individuals that strongly segregate with geographical regions within Britain, though notably, structure was undetectable across a large southeastern portion of the island. However, although this study sampled over 2,000 individuals, only 44 were from Northern Ireland with none from the remainder of the island.
Ireland was also excluded from admixture and ancestry analyses due to the confounding effects of the island acting as "a source and a sink for ancestry from the UK". With this focus on a single island, the PoBI study has an obvious limit, despite its title.
Here, we have used the methods of the PoBI study to explore fine-grained Irish population substructure. We first investigate Ireland on its own, then we consider the genetic substructure observed on the island in the context of Britain and continental Europe. Using modern individuals from these two sources as surrogates for historical populations, we apply the GLOBETROTTER model to infer admixture events into Ireland and we consider these in the context of historically recorded invasions and migrations. Our inclusion of Irish data with previously-published data from Britain presents a more complete representation of genetic ancestry in the contemporary populations of the British Isles, providing a comprehensive population genetic perspective of the peopling of these islands.

Results and discussion
Celtic population structure in Ireland We used ChromoPainter [14] to identify haplotypic similarities within a genome-wide single nucleotide polymorphism (SNP) dataset of individuals from the Republic of Ireland and Northern Ireland (n=1,035, including 44 from the PoBI study), in which local geographic origin was known for a subset (n=588). Clustering the resulting coancestry matrix using fineSTRUCTURE identified 23 clusters, demonstrating local population structure within Ireland to a level not previously reported, with apparent geographical, sociopolitical and ancestral correlates (Fig 1). All clusters were robustly defined, with total variation distance (TVD) p-values less than 0.001 (S1 and S2 Table). We projected the ChromoPainter coancestry matrix in lower-dimensional space using principal component analysis (PCA) and, to ease interpretation and for visual brevity with labels, we defined 9 cluster groups that formed higher order clades in the fineSTRUCTURE dendrogram, overlapped in PC space and were sampled from geographically contiguous regions. These cluster groups also showed robust definition by TVD analysis (S3 Table and S4 Table), suggesting they are meaningful. ChromoPainter PCA revealed a tight relationship between haplotypic similarity and geographical proximity, with principal component (PC) 1 roughly describing a north to south cline and PC2 largely describing an east to west cline (Fig 1B).
At a high level, both ChromoPainter PCA and fineSTRUCTURE clustering loosely separated the historical provinces of Ireland (Ulster, Leinster, Munster and Connacht) suggesting that these socially constructed territories may have had an impact on genetic structure within Ireland which is deeply embedded in time. Careful inspection of the tree ordering and the PCA revealed more nuanced relations between the provinces; for example south Leinster clusters share more haplotypes with those from north Munster than with their central and north Leinster counterparts. The geographical distribution of this deep subdivision of Leinster resembles pre-Norman territorial boundaries which divided Ireland into fifths (cúige), with north Leinster a kingdom of its own known as Meath (Mide) [15]. However interpreted, the firm implication of the observed clustering is that despite its previously reported homogeneity, the modern Irish population exhibits genetic structure that is subtly but detectably affected by ancestral population structure conferred by geographical distance and, possibly, ancestral social structure.
ChromoPainter PC1 demonstrated high diversity amongst clusters from the west coast, which may be attributed to longstanding residual ancient (possibly Celtic) structure in regions largely unaffected by historical migration. Alternatively, genetic clusters may also have diverged as a consequence of differential influence from outside populations. This diversity between western genetic clusters cannot be explained in terms of geographic distance alone. South Munster (SMN) and Cork (CRK) clusters branch off first in the fineSTRUCTURE tree and show distinct separation from their neighbouring north Munster clusters (NMN), indicating that south Munster's haplotypic makeup is more distinct from its neighbouring regions and the remaining regions than any other cluster. TVD analysis supports this observation (S1 Table and S3 Table), with the Cork cluster in particular showing strong differentiation from other clusters. This may reflect the persistent isolating effects of the mountain ranges surrounding the south Munster counties of Cork and Kerry, restricting gene flow with the rest of Ireland and preserving older structure.
In contrast to the west of Ireland, eastern individuals exhibited relative homogeneity; a similar pattern was observed in the PoBI study [7], in which all samples in a large region in To explore this, we estimated the extent of admixture per individual in the Irish dataset from Britain, using samples from the PoBI dataset as a reference [7], along with eighteen ancient British individuals from the Iron Age, Roman and Anglo-Saxon periods in northeast and southeast England [16,17]. Using an unsupervised ADMIXTURE analysis [18], we observed that one of the ADMIXTURE clusters (k=2) comprises the totality of ancestry of several

The genetic structure of the British Isles
The genetic substructure observed in Ireland is consistent with long term geographic diversification of Celtic populations and the continuity shown between modern and Early Bronze Age Irish people [2]. However, this diversity is weaker on the east coast in a manner that correlates with British admixture, suggesting a role for recent migrations in eroding this structure. We therefore further investigated the relationship between Ireland and Britain by generating a ChromoPainter coancestry matrix for all Irish and PoBI data combined (n=3,008). Clustering with fineSTRUCTURE revealed 50 distinct clusters that segregated geographically, both on a cohort-wide and local level (Fig 2). Projecting this coancestry matrix in PC space revealed a striking concordance between haplotypes and geography (sampling regions were defined using Nomenclature of Territorial Units for Statistics 2010; [19]) for ChromoPainter PCs 1 and 4, reminiscent of previous observations for single marker-based summaries of genetic variation within European populations [20].
The principal split in the combined Irish and British data defined two genetic islands, both in the fineSTRUCTURE tree and in ChromoPainter PC1 (Fig 2). This distinction between Irish and British genetic data was particularly pronounced when we applied t-distributed stochastic neighbour embedding (t-SNE) [21] to the ChromoPainter coancestry matrix (Fig   3). t-SNE is a nonlinear dimensionality reduction method that attempts to provide an optimal low-dimensional embedding of data by preserving both local and global structure, placing similar points close to each other and dissimilar points far apart. In principle, a twodimensional t-SNE plot can therefore summarize more of the overall differences between groups than those described by any two principal components, although the relative group sizes, positions and distances on the plot are less straightforward to interpret. Applying t-SNE to the Irish and British coancestry matrix captured the salient structure described by PCA, and particularly validates that observed in the plot of ChromoPainter PC1 vs PC4. This clearly distinguishes the two islands, discerns their north-south and west-east genetic structure and places Orkney and north/south Wales, whose variation is captured in ChromoPainter PCs 2 and 3 respectively (Fig 4), as independent entities from the bulk of the British data.   Fig 6), indicating that this estimate is not a good reflection of the true Unlike the PoBI study, Irish data were not specifically selected for longstanding pure ancestry in each geographic region (for example, having four grandparents in a location), but instead represent a repurposed medical dataset. Our data are therefore more representative of those that are typically used in population-based genome-wide surveys for trait-associated genetic variation; as these studies survey increasingly rare genetic variants in larger populations, the geospatial segregation of rare haplotypes and variants will become increasingly important, especially when environmental effects and interactions play a role [27]. Our observation that these haplotypes are intricately tied to geography in Ireland and Britain highlights the importance of considering fine-grained population structure in future studies.

Data and quality control
Our study included three datasets of genotype data: a population-based Irish ALS casecontrol dataset (n = 991) incorporating existing [28] and newly-genotyped samples, the People of the British Isles dataset (EGA accession ID EGAD00010000632; n = 2,020) [7] and a pan-European dataset derived from a genome-wide association study (GWAS) for multiple sclerosis (MS; EGA accession ID EGAD00000000120; n = 4,514) [24] (S1 Text: Populations). All Irish subjects provided written informed consent to participate in genetic research and the study was approved by the Beaumont Hospital Research Ethics Committee in Dublin, Ireland. We applied quality control to each dataset using PLINK 1.9 [29] and merged data as detailed in Supplementary Methods (S1 Text: Quality Control).
Briefly, we excluded both infrequent and high-missingness SNPs; individuals with high missingness, excessive heterozygosity or cryptic relationships to other individuals in the data; and finally individuals who had been removed during QC carried out in the source papers.
As the European dataset included patients and controls from a GWAS for MS, we additionally removed SNPs in a 15 Mb region surrounding the strongly associated HLA locus on chromosome 6 (GRCh37 position chr6:22,915,594-37,945,593), as is consistent with previous studies using the data [7,30]. This was to avoid haplotypic bias arising from this association.
The Supplementary Methods (S1 Text: Populations and S1 Text: QC) Geographic information was available for 544 of the 991 Irish samples in the form of home address. To preserve anonymity this was jittered in all maps containing patients (Fig 1 and   S5 Fig). For all British and some Northern Irish data, sample location was supplied by the authors of PoBI [7] as membership of 35 sampling regions. Finally, for European data sampling country was available [24]. Full details of treatment of samples for mapping are available in Supplementary methods (S1 Text: Mapping.) We phased autosomal genotypes in each dataset and merged dataset with SHAPEIT V2 [31] using the 1000 Genomes (Phase 3) as a reference panel [32]. A pre-phasing step was carried out (--check) to remove any SNPs which did not correctly align to the 1000 genomes reference panel. Samples were then split by chromosome and phased together using default settings and the GRCh37 build genetic map to estimate linkage disequilibrium.

fineSTRUCTURE analysis
To detect population structure we performed ChromoPainter/fineSTRUCTURE analysis [14] on each of the population datasets (Irish, British and European) individually, and then separately on a merge of the Irish and British datasets. In brief, we used ChromoPainter to paint each individual using all other individuals (-a 0 0) using default settings with the exception that the number of "chunks" per region value was set to 50 (-k 50) for all analyses including Irish and British individuals to account for the longer haplotypes observed in these datasets, in keeping with previous studies [7,30]. The fineSTRUCTURE algorithm was then run on the resulting coancestry matrix to determine genetic clusters based on patterns of haplotype sharing. Further details are included in the Supplementary Methods (S1 Text: fineSTRUCTURE analysis).

Cluster robustness
We assessed the robustness of Irish clusters by calculating total variation distance (TVD) as described in the PoBI study [7]. This metric compares the "copying vectors" of pair of clusters. Here we define the copying vector for a given cluster A as a vector of the average lengths of DNA donated by each cluster to individuals within cluster A under the ChromoPainter model. Hence the magnitude of differences between copying vectors of two clusters reflects the distances between those clusters in terms of their haplotypic sharing with other clusters. TVD can therefore be used to determine whether fineSTRUCTURE clusters detect significant differences in haplotype sharing, and hence ancestry.
We tested whether the observed clustering performed better than chance by permuting (1,000 times) the individuals in each of our cluster pairings into clusters of the same size, and calculating the number of permutations that exceeded our original TVD score. If 1,000 unique permutations were not possible, the maximum number of unique permutations was used instead. P-values were calculated based on the number of permutations greater than or equal to the original TVD statistic. All p-values for Irish clusters were less than or equal to 0.001 indicating robust clustering. (S1 Table and S2 Table). We also applied these methods to our Irish cluster groups (Fig 1) and observed that these are statistically distinct (S3 Table   and S4 Table).
To provide an additional measure of population differentiation between "cluster groups" we calculated mean FST between groups using PLINK 1.9 [29] which is reported in S5 Table. Estimating admixture dates We used the GLOBETROTTER method [23] to infer and date admixture events from Europe and Britain into Ireland separately. GLOBETROTTER uses output from ChromoPainter to estimate the pairwise likelihood of being painted by any two surrogate populations at a variety of genetic distances to generate coancestry curves. Assuming a single admixture event, these curves are expected to follow an exponential decay rate equal to the time in generations since admixture occurred [23]. As the true admixing sources are modelled as a linear mixture of surrogate sources rather than individual sources this method has the advantage of not requiring exactly sampled source populations. independent analyses. Target and donor clusters for this analysis were defined using the fineSTRUCTURE maximum concordance tree method described in PoBI ( [7]) to ensure homogeneity (Supplementary methods S1 Text: fineSTRUCTURE analysis); hence, the Irish target clusters that were used differ slightly from those in Fig 1. Briefly, for each surrogate population separately (Europe and Britain) we applied ChromoPainter v2 to paint Ireland and the surrogate population using the surrogate population as donors and generated a copying matrix (chunk lengths) for all individuals, and also 10 painting samples for each Irish individual as recommended. GLOBETROTTER was then run for 5 mixing iterations twice, first using the null.ind:1 setting to test for any evidence of admixture and then null.ind:0 setting to infer dates and sources. We ran 100 bootstraps for admixture date and calculated the probability of a null model of no admixture as the proportion of nonsensical inferred dates (<1 or >400 generations) produced by the null.ind:1 model, as in the GLOBETROTTER study [23]. Confidence intervals for the date were calculated from the bootstraps for the standard model (null.ind: 0) using the empirical bootstrap method. (See S1 Text: Globetrotter analysis of Admixture Dates for further details). A generation time of 28 years was assumed as in previous studies of this nature [7,23] for conversion of all date estimates from generations to years.

Ancestry proportion estimation
We assessed the ancestral make up of Ireland in terms of Europe and Britain for each Republic of Ireland cluster (see Estimating admixture dates) to explore variation in ancestry across Ireland. To do so we modelled each cluster's average genome as a linear mixture of the European and British donor populations using the method described in the PoBI study [7] and implemented in GLOBETROTTER (num.mixing.iterations: 0). This approach uses the ChromoPainter chunk length output to estimate the proportion of DNA which most closely coalesces with each individual from the donor populations, correcting for noise caused by similarities between donor populations whose splits may have occurred after the coalescence event. This is achieved through a multiple linear regression of the form Yp = B1X1 + B2X2 + … +BgXg, where Yp is a vector of the averaged length (cM) of DNA that individuals across cluster P copy from each donor group, normalised to sum to 1 across all donor groups, and Xg is the vector describing the average proportion of DNA that individuals in donor group g copy from other donor groups including their own. The coefficients of this equation B1...Bg are thus interpreted as the "cleaned" proportions of the genome ancestral to each donor group. The equation is solved using a non-negative-least squares function such that Bg ≥ 0 and the sum of proportions across groups equals 1.
To assess uncertainty of these ancestry proportion estimates we again follow PoBI [7] and resample from the ChromoPainter chunk length output to generate Np pseudo individuals for each cluster P. We achieve this by randomly sampling each of the autosomal chromosome For comparison we implemented an alternative delete one chromosome jack-knife approach as in Montinaro et al. [33.], and estimated the s.e. as in ref. [34] (S6 Fig and S7 Fig.) We also used this linear regression model to determine per-individual ancestry proportion estimates from different British clusters across Ireland, treating each individual as a cluster to enable us to assess whether gene flow from northern Britain had a gradient across Ireland.

ADMIXTURE
To estimate the proportion of British admixture into Irish clusters, ADMIXTURE [18] was run on the combined PoBI and Irish datasets, alongside eighteen ancient individuals from the Iron Age, Roman and Anglo-Saxon periods of northeast and southeast England [16,17].
Pseudo-haploid genotypes were generated for the ancient genomes at the relevant variant sites, as is standard for low coverage data, and subsequently merged with the modern diploid dataset. Data were then pruned for linkage disequilibrium between SNPs using

PCA and t-SNE
ChromoPainter coancestry matrices were projected in lower-dimensional space using principal component analysis (PCA) and t-distributed stochastic neighbour embedding (t-SNE) [21]. PCA was run using the default approach provided as part of the fineSTRUCTURE R tools [14]   Details for each cluster in the dendrogram are provided in S2 Fig. (    particularly pronounced for Northern Irish and Scottish individuals that fall within the NICS and SSC cluster groups (Fig 2), respectively.