Complex spatio-temporal distribution and genomic ancestry of mitochondrial DNA haplogroups in 24,216 Danes

Mitochondrial DNA (mtDNA) haplogroups (hgs) are evolutionarily conserved sets of mtDNA SNP-haplotypes with characteristic geographical distribution. Associations of hgs with disease and physiological characteristics have been reported, but have frequently not been reproducible. Using 418 mtDNA SNPs on the PsychChip (Illumina), we assessed the spatio-temporal distribution of mtDNA hgs in Denmark from DNA isolated from 24,642 geographically un-biased dried blood spots (DBS), collected from 1981 to 2005 through the Danish National Neonatal Screening program. ADMIXTURE was used to establish the genomic ancestry of all samples using a reference of 100K+ autosomal SNPs in 2,248 individuals from nine populations. Median-joining analysis determined that the hgs were highly variable, despite being typically Northern European in origin, suggesting multiple founder events. Furthermore, considerable heterogeneity and variation in nuclear genomic ancestry was observed. Thus, individuals with hg H exhibited 95%, and U hgs 38.2% - 92.5%, Danish ancestry. Significant clines between geographical regions and rural and metropolitan populations were found. Over 25 years, macro-hg L increased from 0.2% to 1.2% (p = 1.1*E-10), and M from 1% to 2.4% (p = 3.7*E-8). Hg U increased among the R macro-hg from 14.1% to 16.5% (p = 1.9*E-3). Genomic ancestry, geographical skewedness, and sub-hg distribution suggested that the L, M and U increases are due to immigration. The complex spatio-temporal dynamics and genomic ancestry of mtDNA in the Danish population reflect repeated migratory events and, in later years, net immigration. Such complexity may explain the often contradictory and population-specific reports of mito-genomic association with disease.

Introduction between R and L, with N and M located intermediately (Fig 3). The PC1 seems to reflect time since branching, whereas PC2 reflects geographical distance.

Distribution of haplogroups
The R macro-hg was dissolved into hgs ( Table 2) however, it was not possible, with the available SNPs, to differentiate the HV and P hgs from the R macro-hg (S1 Table, for details). Principal Component 1 versus 3 plot (Fig 4), demonstrated a clear clustering of mtDNA SNPs from persons belonging to each hg. The proximity of the clusters comprising U and K and H and V, respectively, is in accordance with the current phylogenetic mtDNA tree (Fig 1). Likewise, the median-joining graph of the R macro-hg (Fig 5), based on all the called SNPs, disclosed a phylogenetic relationship compatible with that of Fig 1. In addition, all the hgs exhibit a considerable complexity, suggesting that each hg is the result of multiple migratory events and thus, multiple founder events. The sub-hg distribution for each of the hgs that belong to the R macro-hg is shown in Table 3.
The L, N and M macro-hgs were infrequent (Table 1). M and N could be broken down into hgs as shown in Table 4. The complexity of these macro-hgs, and macro-hg L, as demonstrated by median-joining phylogenetic analysis, S1A-S1C

Geno-geographical affinity of mtDNA haplogroups
An admixture analysis of the persons of different hgs was performed with results as shown in Fig 6. The major hgs, H and its sub-hgs, have a 90-95% Danish ancestry and~5% non-Danish European structure. However, there is a great variation between mtDNA hgs-most pronounced in hg U -, and most of the hgs have a notable-but varying-proportion of admixture from Europe, Middle East and Central South Asia (Fig 6). This finding is compatible with the very complex M-J diagram of the R-hgs (Fig 5). The genomic ancestry of the macro-hgs M and N differ even more (Fig 6), with the M being of South East and East Asian affinity, and N of Danish and European affinity. Macro-hg L exhibited-surprisingly-a predominance of Middle Eastern and European genomic affinity (Fig 6), where an African ancestry should be expected [39,41].

Spatial distribution of mtDNA haplogroups
Denmark is divided into five geographical and administrative regions as shown in Fig 2. The samples for this study were obtained from all regions in Denmark and linked to the postal code of the birthplace. The frequency of the macro-hgs and the most frequent hgs is shown for each administrative region in Table 5. The most marked differences were the relatively low frequency of hg H, 43.4%, in the Capital Region, compared to the North Denmark Region with 48.8% hg H (p < 0.0001), and the higher frequencies of hgs L, M and R (in total 7.5%) in the Capital Region compared to the North Denmark Region (in total 3.9%) (p < 0.0001).These differences were significant even after Bonferroni correction for the simultaneous comparison of 14 groups  (Table 5). In the Capital Region, the L hgs have a frequency of 1.3%, compared with 0.4-0.6% in the other regions and the M hgs have a frequency of 2.6% in the Capital region compared to 1.0-1.6% in the other regions. As the L and M hgs are rare in the European population and very frequent in African and Asian populations, the noted difference probably reflects a higher proportion and preferential localization of non-ethnic Danes in the Capital region. A similar spatial difference was apparent when the hg distributions of persons from the Danish metropolitan areas, comprising the major cities, Copenhagen, Aarhus, Odense and Aalborg, were compared with the distributions in the remaining rural areas (S4 Table). In the metropolitan areas, the combined frequency of L and M hgs was 4.4% as compared to 1.9% in the rural areas.

Temporal distribution of mtDNA haplogroups
The frequency of the major hgs H, J, T, K, V did not change significantly from year to year in the period from 1981 to 2005 (Results not shown).
The frequencies of M and L hgs (Fig 7) increased over the period. The L hgs increased from a constant level of~0.4% from 1981-1995 to~1.5% in 2005. The M hgs rose from a constant level of~1% from 1981-1991 to~3% in 2005. The proportion of L and M hgs increased significantly (even after Bonferroni correction for the simultaneous analysis of 14 groups) from the period 1981-1986 to the period 2000-2005, Table 6. The considerable diversity of the M macro-hg (S1B Fig) and the extreme diversity of the L macro-hg (S1C The hg U also increased in proportion each year from 1981 to 2005 albeit not significantly, following Bonferroni correction (Table 6). However, when analyzing only the R-macro-hg, the proportion of hg-U increased significantly (p < 0.002) (Fig 7). In the case of the U-hg, the PCA Thus, there was no evidence for the introduction of novel U-hgs over the period. However, a more detailed analysis revealed that the increase in U-hgs was largely due to an increase in the infrequent sub-hgs U1, U6, U7, and U8 (S3 Table) as all these increased by more than 40%. An admixture analysis showed that whereas the Danish autosomal genomic ancestry of the hgs U � , U2-U3, U4-9, U5a&b is in the range of 80.6% -92.5% (Fig 6) comparable to that of the other major European hgs, the Danish ancestry drops to 36.4% -42.3% for U1, U6 and U7. As illustrated in Fig 6, the U1 and U7 have strong Middle Eastern (29.4% and 13.6%) and Central South Asian (14.7% and 39.8%) genomic ancestry, and the U6 exhibits a very strong Middle Eastern genomic ancestry (53.8%). This suggests that the increase in frequency of hg U is largely due to expansion of hgs brought into Denmark as a result of recent immigration.

Discussion
This study shows that the distribution of mtDNA hgs in Denmark is highly dynamic and complex. It comprises 1.6% of the Danish population over a 25-year period, and is by far the largest , is similar to that seen in the present study (Table 3) e.g. H1 and H3 constitute 37.6% and 6.5% of the H hgs in our study and 39.6% and 5.2% in the Li et al. study [47].
Haplotyping of mtDNA was based on array data with only 418 mtDNA SNPs and as a consequence not all sub-hgs could be called. Ideally, sub-haplotyping could be refined-without including sub-hg defining SNPs from Phylotree-by the application of clustering approaches. However, this was not attempted, as it would lead to results that could not be compared with other studies. The stringent adherence to specific SNPs meant that a number of persons, normally in the order of 1-2%-albeit somewhat higher for specific hgs (see S2 Table), were not assigned to a specific hg or sub-hg. An advantage of using a limited set of markers is that confounding due to private variants is avoided. The complexity of each of the major hgs, as seen from the M-J-networks (Figs 5 and S1A-S1C) with multiple nodes and a span of many mutations between different leaves suggests that the hgs are not representative of single early founder events, as one mtDNA mutation is expected to occur per 3,000 years [48]. A notable exception is hg A, with a total prevalence of~0.5% (Table 4) where the largest node (S1A Fig) is associated with "daughter" nodes only 1-2 mtDNA mutations from the major node. It was only possible to identify the sub-hgs A1 and A5, and they only constituted a small fraction of the total hg A. A likely source of the hg A is the Inuit population from Greenland, which is now a self-ruling part of The lack of recombination, high mutation frequency, and fixation of mtDNA hgs has enabled the use of mtDNA in population genetics to study population ancestry, migrations, gene flow and genetic structure [53,54]. Thus, populations on different continents, e.g. Native Americans [55], Africans [56] and Europeans [57] were ascribed specific matrilineal mtDNA hg distributions [39,44]. However, recent studies using autosomal SNP markers have disclosed a considerable ancestral complexity underlying an mtDNA classification in specific admixed populations, and the prediction of a specific mtDNA hgs is not possible from a specific continental ancestry based on nuclear genetic markers [58]. A problem with genetic analysis of admixed populations is the lack of temporal resolution. It is often not possible to date a specific population split because differentiation between new and old migrations is impossible. Recent advances in sequencing ancient genomes [59] have made it possible to combine genetic information from ancient humans with archaeological information on the age of skeletal remains [60], thus constructing hg distribution maps with a temporal dimension for e.g. Ice age and Bronze age Europe [61,62]. Such maps may help explain demic and cultural exchange [52,[63][64][65].
There is no solid evidence of the presence of humans in Denmark [66] [67] prior to the Last Glacial Maximum (LGM) (26.5-19 kYBP) [68]. At that time Denmark was covered in ice, except for the south-western part of Jutland [69], and following the retraction of the ice sheet, peopling became possible from the south [69][70][71][72]. The first inhabitants documented were late Paleolithic hunters entering from southern Europe [73]. These hunters are discernible from Bølling time [69] around 12,800 BC. The archaeological remains suggest their transient presence in seasonal hunting periods until the Mesolithic around 9,700 BC [74]. The Hamburgian, Federmesser, Bromme and Ahrensburgian material cultures [69,75], well known from findings in Germany [69], are represented during this period. The earliest anatomically normal humans (ANM), present from around 45-41.5 kYBP [76][77][78], where ancient DNA studies have revealed the presence of mtDNA hgs M and pre-U2 [79], have little similarity to presentday Europeans. However, the Europeans from around 37 kYBP to 14 kYBP have left their  [62]. From 14 kYBP the European population has a strong near-eastern component [62]. Around 7 kYBP the Neolithic transformation gradually started as the result of a demic dissemination of Neolithic Aegeans [80]. Thus, a minimum of three ancestral populations, i.e. a western European hunter-gatherer, an ancient north Eurasian, and an early European farmer population are needed to explain present-day European autosomal genome compositions [81]. In Denmark, where Paleolithic ancient genetic data have not been published, ancient mtDNA haplotyping of three Neolithic corpses from 4.2-4 kYBP revealed two U4 and one U5a mtDNA hgs [82] and later samples showed a mixture of hgs also found in northern Germany [79,82]. The presence of mtDNA clades deriving from Paleolithic and Neolithic Europeans in the extant Danish population is thus explained. In the Bronze age, an influx of people from the Russian steppe and North Caucasus, bringing the Indo-European language and culture [83], resulted in the last major prehistoric demic change [61] in Europe. In historic times the migrations have been many, particularly in the half-millennium following the fall of the Western Roman Empire [84]. In Denmark, apart from continuous demic exchange with Southern Scandinavia and present-day Germany [69,72], early historic time was mostly characterized by emigration, i.e. the Heruli, Cimbriae and Teutones, Burgundians and Vikings [71,72]. The first census held in 1769 AD reported the The distribution of H sub-hgs (Table 3) and the M-J graph of the H-hg (Fig 5) with multiple major nodes and 5-10 variants between leaves, suggest that the H-hg lineages are the result of repeated immigrations. Most likely from northern Europe, where a study of 39 prehistoric hg H samples has shown that the distribution of hg H differed between early and middle-to-late Neolithic groupings. Prior to Neolithicum H hgs have not been found in skeletal remains; an ensemble of Swedish Mesolithic hunter-gatherers all had U-hgs [63]. Whereas H, H5 and H1 were found throughout Neolithicum, H5b, H10, H16, H23, H26, H46, H88, and H89 were seen in early Neolithic samples, and H2, H3, H4, H5a, H6, H7, H11, H13, H82, and H90 in middle-to-late Neolithic samples [87]. The extant Danish H-hg distribution is compatible with contributions from throughout Neolithicum, but estimating when the hgs appeared in Denmark remains unfeasible, as the distribution is similar to that of northern Germany. However, not all carriers of an H-hg have a Danish autosomal genomic ancestry (Table 5) compatible with a considerable recent immigration from countries with a European mtDNA distribution (S4 Fig).
The most frequent Danish U-sub-hg is U5 (Table 3) which is an old European mtDNA hg, with two major subclades (U5a and U5b) with coalescence time estimates of 16-20 kYBP and 20-24 kYBP, respectively [88]. The U5-hg is the most frequent U-sub-hg after LGM [79] and the carriers have a Danish autosomal genomic ancestry around 90%, and a European ancestry of 6.1-8.8% (Fig 6). However, several of the less frequent U-sub-hgs have a considerable,~40-60% non-Danish and non-European genomic ancestry based on autosomal markers, with U7  6) exhibit a relatively strong non-Danish autosomal genomic ancestry. The M-J graphs (Fig 5) also disclose a considerable variation, much larger than could be attained merely within the timeframe where Denmark has been populated. This may be explained by an extensive immigration, which for U6 and U7 occurred recently. This is also compatible with the rising frequency of these hgs during the 25-year study period (Fig 7 and S3 Table). The N-macro-hg exhibits a high (>90%) combined Danish and European genomic ancestry (Fig 6) suggesting that the major part has been in Denmark for a long time. The major Nhgs are the I-hg, which has been found in meso-and neolithic Scandinavians [82] and hg-X [89] and hg-W (Table 4) and they exhibit (S1 and S2A Figs) a very extensive heterogeneity. All three hgs are old and have a broad, low frequency, distribution in western Eurasia [43], resulting from migratory events from the Near East and Central Asia. These events, viz. the significance for the presence of the hgs in Denmark, cannot be temporally resolved.
Macro-hg-M, while infrequent (1.6%) ( Table 1) has also increased in frequency recently (Fig 7) and exhibits extensive heterogeneity (S1 Fig) [90], and the M2, M3, M4 and M6 hgs, constituting 12.5% of M-hgs, are of Indian or Pakistani origin [91]. These findings are compatible with the high genomic ancestry (42.2%) (Fig 6) towards Central South Asia, and a recent entry to Denmark. There is a propensity for location in Metropolitan areas (S4 Table) and the Capital region (Table 5) also compatible with recent immigration.
The low, but increasing, frequency of L-hgs (Tables 1 and 6 and Fig 7) predominantly L2 and L3 (Table 1), is also the result of multiple immigration events, as the PCA (S2C Fig) as well as the M-J plot reveals an extensive heterogeneity (S1C Fig). L2 and L3 hgs are frequent in south, west or east Africa, whereas the contribution from central Africa is small [92,93]. However, the admixture analysis (Fig 6) suggests a much stronger genomic ancestry with the Middle East (70.8%) than to Africa (8.4%). However, this finding could be an artefact of the ADMIXTURE analysis, as the highly variable African genomic ancestry[41] is represented by only 127 genomes (~6% of the total reference set-see Materials and Methods), as compared to 178 genomes from the Middle East, making the definition of African ancestry imprecise. Alternatively, it could be caused by the extensive demic exchange that have occurred through time between the Near East and Northern Africa [85,94]. The number of children born in the period 1980 to 2005 with one or two African parents (See S4 Fig) is compatible with the origin of the L hg persons being African. The L hgs are predominantly located in the Capital Region (Table 5) and in Metropolitan areas (S4 Table) which is also compatible with recent immigration.
Immigration to Denmark increased from 1980, where 135,000 immigrants were registered, to 2005, where this number had risen to 345,000 [85]. In the same period the number of descendants of immigrants, i.e. persons that might turn up in this study, rose from 18,000 to 109,000. Roughly 50% of the immigrants were from western countries [85] (S4 Fig). This should give approximately 45,000 descendants of non-western immigrants over the time of study. This is within the order of magnitude to be expected from the frequency distribution of mtDNA hgs in Denmark. There is thus a reasonable concordance between the suspected number of immigrants, from the temporal and structural study of mtDNA hgs, and the registered births of persons with non-Danish parents.
Whereas the temporal change in mtDNA distribution can be explained by immigration, it is more difficult to explain the spatial clines (Tables 5 and S4). The frequency of hg-H is higher in Northern Jutland than in other regions, particularly the Capital Region ( Table 5). The difference cannot be explained by the comparatively much smaller differences in frequencies of hgs L and M. As the mobility of Danes was fairly restricted until around 1900 [71], it may represent differences caused by centuries of relative isolation of a population north of Limfjorden. In Slovenia, historically confirmed geographically-based sub-stratification of the population [95] has led to extreme differences in mtDNA distributions.
A recent fine-scale gDNA population structure study from the UK [37] revealed considerable geographical heterogeneity and enabled the identification of specific sources of admixture from continental Europe. A gDNA study of admixture in the Danish population [96] showed considerably more homogeneity, but medieval admixture from Slavic tribes in North Germany, as well as a North-South gradient were discernible. Both of these studies limited the participants to persons with local grandparents, whereas our present study was not directed towards previous generations, but rather present-time and prospective.
The association of mtDNA SNPs and hgs with both diseases and functional characteristics of mitochondria, has led to a pathogenic paradigm [97] where variation in mitochondrial function is considered to be of paramount importance for development of disease. Specific hgs have also been associated with longevity [98] and likelihood of being engaged in endurance athletic activities [99]. The clinical presentation of diseases caused by specific mtDNA variants depends, in some cases, on the hg background [100]. However, several of these studies are underpowered [101], poorly stratified with respect to sex, age, geographical background [102] or population admixture [103], or have used small areas of recruitment risking "occult" founder effects [104]. Consequently, there are many contradictory results reported in the literature. To circumvent some of these problems a recent large study on mtDNA SNPs identified a number of SNPs that were associated with several degenerative diseases [105], however, the study pooled sequence information from a large geographical area, without correcting for potential population sub-structure.
In addition to the gDNA/mtDNA interaction (Fig 6) the demonstrated spatio-temporal dynamics of the mt DNA hg distribution (Table 5 and Fig 7) should be taken into account when designing studies of mtDNA associations with disease, physiological characteristics and pharmacological effects. A prerequisite for genetic association studies is that the individuals are sampled from a homogenous population, or that cryptic population structure is corrected for, if not, false positive associations may occur [106][107][108]. It is even more complicated, as a functional interaction between different mtDNA hgs and nuclear genomes, of importance for longevity and insulin resistance, has been described in mice [109].Thus, the combined variation in mtDNA and gDNA should be accounted for in association studies. This may be done by including the principal components of both in the association analysis.
The need for well-designed and robust association studies in mitogenomics is underscored by the difficulty of performing functional analyses of mtDNA variants. Thus, OXPHOS enzyme activity measurements in cells and tissues [110], bioinformatic analysis of the consequences of altered mitochondrial protein function [111], as well as cybrid studies [112] are all characterized by not taking the mitochondrial-nuclear interaction into account. This is a major problem as changes in mitochondrial function can influence a plethora of nuclear functions, and vice versa [113].
Thus, when studying bi-genomic, i.e. both nucleo-and mito-genomic, disease associations, our results suggest that it is necessary to compensate for gDNA and mtDNA genetic stratification, as well as the interaction between the two sources of variation and spatio-temporal clines, in order to establish a functional significance of a specific mtDNA hg or SNP. To our knowledge, this has never been done, suggesting that previous reports on disease associations and mtDNA haplogroups should be considered preliminary.

Ethics statement
This is a register-based cohort study solely using data from national health registries. The study was approved by the Scientific Ethics Committees of the Central Denmark Region (www.komite.rm.dk) (J.nr.: 1-10-72-287-12) and executed according to guidelines from the Danish Data Protection Agency (www.datatilsynet.dk) (J.nr.: 2012-41-0110). Passive consent was obtained, in accordance with Danish Law nr. 593 of June 14, 2011, para 10, on the scientific ethics administration of projects within health research. Permission to use the DBS samples stored in the Danish Neonatal Screening Biobank (DNSB) was granted by the steering committee of DNSB (SEP 2012/BNP).

Persons
As part of the iPSYCH (www.iPSYCH.au.dk) recruitment protocol, 24,651 singletons (47.1% female), born between May 1 1981 and Dec 31 2005 were selected at random from the Danish Central Person Registry. The singletons had to have been alive one year after birth, and to have a mother registered in the Danish Central Person Registry. Furthermore, it should be possible to extract DNA from the DBS. DBS cards were obtained from the Danish Neonatal Screening Biobank at Statens Serum Institut [114] and DNA was extracted and analyzed as described below. At the time of analysis (2012), the mean age of females was 18.2 years (SD: 6.6 years) and for males 18.8 years (SD: 6.7 years). There was no bias in the geographical distribution of the birthplace of samples (Fig 2).

Genetic analysis
From each DBS card two 3.2-mm disks were excised from which DNA extracted using Extract-N-Amp Blood PCR Kit (Sigma-Aldrich, St Louis, MO, USA) (extraction volume: 200 μL). The extracted DNA samples were whole genome amplified (WGA) in triplicate using the REPLIg kit (Qiagen, Hilden, Germany), then pooled into a single aliquot. Finally, WGA DNA concentrations were estimated using the Quant-IT Picogreen dsDNA kit (Invitrogen, Carlsbad, CA, USA). The amplified samples were genotyped at the Broad Institute (MA, USA) using the Psychiatric Genetic Consortia developed PsychChip (Illumina, CA, USA) typing 588,454 variants. Following genotyping, samples with less than 97% call rate, as well as those where the estimated gender differed from the expected gender were removed from further analysis; altogether 435 (1.8%) samples were removed due to problems with calling mtDNA variants. We then isolated the 418 mitochondrial loci and reviewed the genotype calls, before exporting into the PED/MAP format using GenomeStudio (Illumina, CA, USA). Samples were loaded into GenomeStudio (version 2011.a), a custom cluster was created using Gentrain (version 2), following automatic clustering all positions with heterozygotes were manually curated. The custom cluster file was trained on genomic DNA and optimized for clustering of AA, AB, and BB. As mtDNA exhibits heteroplasmy each cluster was manually inspected and adjusted based on Norm R (intensity) and Norm Theta (allele frequency). Samples with low intensities (< 0.08) were considered unreliable calls and rejected. Norm Theta was used to adjust the clusters into two distinct clusters, AA and BB. The few samples that clustered in AB were regarded as missing calls. The data was exported relative to the forward strand using PLINK Input Report Plug-in (version 2.

mtDNA SNPing
Haplotyping of mtDNA was performed manually using the defining SNPs reported in www. phylotree.org [19]. Hierarchical affiliation to macro-hg i.e. L0 -L6, M, N, R, and subsequent to hgs-units more distal in the cladogram (Fig 1) was performed. In some cases it was possible to establish affiliation to even sub-hgs. The call efficiencies of SNPs used in defining haplo-and sub-haplogroup affiliation are summarized in S1 Table. The proportion of samples that could not be unequivocally distributed into hgs or sub-hgs are given in S2 Table, with the appropriate references to the relevant tables in the main manuscript.

Phylogenetic analyses
Phylogenetic analyses was performed by constructing median-joining networks with Network 4.6.1.3 (http://www.fluxus-engineering.com). FASTA converted sequences of mtDNA SNPs from each person were aligned, sequences were pre-processed with Star Contraction (Maximum star radius 5 for R, N, M and 1 for L), then Median Joining networks were constructed (using the network parameters: Epsilon 10, Frequency >1, active) followed by post-processing with a maximum parsimony algorithm (MP) [115,116]. Network Publisher were used to postprocess the networks [42].

Genomic ancestry analysis
Ancestry estimation was done using ADMIXTURE 1.3.0 [117]. Briefly, a reference population consisting of Human Genome Diversity Project (HGDP) (http://www.hagsc.org/hgdp/) genotyping SNP data set, supplemented with representative samples of Danes (716 individuals) and Greenlanders (592 individuals) available at SSI from unrelated projects, was used. The final reference data set consisted of 103,268 SNPs and 2,248 individuals assigned to one of nine population groups: Africa, America, Central South Asia, Denmark, East Asia, non-Danish Europe, Greenland, Middle East and Oceania. K-number of clusters defined-was set to eight, based on principal component analysis clustering (data not shown).
Individuals characterized by different mtDNA hgs or sub-hgs were merged with the reference population data set and analyzed using ADMIXTURE. For prediction of the ancestry of individuals within the mtDNA hgs we created a random forest model [118] based on the reference data set, with the clusters Q1-8 as predictors and population groups as outcome. The prediction was thus supervised. Prediction was done in R version 3.2.2, using the caret package. The distribution of the eight basic clusters in samples of different geographical origin is shown in S1 Fig. As expected, the African-characteristic cluster distribution plays a decreasing role when going from Africa over the Middle East to Central South Asia. Likewise, the Danish cluster distribution is very similar to that of Europe.

Statistics
The statistical significance of differences in mtDNA proportions was assessed using a permutation version of Fisher's exact test [119]. Calculations were performed using R [120]. To assess population stratification [121], principal component analysis (PCA) was performed. When evaluating the significance of multiple comparisons between two populations/groups, Bonferroni correction was used.