^{1}

^{*}

^{1}

^{2}

^{1}

^{2}

NP, ALP, and DR chose data and designed experiments. DR advised on how to compare the methods with previous work. NP analyzed the data. NP, ALP, and DR wrote the paper.

The authors have declared that no competing interests exist.

Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general “phase change” phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like _{ST}

When analyzing genetic data, one often wishes to determine if the samples are from a population that has

A central challenge in analyzing any genetic dataset is to explore whether there is any evidence that the samples in the data are from a population that is structured. Are the individuals from a homogeneous population or from a population containing subgroups that are genetically distinct? Can we find evidence for substructure in the data, and quantify it?

This question of detecting and quantifying structure arises in medical genetics, for instance, in case-control studies where uncorrected population structure can induce false positives [

We focus on principal components analysis (PCA), which was first introduced to the study of genetic data almost thirty years ago by Cavalli-Sforza [

We will use PCA and “eigenanalysis” interchangeably. The latter term focuses attention on the fact that not just the eigenvectors (principal components) are important here, but also the eigenvalues, which underlie our statistical procedures.

PCA has become a standard tool in genetics. In population genetics, we recommend a review paper [

Usually PCA been applied to data at a population level, not to individuals as we do here. Exceptions are [

In addition to single nucleotide polymorphisms (SNPs) and microsatellites, PCA has been applied to haplotype frequencies [

Using some recent results in theoretical statistics, we introduce a formal test statistic for population structure. We also discuss testing for

Our methods work in a broad range of contexts, and can be modified to work with markers in linkage disequilibrium (LD). The methods are also able to find structure in admixed populations such as African Americans—that is, in which individuals inherit ancestry from multiple ancestral populations—as long as the individuals being studied have different proportional contributions from the ancestral populations.

We believe that principal components methods largely fell out of favor with the introduction of the sophisticated cluster-based program STRUCTURE [

Our implementation of PCA has three major features. 1) It runs extremely quickly on large datasets (within a few hours on datasets with hundreds of thousands of markers and thousands of samples), whereas methods such as STRUCTURE can be impractical. This makes it possible to extract the powerful information about population structure that we will show is present in large datasets. 2) Our PCA framework provides the first formal tests for the presence of population structure in genetic data. 3) The PCA method does not attempt to classify all individuals into discrete populations or linear combinations of populations, which may not always be the correct model for population history. Instead, PCA outputs each individual's coordinates along axes of variation. An algorithm could in principle be used as a post-processing step to cluster individuals based on their coordinates along these axes, but we have not implemented this.

We note that STRUCTURE is a complex program and has numerous options that add power and flexibility, many of which we cannot match with a PCA approach. Perhaps the central goal of STRUCTURE is to classify individuals into discrete populations, but this is not an object of our method. We think that in the future both cluster-based methods such as STRUCTURE and our PCA methods will have a role in discovering population structure on genetic data, so that, for example, our PCA methods offer a good default for the number of clusters to use in STRUCTURE. In complex situations, such as uncovering structure in populations where all individuals are equal mixtures of ancestral populations, it may remain necessary to use statistical software that explicitly models admixture LD, such as [

In this study we aim to place PCA as applied to genetic data on a solid statistical footing. We develop a technique to test whether eigenvectors from the analysis are reflecting real structure in the data or are more probably merely noise. Other papers will explore applications to medical genetics [

Two important results emerge from this study. First, we show that application of PCA to genetic data is statistically appropriate, and provide a formal set of statistical tests for population structure. Second, we describe a “phase change” phenomenon about the ability to detect structure that emerges from our analysis: for a fixed dataset size, divergence between two populations (as measured, for example, by a statistic like _{ST}_{ST}

The theory shows that the methods are sensitive, so that on large datasets, population structure will often be detectable. Moreover, the novel result on the phase change is not limited just to PCA, but turns out to reflect a deep property about the ability to discover structure in genetic data. For example, in the paper we present simulations that show the ability to detect structure occurs with the same dataset size when STRUCTURE and PCA are used; that is, the phase change manifests itself in the same place.

The phase change effect was suggested by a recent paper in theoretical statistics [

The basic technique is simple. We assume our markers are biallelic, for example, biallelic single nucleotide polymorphisms (SNPs). Regard the data as a large rectangular matrix

Set

We verified (unpublished data) that the normalization improves results when using simulated genetic data, and that on real data known structure becomes clearer. (However all the results are just as mathematically valid even without the normalizations.)

The methods also are applicable to data such as microsatellites, where there are more than two alleles at a single site. We use a device of Cavalli-Sforza [

An alternative, suggested by a referee, is to use the microsatellite allele length as a continuous variable, and carry out PCA directly after a suitable normalization.

Now we carry out a singular value decomposition on the matrix

The method is fast. In practice, the running time is dominated by the calculation of the matrix product ^{2}

Most, though not all, previous applications of PCA to analysis of population structure have taken the data to be a matrix where the rows are indexed by populations not by individuals (e.g., [

We note that population labels may be socially constructed. This makes us nervous about employing them in an initial data study. On the other hand, the individual samples certainly do not have any such construct, and even if population labels are available, initial analysis at an individual level allows us to check the meaningfulness of the labels [

Cavalli-Sforza [15, pp. ^{[1]} is the principal eigenvector of the matrix ^{[2]} maximizes the same expression with the constraint that ^{[1]}, ^{[2]} are orthogonal, and so on. Why would we expect this to reveal population structure? Suppose that in our sample, we have just two populations and that each is homogeneous. Choose a vector with coordinates constant and positive for samples from one population, and coordinates constant and negative for samples from the other. Arrange so that the vector coordinates sum to zero. Then, since alleles within a population will tend to agree more than in the sample as a whole, the quantity

We now turn to some recent theoretical statistics. Consider an

(A notational issue is that in our genetic data, if

Let

_{i}_{1}_{≤i≤m} be the eigenvalues of _{1},… _{m}

Order the eigenvalues so that

Johnstone in a key paper [_{1} is approximately a distribution discovered by Tracy and Widom [

More precisely, if as

The only difference here is that the

In [

We show in

Conventional percentile points are:

The complexity of the TW definition is irrelevant to its application to real data. One computes a statistic, and then looks up a

One concern with applying this approach to genetic data is that the entries in the matrix

This theory, originally developed for the case of Gaussian matrix entries, thus seemed likely to work well with large genetic biallelic data arrays. The remainder of this paper verifies that this is the case.

For genetic applications we cannot necessarily assume that all our markers are unlinked and thus independent. For instance, in the International Haplotype Map project [

Suppose we have

There are two unknowns that we can regard as parameters to the Wishart: 1) ^{2}: the variance of the normal distribution used for the cells of the rectangular matrix

We want to carefully distinguish here between

This leads immediately to a formal test for the presence of population structure in a biallelic dataset.

1. Compute the matrix

2. Compute

3. Order the eigenvalues of

4. Using the eigenvalues _{i}

5. The largest eigenvalue of _{1}. Set

6. Normalize

Our statistic _{i},

We first made a series of simulations in the absence of population structure. (Some additional details are in the Methods section.) Our first set of runs had 100 individuals and 5,000 unlinked SNPs, and the second 200 individuals and 50,000 unlinked SNPs. In each case we ran 1,000 simulations and show in

(A) We carried out 1,000 simulations of a panmictic population, where we have a sample size of

(B) P–P plot corresponding to a sample size of

It is very important to be able to answer the question: _{k}_{+1},…λ_{m}

We generated genotype data in which the leading eigenvalue is overwhelmingly significant (_{ST}

There is a much closer relationship between our PCA and a cluster-based analysis than is at first apparent. Consider a model of genetic structure where there are _{i}_{1},_{2},…_{K}_{ii}

Suppose we sample (autosomal) genotypes from these _{i}_{ST}

Thus, natural models of population structure predict that most of the eigenvalues of the theoretical covariance will be “small,” nearly equal, and arise from sampling noise, while just a few eigenvalues will be “large,” reflecting past demographic events. This is exactly the situation that Johnstone's application of Tracy–Widom theory addresses. We also note that on real data (as we show below), the TW theory works extremely well, which shows that the model will be a reasonable approximation to “truth” in many cases.

Thus, we expect that the theoretical covariance matrix (approximated by the sample covariance) will have

As defined, our axes of variation do depend on the relative sample sizes of the underlying populations, so that differences between populations each with a large sample size are upweighted. This is worth remembering when interpreting the results, but does not seem a major impediment to analysis.

We do not review in detail older methods for testing for significance. One technique is the “broken stick” model [

We believe that the application of PCA to genetic data—and our way of analyzing the data—provides a natural method of uncovering population structure, for reasons that are subtle; thus, it is useful to spell them out explicitly. In most applications of PCA, the multivariate data has an unknown covariance, and PCA is attempting to choose a subspace on which to project the data that captures most of the relevant information. In many such applications, a formal test for whether the true covariance is the identity matrix makes little sense.

In genetics applications we believe the situation is different. Under standard population genetics assumptions such as a panmictic population, the natural null is that the eigenvalues of the true covariance matrix are equal, a formal test is appropriate, and deviations from the null are likely to be of real scientific and practical significance. To support this, in our experience on real data we take our null very seriously and attempt to explain all statistically significant axes of variation. Often the explanation is true population structure in the data, but we also often expose errors or difficulties in the data processing. Two examples follow.

In some population genetic data from African populations, the fourth axis of variation showed some San individuals at each end of the axis. This made little genetic sense, and the cause was some related samples that should have been removed from the analysis.

In a medical genetic study, an unexplained axis of variation was statistically correlated with the micro-titer plates on which the genotyping had been carried out. This suggested that the experimental setup was contributing to the evidence for structure, instead of real population differentiation.

In both these cases more careful data preprocessing would have eliminated the problem, but analysis and preparation of large datasets is difficult, and more tools for analysis and error-detection are always of benefit.

In many practical applications, samples will already be grouped into subpopulations (for instance, in medical genetics there are often two populations: cases and controls). It is natural to want to test if our recovered eigenvectors reflect differences among the labeled subpopulations. We therefore fix some eigenvector, and can regard each individual as associated with the corresponding coordinate of the eigenvector. We want to test if the means of these coordinate values in each subpopulation differ significantly. Our motivation is firstly that this is a powerful check on the validity of our (unsupervised) Tracy–Widom statistics, and secondly that the supervised analysis helps in interpretation of the recovered axes of variation.

The conventional statistic here is an ANOVA F-statistic. (See for instance [

Plots of the first two eigenvectors for some African populations in the CEPH–HGDP dataset [

Statistics from HGDP African Data

There is excellent agreement between the supervised and unsupervised analysis. The lack of significance of the third eigenvector is an indication that no additional structure was apparent within the Bantu.

Our second study took samples from three regions: Northern Thailand, China (Han Chinese), and Japan. The last two population samples were available from the International Hapmap Project [

In

Plots of the first two eigenvectors for a population from Thailand and Chinese and Japanese populations from the International Haplotype Map [

Statistics from Thai/Chinese/Japanese Data

In the third dataset, which was created and analyzed by Mark Shriver and colleagues [

Statistics from Shriver Dataset

In all the datasets mentioned above, we have very good agreement between the significance of the TW statistic, which does not use the population labels, and the ANOVA, which does. This verifies that the TW analysis is correctly labeling the eigenvectors as to whether they are reflecting real population structure.

Shriver and colleagues [

A recent paper by Baik, Ben Arous, and Péché [

Let _{1} be the lead eigenvalue of the ^{2} = _{1} be the largest eigenvalue of the sample covariance. We will let

(1) _{1} < 1 + 1/_{1}, _{1}

(2) _{1} > 1 + 1/

That is, the behavior of _{1} is qualitatively different depending on whether _{1} is greater or less than 1 + 1/_{1} varies, becomes arbitrarily sharp. The result, as stated above, is proved in [_{1} is greater or less than 1 + 1/

Consider an example of two samples each of size

This is interesting by itself:

_{ST} is equal to the square root of the data size D, independently of the number of individuals and markers, at least when both are large.

At a fixed data size, the expected value of the leading eigenvalue of the data matrix (and the power to detect structure) of course is a continuous function of _{ST},

Let us take ^{20} (about one million genotypes), so that the BBP threshold is _{ST}^{−10}. We let ^{k}^{20−k} so that ^{20}.

Now for each value of _{ST}^{−13} to 2^{−7}. For each simulation, we compute _{1}, the TW statistic, and a

We ran a series of simulations, varying the sample size ^{20}. Thus the predicted phase change threshold is _{ST}^{−10}. We vary _{S}_{10}

The phase change is evident. Further, from [

_{ST},

Another implication is that these methods are sensitive. For example, given a 100,000 marker array and a sample size of 1,000, then the BBP threshold for two equal subpopulations, each of size 500, is _{ST}_{ST}_{ST}

The BBP phase change is _{ST}_{ST}

Thus, we will compute three

BBP Phase Change: Eigenanalysis and STRUCTURE

Here the BBP threshold is .0025. Below the threshold nothing interesting is found by the TW unsupervised statistic. Above the threshold, the TW statistic is usually highly significant, and the ANOVA analyses show that the true structure has become apparent. At the threshold we

Summarizing: below the threshold, neither procedure succeeds with reasonable probability, at the threshold success is variable, and above the threshold success is nearly guaranteed.

In an admixed population, the expected allele frequency of an individual is a linear mix of the frequencies in the parental populations. Unless the admixture is ancient—in which case the PCA methods will fail as everyone will have the same ancestry proportion—then the mixing weights will vary by individual. Because of the linearity, admixture does not change the axes of variation, or, more exactly, the number of “large” eigenvalues of the covariance is unchanged by adding admixed individuals, if the parental populations are already sampled. Thus, for example, if there are two founding populations, admixed individuals will have coordinates along a line joining the centers of the founding populations.

We generated simulated data, by taking a trifurcation between populations (_{ST}

We show a simple demography generating an admixed population. Populations

We plot the first two principal components. Population C is a recent admixture of two populations, B and a population not sampled. Note the large dispersion of population C along a line joining the two parental populations. Note the similarity of the simulated data to the real data of

There remain issues to resolve here. Firstly, recent admixture generates large-scale LD which may cause difficulties in a dense dataset as the allele distributions are not independent. These effects may be hard to alleviate with our simple LD correction described below. STRUCTURE [

A third issue is that our methods require that divergence is small, and that allele frequencies are divergent primarily because of drift. We attempted to apply our methods to an African-American dataset genotyped on a panel of ancestry-informative markers [_{ST}

This is an issue for our TW techniques, but not for PCA as such. Indeed, on this dataset the correlation of our principal eigenvector with the estimated European ancestry for each individual recovered by the admixture analysis program ANCESTRYMAP [

Finally, if “admixture LD” is present, so that in admixed individuals long segments of the genome originate from one founder population, simple PCA methods will not be as powerful as programs such as STRUCTURE [

The theory above works well if the markers are independent (that is have no LD), but in practice, and especially with the large genotype arrays that are beginning to be available, this is difficult to ensure. In extreme cases uncorrected LD will seriously distort the eigenvector/eigenvalue structure, making results difficult to interpret. Suppose, for example, that there is a large “block” [

We recommend the following if LD between markers is a concern in the data. Pick a small integer

Set:
^{−L}. We show the corresponding plots. Note that here the uncorrected statistic is distributed quite differently than the Tracy–Widom distribution. Our suggested correction strategy seems to work well, and should be adequate in practice, especially as most large genotype arrays will attempt to avoid high levels of LD. We would recommend that before analyzing a very large dataset with dense genotyping, one should filter the data by removing a marker from every pair of markers that are in tight LD.

P–P plots of the TW statistic, when no LD is present and after varying levels (

(A) Shows P–P plots of the TW statistic (

(B) Shows a similar analysis with

In the work above on the BBP phase change, we already showed some comparisons between STRUCTURE and our methods. A fair comparison to STRUCTURE is not easy, as the two programs have subtly different purposes and outputs. STRUCTURE attempts to describe the population structure by probabilistic assignment to classes, and we are attempting to determine the statistically significant “axes of variation,” which does not necessarily mean the same thing as assigning individuals to classes.

Our impression, confirmed by

STRUCTURE is a sophisticated program with many features we have not attempted to match. STRUCTURE has an explicit probability model for the data, and this allows extra options and flexibility. It incorporates a range of options for ancestry and for allele frequencies, and has explicit options for modeling microsatellite distributions.

On the other hand, eigenanalysis has advantages over STRUCTURE. First, it is fast and simple, and second, it provides a formal test for the number of significant axes of variation.

One future possibility is to somehow incorporate recovered significant eigenvectors into STRUCTURE—in particular with regard to choosing the number of subpopulations, which is not statistically robust in the STRUCTURE framework. A sensible default for the number of clusters in STRUCTURE is one more than the number of significant eigenvalues under the TW statistic.

The most problematic issue when applying any method to infer population structure is that genotyping may introduce artifacts that induce apparent (but fallacious) population structure.

Missing genotypes by themselves are not the most serious concern. Simply setting

Another issue that may produce confounding effects is if data from different populations or geographical areas is handled differently (which may be inevitable, especially in the initial processing); then, in principle, this may induce artifacts that mimic real population structural differences. Even restricting analysis to markers with no missing data, apart from an inevitable power loss, does not necessarily eliminate the problems. After all, if a subset (the missing data) is chosen in a biased way, then the complementary subset must also be biased.

We have no complete solution to these issues, though there is no reason to think that our eigenvector-based methods are more sensitive to the problems than other techniques [

Another possible source of error, where the analyst must be careful, is the inclusion of groups of samples that are closely related. Such a “family” will introduce (quite correctly from an algorithmic point of view) population structure of little genetic relevance, and may confound features of the data of real scientific interest. We found that this occurred in several real datasets that we analyzed with eigenanalysis and in which related individuals were not removed.

For many genetic datasets, it is important to try to understand the population structure implied by the data. STRUCTURE [

We can only uncover structure in the samples being analyzed. As pointed out in [

Our methods are conceptually simple, and provide great power, especially on large datasets. We believe they will prove useful both in medical genetics, where population structure may cause spurious disease associations [

We justify our estimator of the “effective number of markers.”

Theorem 1.

_{1}, _{2},… _{m} be eigenvalues of an m × m Wishart matrix MM′, where M is m × n with entries that are Gaussian with mean^{2}

^{2}

_{1}

Note that in this section we define our Wishart as

Proof:

Let _{1},_{2},…_{m}

To obtain the distribution of _{1}, _{2},… _{m}^{2} is distributed as a ^{2}

It would be interesting to estimate the standard error for

We next show that normalizing the eigenvalues of an

Theorem 2.

_{i}, originating from an m × m matrix M whose entries are Gaussian with mean_{1}

Proof:

Let

We now show that this implies that

All three probabilities on the right hand side of

By Johnstone's theorem,

We now turn to genetic (genotype count) data, and analyze the theoretical covariance matrix of the data. We concentrate on the covariance of the sample genotypes at a single biallelic marker. Note that in contrast to the results for a Wishart discussed in Theorem 2, we are now interested in a case where there is population structure, which implies dependence between the samples.

Consider sampling a marker from samples belonging to _{i}_{j}_{j}

We must specify the covariance of the population frequency vector

We assume that there is a hidden allele frequency _{ii}_{ST}_{i}_{i}_{j}_{j}_{i}_{(j)}. This assumes Hardy–Weinberg equilibrium in each of the

Theorem 3.

_{i}^{*} has mean 0. Let V^{*} be the covariance matrix of C^{*} and set

1.

2.

3. _{k}

4. ^{*}

5. If the matrix ^{*}

Proof:

Let _{ij}_{ij}_{ii}_{π}_{(i)} and _{ij}_{π}_{(i),π(j)}. So the covariance structure depends only on the population labels of the samples. It follows that the vector space of _{i}_{i}_{k}_{k}_{k}_{k}_{k}_{k}

^{*}_{k},

We now consider the action of ^{[k]} be the vector whose coordinates are 0 except for samples

The vectors ^{[k]} form an orthonormal basis for _{k}^{[k]}._{k}

In the main paper we subtract the sample mean from the counts

Set

Set
^{*}_{k},

This is enough to prove that

Next, ^{*}^{*}^{*}^{*}^{*}^{*}

The case in which

For completeness, we define the TW density. Our description is taken from [

Let

A table of the TW right-tail area, and density, is available on request.

We believe this work raises some challenges to theoretical statisticians. Our results with genetic simulations would be even more convincing if there were theorems (say for the Wishart case where the data matrix has Gaussian entries) that showed: 1) that using the effective number of markers calculated by

For the data used in

For the data used in

For the data of Mark Shriver and colleagues [

In the eigenanalysis of the Shriver data, we examine no more than two markers as independent regression variables for each marker we analyze, insisting that any marker that enters the regression be within 100,000 bases of the marker being analyzed. This slightly sharpens the results. Varying these parameters made little difference.

For all STRUCTURE runs, we ran with a burn-in of 10,000 iterations with 20,000 follow-on iterations, and no admixture model was used. Computations were carried out on a cluster of Intel Xeon compute nodes, each node having a 3.06-GHz clock.

For our coalescent simulations, we assumed a phylogenetic tree on the populations, and at each simulated marker, ran the coalescent back in time to the root of the tree. At this point we have a set of ancestors

For the admixture analysis that created the plot of

SMARTPCA, a software package for running eigenanalysis in a LINUX environment, is available at our laboratory:

We are grateful to Alan Edelman for very helpful advice and references to the modern statistical literature and a very useful preprint, and to Plamen Koev for a numerical table of the TW-distribution. We thank Craig Tracy for some important corrections and advice, and Jinho Baik for alerting us to his recent work. The comments of three anonymous referees have greatly improved the manuscript. We thank Jonathan Seidman and Dr. S. Sangwatanaroj for early access to the Thai data. We are grateful to Mark Shriver for giving us access to the data of [

linkage disequilibrium

principal components analysis