Population Structure and Eigenanalysis

doi:10.1371/journal.pgen.0020190

Figure 1.

The Tracy–Widom Density

Conventional percentile points are: P = 0.05, x = .9794; P = 0.01, x = 2.0236; P = 0.001, x = 3.2730.

More »

Expand

Figure 2.

Testing the Fit of the TW Distribution

(A) We carried out 1,000 simulations of a panmictic population, where we have a sample size of m = 100 and n = 5,000 unlinked markers. We give a P–P plot of the TW statistic against the theoretical distribution; this shows the empirical cumulative distribution against the theoretical cumulative distribution for a given quantile. If the fit is good, we expect the plot will lie along the line y = x. Interest is primarily at the top right, corresponding to low p-values.

(B) P–P plot corresponding to a sample size of m = 200 and n = 50,000 markers. The fit is again excellent, demonstrating the appropriateness of the Johnstone normalization.

More »

Expand

Figure 3.

Testing the Fit of the Second Eigenvalue

We generated genotype data in which the leading eigenvalue is overwhelmingly significant (F_ST = .01, m = 100, n = 5,000) with two equal-sized subpopulations. We show P–P plots for the TW statistic computed from the second eigenvalue. The fit at the high end is excellent.

More »

Expand

Figure 4.

Three African Populations

Plots of the first two eigenvectors for some African populations in the CEPH–HGDP dataset [30]. Yoruba and Bantu-speaking populations are genetically quite close and were grouped together. The Mandenka are a West African group speaking a language in the Mande family [15, p. 182]. The eigenanalysis fails to find structure in the Bantu populations, but separation between the Bantu and Mandenka with the second eigenvector is apparent.

More »

Expand

Table 1.

Statistics from HGDP African Data

More »

Expand

Figure 5.

Three East Asian Populations

Plots of the first two eigenvectors for a population from Thailand and Chinese and Japanese populations from the International Haplotype Map [32]. The Japanese population is clearly distinguished (though not by either eigenvector separately). The large dispersal of the Thai population, along a line where the Chinese are at an extreme, suggests some gene flow of a Chinese-related population into Thailand. Note the similarity to the simulated data of Figure 8.

More »

Expand

Table 2.

Statistics from Thai/Chinese/Japanese Data

More »

Expand

Table 3.

Statistics from Shriver Dataset

More »

Expand

Figure 6.

The BBP Phase Change

We ran a series of simulations, varying the sample size m and number of markers n but keeping the product at mn = 2²⁰. Thus the predicted phase change threshold is F_ST = 2⁻¹⁰. We vary F_S and plot the log p-value of the Tracy–Widom statistic. (We clipped −log₁₀ p at 20.) Note that below the threshold there is no statistical significance, while above threshold, we tend to get enormous significance.

More »

Expand

Table 4.

BBP Phase Change: Eigenanalysis and STRUCTURE

More »

Expand

Figure 7.

Simulation of an Admixed Population

We show a simple demography generating an admixed population. Populations A,B,D trifurcated 100 generations ago, while population C is a recent admixture of A and B. Admixture weights for the proportion of population A in population C are Beta-distributed with parameters (3.5,1.5). Effective population sizes are 10,000.

More »

Expand

Figure 8.

A Plot of a Simulation Involving Admixture (See Main Text for Details)

We plot the first two principal components. Population C is a recent admixture of two populations, B and a population not sampled. Note the large dispersion of population C along a line joining the two parental populations. Note the similarity of the simulated data to the real data of Figure 5.

More »

Expand

Figure 9.

LD Correction with no LD Present

P–P plots of the TW statistic, when no LD is present and after varying levels (k) of our LD correction. We first show this (A) for m = 500, n = 5,000, and then (B) for m = 200, n = 50,000. In both cases the LD correction makes little difference to the fit.

More »

Expand

Figure 10.

LD Correction with Strong LD

(A) Shows P–P plots of the TW statistic (m = 100, n = 5,000) with large blocks of complete LD. Uncorrected, the TW statistic is hopelessly poor, but after correction the fit is again good. Here, we show 1,000 runs with the same data size parameters as in Figure 2A, m = 500, n = 5,000, varying k, the number of columns used to “correct” for LD. The fit is adequate for any nonzero value of k.

(B) Shows a similar analysis with m = 200, n = 50,000.

More »

Expand