Figure 1.
Conventional percentile points are: P = 0.05, x = .9794; P = 0.01, x = 2.0236; P = 0.001, x = 3.2730.
Figure 2.
Testing the Fit of the TW Distribution
(A) We carried out 1,000 simulations of a panmictic population, where we have a sample size of m = 100 and n = 5,000 unlinked markers. We give a P–P plot of the TW statistic against the theoretical distribution; this shows the empirical cumulative distribution against the theoretical cumulative distribution for a given quantile. If the fit is good, we expect the plot will lie along the line y = x. Interest is primarily at the top right, corresponding to low p-values.
(B) P–P plot corresponding to a sample size of m = 200 and n = 50,000 markers. The fit is again excellent, demonstrating the appropriateness of the Johnstone normalization.
Figure 3.
Testing the Fit of the Second Eigenvalue
We generated genotype data in which the leading eigenvalue is overwhelmingly significant (FST = .01, m = 100, n = 5,000) with two equal-sized subpopulations. We show P–P plots for the TW statistic computed from the second eigenvalue. The fit at the high end is excellent.
Figure 4.
Plots of the first two eigenvectors for some African populations in the CEPH–HGDP dataset [30]. Yoruba and Bantu-speaking populations are genetically quite close and were grouped together. The Mandenka are a West African group speaking a language in the Mande family [15, p. 182]. The eigenanalysis fails to find structure in the Bantu populations, but separation between the Bantu and Mandenka with the second eigenvector is apparent.
Table 1.
Statistics from HGDP African Data
Figure 5.
Plots of the first two eigenvectors for a population from Thailand and Chinese and Japanese populations from the International Haplotype Map [32]. The Japanese population is clearly distinguished (though not by either eigenvector separately). The large dispersal of the Thai population, along a line where the Chinese are at an extreme, suggests some gene flow of a Chinese-related population into Thailand. Note the similarity to the simulated data of Figure 8.
Table 2.
Statistics from Thai/Chinese/Japanese Data
Table 3.
Statistics from Shriver Dataset
Figure 6.
We ran a series of simulations, varying the sample size m and number of markers n but keeping the product at mn = 220. Thus the predicted phase change threshold is FST = 2−10. We vary FS and plot the log p-value of the Tracy–Widom statistic. (We clipped −log10 p at 20.) Note that below the threshold there is no statistical significance, while above threshold, we tend to get enormous significance.
Table 4.
BBP Phase Change: Eigenanalysis and STRUCTURE
Figure 7.
Simulation of an Admixed Population
We show a simple demography generating an admixed population. Populations A,B,D trifurcated 100 generations ago, while population C is a recent admixture of A and B. Admixture weights for the proportion of population A in population C are Beta-distributed with parameters (3.5,1.5). Effective population sizes are 10,000.
Figure 8.
A Plot of a Simulation Involving Admixture (See Main Text for Details)
We plot the first two principal components. Population C is a recent admixture of two populations, B and a population not sampled. Note the large dispersion of population C along a line joining the two parental populations. Note the similarity of the simulated data to the real data of Figure 5.
Figure 9.
LD Correction with no LD Present
P–P plots of the TW statistic, when no LD is present and after varying levels (k) of our LD correction. We first show this (A) for m = 500, n = 5,000, and then (B) for m = 200, n = 50,000. In both cases the LD correction makes little difference to the fit.
Figure 10.
(A) Shows P–P plots of the TW statistic (m = 100, n = 5,000) with large blocks of complete LD. Uncorrected, the TW statistic is hopelessly poor, but after correction the fit is again good. Here, we show 1,000 runs with the same data size parameters as in Figure 2A, m = 500, n = 5,000, varying k, the number of columns used to “correct” for LD. The fit is adequate for any nonzero value of k.
(B) Shows a similar analysis with m = 200, n = 50,000.