^{1}

^{*}

^{2}

^{3}

^{4}

^{5}

^{6}

^{7}

^{8}

^{9}

^{10}

PP, EZ, and PD conceived and designed the experiments. PP and PD performed the experiments and analyzed the data. All authors contributed reagents/materials/analysis tools. PP, EZ, and PD wrote the paper.

The authors have declared that no competing interests exist.

Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on genomewide data, selects small subsets of SNPs (PCA-correlated SNPs) to reproduce the structure found by PCA on the complete dataset, without use of ancestry information. Evaluating our method on a previously described dataset (10,805 SNPs, 11 populations), we demonstrate that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. We validate our methods on the HapMap populations and achieve perfect intercontinental differentiation with 14 PCA-correlated SNPs. The Chinese and Japanese populations can be easily differentiated using less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. We show that, in general, structure informative SNPs are not portable across geographic regions. However, we manage to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations. Compared to analysis with the measure of informativeness, our methods, although unsupervised, achieved similar results. We proceed to demonstrate that our algorithm can be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals, 7,257 SNPs), we show that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. We subsequently validate these SNPs for structure identification in an independent Puerto Rican dataset. The algorithm that we introduce runs in seconds and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations.

Genetic structure among and within human populations reflects ancient and recent historical events, migrations, bottlenecks, and admixture, and carries the signatures of random drift and natural selection. The complex interplay among these forces results in patterns that could be used as tools in diverse areas of genetics. In population genetics, uncovering population structure can be used to trace the histories of the populations under study [

One of the prevailing methods for identifying population structure is a model-based algorithm implemented in the program STRUCTURE [

Identifying a set of markers that could effectively be used for inference of population structure will reduce genotyping costs and will potentially provide insight to genetic loci that have undergone selective pressures. Several approaches have been used to this end [_{ST}_{ST}_{n}_{ST}

We have developed a novel algorithm to identify a subset of SNP markers that capture major axes of genetic variation in a genotypic dataset without use of any prior information about individual ancestry or membership in a population. Our approach is a greedy deterministic variant of a Monte-Carlo algorithm that has provable performance guarantees [_{n}

We will first develop the theoretical underpinnings of our method and explain its connections to PCA. PCA is a linear dimensionality reduction technique that seeks to identify a small number of “dimensions” or “components” that capture most of the relevant structure in the data. In genetics, given a large number of genetic markers (e.g., thousands of SNPs) for a large number of individuals, PCA and the Singular Value Decomposition (SVD) have been used in order to infer population structure. We note here that SVD is the fundamental algorithmic and mathematical component of PCA; indeed, PCA is equivalent to computing the SVD of a distance matrix representing the data.

Consider a SNP data matrix ^{i}^{i}_{i}

For SNP data matrices ^{i}

Since eigenSNPs are mathematical abstractions and do not correspond to actual SNPs, a natural question arises: is it possible to identify a small subset of actual SNPs (i.e., columns of the original data matrix) such that the top few singular vectors of the matrix containing only the chosen SNPs are strongly correlated with the top few singular vectors of the original SNP matrix? Drineas et al. [

Assume that there are ^{j}

Here, ^{j}

We will pick columns of

Drineas et al. [_{j}_{j}_{j}

In order to compute the scores _{j}

We evaluated our methods extensively on a previously described dataset of 11 populations from around the world [_{n}

We first examined if our algorithms could be used to select a small subset of SNPs that cluster individuals in broad continental regions. The studied populations can be assigned to four different continents: Africa (Mbuti, Mende [East African], Burunge [West African], and African Americans), Europe (European Americans and Spanish), Asia (Mala [South Indian], East Asian, and South Altaian), and America (Nahua and Quechua). A total of 9,419 SNPs were included in our analysis. As discussed later in this report, PCA will recognize much finer resolution than broad intercontinental clustering. So, for this particular experiment we manually set the number of principal components for further analysis to three. The rationale behind this choice is that the first principal component captures 37.4% of the variance in the data, the second an additional 7.5%, the third an additional 3.1%, whereas the contribution of the fourth one drops below 1.5%. (The experimental results would be essentially the same even if four principal components were kept.) These top three eigenvectors were used to cluster the data in four clusters using the

The reported correlation coefficient is averaged over all populations in the respective geographic region or over the broad continental clusters.

Next, we reproduced this result using only a small subset of SNPs. Calculating the scores described in

(A) Raster plot of 255 subjects from four different continental regions with respect to 9,419 SNPs (red/green denotes homozygotic individuals and black denotes hererozygotic individuals).

(B) The scores _{j}

(C) Raster plot of the 255 subjects with respect to the top 30 PCA-correlated SNPs. Notice the patterns formed in the four continental blocks.

(D) Plot of the 255 subjects in the “optimal” 2-D space using the top 30 PCA-correlated SNPs.

(E) Raster plot of the 255 subjects with respect to the top 30 _{n}

We compared the efficiency of the PCA-correlated SNPs that we selected to that of a set of SNPs selected using the informativeness for assignment measure (_{n}_{n}_{n}_{n}_{n}_{n}_{n}

We repeated the aforementioned procedure using sets of randomly selected SNPs. Experiments were repeated 30 times, each time selecting ten to 200 random SNPs. The average correlation coefficient of individual membership using these random SNPs to membership using all available SNPs is shown in

In order to validate our method, we split the studied individuals into a training set (50%) and a test set (50%). This time, we first used both our method and the _{n}_{n}

(A, B) Split of our worldwide sample in 50% training and 50% test set. Average correlation coefficient between true and predicted membership of an individual to a continental region using sets of (A) ten to 200 PCA-correlated or (B) ten to 200 high-_{n}

(C) Application of the SNP panels selected for intercontinental clustering in our worldwide sample, on the HapMap populations (average correlation coefficient between true and predicted membership of an individual to one of three continents is shown).

Finally, we examined the value of the SNP panels that we selected for clustering individuals to different continental regions by testing their performance for assigning individuals from the four HapMap populations (Yoruba in Ibadan, Nigeria; Utah residents with ancestry from northern and western Europe (CEPH); Han Chinese in Beijing, China; and Japanese in Tokyo, Japan) to their true continent of origin (_{n}_{n}

We next tested the efficiency of our methods for detecting population structure in finer detail. To this end, we studied populations that originated from the same geographic region separately, and repeated the empirical analysis described earlier in this report. Three of the populations that we studied are indigenous Africans. We tried to define a subset of SNPs that could be used in order to accurately cluster individuals to each of these populations (_{n}_{n}_{n}

Overlap between the Top 200 PCA-Correlated SNPs and the Top 200 _{n}

Adding the admixed ancestry population of African Americans to this group decreases our ability to perfectly cluster individuals in a distinct population of origin using _{n}_{n}

We next studied two European populations: a Spanish sample and a broadly defined Caucasian sample (_{n}_{n}

For the Asian and American populations, 9,707 and 8,202 SNPs, respectively, were analyzed (_{n}_{n}_{n}

Both our method and the _{n}^{2} between all pairs. Out of the thousands of possible pairs very few are actually in high LD. The same is true for the top 200 _{n}

Number of Pairs among the Top 200 PCA-Correlated (PCA-c.) and _{n}

We finally explored the feasibility of accurately clustering two populations of related ancestry, the Han Chinese and the Japanese, using data available from the HapMap database [_{n}_{n}_{n}_{n}

(A) Projection of all 90 Han Chinese and Japanese individuals on the top two principal components using PCA on all available SNPs

(B)

(C) Average correlation coefficient between true and predicted membership of an individual to the Japanese of Han Chinese populations, using PCA and _{n}_{n}

We investigated whether SNPs that were selected for assigning individuals to clusters in one continent would be useful in another continent or for intercontinental differentiation. In an effort to answer this question, we tested the panels that we selected in each of the four continental regions that we studied (both using the PCA-correlated measure and _{n}_{n}

Next, we explored the possibility of ascertaining a general SNP panel that could be used for ancestry inference and the study of population structure around the world. Results are shown in _{n}_{n}

(A) Projection of all individuals of nine indigenous populations on the top three principal components using PCA on all available SNPs. (Ten significant principal components were actually detected.)

(B) Average correlation coefficient between true and predicted membership of the individuals to the nine populations, using PCA and _{n}

In order to test how the set of PCA-correlated SNPs is modified each time an additional population is added to the analysis, we studied incrementally distinct subsets of the data and compared the SNPs selected as structure informative in each subset to the panel of 50 SNPs that are sufficient for accurate clustering of the individuals to nine different populations (see experiment described above). The first subset of populations that we analyzed consisted of genotypes for one population from each continent (East African, Spanish, East Asian, and Nahua) and in each round one additional population was added randomly to the analysis. Results are shown in

Incremental Analysis of Nine Populations and Effect on the Selection of PCA-Correlated SNPs

Finally, we investigated the applicability and efficiency of our method to select structure informative SNPs in admixed populations. To this end, we studied two independent samples of Puerto Ricans. The first dataset (Puerto Rican A) is a sample of 192 Puerto Ricans [

It is well known that Puerto Ricans are genetically complex and composed of various proportions of Native American, African, and European genetic origins. We first investigated the Puerto Rican A dataset, and explored the ancestry of the 192 individuals across the African–European and the African–European/American axis (

(A) PCA on 7,259 SNPs typed on Puerto-Rican dataset A, as well as Europeans (Spanish and Caucasians), West Africans (Burunge), and Native Americans (Nahua and Quechua) (axes of variation are shown).

(B) Projection of 192 individuals from Puerto Rican dataset A on two significant principal components and variation across the European-West African axis.

(C) Comparison of ancestry coefficient of 192 Puerto Ricans across the West African-European axis and predicted ancestry coefficient using the top 200 PCA-correlated SNPs.

(D) Prediction of West African-European ancestry coefficient in Puerto Rican dataset A using PCA-correlated SNPs versus random SNPs.

(E) Using PCA-correlated SNPs selected as structure informative in Puerto Rican dataset A for ancestry prediction in Puerto Rican dataset B.

We interpreted individuals from the Puerto Rican sample as a combination of European and African ancestry, with the proportion of admixture being equal to the distance of each individual from the centroid of the ancestral population (ancestry coefficient, see

Finally, we cross-validated these findings by applying the panel of PCA-correlated SNPs that we selected on the Puerto Rican A dataset to infer individual ancestry in the Puerto Rican B dataset. As shown in

Geographic ancestry can be inferred from genotypic data [

Our algorithm is simple and computationally fast (less than one minute for the largest runs presented here) and thus allows the analysis of very large genome-wide datasets with thousands of individuals. Perhaps the most important advantage of our method for selecting PCA-correlated SNPs is that it is nonparametric and does not rely on any assumptions or modeling of the data. We simply detect SNPs that are correlated with the subspace spanned by the top few eigenvectors after determining the number of significant principal components. All other methods in the literature that are used to identify ancestry informative markers either rely on a specific model or are frequency based and demand prior knowledge of the origin of individuals [

It should be noted that the final number of SNPs needed to describe population structure is not directly provided by the method and can only be estimated through empirical evaluation of a specific dataset. Also, in applying

We were unable to compare the SNPs we selected as ancestry informative using our algorithm to published lists of ancestry informative markers [_{n}_{ST}, δ_{n}_{n}_{n}

Dissecting substructure in admixed populations is a central challenge in association studies, especially for common complex disorders [

Our findings demonstrate that to a large extent, SNPs identified as structure informative in one geographic region are not portable for the analysis of populations in a different geographic region, suggesting that the forces that shaped population structure in each geographic region have influenced different parts of the genome. However, analyzing jointly nine populations from around the world and 9,160 SNPs, we showed that using 50 PCA-correlated SNPs we can assign the studied individuals with 100% accuracy to their population of origin. SNPs with high-_{n}_{n}_{n}

Even though our results suggest that our method is powerful enough to be used for the identification of a universal panel of SNPs for the analysis of different populations from around the world, we also showed that each time a new population is added to the analysis, the panel of SNPs needed for population differentiation is modified. So, it should be made clear that we only studied a few representative populations from each continent and much more detailed studies are needed in order to test a universal structure informative SNP panel. This is also true for each of the continental regions that we discussed. We believe that many more population samples should be analyzed in order to accurately define a set of SNPs that could be used to reproduce fine-resolution population structure in a given geographic region.

We have not dealt with the effect of local LD on the results of our algorithm and PCA in general. We showed that given the worldwide dataset that we analyzed here, structure informative SNPs picked by our method are not redundant for the most part in terms of LD. However, as SNP scans become denser, local LD will become a prominent feature of a dataset and we are currently working to see how this affects PCA. At the same time, since our method is not allele frequency based, it is possible that we are able to pick up global correlations among SNPs and haplotype patterns, and more research is necessary to clarify the relationship between the output of PCA and LD.

In summary, we have developed a fast and simple algorithm for the selection of SNPs that uncover the structure of populations without knowing a priori the origin of individuals. After extracting meaningful dimensions from a dataset using PCA, we pick small sets of markers that retain the information carried in the full dataset. We believe that PCA-based algorithms will prove to be an invaluable tool for geneticists in a world of complex and ever-increasing genome-wide data.

The first dataset we used has been described in detail previously [

We transformed the raw data to numeric values, without any loss of information, in order to apply SVD. Our data on a population ^{X}^{X}^{X}_{1}_{2}_{1}B_{1}^{X}_{1}B_{2}^{X}_{2}B_{2}^{X}

In order to handle missing data without rejecting too many SNPs that may contain important structural information, we first removed all SNPs with more than 10% missing entries. (This was done independently for each experiment that we ran.) This results in an average of roughly 2% of missing entries in each SNP. We subsequently filled in the missing entries using a least-squares regression-based technique from [

Given the filled-in data matrix ^{T}

For concreteness, consider the SNP data matrix that emerges from the HapMap Han Chinese and Japanese populations, where the data matrix ^{1}^{2}^{1}^{2}^{1}^{2}

We summarize the algorithm for selecting PCA-correlated SNPs.

Let

Return the columns (SNPs) of _{j}

An implementation of our method is posted at

Informativeness for assignment (_{n}

In order to compare two clusterings, we simply compute the correlation coefficient (normalized inner product) of the cluster indicator vectors. This is effectively the Pearson correlation coefficient without the mean centering; recall that our vectors are zero–one vectors. For example, in the HapMap Han Chinese and Japanese experiment described above, given a total of 90 individuals, the ground truth cluster indicator vector for the Han Chinese population is a vector whose first 45 entries are set to one and the remaining entries are set to zero. After running

We outline our analysis of the Puerto Rican dataset A (192 individuals). In

In order to determine whether the _{m − k}_{m − k}_{m − k}_{m − k}_{m − k}_{m − k}_{m − k}

Two special cases exist. The minimal number of principal components that we keep is at least two. In all populations—except for the combination of Europeans and Spanish—at least two principal components are returned by the aforementioned algorithm as well. However, in order for our PCA-correlated SNPs algorithm to identify the appropriate correlations if exactly one principal component is kept (in which case the associated subspace is just a line), we need some normalization of the original data (e.g., mean centering). To avoid this unnecessary complication, we always keep at least two principal components, which fixes this issue by embedding the data in—at least—the Euclidean plane. The other special case is when too many principal components (e.g., more than 80% of all principal components) are selected by the above algorithm. In this case, we simply skip dimensionality reduction and directly cluster the original data. This never appears when using all SNPs, but may appear when a small number of SNPs is selected from a very large dataset (e.g., ten out of 10,000 SNPs). In

Finally, we should mention that the aforementioned test could potentially be replaced by the test proposed by Patterson et al. [

(A) Examples of transferability of SNPs selected as structure informative in one geographic region (Africa and Europe) for dissecting population structure within a different region. (B) Average correlation coefficient between true and predicted membership of an individual to a particular population within four different geographic regions, using SNPs originally selected for broad intercontinental clustering.

(517 KB PDF)

The number of principal components is shown for _{n}

(71B PDF).

We would like to thank Mark Shriver for providing access to the worldwide dataset that we studied here.

linkage disequilibrium

Principal Components Analysis

Singular Value Decomposition

_{2}regression and applications.