Ancestral Informative Marker Selection and Population Structure Visualization Using Sparse Laplacian Eigenfunctions

Identification of a small panel of population structure informative markers can reduce genotyping cost and is useful in various applications, such as ancestry inference in association mapping, forensics and evolutionary theory in population genetics. Traditional methods to ascertain ancestral informative markers usually require the prior knowledge of individual ancestry and have difficulty for admixed populations. Recently Principal Components Analysis (PCA) has been employed with success to select SNPs which are highly correlated with top significant principal components (PCs) without use of individual ancestral information. The approach is also applicable to admixed populations. Here we propose a novel approach based on our recent result on summarizing population structure by graph Laplacian eigenfunctions, which differs from PCA in that it is geometric and robust to outliers. Our approach also takes advantage of the priori sparseness of informative markers in the genome. Through simulation of a ring population and the real global population sample HGDP of 650K SNPs genotyped in 940 unrelated individuals, we validate the proposed algorithm at selecting most informative markers, a small fraction of which can recover the similar underlying population structure efficiently. Employing a standard Support Vector Machine (SVM) to predict individuals' continental memberships on HGDP dataset of seven continents, we demonstrate that the selected SNPs by our method are more informative but less redundant than those selected by PCA. Our algorithm is a promising tool in genome-wide association studies and population genetics, facilitating the selection of structure informative markers, efficient detection of population substructure and ancestral inference.


Introduction
Understanding genetic structure of human population is of fundamental interest in many applications. In population genetics, it has been widely used for inference of population evolutionary histories. In medical genetics, spurious associations can arise in the presence of population substructure. Detection and correction of population structure is a necessary step in genome-wide association studies. With the availability of highthroughput genotyping data in genome-wide disease studies, there has been increased interest in population structure. Correctly quantifying and understanding the genetic variation of human population is a challenging task. PCA has been used as a dominant method to identify population structure in the literature [1][2][3]. As a classical statistical tool to achieve dimension reduction, principal components (PCs) are linear combinations of the underlying variables and usually several top PCs can explain a large amount of variation in the whole dataset. For population based case-control association studies, the confounding effect due to population stratification can be effectively counted for by including the top PCs as covariates in a regression setting [3][4][5].
Further identifying a small panel of structure-informative markers that can be used to unravel population structure is also desired, since it can achieve genotyping savings and provide insight to genetic regions that undergone the evolutionary forces. This topic has been extensively studied in the literature [6][7][8][9]. A MCMC based program STRUCTURE [6] has been widely used for assigning individuals to clusters of populations. However, the expensive computing cost becomes impractical for disease studies involving genome scale markers and thousands of individuals. The result is also sensitive to the prior assumption of the number of underlying subpopulations. Other existing approaches such as information theory based informativeness for assignment [7] I n , d and F st are allele frequencies based and require prior knowledge of individuals' ancestral memberships, which limits the application to admixed populations such as African Americans or individuals whose ancestral information is unknown. Recently Paschou et al. [10,11] used the square sums of top PCs' entries as the weights to rank the informativeness of markers, which outperformed the approach of informativeness for assignment using statistic I n on worldwide human populations. Similar PCA based approaches have also been widely used to select a small set of PCA-correlated SNPs to correct population stratification [12][13][14].
However, PCA also has its limitation. It is sensitive to outliers which is caused by the fact that it actually computes the projection that maximizes the preservation of pairwise squared distances. The squaring of distances tends to preserve larger distances at the expense of preservation of short distances. The top PCs emphasize global patterns of the data, while the substructure of the data tends to appear in the lower ranked PCs. In the presence of outliers, pairwise distances involving outliers are significantly larger than other pairwise distances, which makes PCA tend to preserve the outlying structure rather than the bulk of the data. Also, the inclusion of extra PCs for population structure usually leads to power loss in association testing [15].
Motivated from geometric learning, new approaches [16][17][18] based on spectral graph theory [19] have been recently proposed to summarize population structure. Different from PCA, the methods use the idea of shrinkage and they preserve the local dependence structure of the study subjects. The proposed algorithms are nonlinear and robust to outliers, where one regards each subject as a vertice of a weighted graph [19] and makes edges only to its close neighbors, instead of all subjects in the study (see Materials and Methods). This reflects the fact that distances between vertices that are far apart are usually meaningless than closely correlated ones. The weight associated to edges for each pair of subjects measures their degree of being related. This adjacency graph approximates the underlying dependence structure of the sample population and the eigenvectors of the associated graph Laplacian contain useful geometric structure information (for details see the references above). The corresponding Laplacian eigenmap formed by embedding subjects to a lower dimensional Euclidean space via the top few eigenfunctions has locality preserving property. That is, distance between a pair of subjects in the embedded space reflect theirs degree of being correlated. The more they are correlated, the closer they are mapped to. Therefore Laplacian eigenmap clusters subjects who either come from the same discrete subpopulation or share more common ancestry from an admixed population and is ideal from revealing population structure.
Because of the limitation of PCA mentioned above, those PCA based approaches can be potentially problematic in the presence of outliers. In this paper, we use the global HGDP diversity panel to demonstrate that the markers selected based on Laplacian eigenfunctions in a regression setting (see Materials and Methods) are more informative but less redundant than the ones based on PCA approach (see examples below). Additionally, those most informative markers are typically sparse in the whole genome since they usually take only a very small percentage (less than 1%) of the total number of markers. Neither of the existing approaches in the literature has used this sparsity priori. Furthermore, we show that suitably incorporating the sparsity can significantly improve the overall performance on the HGDP panel. Therefore, we propose a sparse version of graph Laplacian eigenfunctions to select structure most informative markers which are also ancestry informative and can also be efficiently used to visualize the underlying population structure and correct population stratification in association studies. To compare the informativeness of selected SNPs with the PCA approach, we split the HGDP dataset equally into a training set and a testing set, and use the standard Support Vector Machine (SVM) [20][21][22] to predict the continental memberships of the samples (see Results for details). On the worldwide population HGDP panel, the proposed sparse Laplacian approach not only outperforms the PCA approach on the population membership prediction, the set of selected markers is strikingly less redundant than that by PCA. Therefore it is valuable for studies involving genome-wide biomarkers of thousands of individuals.

Simulation study of a ring population
We first applied PCA to the covariance matrix of this simulated sample. From Figure 1, one observes that the PC1 and PC2 distinguish the ring species from the two outlier subpopulations well, while the ring structure of the species, together with the two outliers, is detected by lower ranked PC3 and PC4. From Figure 2, one sees that the top two Laplacian eigenfunctions, LAP1 and LAP2, describe the ring structure and the two outlier subpopulations very well. Further comprehensive comparison of PCA and Laplacian eigenfunctions is available in the literature [16,18,23]. Next we used the marker selection procedure described below (see Materials and Methods) and selected top 300 informative markers out of total 10,000 markers. With these selected markers, the sparse Laplacian eigenfunctions, SLAP1 and SLAP2, recover a similar population structure without much information loss. Their correlation coefficients with the LAP1 and LAP2 are respectively 0.9912 and 0.9910. To measure the similarity of the two Figures 2(a) and 2(b), the Mantel test [24] based on pairwise distance is carried out with a highly significant Z-statistic value 2037.11.

Global genomic variation of HGDP-CEPH dataset
After the preliminary data cleaning and normalization (see Materials and Methods for details) we first computed the standard top principal compents of the HGDP global sample. The biplot of PC1 and PC2 distinguishes the seven continents very well, except that there is some overlap of individuals from East Asia and America (see Figure 3 (a)). Next we computed the top Laplacian eigenfunctions with varying parameter e. For large values of e, the biplot of LAP1 and LAP2 gives very similar global patterns observed in biplot of PC1 and PC2. Tuning the e slightly, we can observe some fine local strucuture such as the structures of East Asia and America and their clear classification in Figure 3 (b). Finally we applied the proposed algorithm to identify the most structure informative markers for the top K Laplacian eigenfunctions. Here K~2 in the computation. The loading vectors b i 's are very sparse and have more than 99% of the entries are vanishing. We computed the top two sparse Laplacia eigenfunctions, SLAP1 and SLAP2, using the selected top 1,400 SNPs. Their correlation coefficients with the LAP1 and LAP2 using all data are respectively 0.5275 and 0.5221. The Z-statistic of the Mantel test for the two Figures 3(b) and 3(c) is 1867.04. The biplot of SLAP1 and SLAP2 preserves the essential geographic patterns as observed in biplot of LAP1 and LAP2, see Figure 3 (c). Even more, the clusters of C.S.Asia, E.Asia and America are slightly better separated.

Intra-continent population structure
We also explored the intra-continental structure in the HDGP-CEPH data using Laplacian eigenfunctions. Here we demonstrate it on the Central and South Asian population group consisting of total 207 individuals. The biplot of LAP1 and LAP2 gives almost identical gobal pattern as given by biplot of PC1 and PC2, see Figure 4. The biplot of PC3 and PC4 mainly identifies several outliers faraway from the clustering of the rest individuals. While with a suitably small e, the biplot of LAP3 and LAP4 clearly distinguishes the Burusho subpopulation out. Next, we applied our algorithm to select the most informative SNPs for these top four Laplacian eigenfunctions. With the top 747 SNPs, we recovered the main population structure as the structure above obtained using all available SNPs. Their correlation coefficients with the top four Laplacian eigenfunctions using all data are respectively

Informative SNPs predicts continent membership via Support Vector Machine
To further validate the selected SNPs as signatures of population structure, we study the performance of predicting the continental memberships of the samples using the panel of most informative SNPs. We randomly split the total 940 individuals equally into a training set and a testing set. For individuals in the training set, the class labels are simply assigned to be f1,2,3,4,5,6,7g to stand for their corresponding seven continental memberships of Africa, Middle East, Europe, Central and South Asia, East Asia, Oceania and America. We use a standard SVM [20] to achieve our multi-class classification task with the top most informative SNPs. SVM is a supervised learning method which constructs a hyperplane or set of hyperplanes in a highdimensional space typically for classification and regression tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training datapoints of any class, since in general the larger the margin the lower the generalization error of the classifier. In all the experents carried out, the radial basis function is used as the default kernel function. The experiment is repeated 10 times and the average percentage of correct continental membership prediction is shown in Fig 5. Here we also selected the informative markers using the PCA based approach [10,11]. We reminder the readers that this approach transforms the genotype matrix g differently from the standard normalization by setting the heterogenious genotypes to 0 and the homogenious wild/mild genotypes to +1/21. Here we denote the updated data matrix as A. For optimal performance, we next estimated that the top 18 principal components of AA t are significant, all of whose entries are then summed to select the ancestral informative markers. However, the initial identified set of informative markers by PCA is very redundant, see the summarized distribution of the linkage disequilibrium (LD) measure r 2 in Table 1. Finally we use the designed QR algorithm [11] to select the first 100 less correlated markers among the initial top 500 most informative markers identified by PCA.
From the results in Fig 5, we can see first that the PCA approach is effective as compared with the poor result predicted by random SNPs. Next, with only the top two eigenfunctions the Laplacian approach (LAP) without sparsity consideration, which is equivalent to setting the penalty parameters l's to zero in the general framework (see Materials and Methods), is comparable with the PCA approach using all 18 significant PCs and redundancy removal procedure on prediction performance. Finally as expected, the sparse Laplacian approach (SLAP) improves the performance uniformly and works the best.
The error percentage of assigning individuals to their populations of the three approaches is also provided in Fig 6. There, for example, one can observe that for Americans Laplacian approach has reduced prediction error than PCA and sparse Laplacian has even no prediction errors. The top 500 informative SNPs identified by the proposed sparse Laplacian eigenfunction approach and PCA are both shown in Fig 7. Interestingly, the SNPs of Africa are dominantly green(wild alleles), and the PCA identified SNPs are dominantly red(mild alleles) for the three continents of East Asia, Oceania and America. However, this homogeneity of the alleles for the three continents makes it difficult to distinguish the continent memberships among them. For example, in the PCA experiments of Fig 5 quite a few individuals from American were mistakenly predicted as from East Asia. This is partly due to the clustering of America and East Asia in the biplot of the top principal components, which was observed earlier. While that is a relatively easy task using the SNPs identified by sparse Laplacian approach.
The distribution of these top 500 informative SNPs in the genome is also provided in Fig 8. These markers are relatively uniformly distributed in the genome. Nearby markers are usually redundent in terms of ancestral informativeness because of linkage disequlibrium (LD). The LDs among them are generally small. This pattern suggests that the driving forces that differentiate geographic population structure such as selection, climates, historical events, migration and drift may adapt the whole genome simultaneously rather than a specific region at a time. The top 20 most informative SNPs are provided in Table 2 for interested readers, and the complete set of markers are available upon request.

Discussion
The idea of incorporating regularized regression is that majority of the top Laplacian eigenfunction entries are very close to zero and represent random noise rather than true signals of population structure differentiation. The corresponding biological motivation is that some genomic regions undergone evolutionary processes such as selection or historic events more significantly than majority of the genome, though accumulated evidence [25,26] shows that most of the regions can tell the population diversity. Therefore, suitably forcing small entries of eigenfunctions to be zero with l 1 norm can presumably reduce the random effect and improve the entry precision of informative markers.
However, we emphasize that structure informative markers are usually many and the proposed algorithm selects only the most informative ones. The number of selected informative markers with nonvanishing scores increases as the penalty tuning parameter l 1 decreases. The rankings of the top most informative markers are quite stable as the tuning parameter varies, which may suffice for most applications. However, other unselected random markers can also detect the underlying population structure except that it generally requires a lot more random markers than those top informative ones. For the selection of tuning parameter, generally there is no universal optimal parameters. For the parameters E and t of the undiscovered structures, we usually default t to be 1 and set a large value of E if we are interested in global pattern of the dataset, while setting small values of E will give more details of the local pattern. Also, the set of informative markers selected by sparse Laplacian approach is less redundant than usual Laplacian regression approach is partly due to the property of LASSO [27] that it tends to select a representative rather than a few from a group of correlated variables, which corresponds to the LD of markers in our setting. While the disadvantage of the LASSO type sparse regression is that it could  Generally inclusion of more significant Laplacian eigenfunctions or principal components describing the population structure in our regression setting will improve the overall performance, as the additional eigenfunctions can help locating the specific markers that distinguish the under described subpopulation more efficiently. Here we simply demonstrate that the panel of SNPs selected by our algorithm with K = 2 gives an effective set of informative markers to distinguish seven continents, which is not necessarily optimal. Earlier Lao et al. [28] developed a method based on the informativeness of assignment index I n to find markers that differentiate populations and identified 10 SNPs from Y Chromosome Consortium [YCC] panel to successfully differentiate four geographic regions: western Eurasia, East Asia, Africa and America. Their result shows also that there is considerable lack of power when applying the ascertained SNPs to another independent set of population samples. Here we also provide the informativeness of assignment of the top 20 informative markers for interested readers. Needs to mention, addition to the simple application of the I n approach on the training dataset. One can also employ suitable clustering algorithms such as STRUCTURE [29] and FRAPPE [30] etc.
on the data to infer clusterings of individuals which rather than the predefined individual's membership can then be used to compute I n .
The incorporated standard SVM with multiple classes feature is not necessarily the optimal approach for the task of multiple continental membership prediction. Even the choice of different kernel functions used can produce slightly different results. Here it is just employed to compare the informativeness between the panel of SNPs selected by PCA and ours. It is possible that other classification techniques such as K-means and variations of SVM etc. may improve the performance. Further investigation in this direction is encouraged. However, we point out that the performance generally depends not only on the number of classes to be predicted but also the variance of each class. The larger the variance is, the more difficult the task is. For the continental membership prediction problem we consider above, the variation within each continent is large since each continent contains quite a few subpopulations with a total 52 worldwide subpopulations. Therefore, it is a challenging task. In the case of population membership prediction for the same number of subpopulations instead of continents or other large geographic regions, the difficulty level drops as the variation of each subpopulation generally is much smaller than that of a continent.  In the current study we exclude the reported related and ambiguous samples [31]. Generally speaking, inclusion of atypical or related samples changes the population structure of the samples. Specifically, atypical samples spread away from major population clusters and related samples cluster toward respective subpopulations. The structure identified by the Laplacian approach is less sensitive to outliers by considering only the close neighbors of each individual, compared with PCA. One expects the Laplacian approaches are relatively robust to a set of samples with a small number of related or ambiguous individuals. However, a careful identification of any potential ambiguous or related samples from the genotype data is strongly recommended, as a few softwares such as PREST [32][33][34] are available to achieve such tasks.
In summary, we have developed an algorithm to select population structure informative markers which are also ancestry informative and can be used to recover the original population structure with usually more than 99% genotyping savings. Compared with the PCA approach, the algorithm is not only robust to outliers but also the selected informative markers are less redundant. It is a promising basic tool for the tasks of identifying informative markers and visualization of genetic variation in population genetics and rapidly ongoing genome-wide association studies.

Data
We use the public global population sample HGDP-CEPH dataset consisting of 1043 individuals from 52 populations of seven geographic continents. All individuals were genotyped using 650K SNP array with total 660,918 SNP markers. We did quality control of the SNPs with the following criteria: minor allele frequency larger than 0.01 and missing rate less than 0.10. After the quality control, 647,483 SNPs are retained. The earlierly reported relatives and ambiguous samples were also excluded in the analysis [31]. The final dataset contains 940 unrelated individuals. The missing genotype data were simply replaced with the average of the nonmissing genotype.

Basic Notations
Assume there are total N affected and unaffected individuals in the sample. Let Y j denote the disease status of individual j, i.e., Y j~1 if j is affected, and Y j~0 if j is unaffected. Let g ij denote the matrix of genotype (0, 0.5, 1) of individual j at SNP i, where i~1, . . . ,M. Each SNP i is then normalized by subtracting off the row mean m~1 N X j g ij , and then divide each entry by ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi where p i is a posterior estimate of the allele frequency at SNP i given by p i~1 , all missing entries are excluded from the computation. Let's useĝ g ij denote the normalized genotype matrix of size N|M, then C jk1 M X iĝ g ijĝ g ik denotes the standard sample correlation coefficient between individuals j and k.

Laplacian of weight matrix
Next we summarize the main ingredients of the recent work [17,18] on describing population structure using Laplacian eigenfunctions. For each pair of individuals j and k we assign a distance d jk and weight W jk . Here we set d jk~1 {C jk . The weight i.e., C jk w1{E, and W jk~0 otherwise, where both E and t are some preselected positive real numbers. The t stands for global diffusion scale and in all the computation within the paper we set t~1:0. The E measures the size of each subject's neighborhood in terms of the metric d jk . The motivation of the proposed weight is that one counts only pairs who are genetically close. The selected Gaussian weight is optimal in certain sense, and it has deep connection with heat kernel on a manifold which gives the general solution to heat equation.
Let D be a diagonal matrix of size N|N with row sums of W as entries D jj~P k W jk . The Laplacian matrix of the weight W is defined to be L~D{W . Note that L is a symmetric and positive semi-definite matrix. We restrict to the normalized version L~D { 1 2LD { 1 2 which is also symmetric. We remark that an alternative normalized version of L is given by D {1 L, which is not symmetric and can be regarded as a Markov matrix on the graph since each row sum equals one. These two normalizations of Laplacian share the same spectrum [35].

Laplacian eigenfunctions with sparse loadings
Let f be a function on the graph with value f (i) on the i th vertex. Then the inner product can be written as vf , Lf wṽ 2f is the normalized version. The eigenfunctions of L, denoted as e 0 ,e 1 , . . . ,e N{1 in the increasing order of eigenvalues, are the functions that minimize the weighted variation. That is, Le i~Li e i , where i~0, . . . ,N{1 and L i is the i th associated eigenvalue. Note e 0~( 1, . . . ,1) t is a trivial solution with equal value on every vertex. The top eigenfunctions of L has been recently used to describe population structures [16][17][18].
Next let b~(b 1 , . . . ,b K ) be a matrix of size M|K, where each column b i~( b i1 , . . . ,b iM ) t [R M is a unit vector and K is the number of significant top Laplacian eigenfunctions that one uses to represent the meaningful population structure. We consider the optimization problem below Here l 1 and l 2 are two nonnegative real numbers which serve as the tuning parameters of the regularized terms l 1 and l 2 norms of b i . The i th entry b ai of the loading b a measures the projected signal of the i th marker on the a th Laplacian eigenvector. It is a general belief that the SNPs that are most informative about the population structure are only a few. That is, the loadings of the eigenvectors are sparse. The l 1 norm term serves as penalty for being nonzero and forces majority of the SNPs with small effect or just random noise to have zero loadings for the corresponding eigenvectors. Linear regression with l 1 constraint was first introduced as LASSO to the statistical community by Tibshirani [36]. Later l 2 term was also included in order to have the grouping property for variables sharing group effect, for details see Zou et al [37]. Nowadays sparse regression has been applied in many fields such as compressed sensing and gene expression profiles [38][39][40] and various combinations of penalization terms have been proposed in the literature. In the computation we simply set l 2~1 0 {6 and l 1~1 :0. However, one can choose different values. For the i th marker, we define a rank statistic t i~P K k~1 r k b 2 ki , where r k 's are weights for each eigenvector. Ideally r k measures the percentage of variance of the data explained by the k-th eigenvector. A simple alternative statistic is just t i~P K k~1 b 2 ki with uniform weights. The markers are ranked in the decreasing order of t i 's. The more informative a marker is, the higher it ranks. Majority of the markers have their rank statistic value equal to zero and this reflects the fact that their contribution to the underlying structure is relatively weak.

Whole Genome Scan
For the computational and memory limitation due to large number of SNPs in whole genome studies, for example in scale of a million SNPs, we propose an alternative stepwise iterative genome scan as follows. In first step, one partitions all the available SNPs randomly into multiple groups whose sizes are around a previously set small number, say, 10,000. To reduce the effect caused by the linkage disequilibrium (LD) between closeby SNPs, one tries to partition SNPs that are in strong LD into distinct groups.
Step two, one applies the proposed selection algorithm to each group and selects a proportion of the top SNPs. Then one merges the selected SNPs into a group and apply the above procedures again.

Simulation Study
A ring species. Following reference [2], an equilibrium population is simulated using the softare MS for population genetics developed by Hudson [41]. The population consists of 100 subpopulations which are equal-spacely arranged on a circle and two isolated subpopulations as outliers. Each subpopulation is assumed to consist of equal number of diploids. During each generation backward in time, a fraction m~0:1 of each subpopulation along the circle is made up of migrants from each adjacent subpopulation and there is no gamete swaps between non-adjacent subpopulations. 10,000 SNP loci were independently simulated with one segregation site per locus and ten individuals were sampled from each subpopulation with total 1020 samples.