MB, CP and MG are employees of Pharnext, Paris. There are no patents, products in development or marketed products to declare. This does not alter the authors' adherence to all the PLOS ONE policies on sharing data and materials.
Conceived and designed the experiments: MB MG CA. Performed the experiments: MB CP. Analyzed the data: MB CP. Contributed reagents/materials/analysis tools: MB CP. Wrote the paper: MB. Conceived and designed the method: MB MG CA.
Inferring the structure of populations has many applications for genetic research. In addition to providing information for evolutionary studies, it can be used to account for the bias induced by population stratification in association studies. To this end, many algorithms have been proposed to cluster individuals into genetically homogeneous subpopulations. The parametric algorithms, such as Structure, are very popular but their underlying complexity and their high computational cost led to the development of faster parametric alternatives such as Admixture. Alternatives to these methods are the nonparametric approaches. Among this category, AWclust has proven efficient but fails to properly identify population structure for complex datasets. We present in this article a new clustering algorithm called Spectral Hierarchical clustering for the Inference of Population Structure (SHIPS), based on a divisive hierarchical clustering strategy, allowing a progressive investigation of population structure. This method takes genetic data as input to cluster individuals into homogeneous subpopulations and with the use of the gap statistic estimates the optimal number of such subpopulations. SHIPS was applied to a set of simulated discrete and admixed datasets and to real SNP datasets, that are data from the HapMap and PanAsian SNP consortium. The programs Structure, Admixture, AWclust and PCAclust were also investigated in a comparison study. SHIPS and the parametric approach Structure were the most accurate when applied to simulated datasets both in terms of individual assignments and estimation of the correct number of clusters. The analysis of the results on the real datasets highlighted that the clusterings of SHIPS were the more consistent with the population labels or those produced by the Admixture program. The performances of SHIPS when applied to SNP data, along with its relatively low computational cost and its ease of use make this method a promising solution to infer finescale genetic patterns.
Population structure relates the genetic heterogeneity that exists between individuals of a population. This heterogeneity is a natural phenomenon resulting from biological and evolutionary processes such as for instance natural selection, genetic drift, populations migrations or mating processes
Identifying the underlying structure of populations is often of use for genetic research. It allows the study of evolutionary relationships between populations as well as learning about their demographic histories
Such analyses are also of a great interest for populationbased genetic studies such as GenomeWide Association Studies (GWASs). Notwithstanding the widespread usage of GWASs, their findings have been criticized partly because they are vulnerable to population stratification. This corresponds to the bias induced in situations where the studied populations are genetically heterogeneous and the sampling of cases and controls is imbalanced between the various ancestries. Population stratification is known to lead to finding spurious associations or to missing genuine ones
Two major strategies have been developed to infer the structure of the populations that are parametric modelbased clustering and nonparametric clustering. Modelbased clustering approaches make numerous assumptions on the genetic data and use statistical inference methods to assign individuals to subpopulations. Many of these parametric approaches exist such as for instance Structure
We propose in this paper a novel nonparametric distancebased clustering approach based on a divisive hierarchical clustering method. Our method is based on the idea that it might not be possible to uncover all of the structure in the data when applying a clustering algorithm just once. Fine population structures may not be detected as the corresponding subpopulations are hidden within the major subpopulations detected by the first run of the algorithm.
We therefore implemented a robust statistical framework to iteratively apply a clustering algorithm to the data and so analyze in depth the genetic patterns of the studied populations. This corresponds to a divisive hierarchical clustering strategy. Based on a pairwise distance matrix, the algorithm progressively divides the original population in two subpopulations by the use of a spectral clustering algorithm. The process is then iterated in each of the two subpopulations and so on. This leads to the construction of a binary tree, where each node represents a group of individuals. To determine the final clusters, a tree pruning procedure and an estimation of the optimal number of clusters are applied. In such an approach, both the final clustering of the individuals and the number of clusters are estimated by the method. We call our method ‘Spectral Hierarchical clustering for the Inference of Population Structure’ (SHIPS).
We present in this article the SHIPS algorithm along with several applications to SNP datasets. We consider five scenarios of simulated population structures. The software Genome
We present in this part the strategy of the SHIPS algorithm along with details of each step of the program. We also provide details about the methodologies of the other algorithms compared to SHIPS and the process used to assess all the methods. The simulated and real datasets analyzed are then described.
SHIPS can be described in several steps that are graphically represented in
Computation of a distance matrix that is a similarity matrix
Creation of a binary tree. Each population is divided in two subpopulations and so on (
Pruning of the tree to keep only the relevant branches corresponding to the relevant divisions (
Estimation of the optimal number of clusters
After that the initial binary tree is built, the pruning procedure leads at the end of each step to a possible clustering of the individuals. In this example the data is clustered in four, then three then two clusters (gray nodes) at step
SHIPS is based on a spectral clustering algorithm. A similarity matrix is therefore necessary to apply this clustering method. We decided to consider a similarity matrix based on the allele sharing distance (ASD) that has been previously used to identify genetic patterns among populations
The total similarity between samples
One has to note that any pairwise similarity matrix could be used in the algorithm instead of the one presented here. Examples of such matrices, based for instance on haplotypes instead of genotypes, are presented in
The binary tree produced by SHIPS is obtained by successively dividing each population in two subpopulations using a spectral clustering algorithm. Spectral clustering methods cluster points using eigenvectors of matrices derived from the initial data. We decided to use the version of this method proposed by Ng et al.
First, the similarity matrix
In a third step, a clustering algorithm is applied to the new vectors
If the population that we wish to split in two subpopulations is deemed homogeneous by the algorithm, the GMM clustering creates two clusters, one with all the samples and an empty one. This is a termination criterion that defines the end of a branch of the tree, called a terminal node. In extreme cases, the terminal nodes are all composed of a unique sample of the original population which ensures the convergence of the tree building step of the algorithm.
The divisive strategy of SHIPS consists in dividing the original population in two subpopulations with the spectral clustering algorithm previously described and to iterate this procedure within each subpopulation. This process leads to the computation of a binary tree (
The strategy of tree pruning that we use is the reduced error pruning. A quality indicator is defined and calculated for each node of the tree. This indicator is based on the sum of the squared similarities of a node and of its leaves. We define the function calculating the sum of squared similarities within a node
Considering a tree
At each step, the node with the lowest quality value,
A range of possible numbers of clusters,
Several possible estimations of the optimal number of clusters
Let
In the classical version of the gap statistic, the logarithm of
To simulate null reference datasets we simulate datasets with a number of variables and individuals identical to the one of the original datasets. Each variable was taken uniformly within
SHIPS has the advantage of producing in one run of the algorithm nested clusterings of the samples for
The SHIPS algorithm was implemented in R (
This algorithm takes as input parameters a SNP matrix of dimension
A comparison study was conducted to assess the potential of SHIPS. Both simulated and real genotype datasets were considered and a panel of other methods was also applied to these data to conduct a comparison of their performances.
We compared SHIPS to some of the most commonly used clustering algorithms in the genetic field. We first considered the parametric approaches Structure and Admixture. Also we included a nonparametric approach, namely AWclust, and finally we added the alternative clustering strategy PCAclust to the comparison. We briefly describe the methods and the parameters used in this part and a detailed methodology of each of these algorithms is provided in
SHIPS was used with the default parameters, i.e. 20 null datasets simulated for the gap statistic. A reasonable maximum number of clusters was considered for all the methods, for instance, when analyzing a dataset with 10 (known) subpopulations we investigated up to 20 possible subpopulations.
Structure is a parametric algorithm that uses Bayesian statistical inference to cluster individuals. The version 2.3.2.1 was downloaded from
Admixture is also a parametric method that similarly to Structure model the ancestry proportions. It is based on the same statistical model but the optimization of the likelihood is enhanced. The program was downloaded from
AWclust uses a hierarchical clustering. The version 2.0 was downloaded from
PCAclust consists in computing a principal component analysis of the genotype data and then to apply a clustering algorithm, namely a Gaussian mixture model clustering, to the principal components such as described in
We assessed SHIPS and the other methods on several datasets. We considered simulated datasets where the structures of the populations were controlled, a simulated admixed dataset and real datasets to determine the performances of the different approaches in real situations. For all of these scenarios small datasets of thousands of markers and large datasets of hundreds of thousands of markers were considered. We used several replicates for the small data in order to account for the simulation process or the markers sampling. Only one was used for the large scenarios due to the computational cost of certain algorithms.
We simulated datasets using the software Genome based on the coalescent approach. We considered a first model M1 with no structure of the population in order to determine which methods are capable of uncovering that the data is not structured. We then considered 4 structured models, M3, M5, M10 and M20 with respectively 3, 5, 10 and 20 subpopulations and increasing complexities of population histories.
A) one population B) three subpopulations C) five subpopulations D) ten subpopulations E) twenty subpopulations.
In order to assess the performances of the various algorithms on more realistic situations we simulated a discrete admixed dataset corresponding to the model named Madx. Two real populations from the HapMap phase 3 data, namely the Han Chinese from China (CHB) and the Utah residents with Northern and Western European ancestry from the CEPH collection (CEU), were used in an evolutionary model to produce an admixed population. The evolutionary model consists in randomly mating samples from each of the two original populations and to iterate this process over time. The final dataset is composed of the two original populations (CEU and CHB) and the admixed simulated one (named XY). The detail of the sampling is provided in
We also focused on the potential of the methods when applied to real datasets. We first considered the HapMap phase 3 dataset with 9 populations and 1,087 individuals (
This scatterplot uses the first five principal components of a dataset with 20 K SNPs. This graph is only intended to present the general genetic pattern of the dataset and does not exhaustively represent the capability of the PCA to separate the populations.
The PASNPi consortium provides the genotype data of 75 PanAsian and HapMap populations with 1928 individuals and 54,794 SNPs. Among all these populations, certain main groups, defined by the countries of origin, can be highlighted. We focused on 10 subpopulations formed by 443 individuals, from each of these groups (
This scatterplot uses the first five principal components of a dataset with 20 K SNPs. This graph is only intended to present the general genetic pattern of the dataset and does not exhaustively represent the capability of the PCA to separate the populations.
To assess the potential of a clustering method it is important to focus on both the sample assignments and the estimated number of clusters. The quality indicator usually considered is the accuracy, that is the proportion of individuals that were assigned to the correct populations. This indicator focuses only on the onetoone relationship between estimated clusters and true populations. We decided not to retain this criterion as it does not exhaustively describe the quality of a clustering method's assignments and does not account correctly for the estimated number of clusters. The indicator we selected to account for both the assignments and the estimation of the number of clusters is the adjusted Rand index


… 






… 





… 





<$>\raster(60%)="rg2"<$> 





… 





… 


This index focuses on all pairs of samples and considers whether they have correctly been assigned to the same population or correctly been assigned to different populations. That way, in addition to the accuracy criterion, the adjusted Rand index takes into account the fact that certain samples should not be clustered together. The adjusted Rand index is comprised between
For simulated datasets we compared, via the adjusted Rand index, the clusterings proposed by the different methods to the true population labels that are available through the simulation process. For the admixed and the real datasets, no true population labels exist. As a consequence we provide two quality measures that are the quality index using as comparison partitions the population labels provided with the datasets (e.g CHB or CHD in HapMap) and the partitions produced by Admixture. We selected Admixture as it is one of the most widely used methods for the estimation of population structure. Also we represent the admixture proportions of all the methods with barplots. For discrete clusterings these proportions are either 0 or 1.
Several small datasets and one large dataset were investigated for each simulated or real scenario. The average Rand indexes and the average estimated numbers of clusters are the indicators we are interested in.
Average Rand indexes over all small replicates are indicated for each method and each model along with the estimated number of clusters in parenthesis. The darker a cell color is, the better the corresponding clustering is.
Rand indexes are indicated for each method and each model along with the estimated number of clusters in parenthesis. The darker a cell color is, the better the corresponding clustering is. the software Structure was not applied to large datasets due to a too large computational cost.
For the model M1, with only one population, SHIPS was always able to correctly determine the correct number of one cluster for both all the small and large datasets. This was also the case of Structure and PCAclust. As a consequence these three methods perfectly assigned all the individuals to the correct population and had a Rand index of
The performances of SHIPS, Structure and AWclust were comparable for the models M3 and M5. An average number of 3 and 5 clusters was respectively estimated for all small and large replicates of the models M3 and M5 (except for Structure that was not applied to large datasets). These three methods misclassified in average less than 3 individuals leading to Rand indexes higher than 0.99. PCAclust was able to estimate the correct number of 3 subpopulations in 8 small replicates out of 10 small datasets of the model M3 and in 5 replicates for the model M5. When the number of SNPs increased to 200 K, PCAclust was able to correctly estimate
The model M10, with 10 populations, pertains to a more complex structure of the data. In this scenario SHIPS, Structure and AWclust succeeded in perfectly estimating
In this last simulated model, with the more complex structure and 20 populations, both SHIPS and Structure evaluated the correct number of clusters for all replicates and completed an individual assignment very consistent with the true populations. AWclust and PCAclust underestimated the number of clusters. AWclust only allows to estimate a maximum of 16 clusters that was reached for this complex dataset. One could wonder if the clustering assignments would have been better if the maximum number of clusters was more flexible. On the other hand, PCAclust was not able to detect the structure of this dataset. Only 4 clusters in average were identified in the small and large datasets as many populations were not separated thus leading to a low Rand index close to 0.2. For both small and large datasets Admixture estimated 21 clusters and almost perfectly assigned all the individuals to the correct populations. Even though these clusterings are quite accurate, it is noticeable that 21 was the maximum number of clusters for which the algorithm converged. In other words, it is possible that if the convergence could have been reached for greater values of
SHIPS and Structure were the most accurate methods when applied to simulated datasets both in terms of estimating the correct number of clusters
In order to assess the quality of the clustering methods we were also interested in looking at admixed and real datasets, more representative of the ones encountered in genetic studies. We present the average results over the different small and large replicates, along with details on the assignments performed. In order to account for the fact that there is no “true” structure in real datasets, we considered both the population labels and the labels produced by the program Admixture as structures (also called partitions) of reference.
The first small dataset was used to produce this plot. Populations are separated by black lines and assigned with a unique color that is approximatively reported on the barplot of each method. For the discrete methods the admixture proportions are either 0 or 1.
The first small dataset was used to produce this plot. Populations are separated by black lines and assigned with a unique color that is approximatively reported on the barplot of each method. For the discrete methods the admixture proportions are either 0 or 1.
The first small dataset was used to produce this plot. Populations are separated by black lines and assigned with a unique color that is approximatively reported on the barplot of each method. For the discrete methods the admixture proportions are either 0 or 1.
This representation is an output produced by SHIPS. The tree structure corresponds to the successive divisions conducted by the algorithm. Each final cluster is represented by a scatterplot of its members. We colored here the individuals according to the population labels.
This representation is an output produced by SHIPS. The tree structure corresponds to the successive divisions conducted by the algorithm. Each final cluster is represented by a scatterplot of its members. We colored here the individuals according to the population labels.
SHIPS identified 3 distinct populations for the admixed datasets that are the two populations of origin (CEU and CHB) and the one simulated as an admixture. Structure, Admixture and AWclust detected two populations. The admixture proportions displayed in
In terms of quality indexes, when comparing to the population labels, SHIPS and PCAclust performed the best as they identified the 3 main discrete populations. When comparing the results to Admixture, Structure is the closest in such a setting and SHIPS and AWclust are in agreement at about 50% as they assigned the samples from the admixed population to another population being a cluster of admixed, CEU or CHB individuals.
The results are quite similar on the large admixed dataset except PCAclust that did not find small subclusters within the CHB populations (
It is interesting to notice that there are two kinds of behaviors to cluster the admixed individuals. Certain methods assigned them to the populations of origin they are the closest genetically speaking and others created a specific admixed cluster. These two behaviors of the methods are understandable given the nature of the admixture that we considered in this simulation. Indeed, we simulated a discrete admixture, meaning that the admixed samples, even though originating from the CHB and CEU populations, form a discrete cluster. The nature of this structure is therefore more challenging for discrete clustering algorithms such as SHIPS and AWclust but also quite favorable to discrete assignments compared to ‘real life’ admixtures that are usually continuous. The results produced by Structure and Admixture have to be interpreted in the sense that with a continuous admixture only the admixture proportions can properly relate the structure as there would be no discrete cluster to be identified. Further analyses of these algorithms on continuous admixture would reveal more precisely the behaviors of the algorithms with such population structure and complete the partial results presented here.
Considering all 20 small replicates, SHIPS was able to identify 8 clusters in average (
Admixture estimated 7 ancestral populations in the small datasets. As we can observe on
On the large dataset, results are quite similar except that Admixture estimated 6 ancestral populations. The corresponding assignments were however more consistent with the population labels. The same observation can be made for SHIPS and as a consequence the quality indicator of our new method improved whether we compared it to the population labels or to Admixture.
We first describe the results for the small datasets. In average, over all the small PanAsian datasets SHIPS estimated 8 clusters. In the majority of the replicates the population from India (IN.TB) was clustered with the Philippines (PI.AT) or Singapore (SG.ID) and the populations from China (CN.WA) and Indonesia (ID.JA) or Japan (JP.ML) were assigned to the same cluster. These clusterings of the data are quite consistent with the labels of the populations and as a consequence SHIPS has the highest Rand index of 0.81 with this reference partition. PCAclust estimated 9 clusters. The CN.WA population was split in several clusters and often assigned to the same clusters as samples from SG.ID and IN.TB or PI.AT and MY.JH. Several other populations were separated according to the population labels and therefore the quality index with this reference is of 0.71. Structure identified 5 ancestral populations. The corresponding discrete clustering is however quite distant from the population labels. Indeed, only the MY.JH, TH.MA and part of the SG.ID populations are separated. As a consequence the Rand index compared to the population labels is quite low. Likewise, AWclust has a null Rand index as this method did not determine any structure in the data. Admixture found 6 ancestral populations. The populations IN.TB, JP.ML, KR.KR and TW.HA were assigned to the same cluster like CN.WA and ID.JA. This results in a Rand index of 0.45. When analyzing the admixture proportions (
On the large datasets, SHIPS and PCAclust estimated fewer clusters than on the small datasets. SHIPS estimated 5 clusters and PCAclust 7 clusters. These differences resulted in SHIPS identifying a structure very close to that estimated by Admixture (Rand index of 0.89) while PCAclust's clustering was less in agreement with Admixture (Rand index of 0.25). On the other hand, PCAclust was closer to the population labels partition than SHIPS. One has to note that when setting the number of clusters manually, SHIPS and PCAclust estimated the same structure than on the small datasets. These different behaviors of the methods are therefore due to the size of the dataset that influenced the estimations of the number of clusters.
The analysis of the real datasets pointed out that compared to the population labels as reference partitions, SHIPS was the most efficient method to uncover the population structures followed by PCAclust. Even though SHIPS produces discrete clusterings, this novel algorithm reached the most important agreement with the clusterings estimated by widely used methods such as Admixture.
We have proposed in this paper a novel clustering approach to infer the genetic structure of populations from SNPs data. SHIPS is based on a divisive hierarchical clustering procedure and a pruning strategy followed by the use of the gap statistic to estimate the final number of clusters
SHIPS has proven to be an accurate and precise method to estimate both relevant optimal numbers of clusters as well as for producing assignments consistent with the reference partitions of the data considered. In the simulated datasets,
The other algorithms considered had less regular performances, either missing the structure of the complex simulated data or of the real datasets. A possible explanation of these results depends on the algorithms' methods to estimate the number of clusters or on the parameters utilized for each algorithm. It is interesting to observe that even though Structure and Admixture are based on the same model their performances are notably different. On the simulated datasets, Structure was able to estimate the correct
AWclust generally uncovered the structure of the small and large simulated datasets but failed to properly analyze the real datasets. Whether we considered the population labels or the partitions produced by Admixture as reference for the real datasets, AWclust's clusterings were not in agreement with these references. Only the three main ethnicities were detected in the HapMap data and no structure in the PanAsian data due to the fact that the optimal estimated number of clusters were underestimated. It is however interesting to notice that when manually setting the number of clusters, the sample assignments were more consistent with both the population labels or the results of Admixture. This can be explained by the gap statistic used by the algorithm that was not able to select the correct values of
In addition to the individuals clustering, both SHIPS and AWclust provide tree structures that allow the analysis of the relationship between populations. The corresponding graphical representations, presented in
The method PCAclust selected the number of principal components to be used for the clustering using the TracyWidom statistic (
The performances of this method are however better when applied to real datasets, especially when compared to the population labels. When comparing the clusterings produced by PCAclust to Admixture, the results are more mitigated. PCAclust estimated more clusters than Admixture and split populations that this latter algorithm considered coming from the same ancestral populations. A reason might be that even though the two algorithms are somehow linked
The methods discussed here are composed of two parts to analyze the structure of the populations. The first corresponds to the quality to assign individuals to relevant clusters and the other is the ability to estimate a proper optimal number of clusters
In terms of ease of use of the algorithms, the nonparametric ones generally have the advantage of demanding fewer input parameters than parametric approaches. In addition to the data, SHIPS needs the maximal number of clusters investigated and the number of null simulations for the gap statistics. Usually parametric algorithms need a lot of input parameters, often pertaining to the underlying statistical models and therefore more complicated to set. This is the case of Structure, however Admixture needs only the maximal number of clusters and the parameter to conduct the crossvalidation.
Considering the computation time of the algorithms, PCAclust is the faster, e.g taking less than an hour when applied to the PanAsian data. SHIPS and Admixture take a couple of hours while AWclust is close to a day and Structure several days. Even though PCAclust is the fastest algorithm that we considered in our comparison, one has to note that the program does not come as a package and has to be recoded. The other methods that we considered have the advantage of being freely available in the form of packages.
Several particularities of the SHIPS algorithm can be highlighted. The divisive strategy is based on the rationale that a clustering method has to be applied iteratively to the subpopulations in order to detect the cryptic structures that are hidden behind the main structure of the data. SHIPS finely investigates each estimated cluster to determine if it can be divided into several relevant subclusters. This division procedure, that is equivalent to the construction of a binary tree, is conducted by the use of a spectral clustering that takes as input a similarity matrix. This similarity matrix has to be computed only once for all the data and submatrices corresponding to the subclusters investigated can be extracted at each step. This renders the construction of the tree a fast and efficient part of the algorithm. One has to note that the individual assignment part of the SHIPS algorithm is intimately linked to the choice of a proper similarity matrix. We decided to consider a matrix based on the allele sharing distance as it is computationally fast to compute and led to accurate clustering results. It is however possible to use different matrices that could lead to even better clustering performances
The pruning procedure leads to several possible clusterings of the samples. These configurations are all nested within each other. This allows in one run of the algorithm to get for all possible
SHIPS does not use the same version of the gap statistic than the one used in AWclust. As explained in the Methods section, we decided not to consider the logarithm of the withincluster sum of squares but directly the sum of squares. This indicator showed better empirical performances to estimate the optimal
Also, we determined through several experiments that repetitive applications of the SHIPS algorithm to the same dataset leads to the same clustering results. This robustness of the algorithm confirms that SHIPS is a powerful tool to detect population structure.
The novel clustering approach presented in this paper was applied to SNP data. It produces accurate clustering results and is therefore a promising method to uncover the genetic structure of many populations. Also, one has to note that the methodology of SHIPS, that is the divisive strategy, the following pruning and the gap statistic can easily be extended to cluster other sorts of data such as gene expression for example. Given that a proper distance matrix is used and that an adequate simulation process for null reference datasets of the gap statistic is applied, various usages of the SHIPS algorithm can be expected.
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
We thank Marine Jeanmougin for helpful discussions and thank Fabrice Glibert, Gilles Grasseau, Maurice Baudry and Ilya Chumakov for their support. We also thank the 2 anonymous reviewers for their constructive comments.