On Identifying the Optimal Number of Population Clusters via the Deviance Information Criterion

Inferring population structure using Bayesian clustering programs often requires a priori specification of the number of subpopulations, , from which the sample has been drawn. Here, we explore the utility of a common Bayesian model selection criterion, the Deviance Information Criterion (DIC), for estimating . We evaluate the accuracy of DIC, as well as other popular approaches, on datasets generated by coalescent simulations under various demographic scenarios. We find that DIC outperforms competing methods in many genetic contexts, validating its application in assessing population structure.


Introduction
A common problem in modern population genetics is identifying population substructure among a sample of individuals genotyped across a set of neutral genetic markers. Bayesian clustering algorithms such as STRUCTURE [1,2] and BAPS [3] and their derivates [4][5][6][7][8] are commonly used for addressing this problem. Of particular concern to many investigators is estimating the number of subpopulations or clusters K that are necessary and sufficient to explain observed patterns of genetic variation. Part of the reason investigators are concerned with the ''choosing K'' problem is that many of the classification algorithms (including STRUCTURE) require specifying the number of clusters as a parameter in the model. A consequence of this is that the biological conclusions one draws from the data may be artificially dependent on the value of K chosen. In practice, many investigators analyze their data using a range of values for K, reporting the output for all (or a plausible set of) K's and/or employ one of several post hoc statistics [1,4,9] to choose an optimal value for K. The purpose of this communication is to report our experience with the Deviance Information Criterion (DIC) as a statistic for choosing K. By comparing the performance of DIC to other commonly used statistics on simulated data under a variety of population genetic scenarios, we find that it often outperforms other approaches and recommend it be considered by investigators interested in estimating K from genotype data. Its advantage over more complex approaches such as the reversible-jump Markov chain Monte Carlo (MCMC) or the Dirichlet process prior on K, is that calculating DIC requires trivial computational overhead once the MCMC has been run.
Choosing K is a difficult problem in the Bayesian clustering setting, because as K increases, the likelihood of the data increases monotonically, as well as the complexity of the model. Adding more degrees of freedom to the analysis generally improves the overall fit of the model to data. This often results in monotonic non-decrease in the probability of the data given K as K increases [1,9]. A common way of dealing with this class of statistical problems (known as ''model selection'') is to use a penalizing function which weighs the fit of a model versus its complexity. This is the underlying idea behind many model selection statistics such as the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). The Deviance Information Criterion (DIC) is a recently proposed statistic for model selection when the posterior distribution of parameters in competing models are estimated using Markov chain Monte Carlo, as is the case with STRUCTURE and its derivatives [10].

Results
We applied the Deviance Information Criterion to estimate K for datasets generated by coalescent simulations under various demographic scenarios and for the large-scale genotype data from Human Genome Diversity Panel. We evaluated the accuracy of DIC in comparison with other popular approaches and demonstrate that DIC performs well in a variety of scenarios.

Application to Simulated Data
We performed extensive coalescent simulations using multiple demographic models, including Models Split, Tree, M 0:5 , M 2:0 , M 10 and Inbred (see Section Methods and Figure 1). Models Split and Tree implement the distinct demographic histories during subpopulation formation. Models M 0:5 , M 2:0 , and M 10 are used to investigate the impact of different levels of exchange among subpopulations on the inference of population structure. Model Inbred is designed to test the effect of the confounding factor ''inbreeding''. To evaluate the robustness of our method in the case of scarcity of data, we also simulate the Model Split with n~10 individuals or S~10 SNPs. The last scenario tested is to reduce the splitting time among subpopulations by a factor of ten. This is equivalent to decreasing the genetic distances among subpopulations, which implicitly reflects the various levels of physical distances among populations. Then we ran each data set through InStruct [5] with five MCMC chains for each value of K, retaining a total of 50,000 iterations after a 500,000 iteration burnin period with a thinning interval of ten iterations between retained draws. Figure 2 illustrates the performance of DIC on a randomly selected data set generated under Model Split with true K[f1,2,3,5g. For these four data sets, {DIC always peaked at the correct K values for all the chains. (Note that we choose to plot {DIC because it is often easier to visualize a maximum peak than a minimum trough).
To place our work in a broader context, we also ran these data sets through five methods commonly used to estimate K: (1) the approximate likelihood method implemented in STRUCTURE using both the original and correlated allele frequency model, (i.e., the ''F'' model [1,2]), (2) the DK approach based on running STRUCTURE with both the original and F models [9], (3) Eigenanalysis method (implemented in ''SmartPCA'' software) proposed by Patterson et al. [11] which estimates K as 1 plus the number of significant eigenvalues underlying a principal component decomposition (PCD) of the scaled genotypic value matrix, (4) Structurama which uses a Dirichlet process prior model to partition a sample into subgroups [12,13], and (5) BAPS utilizing the splitting and merging strategy to attain the best classification [3,[6][7][8]. We also conducted preliminary analyses using the regularization method [4], but found that it consistently performed poorly for moderate values of K (e.g., the accuracy was under 50% when K~3 under the Split model).
We assessed the accuracy of each method as the proportion of data sets which correctly recover the value of K used in data simulation using the optimality criterion defined for each approach. For example, for DIC, we used the lowest DIC value observed across five independent MCMC chains run for each of the six values of K. For Eigenanalysis, we assessed accuracy under three significance levels (a[f0:05,0:01 and 0:001g). For Structurama, we chose the partitions of individuals with the highest posterior probabilities under two prior distributions: (1) a noninformative prior on the number of clusters, and (2) a prior distribution with the expected number of clusters equal to the true value of K used to simulate the data. We use the individual clustering mode of BAPS as our simulation does not include admixture.
Under the case of simple population splitting with a high degree of population differentiation, i.e., F ST values around 0:5, we found that the DIC method consistently outperformed other approaches (see Table 1). For example, under Model Split, the accuracy is near 100% for all values of K considered. STRUCTURE, on the other hand, has an accuracy that ranges from 54% to 100% depending on the true K and whether or not the F model is employed. We also observe that the accuracy of DK decays with K, starting at 100% for K~2 and reaching 50% and 64% for the F and non-F models, respectively, at K~5. Eigenanalysis tends to perform well, but is sensitive to the choice of a with smaller values (e.g., 0:1%) of a performing better than higher values (e.g., 5%). The performance of Structurama on simulated data was interesting. It performed perfectly well when K was small (Kv3) but when Kw3, it tended to fail almost completely. We posit that this may be due to the tendency of the Dirichlet process mixture model to overcluster, which results in K being underestimated. An alternative explanation is that the Dirichlet process prior fails to converge within a finite number of iterations in practice, which commonly challenges many other mixture model methods [14]. BAPS performs perfectly well, except in the case of K~4, it drops to 82%. The performance of most methods under the complex splitting model (i.e., Model Tree) was similar to the performance under Model Split. This implies our results are robust to moderate  deviations from the K-wise subpopulation split topology assumed in STRUCTURE.
Migration among subpopulations, on the other hand, can have a profound impact on the accuracy of all approaches. When migration rates are low between subpopulations (Model M 0:5 ), DIC, BAPS, and Eigenanalysis with a stringent p-value cutoff both worked perfectly. STRUCTURE also performed reasonably well with accuracy rates ranging between 84% and 100% (see Table 2). When the migration rates among subpopulations are intermediate F ST *0:06), all methods showed a decrease in accuracy. For example, the accuracy of DIC noticeably decreases with K reaching a low of 54% for K~5 (see Table 3). The original STRUCTURE model also performed poorly with accuracy well below 20%. Interestingly, in the case of strong migration, the F model's accuracy is much higher both for the STRUCTURE and DK statistics. This is probably because the correlated alleles model is doing a good job in modeling patterns of genetic variation among admixed subpopulations. Since InStruct does not implement an F model, we predict that adding the F model to InStruct or implementing DIC within STRUCTURE with the F model would perform as well or better than these statistics. Eigenanalysis also seems to handle the high migration rate scenario well. Its accuracy decreases only slightly with K, compared to the low migration rate case. Intriguingly, the most stringent significance level for high migration does not necessarily perform best, as it does with the slower migration models. This suggests that it may be challenging to find the optimal tuning of a for best classification accuracy when using PCD and a Tracy-Widom approximation to the distribution of p-values. We also observe that Structurama appears to be very sensitive to migration. It clusters all individuals into one group for every data set under Model M 10 , i.e. no matter which prior is used, Structurama incorrectly estimates K~1 for every simulated data set. Our results differ from [12], who found Structurama worked well in estimating K under certain scenarios. We believe the differences may be due to the details of the simulation used. They considered an island model with migration, whereas we used a population-split model with subsequent migration among demes. This slight difference leads to more subpopulation differentiation in their simulations than ours, since they have a longer expected coalescent time between demes than we do. (That is, in our simulations all demes merge, looking back in time, at the time of population splitting). BAPS's accuracy decreases sharply as K increases, implying that it performs poorly in the case of weak population differentiation. When we assessed accuracy under the inbreeding model, assuming undetected inbreeding (such as partial self-fertilization) within subpopulations, we found again that DIC tends to outperform other methods (see Table 3). It is important to note that in calculating DIC, we have used InStruct's inbreeding model whereas the other approaches based on STRUCTURE assume the Hardy-Weinberg equilibrium within clusters. We, and others, have shown that failing to consider inbreeding in the likelihood calculation for STRUC-TURE can lead to spurious signals of population admixture and erroneous inference of the number of subpopulations [5]. This phenomenon appears to cause a large reduction in the accuracy of estimating K by STRUCTURE's F model with only 20% of simulations uncovering the true number of populations underlying the data. Eigenanalysis, which does not account for inbreeding either, likewise overestimates the number of subpopulations, and has an accuracy ranging from 61% to 100% depending on the true value of K. Both Structurama and BAPS are not heavily affected by hidden inbreeding, and have the similar accuracy pattern as under Model Split.
To assess the robustness of DIC in the limit of small data sets, we simulated data under Model Split for n~10 individuals or S~10 SNPs. We found that the accuracy of DIC is robust to the former, but not the latter (Table 4). When the subpopulation size decreases to 10, DIC performs almost as well as with a larger number of individuals per subpopulation. STRUCTURE and DK, on the other hand, show a significant reduction in accuracy as K increases to 5. Eigenanalysis shows a reduction in accuracy only when using a stringent p-value cutoff. When the number of markers is reduced to only 10, DIC's accuracy falls to 42% when K increases to five, which is expected as DIC is an asymptotic approximation that only holds as the sample size is sufficiently large, and the accuracy of STRUCTURE and DK is close to zero. With so few markers, Eigenanalysis fails to provide an output. Structurama also performs poorly under larger values of Ks. BAPS is robust to the decrease in sample size but is strongly affected by reducing the number of markers. While we conclude that DIC is more robust than other approaches to small data sizes, we, of course, expect accuracy to increase with S and so recommend that investigators genotype as many unlinked markers as is economically feasible. Under the Split model, our simulated data sets had a high degree of population differentiation (F ST among clusters was around 0:5).
To investigate the effect of weaker population structure on estimation accuracy, we simulated data with a reduced splitting time of 0.05 in units of 4N e generations. This gives simulated data with F ST among subpopulations in the range of 0:08*0:12. We found that shortening the splitting time, not surprisingly, reduced the accuracy of all methods with results similar to those observed for the strong migration among subpopulations (Model M 10 ). We note, in particular, that the Bayesian methods showed a decrease in accuracy with increasing K. Interestingly, Eigenanalysis performed quite well, particularly using the less stringent significance level (see Table 5), which is consistent with the original results of [11] that their approach can detect very fine-scale population structure.

Application to Human Data
To demonstrate a concrete application of DIC, we have applied the approach with the inbreeding model of InStruct to the Human Genome Diversity Panel (HGDP-CEPH) data from [15], containing 1056 individuals from 52 populations, genotyped at 377 autosomal microsatellite loci. We find that DIC estimates K~6 for these data as shown in Figure 3A. The five clusters we estimate (see Figure 3B) correspond approximately to the geographic regions of Africa, Europe/the Middle East/Central-South Asia, the Americas, East Asia, and Oceania as described by [15]. It is interesting to note that in our classification, we also found evidence that some alleles from the San people of Namibia, Africa, may form a sixth minor cluster with a posterior inbreeding coefficient estimate around 0.20, the highest of all clusters.

Discussion
The Deviance Information Criterion is a simple and effective model selection method for estimating K, the number of clusters underlying a sample of individuals. We anticipate this approach will have wide applications in population structure inference. One important factor affecting our estimation of the accuracy of DIC is the underlying probabilistic model used in InStruct. Since InStruct takes inbreeding into account, it naturally outperforms approaches that do not model non-random mating explicitly. At the same time, since we do not implement the F model, we do poorly when migration rates are high and allele frequencies are similar among clusters. Furthermore, the accuracy of DIC sometimes fluctuates with the quality of the classification of individuals into clusters. As in any complex MCMC framework, the likelihood surface may be multimodal for a given value of K. In practice, we have observed that DIC values may vary substantially among independent MCMC chains for the same dataset, especially for larger K values, due to poor mixing of MCMCs under some scenarios. We recommend that for a given value of K, several chains be run and the minimum value of DIC across chains be used for inference. It is also important to note that population structure is a complex concept with a hierarchical form and multiple levels. DIC infers the best partition of a group of individual genetic materials taken as a whole. To investigate the finer scale of subpopulation structure, we suggest further structure analysis within each inferred cluster.

DIC Statistic
Here we introduce the Deviance Information Criterion formula in details. Denote f (y i jh) for i~1, . . . ,n as the probability of observing individual i's genotype given parameters h of the model which include factors such as subpopulation allele frequencies, probabilities of assignment, inbreeding coefficients, etc. For a given multivariate parameter vector h, we define the deviance as: The above formula is easily recognized as the usual log-likelihood function evaluated at h. [10] defines the Deviance Information  Model Inbred K subpopulations without migration, each subpopulation with a randomly sampled selfing rate.
For Model Split, M 0:5 , M 2:0 , M 10 and Inbred, all subpopulations split from a common ancestral population at a time t~0:5 in the past scaled in units of 4N e generations, where N e is the effective subpopulation size. In Model Inbred, partial selffertilization within subpopulations is taken into account using the same simulation scheme as in [5]. For Models M 0:5 , M 2:0 and M 10 , K~1 is omitted as there is no migration in the case of only one subpopulation. Besides the star-shaped genealogy among subpopulations in Model Split, Inbred, M 0:5 , M 2:0 and M 10 , we also considered the tree topology relationship among subpopulations described in Model Tree as illustrated in Figure 1. For this model, the K~1 and 2 cases are ignored since they are identical to the corresponding Ks under Model Split.
To assess the robustness of our conclusions to changes in sample size, the number of loci genotyped, or population divergence time, we undertook further simulations using Model Split. First, we reduced subpopulation size from n~50 to n~10. Second, we reduced the number of markers used in the analysis from 100 to 10. Third, we reduced the splitting time from the common ancestral population from t~0:5 to 0:05. For each of the nine contexts described above (6 models+3 robustness conditions), we simulated 50 replicate data sets per value of K.