Inference of Population Structure using Dense Haplotype Data

The advent of genome-wide dense variation data provides an opportunity to investigate ancestry in unprecedented detail, but presents new statistical challenges. We propose a novel inference framework that aims to efficiently capture information on population structure provided by patterns of haplotype similarity. Each individual in a sample is considered in turn as a recipient, whose chromosomes are reconstructed using chunks of DNA donated by the other individuals. Results of this “chromosome painting” can be summarized as a “coancestry matrix,” which directly reveals key information about ancestral relationships among individuals. If markers are viewed as independent, we show that this matrix almost completely captures the information used by both standard Principal Components Analysis (PCA) and model-based approaches such as STRUCTURE in a unified manner. Furthermore, when markers are in linkage disequilibrium, the matrix combines information across successive markers to increase the ability to discern fine-scale population structure using PCA. In parallel, we have developed an efficient model-based approach to identify discrete populations using this matrix, which offers advantages over PCA in terms of interpretability and over existing clustering algorithms in terms of speed, number of separable populations, and sensitivity to subtle population structure. We analyse Human Genome Diversity Panel data for 938 individuals and 641,000 markers, and we identify 226 populations reflecting differences on continental, regional, local, and family scales. We present multiple lines of evidence that, while many methods capture similar information among strongly differentiated groups, more subtle population structure in human populations is consistently present at a much finer level than currently available geographic labels and is only captured by the haplotype-based approach. The software used for this article, ChromoPainter and fineSTRUCTURE, is available from http://www.paintmychromosomes.com/.

TEXT S6 Empirical validation of c TEXT S6.1 Calculation of c We first segment the genome into contiguous segments of constant number of chunks d. The number of chunks donated to individual i from j in segment k is x ijk , and d is chosen such that x ijk is approximately independent (for different k and conditional on i and j). This means that different individuals may have a different number of segments if they have different patterns of recombination. In practice, we found that d = 100 works well for the HGDP data, but due to the high LD present in the linked simulation data, there were only an average of 20 chunks per region. We therefore took the whole region to be a segment in this case and computed c using the full 200 region dataset. Then we compute s ij = k (x ijk ) and s 2 ij = k (x 2 ijk ). If individual i has R i segments in total, we can calculate the theoretical variance for x ij by first estimating the rate of inheriting from each other individualP ij = s ij / j s ij and substituting into the multinomial variance: The empirical variance is: This leads (with correction for the known overcounting factor of 2) to the estimate of c: and we simply take the mean value as our estimate: Note that we provide a helper program called 'ChromoCombine' that calculates this, and which can easily use the two options of d described. It also handles summation of multiple files in case of parallelization was used for processing individuals and/or chromosomes separately.

TEXT S6.2 Validation
In this section we present evidence of the effect of varying the rescaling factor 'c' on inference. Note that we view c as a summary of the data in the same way as the coancestry matrix X, and not as a parameter -it is therefore not appropriate to perform inference for it in the standard Bayesian way.
For interpretation of the empirical evaluation presented, we note that when c is 'too large', the effective number of chunks is reduced and therefore any mistakes in population assignment will tend to be under-split, i.e. we will not distinguish efficiently between similar populations. When c is 'too small' our model believes it has more independent chunks than is true and therefore will tend to over-split populations. The smallest c that does not over-split is called efficient, and larger c are called conservative.
We start with the unlinked model in the case where there is no population structure. Provided population sizes are large, we expect the theoretical results derived in Section S4 above to hold. Specifically, the theoretical prediction (for the approximately correct data likelihood) in the case of unlinked data is (Proposition 4 of Section S4.4): which is equal to Equation 1 of the main text: For this section we have generated datasets containing 15000 non-rare (> 5% allele frequency) unlinked SNPs (at varying N ) under the same splitting scenario as the main text. Simulation for each SNP was by a) generating the 'ancestral frequency' f with p(f ) ≈ 1/f (since this is not a probability distribution, we first choose which of 20 bins in the range 0 and 1 the SNP is from, then sample f conditional on this), then b) applying a normally distributed drift matrix for population level drift Σ, giving population level frequency vector g ∼ MVN(f , f (1 − f )Σ), and c) sampling individuals SNP values according to this frequency. (SNPs with empirical frequency below the 5% threshold were resampled). The covariance matrix for the drift was: The theoretical prediction is compared to our empirical estimate of c on this dataset in Figure S1 which shows that our theoretical understanding of c is correct, i.e. that the correlation with the truth is 1 at the predicted value and that both the theoretical and empirical estimates of c are approximately efficient and equal for large N . The empirical estimate is conservative for small N .
Our empirical estimate of c is also applicable in the case of linked data, whether using our linked or unlinked models. For the linked simulated data described in the Results section of the main text, we perform a similar scan of c and N to check that our algorithm is computing an appropriate value of c in realistic circumstances. Figure S2 shows these results.
The value of c estimated by the empirical method again appears to be conservative for small N and approximately efficient for large N . In both cases, the 'truth' (if obtainable from the data) is still obtained for a very wide range of c > c i.e. greater than the estimated value. This demonstrates that exact specification of c is not an important issue for many practical purposes.
Note also that the value of c is significantly larger for linked data (in both the linkage and no-linkage models) than for the case of unlinked data. When the unlinked model is used, correlations between neighbouring loci due to linkage disequilibrium increase the variance between regions. When the linked model is used, c values do not fall substantially below 1, even for very large population sizes. Intuition behind the different behaviour of the linked and unlinked models comes from considering the uncertainty in chunk assignment. For the unlinked model, the number of haplotypes which a particular allele is identical to increases linearly with sample size. For the linked model, each addition individual in the sample has the chance of having a haplotype that is a still better match than any preceding haplotype. For this reason, the uncertainty of assignment of each haplotype does not change substantially as additional individuals are added.
We now describe a scenario in which neither the theoretical nor the empirical estimate of c work well; this is because there is not a single suitable value of c for which our model holds in this case. This is the case of unlinked markers with strong differentiation between populations, large numbers of markers and large sample sizes ( Figure S4).Here the estimated value of c gives confident assignment of incorrect splits. The model predictions break down because individuals within the same population all share SNPs with individuals in other populations. If genetic drift is strong at individual SNPs, then sharing this coancestry can give inappropriately weighted information that the individuals are related to each other. In other words, the assumptions of Section S4, Propositions 3-4 do not hold.
It is important to note that this problem is less dramatic in linked data, and essentially does not arise in the linked model. To see why, we note that these correlations arise when an (unlinked) SNP is found in a individual that is common in another population but rare/absent otherwise. Such SNPs can only arise through strong drift, are not excluded because they are not rare overall, and are interpreted as overly strong evidence of shared ancestry. As N becomes large with population sizes n a ∝ N , most SNPs provide O(N −1 ) information on population level copying proportions (which is why c = O(N −1 )). However, strongly drifted SNPs provide O(1) evidence because they are shared with a high fraction of another population, and not with any other individual. For (truly) linked data, such a SNP will be down-weighted due to the average level of correlation between nearby SNPs, so even in the unlinked model we will infer a larger value of c. For the linked model, all chunks are approximately unique and therefore provide O(1) information per chunk, so a strongly drifted SNP will not have a dramatic influence since all chunks are already 'strongly drifted'. The success of our algorithm for simulated linked data also supports this argument.
For the strong drift and unlinked case, we have developed an alternative algorithm in which the likelihood is modified so that these correlations are accounted for. We do this in effect by considering only within-population counts as important, so that when considering a merge move, the between-population counts are normalised to have the same mean. We re-normalise using: where q i is the index of the population for individual i, and |q i | is the number of individuals in that population. This corresponds to ensuring that all row and column sums copying from population b to population a are equal. This is illustrated in Figure S3 which shows the coancestry heatmap for the unnormalized and normalised cases, as well as the difference between them. The coancestry heatmaps are visually very similar, but the undesired correlation structure is clearly visible in this difference. Some individuals have an elevated number of donated chunks to all individuals within a specific population, leading to a 'striped' pattern. The bottom plots show the same thing but where we consider a potential merged population (merging the most recent split). It is clear that the presence of population structure is preserved under this procedure because the two populations have a different profile within the population being merged. Therefore the standard likelihood applied to both the merged and split matrices can correctly identify both populations due to their different rates of copying within and between, and additionally it is not mislead by the correlated copying from other populations. However, were the only distinction between populations B1 and B2 the copy rate from some third population, this would be normalised out. Note that our simple likelihood is not a well defined entity under this modification, because the data depends on the population assignment. There is however an implicit likelihood induced by the modification of the data which is well defined and is correctly comparable both within states of a given number of populations K and for states of differing K, provided that we consider moving an individual between two populations as creating a merged state between the two populations (which defines the normalisation), and creating a split state corresponding to the move.
Although this procedure is more robust than the use of the raw coancestry matrix, it is not recommended for general use because firstly, it discards information about a split that comes from differential chunk counts from other populations, and secondly, it is not a clearly defined model. We have tested this procedure on the HGDP data (unlinked model, results not shown) and obtain broadly similar results to those quoted in the main text with some subtle splits lost: for example, the Tuscan/Italian split is not fully supported. We recommend using it as a conservative check if the value of c is very low (say less than 0.05) and there is strong structure in the dataset.