Revealing multi-scale population structure in large cohorts

Population structure in genetic data depends on complex demographic processes including geographic isolation, genetic drift, migration, and admixture. Together with technical artifacts, population structure is a prominent confounder of genomic studies. Identifying such patterns is therefore central to the genomic enterprise. Whereas many methods can identify specific types of population structure, few are able to provide simple representations of genomic diversity across a range of scales. Here we investigate an approach to dimension reduction and visualization of genomic data that combines principal components analysis (PCA) with uniform manifold approximation and projection (UMAP) to succinctly illustrate population structure in large cohorts and capture their relationships on local and global scales. We demonstrate using genotype data from the 1000 Genome Project, the Health and Retirement Study, and the UK Biobank that projections using PCA-UMAP effectively cluster individuals who are genetically closely related while placing them in a global continuum of genetic variation. These projections reveal non-trivial population groupings, reflect ethnicity and geography on fine-scale levels, and uncover patterns in the distributions of a variety of phenotypes, establishing PCA-UMAP as a general-purpose approach toexploratory analysis in genomics.

Questions in medicine, anthropology, and related fields hinge on interpreting the deluge of genomic 9 data provided by modern high-throughput sequencing technologies. Because genomic datasets 10 are high-dimensional, their interpretation requires statistical methods that can comprehensively 11 condense information in a manner that is understandable to researchers and minimizes the amount 12 of data that is sacrificed. Both model-based and model-agnostic approaches to summarize data have 13 played important roles in shaping our understanding of the evolution of our species (1). 14 Here we will focus on nonparametric approaches to visualize relatedness patterns among individuals 15 within populations. If we consider unphased single nucleotide polymorphism (SNP) data, an 16 individual genome can be represented as a sequence of integers corresponding to the number of 17 derived alleles carried by the individual at each of the L SNPs for which genotypes are available, 18 with L typically larger than 100, 000. Since each individual is represented as an L-dimensional 19 vector, dimension reduction methods are needed to visualize the data. 20 Principal component analysis (PCA) is often the first dimensional reduction tool used for genomic 21 data. It identifies and ranks directions in genotype space that explain most-to-least variance 22 among individuals. Positions of individuals along directions of highest variance can then be used 23 to summarize individual genotypes. PCA coordinates have natural genealogical interpretations 24 in terms of times to a most recent common ancestor (TMRCA) (2), and are used empirically to 25 reveal admixture (3), continuous isolation-by-distance (4,5), as well as technical artefacts. PCA 26 coordinates are particularly well-suited to correct for population structure in GWAS (6). 27 As sample sizes increase, the amount of information encoded in the lower-variance principal 28 components increases, and researchers typically examine multiple two-dimensional projections to get 29 a sense of the data. While many features of the data can be identified in this manner, other features 30 may be hidden by the projections or hard to interpret. 31 To display more of the high-dimensional features of the data in a two dimensional figure, we can 32 use non-linear transformations that seek to preserve the local structure of the data. A popular 33 method for visualization is t-distributed stochastic neighbour embedding (t-SNE) (8). Whereas PC 34 projection is designed to capture as much variance as possible for a linear transformation, t-SNE seeks 35 a low-dimensional representations of the data that preserves pairwise distances, using a probabilistic 36 Fig. 2. UMAP on the first 10 principal components of HRS data. coloring individuals by estimated admixture from three ancestral populations reveals considerable diversity in the Hispanic population. Individuals with intermediate admixture ratios form a "bridge" connecting three clusters of Black, White, and Hispanic individuals. The projection colored by self-identified race and Hispanic status is presented in figure S3 grounded in Riemannian geometry, algebraic topology, and category theory, and designed to model 43 and preserve the high-dimensional topology of data points in the low-dimensional space (11). 44 The assumption behind UMAP is that data are uniformly distributed on local manifolds in high-45 dimensional space, which can be approximated as fuzzy sets that are patched together to form a 46 topological representation. One can then construct a low-dimensional topological representation 47 that minimizes the differences between the two representations. With genotype data, UMAP creates 48 a neighbourhood around each individual's genetic coordinates and identifies a pre-selected number 49 of neighbours to build high-dimensional manifolds. The end result is a low-dimension representation 50 that groups genetically similar individuals together on a local scale while preserving long-range 51 topological connections to more distantly related individuals.

52
A common practice in dimensional reduction is to first apply PCA to reduce the number of 53 dimensions before performing nonlinear dimensional reduction. In addition to being computationally 54 advantageous, this discards statistical noise that can confound nonlinear approaches: Population 55 structure arising from n isolated randomly-mating demes can be described by the leading n − 1 56 PCs, with the following PCs describing stochastic variation in relatedness (6). Selecting the leading 57 PCs therefore has potential to extract meaningful population structure while filtering out stochastic 58 noise. 59 We explore different strategies to pre-process the data and investigate discrete and continuous 60 population structure patterns present in large datasets of human genotypes: the 1KGP, the Health 61 and Retirement Study (HRS) (12), and the UK BioBank (UKBB) (13).

63
Fine-scale visualization of the 1KGP dataset. The 1KGP contains genotype data of 3,450 individuals 64 from 26 relatively distinct labeled populations(7). Figure 1 shows visualizations using PCA, t-65 SNE, UMAP, and PCA-UMAP (that is, UMAP with PCA pre-processing). Using UMAP and 66 t-SNE on the genotype data presents clusters that are roughly grouped by continent, with UMAP 67 showing a clear hierarchy of population and continental clusters, whereas t-SNE fails to assign many 68 individuals to population clusters. Using either on the top principal components leads to more 69 distinct population clusters and less defined continental structure (see figure S2 for PCA-tSNE).

70
Adding more components results in progressively finer clusters until approximately 20 populations 71 appear using 15 components; further components gradually approach results similar to using the 72 entire genotype data (see figures S1 and S2).

73
Focusing on PCA-UMAP with 15 principal components (figure 1 (iv)), we also find several 74 population clusters that reflect shared ancestries. British individuals from England and Scotland 75 form a cluster mixed with those from Utah who claim Northern and Western European ancestry.  clustering is robust to, e.g., the choice of the number of PCs considered (see figure S2).

91
Finally, contrasting UMAP and t-SNE, we find that UMAP preserves more of the global structure 92 of the data than t-SNE, and is more robust to choices of data pre-processing (figure S2).

93
The genetic continuum of admixed populations. The 1KGP sampled individuals from relatively distinct 94 population groups across the world, which makes the data particularly easy to cluster. Most medical 95 cohorts comprise larger numbers of individuals sampled across extended geographical areas.

96
For example, the HRS contains genotype data of 12,454 American individuals across all 50 states 97 who have provided racial identity (10,434 White, 1,652 Black, 368 Other) as well as whether they 98 identify as Hispanic (1,203 total) and, if so, whether they identify as Mexican-American (705 99 total) (12). We crossed these three variables to form a composite self-reported ethnicity resulting 100 in 10 categories (e.g. White Hispanic Mexican-American), and considered birth regions based on 101 the 10 census regions and divisions used by the US Census Bureau. Admixture proportions for 102 each individual were estimated in (14) by assuming ancestral African, Asian/Native American, and 103 European populations using RFMIX (15). We have scaled these three proportions to values between 104 0 and 255, to color individual points by their estimated admixture represented by RGB where red, 105 green, and blue respectively correspond to African, European, and Asian/Native American ancestry. 106 Using the first 10 principal components and UMAP, we demonstrate projections that present a 107 collection of sub-populations and a continuum of genetic variation.

108
The HRS forms several large groupings and clusters, reflecting both ethnicity (figure S3) and 109 admixture proportions (figure 2). Gradients in admixture proportion are clearly visible within the 110 predominantly Hispanic cluster, but not within the predominantly Black cluster, perhaps because 111 the variance in ancestry proportions is greater among Hispanics. The "White Not Hispanic" (WNH) 112 group forms four interconnected clusters, and these do not correspond to broad geographical areas 113 (figure S34) or clear ancestries represented in the 1KGP (figure S35). The clarity of the interconnected 114 clusters varies by parameterization, but they consistently form a large, roughly connected group. 115 One cluster of WNH individuals -possibly of Italian or Spanish ancestry -consistently appears 116 near the White Hispanic population, and another regularly appears at the periphery of the main 117 cluster and does not cluster with 1KGP populations (see figure S35).

118
Regional patters in the American Hispanic population. By contrast to the WNH individuals, applying 119 PCA-UMAP to self-identified Hispanic individuals reveals clear groupings related to birth region. 120 One separate cluster, highlighted in figure 3, consists almost entirely of individuals born in the 121 Mountain Region of the United States. This cluster is not apparent when looking at a grid of pairwise 122 plots of the first 8 principal components, provided in figure S36, as the signal is distributed along 123 PCs 3, 4, and 6. Even though continental admixture patterns do correlate with UMAP position 124 (figure S37), these do not explain the Mountain Region cluster. The cluster possibly comprises the 125 Hispano population of the Southwest US, who have been present in the Mountain Region area long 126 before the more recent immigrants from Latin America, and whose ancestry is expected to reflect 127 both distinct Native ancestry and population-specific drift relative to other Hispanic populations. A 128 recent preprint discusses the Mountain Region Hispanics and provides a more detailed historical 129 description (16).

130
Population structure in the UKBB reflects local and global genetic variation. The UKBB provides geno-131 type data on 488,377 individuals along with self-identified ethnic background in a hierarchical 132 tree-structured dictionary. Participants provided ethnic background on two occasions. We used the 133 initial ethnicity after finding minimal differences between the two. The dataset is majority White 134 (88.3% British, 2.6% Irish, 3.4% other), with large populations identifying as Black (1.6% either 135 African, Caribbean, or other), Asian (1.9% either Indian, Pakistani, Bangladeshi, or other), Chinese 136 (0.3%), an other ethnic group (0.8%), mixed ethnicity (0.6%), or an unavailable response (0.5%).  using other methods, succinctly illustrating the complex structure of large and multiethnic datasets.

163
The detailed shape of extended clusters is not stable as we vary the number of PCs included. 164 Figure S4 shows a UMAP plot using the top 20 PCs from the UKBB. The shape of the "White British 165 Cluster" is notably different, and we observe finer patterns of geographic variation, yet the qualitative 166 observations made above are maintained. As an alternate visualization of diversity's correlation 167 with geography, we performed a 3D UMAP projection and converted the normalized UMAP values 168 into RGB values, allowing us to plot individuals on a map of Great Britain, emphasizing both 169 spatial gradients of genetic relatedness and increased diversity in urban centers (figure 7). The 170 patterns in rural areas observed are similar to those reported in (17) using the haplotype-based 171 CHROMOPAINTER on British individuals whose grandparents lived nearby.

172
Similarly to UMAP, t-SNE applied to the UKBB data both displays diversity within the "White 173 British" population and identifies clusters among other groups. However, it has three drawbacks: it 174 Fig. 6. The UKBB projected onto two dimensions using PCA-UMAP (as in 4b), with females colored by age-adjusted difference from mean population height (left) and leukocyte counts (right). To protect participant privacy, data has been randomized as explained in the materials and methods section. the difference is not consistent (18,19).  Discussion. Understanding population structure is important to identify confounders in medical 208 genomics and studies of anthropology and human evolution. PCA of genomic data reflects genealogical 209 and geographic data, but visualization in large datasets still requires scanning through a large 210 number of pairwise plots. UMAP condenses these components and comprehensively illustrates 211 information -phenotypic, geographic, and ancestral -contained within genotypes on fine-scale 212 levels and within the context of a global population structure. In large datasets where the number 213 of significant PCs is large, the resulting representation has important advantages over PCA alone 214 and provides a superior visualization to t-SNE.  With these caveats in mind, a priori data visualization plays a central role in quality control, 247 hypothesis generation, and confounder identification for a wide range of genomic applications.

248
Nonlinear approaches, despite their limitations, become increasingly useful as the size of datasets 249 increases: We have shown that UMAP, in particular, reveals a wide range of features that would not 250 be apparent using linear maps. Given its ease of use, breadth of results, and low computational cost, 251 we propose that UMAP should become a default companion to PCA in large genomic cohorts.

253
We used genotype data from 12,454 individuals from the Health and Retirement Study (HRS), genotyped on 254 the Illumina Human Omni 2.5M platform (12). Principal components were computed in PLINK v1.90b5.2 255 64-bit(21) using variants with a minor allele frequency greater than 0.05, Hardy-Weinberg p-value of more 256 than 1 × 10 −6 , and genotype missing rate of less than 0.1, and sample with genotype missing rate of less 257 than 0.1. We used the principal components of genotype data from 488,377 individuals in the UK BioBank 258 (UKBB) as computed by the cohort (13). We used genotype data from 3,450 individuals from the 1KGP 259 project using Affy 6.0 genotyping(7).

264
Both UMAP and t-SNE feature a number of adjustable parameters. Among the parameters that we 265 varied, the number of PCs used in pre-processing of the data has the largest effect for both methods (see 266 figures S1 and S2). 267 We tested different choices for perplexity in t-SNE. The default value of 30 provided comparable 268 performance to other parameter choices. Similarly, we tested different parameter choices for UMAP, with 269 the clearest results generated by specifying 15 nearest neighbours (the default value) and a "minimum 270 distance" between points in low dimensions of 0.5. UMAP developers described "sensible" values for nearest 271 neighbours as between 5 and 50 and minimum distance between 0.5 and 0.001.

272
UMAP and t-SNE projections were carried out on an iMac with a 3.5GhZ Intel Core i7 processor, 32 GB 273 1600 MHz DDR3 of RAM, and an NVIDIA GeForce GTX 775M 2048 MB graphics card.

274
To reduce the potential risks for re-identification from results in this publication, data has been randomly 275 permuted so that the population characteristics are preserved but individual-level data is not presented 276 directly in the figures. We rounded each attribute to an attribute-specific number of bins, and then 277 permuted the data in the following way: For each point (i.e. each individual) in UMAP visualizations, 278 and each attribute, we identified the 9 nearest neighbouring points, and copied the attribute from a 279 randomly selected neighbor (thus allowing for the possibility of one value being printed twice). Because 280 this process is done independently for each visualization, a given point shown on the figure will copy values 281 from different randomly selected individuals. Additionally, spatial coordinates have random noise added 282 (normally distributed about 0 with a standard deviation of 50km) before binning to the nearest 50km.                                    S35. UMAP on the top 10 principal components of the combined HRS and 1KGP datasets. The colors match those used in the separate 1KGP and HRS plots. The large blue clusters are "White Not Hispanic" individuals. Scattered brown dots within these clusters are Utah residents with Northern/Western European ancestry, and British in England and Scotland ("CEU" and "GBR", respectively, in the 1KGP datasets). A cluster of Finnish individuals consistently shows up and is labelled "FIN". Toscani individuals regularly form clusters and are labelled "TSI". However, their locations relative to the other clusters of "White Not Hispanic" are not consistent, sometimes appearing next to clusters of Spanish individuals and other times not. One cluster of "White Not Hispanic" individuals is highlighted in a box -unlike the other clusters, CEU and GBR individuals are not projected here.