Scalable probabilistic PCA for large-scale genetic variation data

Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.


Introduction
Inference of population structure is a key step in population genetic analyses [1] with applications that include understanding genetic ancestry [2][3][4] and controlling for confounding in genome-wide association studies (GWAS) [5]. While several methods have been proposed to infer population structure (e.g., [6][7][8][9][10]), principal component analysis (PCA) is one of the most widely used [6,11]. Unfortunately, the naive approach for estimating principal components (PCs) by computing a full singular value decomposition (SVD) scales quadratically with sample size (for datasets where the number of SNPs is larger than sample size), resulting in runtimes unsuitable for large data sets.
In light of these challenges, several solutions have been proposed for the efficient computation of PCs. One approach taken by many recent scalable implementations (FastPCA [12], FlashPCA2 [13], bigsnpr [14], TeraPCA [15], PLINK2 [16]) takes advantage of the fact that typical applications of PCA in genetics only require computing a small number of top PCs; e.g. GWAS typically use 5-20 PCs to correct for stratification [17]. These methods can be grouped according to their underlying algorithm: blanczos (FastPCA, PLINK2, TeraPCA) or the implicitly restarted Arnoldi algorithm (FlashPCA2, bigsnpr). An alternative approach for efficient computation of PCs takes advantage of the parallel computation infrastructure of the cloud [18]. However, the cost of cloud usage is roughly proportional to the number of CPU hours used by these algorithms, making them cost-prohibitive. Finally, these scalable implementations lack a full probabilistic model, making them challenging to extend to settings with missing genotypes or linkage disequilibrium (LD) between SNPs.
In this work, we describe ProPCA, a scalable method to compute the top PCs on genotype data. ProPCA is based on a previously proposed probabilistic model [19,20], of which PCA is a special case. While PCA treats the PCs and the PC scores as fixed parameters, probabilistic PCA imposes a prior on the PC scores. This formulation leads to an iterative Expectation Maximization (EM) algorithm for computing the PCs. ProPCA leverages the structure of genotype data to further reduce the computation time in each iteration of the EM algorithm. The EM algorithm requires only a small number of iterations to obtain accurate estimates of the PCs resulting in a highly scalable algorithm.
ProPCA obtains a computational speed-up through the integration of the Mailman algorithm [21] into its EM algorithm. The Mailman algorithm allows for fast matrix-vector multiplication when there are a finite number of values (e.g. genotypes) in exchange for additional memory usage. As a result, ProPCA requires more memory than some of the other scalable PCA methods. However, the increased memory consumption is reasonable; often still within the memory available within typical computing environments.
In both simulated and real data, ProPCA is able to accurately infer the top PCs while scaling favorably with increasing sample size. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in less than thirty minutes. To illustrate how the ability to compute PCs in large samples can lead to biological discovery, we leveraged the population structure inferred by ProPCA within the White British individuals in the UK Biobank [22] to scan for SNPs that are not well-modeled by the top PCs and, consequently, identify several novel genome-wide signals of recent positive selection. Our scan recovers sixteen loci that are highly differentiated across the top five PCs that are likely signals of recent selection. While these loci include previously reported targets of selection [12], the larger sample size that we analyze here allows us to identify eleven novel signals including a missense mutation in RPGRIP1L (p = 2.09 × 10 −9 ) and another in TLR4 (p = 7.60 × 10 −12 ).
A number of algorithms that analyze genotype data, including methods for heritability estimation and association testing, can be modeled as iterative procedures where the core computational operation is similar to that solved by ProPCA. Thus, the algorithm that we employ in this work can potentially lead to highly scalable algorithms for a broad set of population genetic analyses.

Accuracy
We first assessed the accuracy of ProPCA using the simulation framework described in the Methods. We generated datasets containing 50, 000 SNPs and 10, 000 individuals across q populations, where q was chosen to be 5 and 10. The populations were simulated with varying levels of population differentiation that are typical of present-day human populations (values of F st ranging from 0.001 to 0.01) and were small enough so that we could compute the full SVD thereby allowing us to estimate the accuracy of the PCs computed by ProPCA. To measure accuracy, we computed the mean of explained variances (MEV), a measure of the overlap between the subspaces spanned by the PCs estimated by ProPCA compared to the PCs computed using a full SVD (Methods). ProPCA, All methods are able to estimate highly accurate PCs (values of MEV close to 1) across the range of parameters (Table 1).

Runtime
We assessed the scalability of ProPCA with increasing sample size (Methods). We simulated genotypes from six populations containing 100, 000 SNPs and sample sizes varying from 10, 000 to 1, 000, 000 with F st = 0.01.
We compared the wall-clock time for running ProPCA, the SVD implementation in PLINK (PLINK_SVD [23]), FastPCA [12], FlashPCA2 [13], bigsnpr [14], PLINK2 [16], Ter-aPCA [15]). The SVD implementation in PLINK could not run in reasonable time on datasets exceeding 70, 000 individuals (Fig 1a). While all the other methods scale with sample size, ProPCA is faster than the methods compared against (Fig 1b). ProPCA computes PCs in about 30 minutes even on the largest data containing a million individuals and 100, 000 SNPs. We similarly explored how each method scale in terms of the number of variants. We repeated our experiment by varying the number of SNPs from 10, 000 to 1, 000, 000 while keeping the sample size constant at 100, 000 and found similar results (Fig 1c). We further tested runtime as a function of the number of PCs on a simulated dataset containing 10,000 individuals, 50,000 SNPs, and 20 latent populations separated at F ST = 0.01. We find that PLINK2 and ProPCA scale linearly when computing the upto the top 40 PCs (S1 Fig). FlashPCA2 is efficient at computing 2-20 PCs, but increases in runtime when computing a single PC or more than 20 PCs. We found that this trend was both reproducible across different datasets. We also tested bigsnpr/bigstatsr, which uses the same underlying algorithm as FlashPCA2 and found a similar trend, i.e., its performance to be similar to FlashPCA2 in that it is efficient at computing 1-20 PCs, but we see a steady increase in runtime after 20 PCs.
We tested a final scenario in which each method computed 40 PCs on our two largest simulated datasets containing one million SNPs and 10,000 individuals dataset as well as the one million individuals and 10,000 SNPs dataset ( Table 2). We find that ProPCA can compute the We report the mean and standard deviation over ten trials. Figure 1b compares the runtimes of all algorithms excluding PLINK_SVD which could only run successfully up to a sample size of 70, 000. Figure 1c displays the total runtime containing 100, 000 individuals, six subpopulations, F st = 0.01, and SNPs varying from 10, 000 to 1, 000, 000. All methods were capped to a maximum of 100 hours and a maximum memory of 64 GB and run using default settings. We were unable to include bigstatsr in the SNP benchmark as it does not allow for monomorphic SNPs.  Since ProPCA, FastPCA, and FlashPCA2 are all based on iterative algorithms, their runtimes depend on details of convergence criterion. We performed an additional experiment to compare the runtime of ProPCA, FastPCA (for which we could instrument the source code) for a single iteration and found ProPCA to be three to four times faster than FastPCA across the range of sample sizes (S2 Fig).
Measuring the accuracy of the PCs (MEV) as a function of runtime (on datasets with a range of F st containing 50, 000 SNPs and 10, 000 individuals so that we could compare the esti-

Application to real genotype data
We applied ProPCA to genotype data from Phase 1 of the 1000 Genomes project [24]. On a dataset of 1092 individuals and 442, 350 SNPs, ProPCA computes the top forty PCs that are qualitatively indistinguishable from running a full SVD (S5 Fig). Furthermore, we tested each method's ability to compute 5-40 PCs on this dataset. We took a small subset for 450k SNPs and 1,092 individuals for which we could compute the full SVD. We tested all methods at increments of 5 PCs to 40 PCs and ultimately found that all that all methods still performed well across the range tested (MEV � 0.95) (S1 Table). We also applied ProPCA to genotype data from the UK Biobank [22] consisting of 488, 363 individuals and 146, 671 SNPs after QC. ProPCA can compute the top five PCs in about 30 minutes and the resulting PCs reflect population structure within the UK Biobank, consistent with previous studies [22] (Fig 2a).

Application to scans for selection
Since the PCs in ProPCA are computed as maximum likelihood estimates under a probabilistic model, ProPCA provides a natural framework for applications such as hypothesis testing. By utilizing the statistical assumptions and set up provided by the ProPCA model, we developed a statistical test to search for SNPs that are not well-modeled by the ProPCA model as a means of discovering signals of natural selection (Methods and S1 Text). This statistic relies on the observation that a SNP evolving under positive selection is expected to exhibit differentiation in the frequencies of its alleles that is extreme compared to a typical SNP that is evolving neutrally [25].
Since deviations from the ProPCA model can occur due to reasons unrelated to selection, we filtered out SNPs with high rates of missingness, low minor allele frequency (MAF), and presence in regions of long-range LD [26]  The Pearson correlation coefficient between birth location coordinates and the PC score for each individual reveals that the estimated PCs capture geographic structure within the UK (Fig  2b, S7 Fig, S2 Table). We used these PCs to perform a selection scan on a larger set of 516, 140 SNPs and we report SNPs that are genome-wide significant after accounting for the number of SNPs as well as PCs tested (p-value < 0:05 6�516;140 ; we use 6 to account for the additional combined test statistic that we describe later). We ensured that the selection statistic for each PC was well-calibrated against a w 2 1 distribution (S8 Fig) and genomic inflation (λ GC ) values for each of the PCs showed no substantial inflation (S3 Table). While our statistic is closely related to a previously proposed statistic to detect selection on PCs (S1 Text), we found that our proposed statistic is better calibrated (S3 Table).
To validate our findings, we utilized birth location coordinates for each individual and assigned them to geographical regions in the UK as defined in the Nomenclature of Territorial Units for Statistics level 3 (NUTS3) classification. We performed a test of association between the allele frequency of the top SNP in each of our novel loci with geographical regions and confirmed that SNPs identified in our selection scan show differences in allele frequencies across specific geographical regions (S6 Table).
One of the novel genome-wide significant loci is RPGRIP1L. RPGRIP1L is a highly conserved gene that encodes a protein that localizes at primary cilia and is important in development [27]. Mutations in this gene have been implicated with neurological disorders such as Joubert syndrome and Meckel syndrome [28], conditions that sometimes also result in additional symptoms such as eye diseases and kidney disorders [29]. The SNP with the most significant p-value in our scan in RPGRIP1L, rs61747071, is a missense loss-of-function mutation A229T that has been shown to lead to photoreceptor loss in ciliopathies [30].
We created an additional variant of our selection statistic which tests for SNPs that are not well-modeled by a linear combination of the first five PCs by summing the per-PC w 2 1 statistics resulting in a new chi-squared statistic with five degrees of freedom. Combining signals across PCs has been previously shown to boost power in association testing [31]. We verified that the resulting combined statistic is also calibrated (S8 Fig, S3 Table). Under this combined statistic, we recover majority of the loci found on each individual PC, but we also discover four additional novel loci: AMPH (rs118079376, p = 2.64 × 10 −10 ), TLR4 (rs4986790, p = 7.60 × 10 −12 ), rs9856661 (p = 6.46 × 10 −9 ), and rs116352364 (p = 5.24 × 10 −11 ) (S7 Table).
TLR4 is a member of the toll-like receptor family. The TLR gene family is known to play a fundamental role in pathogen recognition and activation of innate immunity, but TLR4 in particular is involved with proinflammatory cytokines and has a pro-carcinogenic function [32]. The SNP with the most significant p-value at our TLR4 locus is rs4986790, a missense D299G mutation and D259G mutation on two different transcripts for the TLR4 gene. The D299G mutation is of particular interest as this mutation is strongly correlated with increased infection by Plasmodium falciparum, a parasite that causes malaria [33,34].
To better understand the signals of selection that the proposed statistic is sensitive to, we compared the time-scale for our selection hits to those from a recent study that is designed to detect recent positive selection [35] (S9 Fig). Using estimates of allelic ages for variants in the 1000 Genomes Project [36], we find that the variants detected by the proposed statistic tend to be older on average than those found to have a singleton-density score > 4 from Field et al. 2016 (average age of 19, 007 generations for our statistic vs 11, 944 generations for the SDS statistic using the combined mutation and recombination clock). We caution that the interpretation of these results is complicate by the considerable uncertainty in the allelic age estimates. Further, the timing of an episode of selection might post-date the age of the mutation-for example, when selection acts on standing variation. Finally, there is substantial variation in the mean age estimates of the hits. While the average age is around 19, 000 generations, 17 of the 42 putatively selected variants have ages less than 5, 000 generations. This suggests that the proposed statistic could be sensitive to both recent and older selection where the resulting allele frequencies are not well-modeled by the PCs.
To further illustrate how the ability to compute PCs in large samples is necessary for biological discovery, we analyzed how many selection signals we discover as a function of sample size by randomly subsampling the number of individuals from the White British population and repeating our analyses (S10 Fig). We ultimately find that sample sizes larger than 150, 000 individuals are required to retain over 80% of the total signals of selection we discover.

Discussion
We have presented, ProPCA, a scalable method for PCA on genotype data that relies on performing inference in a probabilistic model. Inference in this model consists of an iterative procedure that uses a fast matrix-vector multiplication algorithm. We have demonstrated its accuracy and efficiency across diverse settings. Further, we have demonstrated that ProPCA can accurately estimate population structure within the UK Biobank dataset and how this structure can be leveraged to identify targets of recent putative selection.
The algorithm that we employ here to accelerate the EM updates is of independent interest. Beyond PCA, several algorithms that operate on genotype data perform repeated matrix-vector multiplication on the matrix of genotypes. For example, association tests and permutation tests, can be formulated as computing a matrix-vector product where the matrix is the genotype matrix while the vector consists of phenotype measurements. Indeed, the algorithm has been used to accelerate heritability estimation [37]. The idea that SVD computations can leverage fast matrix-vector multiplication operations to obtain computational efficiency is well known in the numerical linear algebra literature [38]. Indeed, the algorithms [13,38] implemented in other PCA methods can also utilize these ideas to gain additional computational efficiency. Alternate approaches to improve matrix-vector multiplication in the genetics setting include approaches that rely on sparsity of the genotype matrix. It is important to note that the speedup obtained from the Mailman algorithm does not rely explicitly on sparsity and could be applied even to dense matrices. It would be of interest to contrast the use of sparse multiplication versus the Mailman algorithm and to investigate the potential to combine these two approaches to be able to leverage sparsity as well as the discrete nature of the genotype matrix.
It is likely that different algorithms and implementations to compute PCs (and more generally, infer population structure) might be appropriate based on the specific application. The choice of the specific algorithm and implementation involves a number of trade-offs. While ProPCA is computationally efficient, its use of the Mailman algorithm results in a bigger memory footprint relative to other methods. The probabilistic formulation underlying ProPCA allows the algorithm to be generalized in several directions. One direction is the application of PCA in the presence of missing data that often arises when analyzing multiple datasets. We have explored an extension of the ProPCA model to this setting (S1 Text, S11 Fig). While this approach is promising, a limitation of the use of the Mailman algorithm within ProPCA is the requirement of discrete genotypes, which prevents ProPCA from being directly applied to dosages (e.g. imputed genotypes). Another potential future direction is in modeling linkage disequilibrium and in incorporating rare variants which have the potential to reveal structure that is not apparent from the analysis of common SNPs [39,40]. Current applications of PCA remove correlated SNPs and singletons though this has been shown to discard information [12]. One possible way to incorporate LD would leverage the connection between haplotype copying models [41] and the multivariate normal model of PCA [42], or by a whitening transformation [4]. Further, the observation model can also be modified to account for the discrete nature of genotypes [3,43]. A number of non-linear dimensionality reduction methods have been recently proposed [44,45]. A comparison of these methods to ProPCA (in terms of statistical structure that the methods aim to detect, ability to handle missing data, and computational scalability) would be of great interest. Finally, leveraging fine-scale population structure inferred from large-scale data to study recent positive selection in human history is an important direction for future work. While the probabilistic model underlying ProPCA leads to a natural model for testing for selection, other hypotheses about models of selection could lead to other tests of selection. The challenge is to design realistic statistical models of population structure while enabling inference at scale.

Principal components analysis (PCA)
We observe genotypes from n individuals at m SNPs. The genotype vector for individual i is a length m vector denoted by g i 2 {0, 1, 2} m . The j th entry of g i denotes the number of minor allele carried by individual i at SNP j. Let G be the m × n genotype matrix where G = [g 1 . . .g n ].
Let Y denote the matrix of standardized genotypes obtained by centering and rescaling each row of the genotype matrix G so that ∑ j y i,j = 0 and P j y 2 i;j ¼ 1 for all i 2 {1, . . ., m}. Principal components analysis (PCA) [11] attempts to find a low-dimensional linear transformation of the data that maximizes the projected variance or, equivalently, minimizes the reconstruction error. Given the m × n matrix Y of standardized genotypes and a target dimension k, PCA attempts to find a m × k matrix with orthonormal columns W and n × k matrix Z that minimizes the reconstruction error: kY − WZ T k F where kAk F ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P i;j A 2 i;j q is the Frobenius norm of the matrix A. To solve the PCA problem, we perform a singular-value decomposition (SVD) of the standardized genotype matrix Y = USV T and setŴ ¼ U K , where U K is a m × k matrix containing the k columns of U corresponding to the k largest singular vectors of Y.

Probabilistic PCA
PCA can be viewed as a limiting case of the probabilistic PCA model [10,19,20]. Probabilistic PCA models the observed data y i 2 R m ; i 2 f1 . . . ; ng as a linear transformation of a k-dimensional latent random variable x i (k � m) with additive Gaussian noise. Denoting the linear transformation by the m × k matrix C, and the (m-dimensional) noise by � i (with isotropic covariance matrix σ 2 I m ), the generative model can be written as The maximum likelihood estimate of the matrix C in this model has been shown to span the kdimensional principal subspace of the data Y = [y 1 ,. . .y n ] [46].

EM algorithm for PCA
Since probabilistic PCA is a probabilistic model endowed with latent variables, the EM algorithm presents a natural approach to compute the maximum likelihood estimates of the model parameters (C, σ 2 ) [19,20]. The EM algorithm for learning the principal components can be derived as a special case of the EM algorithm for the probabilistic PCA model where the variance of the observation noise σ 2 tends to zero leading to these updates: Here X = [x 1 . . .x n ] is a k × n matrix and Y = [y 1 . . .y n ] is a m × n matrix. Noting that all matrix inversions require inverting a k × k matrix, the computational complexity of the E-step is Oðk 2 m þ k 3 þ k 2 m þ mnkÞ while the computational complexity of the M-step is Oðk 2 n þ k 3 þ k 2 n þ mnkÞ. For small k and large m, n, the per-iteration runtime complexity is OðmnkÞ. Thus, the EM algorithm provides a computationally efficient estimator of the top k PCs when the number of PCs to be estimated is small.

Sub-linear time EM
The key bottleneck in the EM algorithm is the multiplication of the matrix Y with matrices The vectors representing the sample mean and standard deviation of the genotypes at each SNP are denoted � g and s. Assuming no entry in s is zero (we remove SNPs that have no variation across samples), the matrix of standardized genotypes Y can be written as: Here diag(x) is an operator that constructs a diagonal matrix with the entries of x along its diagonals, 1 n is a length n vector with each entry equal to one, and ρ is a length m vector with . ., m}. The EM updates can be written as: HereẼ can be computed in time OðkmÞ while Eρ1 T n and ρ1 T n M can be computed in time Oðnk þ mkÞ.
The key bottleneck in the E-step is the multiplication of the genotype matrix G by each of the k rows of the matrixẼ and in the M-step, multiplication of G by each of the k columns of the matrix M respectively. Leveraging the fact that each element of the genotype matrix G takes values in the set {0, 1, 2}, we can improve the complexity of these multiplication operations from OðnmkÞ to Oð nmk maxðlog 3 n;log 3 mÞ Þ by extending the Mailman Algorithm [21]. For additional implementation details, see S1 Text.

The Mailman algorithm
In the M-step, we need to compute c = Ab for an arbitrary real-valued vector b and a m × n matrix A whose entries take values in {0, 1, 2}. We assume that m = dlog 3 (n)e. Naive matrixvector multiplication takes Oðdlog 3 ðnÞenÞ time.
The Mailman algorithm decomposes A as A = U n P. Here U n is the m × r matrix whose columns containing all r = 3 m possible vectors over {0, 1, 2} of length m. We set an entry P i,j to 1 if column j of A matches column i of U n : A ðjÞ ¼ U ðiÞ n . The decomposition of any matrix A into U n and P can be done in OðnmÞ time. Given this decomposition, the desired product c is computed in two steps, each of which has OðnÞ time complexity [21]: The Mailman algorithm provides computational savings in a setting where the cost of computing the decomposition of A are offset by the gains in repeated multiplication involving A.
Similarly, in the E-step, we need to compute f T A in Oðdlog 3 ðnÞenÞ time by computing A T f and computing a decomposition of A T . A drawback of this approach is the need to store both decompositions that would double the memory requirements of the algorithm. Instead, we propose a novel variant of the Mailman algorithm that can compute f T A in Oðdlog 3 ðnÞenÞ time using the same decomposition as A (S1 Text).
Additional details on efficient implementation of the EM and Mailman algorithms can be found in S1 Text.

Simulations
We simulated genotypes at m independent SNPs across n individuals in which a single ancestral population diverged into q sub-populations with drift proportional to the

Benchmarking
To compare estimated PCs to reference PCs, we computed the mean of explained variance (MEV)-a measure of the overlap between the subspaces spanned by the two sets of PCs. Two different sets of K principal components each produce a K-dimensional column space. A metric for the performance of a PCA algorithm against some baseline is to see how much the column spaces overlap. This is done by projecting the eigenvectors of one subspace onto the other and finding the mean lengths of the projected eigenvectors. If we have a reference set of PCs (v 1 , v 2 , . . ., v k ) against which we wish to evaluate the performance of a set of estimated PCs ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi is a matrix whose column vectors are the PCs which we are testing.
In practice, when attempting to compute the top k PCs, ProPCA was found to converge faster by computing l PCs for l > k PCs and retaining the top k PCs. The reason for this is that in the initial iterations of the EM algorithm, the estimates of the top PCs are noisy. We set l = k in our experiments for an effective 2k. While ProPCA could be run to convergence, we found that running it for k iterations already gave accurate results across the range of parameters considered. Our empirical results are consistent with our theoretical result that the EM algorithm converges exponentially fast in the spectral norm of the error matrix [38,47] (S1 Text).
We compared ProPCA to the current state-of-the-art methods for computing PCs from genotype data: the SVD implementation in PLINK (PLINK_SVD [23]), FastPCA [12], FlashPCA2 [13], bigsnpr [14], PLINK2 [16], and TeraPCA [15]. PLINK_SVD refers to an exact computation of PCs using the full Singular Value Decomposition as implemented in the PLINK package (PLINK_SVD). FastPCA [12] is an implementation of the blanczos method, a randomized subspace iteration method [38] while FlashPCA2 [13] is an implementation of the implicitly restarted Arnoldi method [48]. PLINK2 [16] and TeraPCA [15] are reimplimentations of the FastPCA algorithm while bigsnpr [14] is a reimplimentation of the FlashPCA2 algorithm designed to utilize disk space as a file backend. We used default parameters for all methods unless otherwise stated. For benchmarking of bigsnpr, we included the creation of the file backend in timing as it is required to run any of the computations included in the backend. Furthermore, we excluded bigsnpr from some experiments due to the inability of its PCA function to natively handle missing data and when faced with monomorphic SNPs. All experiments were performed on a Intel(R) Xeon(R) CPU 2.10GHz server with 128 GB RAM, restricted to a single core, capped to a maximum runtime of 100 hours and a maximum memory of 64 GB.

Selection scan
The White British cohort was identified by the UK Biobank as participants who self-identified as 'British' within the broader-level group 'White' while having similar ancestral background [22]. For our selection scan, we further filtered the 409, 634 individuals in the White British subset to obtain an unrelated White British subset by removing individuals with one other related individual in the data set (individuals with kinship coefficients greater than 0.0625 (third-degree relatedness) to any other individual as determined by KING [49]). After removing these individuals, we obtained an unrelated White British subset containing 276, 736 individuals.
We inferred the top five PCs using ProPCA on all 276, 736 unrelated White British individuals and a filtered SNP set containing 146, 671 SNPs (UK Biobank SNP set). SNPs in the UK Biobank SNP set consist of SNPs on the UK Biobank Axiom array from which SNPs were removed if they have missing rates greater than 1.5%, minor allele frequencies (MAF) less than 1%, or if they were in regions of long-range linkage disequilibrium. The remaining SNPs were then pruned for pairwise r 2 less than 0.1 using windows of 1000 base pairs (bp) and a step-size of 80 bp.
We developed a selection statistic to search for SNPs whose variation is not well-explained by the ProPCA model, closely related to the selection statistic proposed in [12] (S1 Text). Under the probabilistic PCA model, the normalized genotype matrix is modeled by a low rank approximation and Gaussian noise, Y = CX + �. Given our low rank approximation of the genotype matrix,Ŷ ¼ CX, we have the residual: Y ÀŶ ¼ �. For a SNP j, the Gaussian noise, � j � N ð0; s 2 I n Þ. Projecting this residual onto a PC results in a univariate Gaussian with zero mean and constant variance across SNPs. This variance can be estimated as the sample varianceŝ 2 of the resulting statistics across SNPs. In summary, we propose the statistic: 1 for SNP j, given the k-th PC. The projection of the residual onto a PC allows the signal of selection to be interpreted in the context of deviations from ancestry captured by the specific PC.
Furthermore, a variant of this statistic, which we call the combined statistic, can be generated from the selection statistics computed on each individual PC using the observation that the resulting chi-squared statistics are independent of each other. This allows us to create an additional statistic by summing the individual PC statistics to create a combined statistic that follows a chi-squared distribution with additional degrees of freedom for each PC used.
Using the results from the PCA on the UK Biobank SNP set, we performed our selection scan on a different set of 516, 140 SNPs. We generated this set of SNPs by removing SNPs that were multi-allelic, had genotyping rates less than 99%, had minor allele frequencies less than 1%, and were not in Hardy-Weinberg equilibrium (p < 10 −6 ).
We performed an allele frequency test for each novel SNP using the Nomenclature of Territorial Units for Statistics level 3 (NUTS3) classification of regions for the UK. The NUTS3 classification defines non-overlapping borders for each region in the UK, allowing us to uniquely map each individual to a region in the UK using their birth location coordinates by checking which NUTS3 regions they fell into. For each of our novel loci, we then performed an twotailed Z-test between each region's allele frequency against all other regions. We corrected for multiple testing using the Bonferroni correction.
Supporting information S1 Text. Supplementary information. Section S1 contains additional information on the White British analysis. Section S2 compares our selection statistic to other existing selection statistics. Section S3 explores the time-scales of our selection hits. Section S4 explores the application of ProPCA to missing data. Section S5 details the implementation of ProPCA. Section S6 explains our variant of the Mailman algorithm for left multiplication. Section S7 details the convergence of ProPCA in the noiseless setting. Section S8 explores the contribution of the Mailman algorithm to scalability. (PDF)  Figure S4 Figb shows a similar result, but with 100, 000 individuals and SNPs varying from 100,000 to 1,000,000 over a single run. All methods were run using default settings. We were unable to run bigstatsr for the SNPs experiment due to a bug that causes the method to crash in the presence of monomorphic SNPs. (PDF) To further illustrate the importance of large sample sizes for biological discovery, we analyzed how many selection signals we could discover as a function of sample size. We randomly subsampled 10,000, 50,000, 100,000, and 200,000 individuals from the White British populations and performed our selection scan. The x-axis denotes sample size in thousands and the y-axis denotes the proportions of total hits discovered. (PDF) S11 Fig. ProPCA infers more accurate principal components (PCs) in the presence of missing data compared to imputed genotypes. We compared the MEV from eigenvectors calculated from both modes of ProPCA with ground truth eigenvectors from performing a full SVD. We evaluated performance at 5% and 20% random missing values at 5 (S11a Fig) and 10 principal components (S11b Fig). We additionally compared ProPCA to mean imputation followed by a full SVD (S11c Fig). The data consists of simulated genotype data of 50, 000 SNPs from 10, 000 individuals from 5 populations for 5 PCs and 10 populations for 10 PCs separated by a range of F st values. This process was repeated ten times to measure variability. Error bars denoting one standard deviation are shown for each point. (PDF) S12 Fig. The Mailman matrix-vector multiplication contributes to the scalability of ProPCA. We compare the time taken to compute the top five principal components by the EM algorithm underlying ProPCA when used in conjunction with the Mailman algorithm and without. We performed these comparisons on simulated genotype data containing 100, 000 SNPs, six subpopulations, F st = 0.10 and individuals varying from 10, 000 to 1, 000, 000. S12 Figa compares the runtime of the EM algorithm with the Mailman matrix-vector multiplication to an EM algorithm where the genotypes are represented as a matrix of doubles (EM1). With this representation, the EM algorithm could only be applied to sample sizes of at most 70, 000 individuals due to memory constraints. S12 Figb compares the runtime of the EM algorithm with the Mailman matrix-vector multiplication to an EM algorithm where genotypes are represented in a compact representation (EM2). (PDF) S1 Table. Comparison of accuracy of methods to estimate principal components on the genotype data from the 1000 Genomes Phase 1 project. We compared the accuracy of the ProPCA algorithm, bigsnpr, FlashPCA2, PLINK2, TeraPCA, and FastPCA when applied to 1092 individuals in the 1000 Genomes Phase 1 project. We report MEV averaged over ten trials. FastPCA gave us a segmentation fault for estimation of � 35 PCs. We ran all methods using default settings. (PDF) S2 Table. Pearson correlation between principal components and birth location coordinates in the unrelated White British. Pearson correlation between the principal components and birth location coordinates reveals that the principal components unveil geographic variation. P-values from Pearson correlation t-test is shown in parentheses on the right. (PDF) S3  Table. Principal component selection scan reveals 12 unique loci under selection across the top five principal components. We obtained 59 selection hits across the first five principal components of the unrelated White British subset of the UK Biobank. We clustered these hits into 12 unique loci by aggregating all significant hits into 1 Mb windows centered around the most significant hits. Other genes with significant hits that are within the 1 Mb window are listed in the last column. (PDF) S6 Table. Allele frequency tests between NUTS3 regions at novel loci confirms differences between geographic regions. We performed a two-tailed proportion test for our novel loci between the allele frequency in each individual region from the NUTS3 classification of the United Kingdom against the frequency from every other region. We corrected the p-values using the Bonferroni correction (11 loci × 163 regions). The corrected p-values for regions passing the significance threshold are shown in the table. (PDF) S7 Table. Combined selection statistic across the top five principal components reveals four additional novel loci. We discover four additional novel loci using our combined selection statistic from the first five principal components. Loci not found in the individual PC selection statistics are denoted by an asterik in the rsid column. The chi-squared statistic (one degree of freedom) for each principal component is shown in the last five columns of the table. (PDF) S8 Table. Selection hits are associated with phenotypes in the UK Biobank. We ran genome-wide association tests for 64 phenotypes in the full release of the UK Biobank for each of our loci. Phenotypes shown reached a p-value of genome-wide significance level (0.05 × 10 −6 ). (PDF) S9 Table. Selection hits are associated with phenotypes from the Global Biobank Engine. We queried the Global Biobank Engine for associations from our loci. The Global Biobank Engine contains GWAS results for many more phenotypes than those available in the UK Biobank. Phenotypes shown are significant at genome-wide significance level (0.05 × 10 −6 ).