Consequences of SNP codings, PCA variants, and PCA graphs for elucidating population structure

PCA is frequently used to display and discover patterns in SNP data from humans, animals, plants, and microbes—especially to elucidate population structure. Given the popularity of PCA, one might expect that PCA is understood well and applied effectively. However, our literature survey of 125 representative articles that apply PCA to SNP data shows that three choices have usually been made poorly: SNP coding, PCA variant, and PCA graph. Accordingly, we offer several simple recommendations for effective PCA analysis of SNP data. The ultimate benefit from informed and optimal choices of SNP coding, PCA variant, and PCA graph is expected to be discovery of more biology, and thereby acceleration of medical, agricultural, and other vital applications.

Introduction likewise a SNP near the origin has small interactions with all Items. More exactly, these 72 interpretive principles address the interaction structure captured in the IPC1-IPC2 plane of a 73 biplot. 74 By definition, S×I interactions inherently involve both SNPs and Items, so a biplot is 75 required for PCA to display S×I interactions. By way of preview, although DC-PCA has this 76 single and straightforward set of interpretive principles that applies to any and all datasets, it has 77 this advantage uniquely: All other PCA variants have multiple and complicated sets of 78 interpretive principles that vary from dataset to dataset.

80
SNP codings 81 Consider a data matrix comprised of a number of SNPs observed for a number of Items, 82 where "Items" is a generic term for the samples, such as individual humans, horses, cultivars of 83 wheat, or races of a pathogen. The original reads of nucleotides (A, T, C, and G) must be coded 84 numerically for PCA, such as a polymorphism of T and C being coded as 0 and 1. We begin 85 with the simplest case of biallelic SNPs. 86 One option, here called SNP coding rare=1, is to code the rare allele as 1 and the 87 common allele as 0. This coding is of special interest for reasons that emerge momentarily. 88 Another option, here called SNP coding common=1, is the opposite: to code the 89 common allele as 1 and the rare allele as 0. This coding is the default in TASSEL, which is 90 widely used for crop plants ([3]; Peter Bradbury personal correspondence, 23 April 2018). 91 The third and final possibility, here called SNP coding mixed, is to code the alleles in 92 some other manner that yields a mixture of rare and common alleles as 1 (and as 0). For 93 example, the variant call format (VCF) distinguishes one Item as the reference genome, and then 94 for each SNP it assigns 0 to the allele of the reference genome and 1 to the other non-reference 95 allelle. VCF is popular because it was developed for the 1000 Genomes Project in human 96 genetics [4], and subsequently has been adopted widely. 97 The SNP coding rare=1 can be generalized for SNP datasets having more than the 2 98 codes of biallelic data. If need be, transpose the data to a set of consecutive integer values 99 starting with 0, such as 0, 1, 2, and 3. Then recode each SNP to give the rarest allele the highest 100 value, and so on, and lastly give the most common allele the lowest value of 0. Similarly, for a 101 diploid species with three codes-one for each of the two homozygotes and another for the 102 heterozygote-if need be, transpose the data to 0 and 2 for the homozygotes and 1 for the 103 heterozygote. Then recode each SNP to give the rarest homozygote the value 2 and the most 104 common homozygote the value 0. These ideas can be elaborated for polyploids such as 105 hexaploid (AABBDD) bread wheat, Triticum aestivum (L.). The other two codings can also be 106 generalized. order to make the ones readily visible. The convention adopted here is to number matrix 110 columns from left to right, and number matrix rows from top to bottom, starting with 1 for the 111 first column and the first row. This simple example is taken from a blogpost by Morrison et al. 112 [5]; also see [6]. The concentration of ones along the matrix diagonal constitutes a single 113 gradient with evident joint structure that involves both Items and SNPs. for the 24 Items, and red dots for the 20 SNPs. The single gradient has been distorted into an 136 arch with its ends involuted toward the middle for Items 1 to 24, and likewise for SNPs 1 to 20. 137 Provided that one knows about this so-called arch distortion, the gradient is still apparent: Both 138 arches move clockwise from Item 1 to 24 and from SNP 1 to 20. Ecologists have known for 139 decades about this arch distortion, also called the horseshoe effect [7,8]. However, an extensive    The statistical meanings of these three sources of variation are fundamentally different.

247
Consider a data matrix with p SNPs and n Items. The SNP main effects concern the means 248 across Items for each SNP, so they constitute a vector of length p. Likewise, the Item main 249 effects concern the means across SNPs for each Item, so they constitute a vector of length n. By  The biological meanings of these three sources are as follows in the present context of a 257 SNPs-by-Items data matrix. For a given SNP, its mean across Items is simply the frequency of 258 the allele coded 1 (presuming that the alternative allele is coded 0). Likewise, for a given Item, 259 its mean across SNPs is simply the frequency of the allele coded 1. Ordinarily, it makes sense to 260 avoid burdening or distracting a PCA graph with such simple information, so it is best to remove 261 main effects. Also, these means often lack any straightforward or interesting biological meaning.

262
Indeed, it may be especially difficult to attach any biological meaning to the Item means, not 263 least because of alternative choices for SNP coding. When SNP or Item means are biologically 264 meaningless, they merely add noise to a PCA graph.

265
By contrast the S×I interactions pertain to the joint structure of SNPs-and-Items, that is, 266 the differential responses or patterns of Items across SNPs, and of SNPs across Items. These  They concern population structure, SNPs-by-Items heat maps, ancestry-informative SNPs, of Items. Second, when a SNPs-by-Items heat map concentrates one of its colors along the 273 matrix diagonal, this represents joint structure of SNPs-and-Items, that is, S×I interactions. By 274 contrast, main effects produce horizontal or vertical stripes, rather than diagonal structure.

275
Third, ancestry-informative SNPs that identify specific groups of humans are instances of joint 276 structure of SNPs-and-Items. Fourth, GWAS associates specific SNPs with specific Items that 277 exhibit a given trait (or do so to different degrees), and this joint structure constitutes S×I 278 interactions-not SNP main effects, and not Item main effects. Fifth and finally, joint structure 279 involving numerous SNPs and numerous Items frequently reflects an underlying causal factor 280 that affects both SNPs and Items. Generalizing beyond these five examples, interests in SNP 281 data are mostly or entirely about their S×I interactions, rather than their main effects.

282
That said, occasionally there can be special circumstances for which main effects are also 283 of interest, besides the S×I effects, so this situation is also addressed in this article. Which of 284 these three sources of variation-SNP, Item, and S×I effects-are of interest depends on the data 285 and the research objectives. It is the prerogative of researchers to decide which of these sources 286 interest them, so we respectfully leave those choices of interests to others.

287
The choice among PCA variants can be explored with a recent study concerning  There are no missing data, every SNP is biallelic, and the two alleles were assigned values of 1 291 and 2. We transposed her data to 0 and 1 for the sake of convenience, and we refer to this as the 292 "received data." It had mixed SNP coding. To obtain SNP coding rare=1, polarity was reversed 293 for 772 of the 1341 SNPs. The oat data are included in these two codings in the supporting 294 information (S1_OatMixed and S2_OatRare1). Table 1 shows the ANOVA table for DC-PCA of the oat data using three SNP codings: 296 rare=1, VCF with oat line 189 as the reference genome (which is of particular interest because it 297 has a larger SS for Item main effects than any of the other 634 possibilities), and the SNP coding 298 for the data as received (and transposed). ANOVA partitions the total degrees of freedom (df) 299 and sum of squares (SS) into three sources, and then PCA partitions the S×I interactions into the 300 first seven IPCs followed by the residual. Indenting of the sources indicates subtotals. For SNP coding rare=1, the total SS is composed of 88.0% for S×I interaction effects, 310 10.7% for SNP main effects, and 1.3% for Item main effects. SNP coding common=1 is not 311 shown in Table 1, but it necessarily has exactly the same ANOVA table as SNP coding rare=1, 312 not only for DC-PCA shown here, but also for all six variants of PCA considered in this article.

313
However, VCF for oat 189 has different percentages, namely 76.4%, 20.0%, and 3.6%, and the 314 received data has 66.7%, 33.1%, and 0.2%. Hence, choices of SNP coding affect the relative 315 magnitudes of these three sources, as well as the relative magnitudes of the IPCs.

316
The application of PCA to a combination of two sources of variation, unlike the single 317 source for DC-PCA in Table 1 Item-Centered PCA also has four possible outcomes. The joint structure involving S×I interaction information is quite obvious for IPC1, with 436 green mostly on the left and red mostly on the right in both panels. Hence, the 121 winter oats  In review, S×I interaction information is displayed by any PCA (or AMMI) biplot, 512 whereas it is completely absent in any PCA monoplot of only Items or only SNPs. An AMMI1 513 biplot displays main and interaction effects without confounding them. Having characterized the mathematical and statistical consequences of choices of SNP codings, 519 PCA variants, and PCA graphs, here we recommend best practices and contrast them with 520 contemporary practices by means of a literature survey.

522
Merits of SNP coding rare=1 523 Based on the results in the previous section, we recommend SNP coding rare=1 for two reasons. drastically-as well as complexify the interpretive principles for these graphs considerably. shows results for the received data with SNP coding mixed, and the brown point at the bottom represents 571 SNP coding rare=1 and common=1. 572 573 574 To characterize contemporary practices, we conducted a literature survey of 125 articles 575 that apply PCA to SNP data. This survey is included in the supporting information 576 (S6_LiteratureSurvey). Apparently the popularity of different codings varies among 577 communities: Researchers with human SNP data often use VCF coding because of the influence 578 of the 1000 Genomes Project, whereas researchers with crop SNP data often use SNP coding 579 common=1 because it is the default for TASSEL. We did not notice any unambiguous 580 specification of SNP coding rare=1. However, beyond these broad observations, we cannot 581 provide quantitative results about contemporary practices because the choice of SNP coding is 582 reported so infrequently.

584
Merits of DC-PCA and AMMI1 585 We recommend DC-PCA for analyzing SNP data, rather than any of the other five variants 586 considered here, for the following reasons.

587
The argument against SNP-Centered PCA has three cases that exhaust the possibilities.

588
First, if the Item main effects are not of interest, as is often the case, then they should be 589 removed. Since SNP main effects have already been removed from SNP-Centered PCA, the 590 result is the recommended DC-PCA. Second, if the Item main effects are of interest and the SS 591 for these effects is small relative to the SS for PC1 and PC2-as happens for the oat example in 592 Table 2-then a biplot using SNP-Centered PCA is wholly ineffective for displaying Item main 593 effects. Third, if the Item main effects are of interest and their SS is large, then an AMMI1 594 biplot is a better alternative because it includes both main and interaction effects, but without 595 confounding them. In no case is there any reason to choose SNP-Centered PCA.

596
The argument against Item-Centered PCA has exactly the same form as the argument 597 against SNP-Centered PCA.

598
The argument against the last three PCA variants, namely SNP-Standardized, Item-

611
Our literature survey found that the PCA variant is specified quite infrequently. We did 612 not encounter any specification of DC-PCA, so its applications to SNP data must be quite rare.

613
On the other hand, the great majority of articles do specify the software used for PCA analysis,  620 We recommend PCA biplots, rather than monoplots. It is axiomatic that a biplot with points for 621 both Items and SNPs is more informative than a monoplot with points for only Items or only Items patterns in both panels reveal S×I interactions.

642
The display of S×I interactions by biplots is crucial because S×I is commonly large.

643
From Table 1

652
The objection may be raised that PCA biplots would be impractical for datasets with 653 many thousands of SNPs, making graphs unworkably cluttered. In fact, high-density PCA 654 graphs appear in the literature routinely, such as Fig 4 in [18] showing results for 54734 humans.

655
Producing biplots in two adjacent panels helps to reduce clutter by separating Items from SNPs.  In our literature survey, we did not encounter even one biplot.

664
The winning combination 665 The contemporary literature on PCA analysis of SNP data exhibits a multiplicity of options for 666 the choices of SNP codings, PCA variants, and PCA graphs. The great multiplicity of 667 combinations of these choices imposes considerable complexity on this article. By contrast, our 668 recommendations for best practices can be expressed in one concise sentence: Choose SNP 669 coding rare=1, PCA variant DC-PCA (or AMMI1), and biplots.

670
Even a single departure from that winning combination can obliterate or preclude display 671 of S×I interactions, that is, display of the joint structure of SNPs-and-Items.

741
Software developers play a key role in determining which data analyses are reasonably 742 easy to perform, and which analysis options are the defaults and hence are used most frequently. 743 We encourage them to consider including SNP coding rare=1, PCA variant DC-PCA, and biplots 744 for DC-PCA and AMMI1 among the readily available options.

745
How much important structure has been present in SNP data ever since it was collected, 746 but cannot be displayed or understood by contemporary practices? More specifically, have S×I Construction of PCA graphs 753 As explained further in the following appendix, each PC is associated with a unit eigenvector for 754 SNPs, a unit eigenvector for Items, and a singular value. For a comprehensive explanation of 755 ways to handle the singular value, see Malik and Piepho [23]. Two brief remarks suffice for 756 present purposes. First the PC scores that appear in this article's biplots were obtained by 757 multiplying both eigenvectors by the square root of the singular value, as is commonly done for 758 biplots. Second, other contexts may call for a different approach, such as a PCA monoplot of 759 Items, for which multiplication by the singular value is preferable because this optimizes the 760 two-dimensional approximation of distances between Items.

761
The axes in PCA graphs are often scaled to obtain a convenient shape, but actually the 762 axes should have the same scale for many reasons [23]. Unfortunately, our literature survey 763 found a correct ratio of 1 in only 10% of the articles, a slightly faulty ratio of the larger scale 764 over the shorter scale within 1.1 in 12%, and a substantially faulty ratio above 2 in 16%, with the 765 worst cases being ratios of 31 and 44. Also, 7% of the articles failed to show the scale on one or 766 both PCA axes. However, the two axes of an AMMI1 biplot contain different kinds of 767 information (main or interaction effects), so they do not need to use the same scale.

768
PCA gives a unique least-squares solution, up to simultaneous sign change of the Item 769 and SNP scores for a given PC. Therefore, different software packages applied to the same data 770 may produce PCs with reverse polarity, but that is mathematically inconsequential.

771
The percentage of variation captured by each PC is often included in the axis labels of 772 PCA graphs. In general this information is worth including, but there are two qualifications.

773
First, these percentages need to be interpreted relative to the size of the data matrix because large 774 datasets can capture a small percentage and yet still be effective. For example, for a large dataset 775 with over 107,000 SNPs and over 6,000 persons, the first two components capture only 0.3693% 776 and 0.117% of the variation, and yet the PCA graph shows clear structure (Fig 1A in [24]).

777
Contrariwise, a PCA graph could capture a large percentage of the total variation, even 50% or 778 more, but that would not guarantee that it will show evident structure in the data. Second, the  The 125 articles applying PCA analysis to SNP data were taken from the literature more or less 790 at random, with some emphasis on agricultural crop species and on researchers at Cornell 791 University. They span many species and many journals. This survey is included in the 792 supporting information (S6_LiteratureSurvey).

794
Oat datasets 795 The oat dataset supplied by Kathy Esvelt Klos is included in two formats: SNP coding mixed is 796 the data as received, except that the original coding of 1 and 2 was transposed to 0 and 1; and 797 SNP coding rare=1, which required polarity reversal for 772 of the 1341 SNPs (S1_OatMixed 798 and S2_OatRare1).     However, when the SS for Item main effects is small relative to that for SNP-by-Item 848 interaction effects, centering by Item has little effect on the Item scores based on SVD. The 849 verdicts on immunity to SNP coding will be nearly the same for DC-PCA and SNP-Centered 850 PCA when Item main effects are small, and SNP-Centered PCA was already proven earlier in 851 this appendix to be immune. Therefore, correlations for Item scores between different SNP 852 codings are expected to be very close to 1 or -1 for DC-PCA, but not exactly 1. A small SS for 853 Item main effects compared to that for SNP-by-Item interaction effects is a necessary and 854 sufficient condition for DC-PCA monoplots of Items to be virtually immune to changes in SNP 855 coding.

857
Acknowledgments 858 We appreciate helpful comments on this manuscript from Peter Bradbury,Samuel Cartinhour,859 Kathy Esvelt Klos, and Kelly Robbins. We also appreciate Kathy Esvelt Klos sharing the oat 860 SNP data with us.