^{1}

^{2}

^{*}

^{1}

^{*}

^{3}

^{4}

^{4}

^{5}

^{6}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: ZZ YP ESB. Performed the experiments: QW FT. Analyzed the data: QW ZZ. Contributed reagents/materials/analysis tools: QW ZZ FT. Wrote the paper: QW ZZ.

Genome-Wide Association Studies shed light on the identification of genes underlying human diseases and agriculturally important traits. This potential has been shadowed by false positive findings. The Mixed Linear Model (MLM) method is flexible enough to simultaneously incorporate population structure and cryptic relationships to reduce false positives. However, its intensive computational burden is prohibitive in practice, especially for large samples. The newly developed algorithm, FaST-LMM, solved the computational problem, but requires that the number of SNPs be less than the number of individuals to derive a rank-reduced relationship. This restriction potentially leads to less statistical power when compared to using all SNPs. We developed a method to extract a small subset of SNPs and use them in FaST-LMM. This method not only retains the computational advantage of FaST-LMM, but also remarkably increases statistical power even when compared to using the entire set of SNPs. We named the method SUPER (Settlement of MLM Under Progressively Exclusive Relationship) and made it available within an implementation of the GAPIT software package.

Genome-Wide Association Study (GWAS) has become the leading method to identify genes underlying human diseases and agriculturally important traits. However, the genetic variants identified so far only explain a small portion of phenotypic variation

Population stratification and cryptic relationships are two common reasons for the inflation of false positives

The number of individuals in the population largely determines the size of a MLM equation

Efforts have been made to change the computational function from cubic to quadratic, especially for marker screening, which dominates the entire computation for data with high marker density. The Population Parameter Previously Determined (P3D), or Efficient Mixed-Model Association eXpedited (EMMAX), estimates variance components (or their ratio) only once and then fixes them to test genetic markers

The method of compressed MLM

The Factored Spectrally Transformed Linear Mixed Model (FaST-LMM) partitions the cubic function of computing complexity as the product of two parts: 1) the number of individuals and 2) the square of the rank of the relationship among individuals

In this study, we developed a method that dramatically reduces the number of genetic markers used to define individual relationships and remarkably increases statistical power. First, we divide the whole genome into small bins. Each bin is represented by the most significant marker. Second, we select only the influential bins. Third, we use a maximum likelihood method to optimize the size and number of bins selected as the pseudo Quantitative Trait Nucleotides (QTNs) underlying the phenotypes. Fourth, in the final test of each marker, the small set of markers is used to define the relationship among the individuals by excluding the markers that are in Linkage Disequilibrium (LD) to the testing marker, regardless local distance. We call the algorithm the Settlement of MLM Under Progressively Exclusive Relationship (SUPER).

We developed the SUPER method in the framework of a standard MLM approach, which decomposes the observation (_{ij}

To perform a GWAS, marker effect (

Kinship (K) is a known parameter, which is derived from genetic markers. Consequently, different sets of genetic markers create different kinships. This is the only difference among all the methods compared in this study. We used the efficient algorithm

Our procedure consists of three steps. The first two steps perform the inclusion of pseudo QTNs. The last step performs GWAS with exclusion of the pseudo QTNs that are in LD with the tested SNP.

Step 1: To sort SNPs on their p values or effects through a preliminary GWAS or genomic prediction for a specific trait.

Step 2: For each bin (segment) on a chromosome, choose the most influential SNP (e.g., with the lowest P value) as the representative for the bin. Then select

Step 3: When testing a SNP in

Solving

Where

Six published datasets from dog, maize, rice, _{1}, F_{2}, and two backcrosses). All dogs were genotyped with 23,500 SNPs at genome-wide coverage.

The maize data contained 282 inbred lines. The genotypes (2,911 SNPs) were released as a tutorial dataset of the TASSEL and GAPIT software packages

The rice data contained 374 inbred lines, 50,000 SNPs randomly sampled from the one million SNPs from genotyping by sequencing technology

The

The mouse data contained 688 34th generation advanced intercross lines (AIL) derived from two inbred strains (SM/J and LG/J). The genotype data contained 3,117 SNPs

The Human Framingham Heart Study (FHS) data were downloaded from the database of Genotypes And Phenotypes (dbGAP) databases (phg000005.v5). The total Cholesterol (Offspring exams 7) was used as the phenotype for the association study. The present study sample comprised 806 FHS offspring participants who were genotyped using the 100K Affymetrix GeneChip and have fasting blood lipid traits for exams 7. We imputed the missing values using mean values by the program GCTA

A set of SNPs was randomly sampled as causal QTNs for the simulated traits (27, 20, 24, and 20 QTNs for maize,

The distribution of these QTN effects followed a normal distribution with a mean of 0 and variance of 1. Phenotypes were simulated as the following equation: y = additive + residual. For each individual, the total additive effect is calculated as the sum of additive effects across all QTNs. The residual variance was calculated as V

The association tests on the markers were performed by conducting F tests. In the scenario that sampled QTNs without any restriction, the empirical distribution of the non-QTN markers was used as the null distribution of type I error. For the second scenario—last chromosome was excluded for sampling QTNs—the empirical distribution of the markers on the last chromosome was used as the null distribution of type I error. The power is examined as the proportion of QTNs that pass a testing threshold for a given type I error (5%). A total of 100 replications were conducted for each method and the average over the 100 replicates was reported.

All the datasets analyzed herein have been previously published. This study did not obtain actual samples from human or animals.

Through simulations, we demonstrated that the effective components in the small set of selected genetic markers are the QTNs underlying a trait. To remove the confounding between the QTNs and testing markers, the exclusion of QTNs is more effective when LD is used instead of local distance. We examined our proposed method for the practical situations where QTNs are unknown.

We compared SUPER and other popular mixed model methods through a series of simulations. The difference among these methods is how to build kinship. We showed that a small subset of randomly selected genetic markers will not always produce the equivalent statistical power compared to using all genetic markers (

Method to build kinship | Arabidopsis | Rice | Dog | Maize |

All SNPs, including true QTNs | 0.63±0.0070 | 0.52±0.0063 | 0.59±0.0079 | 0.51±0.0095 |

All SNPs, excluding true QTNs | 0.63±0.0072 | 0.52±0.0061 | 0.59±0.0083 | 0.52±0.0091 |

True QTNs only | 0.42±0.0066 | 0.29±0.0064 | 0.40±0.0083 | 0.30±0.0064 |

SUPER with known QTNs | 0.75±0.0065 | 0.65±0.0057 | 0.72±0.0076 | 0.66±0.0075 |

SUPER with unknown QTNs | 0.72±0.0063 | 0.60±0.0059 | 0.68±0.0078 | 0.61±0.0084 |

In the above simulation study, 35% of the time the small set of randomly selected SNPs had higher power than using all SNP kinship. This finding indicates that the gold-standard kinship of using all SNPs is definitely not the best choice. So, the interesting question is: what type of small subset of SNPs produces higher power than using all the SNPs? We were motivated by the fact that a trait specific kinship derived from weighted SNPs has better prediction accuracy than the kinship derived from all the SNPs in genomic prediction

However, when we applied kinship from all the QTNs for GWAS, we found that statistical power decreased to about 30%, which was much lower than using kinship derived from all SNPs. This result is not surprising because the kinship derived from all QTNs is confounded with the effect of the tested SNP if this SNP is one of the QTNs.

This finding confirmed the strategy for selecting the kinship method for GWAS. When testing a SNP, we remove the SNP from the QTN list if the SNP is a QTN. We then use the remaining QTNs to derive a complementary trait specific relationship for the SNP (

For the real situation, where QTNs are unknown, we developed an algorithm to derive a set of pseudo-QTNs for the SUPER method. The algorithm involves three steps. The first step is to perform a preliminary GWAS to sort SNPs. The second step determines the size and number of bins that give the maximum likelihood for a specific trait. Then, for each bin, the most associated SNP is used as the pseudo-QTN to represent that bin. The size and number of bins are the two parameters chosen for optimization. The third step is to perform the complementary process in GWAS by excluding the pseudo-QTNs that are in LD with the tested SNP. The remaining pseudo-QTNs are used to define the complementary relationship among individuals. In the simulation study where the QTNs were masked, we obtained a statistical power of 61%, lower than the situation with known QTNs, but still much higher than using all SNPs (

We extended our examination of statistical power against the type I error for four methods: SUPER, FaST-LMM-Select, EMMAX, and GLM (

We explored several ways to reduce the computing time of SUPER. First, we examined the effect using P3D/EMMAX

Second, we explored speeding up computation by using a fast method to derive the P values at the first stage of SUPER. Three methods were compared: GLM, MLM

Third, we provided a procedure to determine the threshold of LD between tested SNPs and QTNs. When the threshold is too high (e.g., r^{2} = 100%), QTNs are barely removed. The result should be similar to the complete trait specific kinship. In the opposite case, where the threshold is too low (e.g., r^{2} = 0.01%), QTNs are hardly survived in the exclusion process. The kinship matrix does not retain much information and the results would be similar to the GLM. We observed that a threshold of r^{2} = 10% was best for both maize and rice. This threshold also worked well for the other species (dog and

We examined our findings for a variety of circumstances. We verified the effect of the correlation between QTNs and the non-QTN SNPs. The non-QTN SNPs were used to derive the empirical null distribution of type I error. Two scenarios were examined. In the first scenario, no correlation was found because QTNs and non-QTN SNPs were sampled from different chromosomes. In the second scenario, correlation was possible because random sampling might place QTNs and non-QTN SNPs next to each other. We observed that, in either case, our findings still held. That is, 1) SUPER with known QTNs had the highest statistical power, 2) complete trait specific kinship had the lowest power, 3) kinship from all SNPs was in the middle, and 4) SUPER with unknown QTNs fell between SUPER with known QTNs and the kinship from all SNPs (

We then examined the impact of the magnitude of QTN effect (

We expanded the comparisons of SUPER with EMMAX and FaST-LMM-Select to real traits. The first is from the Advanced Intercross Line (AIL) mouse data

The mouse phenotype is methamphetamine-induced locomotor activity on day 3 measured on 688 Advanced Intercross Lines (AIL). The human phenotype is cholesterol collected by the Framingham Heart Study (FHS) Project. Each dataset was analyzed with three different methods (SUPER, EMMAX, and FaST-LMM-Select) except the combination between FaST-LMM-Select and human data. The missing genotypes in the human data were imputed in format of dosage, which is not accepted by FaST-LMM-Select. The most significant SNP is highlighted by a horizontal blue line and labeled by its corresponding False Discover Rate (FDR). The p value threshold of 0.05 (after bonferroni multiple test correction) is indicated by a horizontal red line.

SNP | Chromosome | Position | EMMAX | Fast_LMM_Select | SUPER | Gene |

5-122651666 | 5 | 125405148 | 5.60E-05 | 2.15E-03 | ||

mUC-rs13478501 | 5 | 124051672 | 1.47E-04 | 2.81E-02 | ||

NES14715162 | 5 | 124119050 | 1.28E-04 | 8.81E-03 | ||

5-122053167 | 5 | 124768242 | 1.15E-04 | 7.75E-03 | ||

5-121026072 | 5 | 123740172 | 2.80E-04 | 4.46E-02 |

The second real data is from the Human Framingham Heart Study (FHS) project. Missing genotypes were imputed. As FaST-LMM-Select does not accept dosage genotypes, the comparison was performed between SUPER and EMMAx. Manhattan plots of total cholesterol for SUPER and EMMAX are shown in

With the SUPER method, the restriction of the computationally efficient FaST-LMM method is no longer a problem. Their joint usage retains the similar computing speed while remarkably improving the statistical power.

The concept of complementary trait specific kinship reflects a landmark in the development of kinship. As essential information in population and quantitative genetics, kinship is traditionally derived from pedigree as an expected chance that two individuals share the same allele by descent

An alternative way to derive kinship is to rely on genetic markers

Obviously, the best kinship to define individual genetic relationship on a complex trait is the one derived from all the QTNs underlying the trait as they define it

However, less obvious, is that a small proportion of randomly sampled SNPs would have higher statistical power than using all SNPs. The increased power might result from the combination of the following factors: 1) sampled SNPs contain QTNs or SNPs in LD with QTNs, 2) fewer non-QTN SNPs result in less dilution, and 3) a portion of QTNs, or SNPs in LD with QTNs are excluded and become more detectable.

There is a random chance that a small subset of SNPs selected randomly could have higher power than using all SNPs. In general, the randomly selected subsets of SNPs have less power. Therefore, randomly selecting a small set of SNPs is unsafe. The goal of this study was to find a better method to find subsets. Ideally, the subset contains fewer SNPs than number of individuals and has the same or higher power than using all the SNPs.

FaST-LMM-Select has been undertaken to find small subsets of SNPs

Our study was unique in a several ways. Overall, our study gives the biological, inside-view for the statistical phenomena observed in the FaST-LMM-Select study. Through a series of simulations, we proved that their finding — that using a small set of randomly selected SNPs generates the equivalent statistical power as using all the SNPs — is not always true. In fact, statistical power can be reduced significantly. This result is not surprising as a small random sample of SNPs is less informative than using all the SNPs

Furthermore, we explained why the kinship for GWAS should be specific for a trait and complementary to a testing SNP. We started with known QTNs and showed how different scenarios impact statistical power, such as using all QTNs or using QTNs excluding the one being tested. These studies demonstrate how the inclusion of all QTNs confounds with the effects of testing SNPs when compared to all SNPs and how the exclusion of QTNs eliminates the confounding.

We applied the method derived from situations with known QTNs to real-life situations with unknown QTNs. We developed the algorithm to find their representatives (pseudo-QTNs) and demonstrated that the SUPER approach has statistical power close to that achieved with known QTNs. We determined the set of pseudo-QTNs by optimizing bin size and bin number to define the trait through a method of maximum likelihood. This set of pseudo-QTNs is the best combination among all SNPs compared with the FaST-LMM-Select study, which selects only the top significant SNPs. That we demonstrated a higher power by using SUPER, compared to FaST-LMM-Select, is not surprising.

The top significant SNPs selected in the FaST-LMM-Select study likely include multiple SNPs from each association peak in GWAS. These SNPs are in strong LD among themselves. One obvious disadvantage is that this SNP selection method causes severe dilution. The other disadvantage is that computational time increases by including more SNPs than necessary. The SUPER method avoids this problem by using the pseudo-QTNs. Only one SNP is selected from many SNPs on each peak. Consequently, the optimum number of SNPs used to derive kinship is much smaller.

Moreover, LD is not only caused by local genetic linkage. Many other factors can cause LD between SNPs (e.g., population structure), even when SNPS are on different chromosomes. Therefore, our complementary process is performed genome-wide, and is not limited to the nearby SNPs (FaST-LMM-Select uses a 2 cM interval).

FaST-LMM-Select uses an arbitrary interval (2 cM) as the threshold of exclusion for LD. We use a precise LD parameter (R^{2}). We demonstrated that R^{2} of 10% was robust enough to give the highest statistical power in all species we examined.

Last, but certainly not least, FaST-LMM-Select complements our method. FaST-LMM-Select provides an elegant algorithm to reduce computation time by conducting single value decomposition only once. Thus, the joint usage of these two methods will provide powerful and flexible tools.

We anticipate that the SUPER method could be used jointly with the CMLM to further improve statistical power. Each individual would still have its group assignment. However, the kinship of groups would be replaced by the assignment of individual QTN to groups. The effects of different assignments remain an open research question.

SUPER has been implemented in the publicly available software package, GAPIT. This method makes it possible to detect a gene with smaller samples, or alternatively, to detect a smaller effect gene with the same sample size.

(TIF)

(TIF)

^{2} = 100%, QTNs are barely removed. The result should be similar to the complete trait specific kinship. In the opposite case, when the threshold is too small, e.g. r^{2} = 0.01%, QTNs hardly survived the exclusion process. The kinship does not retain much information and the result would be similar to GLM. Interestingly, we observed that the threshold of r^{2} = 10% work well for all species.

(TIF)

(TIF)

(TIF)

(TIF)

The authors thank Sara J. Miller and Linda R. Klein for editing the manuscript.