Iterative Usage of Fixed and Random Effect Models for Powerful and Efficient Genome-Wide Association Studies
The proposed method, FarmCPU, was inspired by the method development demonstrated on the left panel (a). These methods start with a naïve model (e.g. t-test) that tests marker effect, one at a time, i.e. ith marker (si), on the phenotype (y) with a residual effect (e). Next, GLM controls false positives by fitting population structure (Q) as covariates to adjust the test on genetic markers indicated by the blue arrows. MLM fits both Q and kinship (K) as covariates. However, both Q and K remain constant for testing all the markers. Neither Q nor K receives adjustment from association tests on markers. MLMM add pseudo QTNs as additional covariates (S). These pseudo QTNs are estimated through a stepwise regression procedure. Consequently, these pseudo QTNs receive adjustment from association tests on markers as indicated by the red arrow. However, both Q and K remain constant for testing all the markers. Although similar to MLM, FaST-LMM-Select controls false positives by fitting Q and K as covariates; the K of FaST-LMM-Select is incorporated with association tests on markers as indicated by the red arrow. However, Q remains constant. FarmCPU completely removes the confounding between the testing marker and both K and Q by combining MLMM and FaST-LMM-Select, but allowing a fixed effect model and a random effect model to perform separately. The fixed effect model contains the testing marker and pseudo QTNs to control false positives. The pseudo QTNs are selected from associated markers and evaluated by the random effect model, with K defined by the pseudo QTNs. The fixed effect model and random effect model are used iteratively until a stage of convergence is reached, that is, when no new pseudo QTNs are added. The right panel (b) displays the fixed effect model above the dashed line and the random effect models below the dashed line. The t pseudo QTNs (S1 to St) are fitted as covariates to test markers one at a time, e.g., ith marker (si) in the fixed model. As the pseudo QTNs are fitted as covariates for each marker, Not Available (NA) is assigned as the test statistic for all markers that are also pseudo QTNs—as the genetic marker is completely co-linear to the pseudo QTN marker. However, each pseudo QTN has a test statistic corresponding to every marker, creating a matrix (lightly shaded) with elements of Pij, i = 1 to t and j = 1 to m. The most significant P value of each pseudo QTN (the vector on the right of shaded area) is used as the substitution for the NA of the corresponding marker. The pseudo QTNs are optimized by using the SUPER method in the random model to incorporate both test statistics from the fixed effect model and genetic map information in the genotype data. The random effects are the individuals’ genetic effects (u) with variance and covariance matrix, Var(u), defined by the Singular Value Decomposition (SVD) on the pseudo QTNs by using the FaST-LMM algorithm. The updated set of pseudo QTNs go back into the fixed model. The process continuously repeats until no more pseudo QTNs are added.