A Latent Variable Partial Least Squares Path Modeling Approach to Regional Association and Polygenic Effect with Applications to a Human Obesity Study

Genetic association studies are now routinely used to identify single nucleotide polymorphisms (SNPs) linked with human diseases or traits through single SNP-single trait tests. Here we introduced partial least squares path modeling (PLSPM) for association between single or multiple SNPs and a latent trait that can involve single or multiple correlated measurement(s). Furthermore, the framework naturally provides estimators of polygenic effect by appropriately weighting trait-attributing alleles. We conducted computer simulations to assess the performance via multiple SNPs and human obesity-related traits as measured by body mass index (BMI), waist and hip circumferences. Our results showed that the associate statistics had type I error rates close to nominal level and were powerful for a range of effect and sample sizes. When applied to 12 candidate regions in data (N = 2,417) from the European Prospective Investigation of Cancer (EPIC)-Norfolk study, a region in FTO was found to have stronger association (rs7204609∼rs9939881 at the first intron P = 4.29×10−7) than single SNP analysis (all with P>10−4) and a latent quantitative phenotype was obtained using a subset sample of EPIC-Norfolk (N = 12,559). We believe our method is appropriate for assessment of regional association and polygenic effect on a single or multiple traits.


Introduction
Current genetic association studies in humans, including genome-wide association studies (GWASs) [1], typically involve association of individual SNPs with a trait of interest. Notable drawbacks [2] of such an approach include multiple testing and inability to account for the correlation among SNPs in a region or treat genes as a functional unit [3]. Many attempts were made to account for correlations among SNPs, such as haplotype analysis [4], p-value or odds ratio combination [5][6][7], principal component analysis (PCA) [8], cluster [9], canonical correlation [10], data mining [11][12][13][14], and scan (or slide-windows) statistics [14][15][16]. Regardless the extent to which these approaches have succeeded, they are not developed for integrating multiple related traits underlying a condition or disease. For instance, type II diabetes is linked with fasting glucose, HbA1C, and glucose tolerance, among others; and obesity is another with body mass index (BMI), waist and hip circumference. Ideally, liabilities for developing diseases should be measured on quantitative dimensions [17] with available measurements [17,18], so as to gain more statistical power and facilitate derivation of clinically relevant features [17,19]. The case to combine multiple variants and multiple measurements is compelling and in line with the fact that an increasing number of trait-associated SNPs are identified with the challenge to implement an appropriate weighting scheme for the trait-attributing alleles.
We set to exploit association between multiple SNPs and multiple traits through a latent variable partial least squares path modeling (PLSPM) [20,21] in a context analogous to GWAS: for the discovery sample a set of genetic variants and a latent quantitative trait are modeled through scan statistics and for the replication sample small effects of SNPs from different genes (or genomic regions) are aggregated through polygenic statistics. We examined the performance of the scan statistics with respect to type I error rate and statistical power through computer simulations. Our methods were then applied to 12 regions of GWAS data [22,23] from the European Prospective Investigation of Cancer (EPIC)-Norfolk study.

Study samples
Participants in the EPIC-Norfolk study were men and women aged between 45 and 74 from Norwich and the surrounding towns and rural areas [24,25]. In 2006, a case-cohort study was conducted in which 3,867 individuals were assayed with Affymetrix 500 K genechips among whom subcohort (N = 2,566) was a random sample of the study cohort at baseline and cases were part of the remaining individuals with BMI$30 kg/m 2 (N = 1,301). A total of 2,417 individuals in the subcohort and 1,135 cases with 446,861 SNPs passed quality control and in silico genotypes were obtained according to HapMap (http://www. hapmap.org) [22,23]. An additional sample of 12,559 individuals had complete data on age, sex, BMI, waist and hip circumferences along with 12 BMI associated SNPs in or near genes NEGR1

Anthropometric measurements
The influence of body fat distribution has been linked with body shape named crudely after the fruits and vegetable(s) they resemble most [26,27]. Studies have shown that people with a larger waist have higher risks of hypertension, type 2 diabetes and high cholesterol than those who carry excess weight on the hips [28,29]. The combination of BMI, waist and hip circumferences is also a good predictor of cardiovascular risk and mortality [26,[29][30][31][32]. In this paper, nine types of body shape have been derived from the combination (Table S1) and supported by significant differences in these anthropometric traits by types and sexes. As will soon become clear, adoption of this combination as an approximate quantification of ''body shape'' is furnished through a latent score from formal statistical modelling. Note that the derivation differs from other possible definitions, e.g., http://en.wikipedia.org/ wiki/Body_shape.

The modeling framework
As hinted earlier, our framework resembles structural equation modeling (SEM) with three types of parameters defined: (1) Latent variable scores (j) as combinations of their manifest variables obtained iteratively from an ordinary least squares (OLS)-type algorithm; (2) path coefficients (b's) between dependent (j 2 ) and independent latent variable (j 1 ) by OLS or partial least squares (PLS); (3) loadings (l's) of each block of manifest variables with its latent variables by OLS. In this paper, the Lohmä ller PLSPM algorithm was used [24,26]. The relations between these parameters are shown in Figure 1 and used in two contexts: (1a) scan statistics are used for the detection of the genomic region (j 1 ) -body shape (j 2 ) association in initial data analysis; (1b) the polygenic effect of a set of SNPs (j 1 ) on body shape (j 2 ) is obtained with the replication sample. More information about SEM and PLSPM is available as Information S1.

Non-parametric bootstrap
As the distribution of parameters from PLS is unknown, significant test of path coefficients and loadings were furnished by bootstrap procedures [20,33,34]. A large, pre-specified number of bootstrap samples (5,000), each with the same number of cases as the original sample, were generated. Parameter estimation was done for each bootstrap sample, whose path coefficients or loadings can be viewed as an approximation of the sampling distribution. All bootstrap samples together provided estimators for mean and standard error of each parameter. Significance of a parameter (w) under the null hypothesis: H 0 :w~0 and the alternative H A :w=0 was tested via a normal test in the form ) where se(w) is the bootstrapped standard error [20,21].

Interpretation
Let b ij = the path coefficient between the i-th and the j-th latent variable and l ij = loading between the i-th manifest variable and the j-th latent variable. The interpretation can then be facilitated according to Figure 1: (1) path coefficient (b 21 ) in the structure (inner) model represents an overall effect of the genome region or polygenic effect of a SNPs set (j 1 ) on body shape (j 2 ); (2) R 2 is the proportion of variance explained; (3) With path coefficients and loading obtained from the standardized variables, their product in a given path is a measure of the effect of a specific SNP on a single trait or body shape (j 2 ). For example, the effect of SNP2 on body shape (j 2 ) is l 21 : b 21 , and that on BMI is l 21 : b 21 : l 32 ; (4) Body shape score (BSS), as a combination of waist, hip and BMI with weights l 12 , l 22 , and l 32 , represents a latent quantitative phenotype of body shape such that waist t~wais t tzl 12 s waist : j 2 , hip p~hi p pzl 22 s hip : j 2 , BMÎ I~BM I Izl 32 s BMI : j 2 with the body type determined by (BM I Izl 32 s BMI ) and (WHR R~(wais t tzl 12 s waist )=(hi p pzl 22 s hip ) according to their thresholds (Table S1), and (5) the latent polygenic liability (j 1 ) aggregated by small effects of DNA variants in different genome regions with their weights l 11 , l 21 , …, l p1 is the polygenic risk score (PRS) of the SNP set ( Figure 1b).

Simulation
Simulations were conducted as follows: (1) HapMap phase II CEU data at the brain-derived neurotrophic factor (BDNF) region (Chr 11:27633610..27692970 with 24 SNPs) were used to generate the simulated genotypic data; (2) Based on (1), a large sample of 500,000 individuals was obtained via software gs 2.0 [35] with the 6 th SNP being the causal variant; (3) Quantitative genetic data was generated according to a trivariate normal distribution X~N(m,S), where X~x 1 ,x 2 ,x 3 ð Þis the random vector (waist, hip, BMI) for ''apple- was estimated by published data on genetic predisposition score [18]. Given the increment d on BMI, estimation of waist under fixed hip was obtained by waist t~10:20345z0:62138 : hipz0:99947 : BMI (F~568:25, Pv0:0001, R 2~0 :7635) established by the same ''apple-shaped'' data in the EPIC-Norfolk GWAS; (4) Genotypic data were simulated under various sample sizes from the simulated CEU population (500,000 individuals), and quantitative genetics models with the given d were created by the R mvtnorm package.
The window size had 10 SNPs from the 3 th to the 12 th SNP. Under H 0 , 10,000 simulations given various sample sizes were conducted to assess the type I error. Under H 1 , for each model and a given d, 10,000 simulations were conducted under various sample sizes to assess power. The procedures were implemented with Linux and the R plspm package. Both mvtnorm and plspm packages are available from CRAN. (http://cran.r-project.org/)

Analysis of the EPIC-Norfolk data
Scan statistics were built through the subcohort for association between the 12 regions and body shape, and to contrast with a SNP-wise single trait test performed by linear regressions according to sizes of sliding windows of 1 to 15 SNPs, and the a-level was defined as 1|10 {5 according to the literature [4] for region-based analysis. Polygenic effects on single or latent traits with the PLSPM polygenic statistics were obtained and compared with unweighted sum of BMI-increasing alleles [18] and we also assessed whether j 2 is an appropriate latent quantitative measurement.

Simulation
As shown in Figure 2, the type I error rates of the scan statistics were close to nominal levels (0.01, 0.05) as a function of sample sizes (2a, 2b). Power monotonically increases with sample size, effect size (d), or nominal level (a) (2c-2f). Even with a very small a, for effect size greater than 0.15 and the sample size of up to 4,500, the scan statistics remained to have .80% power (2e, 2f).

Analysis of the EPIC Norfolk data
Single trait results. The model provided the usual association results for single trait adjusted for sex and age including effect size estimate, proportion of variance explained and statistical significance. Results on BMI, waist and hip circumferences were also similar for PRS. Shown in Figure S1 and Table S2 are SEM and results of the 12 SNPs in the 12 gene regions adjusted for sex and age for single trait (a1,b1,c1) as with distribution of their PRS and cumulative effects of these variants (a2,b2 c2). More details can be found in Information S2.
Multi-trait results. As shown in Figure 3, none of the SNPs were significant at 10 24 level according to single-SNP -single-trait tests nor according to sliding window sizes of 1-4 SNPs at the 10 25 level, but smaller p values were obtained for window sizes of 5-11 and 12-15 SNPs. Of particular interest was rs720-4609,rs9939881 at the first intron of FTO with b 21~{ 0:091, P = 4.29610 27 for a sliding window of size 10; its model structure is shown in Figure 4. The standardized overall effect (95% CI) of the genome region on body shape was 20.100 (2014-20.08) without adjustment for sex and age, and 20.09 (20.13-20.07) with adjustment. The effect (95%CI) of a specific SNP on body shape or on a single trait are available 20.09 (20.08-20.08) and 0.07 (20.06-20.05) after adjusting for sex and age, respectively for rs58044769. These results suggest that the location of the causal variant in the 10-SNP loading vector is likely between the rs58044769 and rs11642841 (the sixth SNP). Figure 5a and Table 1 show models and results of the 12 SNPs in the 12 gene regions adjusted for sex and age, where the standardized effect (l SNP ?b 21 ) (95%CI) per allele on body shape was 0.08 (0.07-0.10, P = 7.91610 224 ). The proportion of variance explained was 0.8% by PRS. All genetic variants showed associations with body shape, though some loadings of the SNPs were not significant at a = 0.05 (Table 1). There were substantial variations in standardized effects of each SNP with the largest being rs1121980 (FTO) and rs925646 (BDNF) for all the four traits, followed by rs6538238 (TMEM18), rs17782313 (MC4R) for BMI and hip; rs17782313 (MC4R), rs7132908 (FAIM2) for waist and body shape. Non-standardized effect sizes were largest with rs1121980 (FTO) (0.39), but smallest with rs7647305 (ETV5) (0.05) (see also Figures 5 and S1).
Shown in Figure 5b is the distribution of PRS and cumulative effects of these variants, from which we made the following observations: (1) PRS was normally distributed, with ranges of 0.05-1.69 for body shape, with the majority (68.27%) of individuals ( X X +S) also showing similar patterns of PRS (0.8660.21); (2) for each level of PRS the distribution of body shape had similar pattern according to boxplots, generally normally distributed with range 0.4-1.3 for PRS but skewed with ,0.4 or .1.3; (3) The means of body shape score increased linearly with PRS, with on average each additional unit associated with increments (P) of 2.28 (7.91610 224 ).
Shown in Table 2 and Figure 6 are the distribution of body shape types and characteristics of body shape score in the EPIC-Norfolk replication samples, from which several observations can be made.

Discussion
A latent variable PLSPM framework is outlined for association of multiple SNPs with multiple traits, the behavior of such an association was investigated by simulation study through type I error rate and power. Meanwhile, a polygenic statistic was developed for quantification of a polygenic effect by appropriately weighting trait-attributing alleles. These methods were applied to the study of obesity-related variables in the EPIC-Norfolk study for which a latent score was obtained. Below we compare these with available methods, discuss implications of our findings as with other issues involved and indicate some further work.
Compared to SEM, PLSPM is robust to multicollinearity commonly encountered in GWAS data (such as strong linkage disequilibrium between SNPs and high correlation between traits). It is a ''soft modelling'' approach requiring very few distributional assumptions, variables can be numerical, ordinal or nominal, and no need for normality assumptions, while covariance-based SEM is a ''hard modeling'' with heavy distributional assumptions [20,21]. Through simulation, the scan statistics gave a good approximation of the type I error rate and proved powerful for novel region-based latent quantitative traits analysis, even with very high significant level and a modest single SNP effect size. Our result also agreed with the   literature regarding the optimality of a 10-SNP window [4]. Our scan statistics are embedded with the ''thinking quantitatively framework'' [17] such that there is a theoretical quantitative trait for each qualitative trait and normally distributed polygenic liabilities. Their advantages are as follows: First, it can capture the association between a genomic region and a latent quantitative phenotype of disorder (or trait) all in continuous quantitative dimensions. Second, the model structure provides abundant information for interpretation. Third, fine region of the causal SNP can be located by the loading vector of SNPs in the window (the potential causal variant is probably located between rs58044769 and rs11642841). The latent score of obesity-related variables is a synthetic quantitative phenotype which effectively combines waist, hip and BMI to reflect the risk of obesity in accordance with increasing WHR given increasing BMI. Its derivation is a motivating example for many other disorders and traits, such as diabetes, heart disease and metabolic syndrome.
Analysis of the EPIC-Norfolk discovery sample involving 12 gene regions suggested that the scan statistics are more powerful than single SNP -single trait tests with the size 10 providing the strongest evidence. In particular, the region (rs7204609,rs9939811) within the first intron 1 of FTO gene is of interest, as with some of the reported obesity-susceptibility SNPs near or in the 12 genes [18]. We would like to highlight the utility of PRS. It refers to a set of DNA variants in different genome regions associated with a trait, termed previously as polygenic susceptibility score [36], genomic profiles [37], SNP set [38], aggregate risk scores [39] or genetic predisposition score [18]. Their apparent drawback is the lack of an appropriate scheme for weighting. PRS not only weights the individual risk alleles by the loading vector of the SNP set but also furnishes association analysis between PRS and a latent quantitative phenotype (BSS). Our data showed that PRS was normally distributed, which is consistent with the notion that a theoretical quantitative trait correspond to normally distributed polygenic liabilities (see Figure 5b, Figure S1-(a2,b2,c2)) [17]. Unlike the unweighted estimator, it is also coherent and accurate. For instance, total effects of PRS or a specific SNP on the single trait (BMI, waist or hip circumferences) and on the latent quantitative phenotype (body shape) can be compared by the standardized path coefficient or the product of loading and path coefficients along the path, respectively. The non-standardized path coefficient or the product of loading and path coefficients, total effects of a specific SNP on a single trait and on the latent quantitative phenotype can also be obtained. The mean BMI, waist circumference, hip circumference and body shape score increased in a linear fashion as the PRS increases. The effect of PRS on body shape type can be derived. A reviewer has indicated previous work on multiple linked quantitative trait loci (QTLs) [40,41] that bear some spirit to our use of multiple SNPs. Together with the academic editor they have expressed concerns over the possible impact of population stratification. Fortunately, with availability of genomic data such a concern can be relieved with multiple markers directly [42] or via summary statistics from principal components analysis [43]. The EPIC-Norfolk GWAS has contributed to a variety of consortia, for which the inflation factor derived from per SNP association statistics is always close to one. This is likely to be the result of both homogenous sample and exclusion of outliers at the quality control stage. We believe the analysis as conducted in this report will not be affected. However, in general, it may be necessary to include summary statistics such as principal components as covariates in the model.
A reviewer has questioned the adequacy of body shape as with PLS with a view that body share should be supported by various other measurements such as limb lengths, shoulder widths, etc. However, our interest lies more in utilizing the anthropometric traits from a population study for investigation of health risks. Indeed our results showed that BSS is approximately normal ( Figure S2) and serves as an excellent measurement of body shape types (Figures 6c,  6d). The use of latent trait is also consistent with Fisher's derivation of polygenic effect [17]. At the time the paper was submitted for publication, a form of PLS has appeared for multiple markers [44].
There will be several lines of further research. Firstly, there is an important need to examine the precise nature of regional or polygenic effect on a single trait or a collection of traits, as it may involve both polygenic and pleiotropic effects. This is also the case with GWAS. Long before this work when we reported work using SEM to differentiate pleiotropic effect on obesity-related traits in a GIANT consortium (http://www.broadinstitute.org/collaboration/ giant/index.php/Main_Page) teleconference, a colleague instantly questioned the feasibility across the whole consortium. Secondly, the scan statistics seemed slightly anticonservative and a parametric counterpart is preferable. Thirdly, it will be desirable to catch both linear and nonlinear effects between genome region and latent quantitative trait. Figure S1 SEM of the 12 SNPs in the 12 gene regions adjusted for sex and age for single trait (a1,b1,c1) as with distribution of their PRS and cumulative effects of these variants (a2,b2 c2). (TIF) Figure S2 The distribution of body shape score (BSS). (TIF)