Structural equation modeling for hypertension and type 2 diabetes based on multiple SNPs and multiple phenotypes

Genome-wide association studies (GWAS) have been successful in identifying genetic variants associated with complex diseases. However, association analyses between genotypes and phenotypes are not straightforward due to the complex relationships between genetic and environmental factors. Moreover, multiple correlated phenotypes further complicate such analyses. To resolve this complexity, we present an analysis using structural equation modeling (SEM). Unlike current methods that focus only on identifying direct associations between diseases and genetic variants such as single-nucleotide polymorphisms (SNPs), our method introduces the effects of intermediate phenotypes, which are related phenotypes distinct from the target, into the systematic genetic study of diseases. Moreover, we consider multiple diseases simultaneously in a single model. The procedure can be summarized in four steps: 1) selection of informative SNPs, 2) extraction of latent variables from the selected SNPs, 3) investigation of the relationships among intermediate phenotypes and diseases, and 4) construction of an SEM. As a result, a quantitative map can be drawn that simultaneously shows the relationship among multiple SNPs, phenotypes, and diseases. In this study, we considered two correlated diseases, hypertension and type 2 diabetes (T2D), which are known to have a substantial overlap in their disease mechanism and have significant public health implications. As intermediate phenotypes for these diseases, we considered three obesity-related phenotypes—subscapular skin fold thickness, body mass index, and waist circumference—as traits representing subcutaneous adiposity, overall adiposity, and abdominal adiposity, respectively. Using GWAS data collected from the Korea Association Resource (KARE) project, we applied the proposed SEM process. Among 327,872 SNPs, 24 informative SNPs were selected in the first step (p<1.0E-05). Ten latent variables were generated in step 2. After an exploratory analysis, we established a path diagram among phenotypes and diseases in step 3. Finally, in step 4, we produced a quantitative map with paths moving from specific SNPs to hypertension through intermediate phenotypes and T2D. The resulting model had high goodness-of-fit measures (χ2 = 536.52, NFI = 0.997, CFI = 0.998, GFI = 0.995, AGFI = 0.993, RMSEA = 0.012).


Introduction
Hypertension and type 2 diabetes (T2D) are two of the leading risk factors for atherosclerotic cardiovascular disease, which is a major component of the global burden of disease [1][2][3][4]. These conditions often occur together, and recent studies showed that the presence of T2D increased the risk of hypertension [5,6]. Hypertension and T2D are thought to share common pathways such as obesity, insulin resistance, inflammation, oxidative stress, and mental stress [7]. In addition to lifestyle and environmental factors, genetic factors have also been explored to understand the mechanisms of T2D and hypertension [7,8]. Because obesity-related phenotypes are thought to be a common pathophysiological element underlying T2D and hypertension, understanding the connections among these diseases and factors related to obesity is an important aspect of the search for proper treatments of these diseases.
Genome-wide association studies (GWAS) have been successful in identifying genetic variants associated with complex diseases such as asthma, autism disorder, T2D, and hypertension [9]. A typical approach of GWAS is to report significant single-nucleotide polymorphisms (SNPs) affecting disease; in other words, such studies generally present single-SNP analyses. However, association analyses between genotypes and traits are not straightforward due to the complex relationships between genetic and environmental factors. Moreover, in the presence of multiple correlated phenotypes and/or diseases, such analyses become more complicated. Since a separate analysis of each phenotype or disease ignores dependencies among phenotypes, a multivariate approach considering a joint analysis should be considered. Various genomic studies have been conducted to understand hypertension and T2D individually [10,11]. However, few studies have attempted to model the pathways underlying hypertension through obesity-related traits and T2D [12].
Methodology for mediation to assess the importance of various paths and mechanisms has expanded rapidly in the last few decades [13]. Mediation analysis has traditionally been dealt with in the fields of social sciences and psychology, but more recently it has also been addressed in the fields of epidemiology and health. Mediation processes are framed in terms of intermediate variables, the mediator, that helps explain how or why an independent variable influences an outcome. The mediator significantly accounts for variation in an outcome variable [14]. In the context of mediation analysis, there are many advantages in using the structural equation model (SEM) framework [15]. SEM is a multivariate statistical method that involves the estimation of parameters for a system of simultaneous equations [16]. It can be used when extending a mediation process to multiple independent variables, mediators, or outcomes.
In this study, we present SEM-based approach to studying the associations between genetic variants and phenotypes. In general, SEM is a confirmatory approach rather than an exploratory approach, in that the main focus of SEM is to verify that a model established by a researcher beforehand is supported by the data [17], although it may sometimes seem exploratory since the model can be modified to improve its goodness-of-fit. SEM implies a functional relationship expressed through a conceptual model, path diagram, and mathematical equations. Therefore, it is possible to express, through SEM, the causal relationships in a hypothesized mediation process, the simultaneous nature of the indirect and direct effects, the dual role the mediator plays as both a cause and an effect [15,18]. Various types of models can be used in SEM, including regression, path, confirmatory factor, and growth curve models [19].
SEM has been used in various fields, including genetic analysis [20,21]. Procedures for applying SEM after gene-and pathway-based analysis have been proposed [22]. Also, a method for merging GWAS and gene regulatory networks (GRNs) using the SEM framework has been attempted [23]. More recently, the GW-SEM method, which relies on a diagonally weighted least squares (DWLS) estimator, has been proposed to construct SEM on a genomelevel [24]. The SEM-based genome scan approach for metabolic syndrome has been carried out [25]. Additionally, cross-sectional data analysis to provide a path diagram from genetic variants to metabolic syndrome and disease has been studied. SNPs with significant associations with phenotypic traits were as exogenous predictors [8]. Meanwhile, the approaches to analyzing longitudinal data with pedigrees have also been developed [26,27]. However, there are few studies that have applied SEM to analyze multiple SNPs, intermediate phenotypes, diseases simultaneously.
In this respect, we suggest a procedure for constructing an SEM to investigate the relationships of multiple SNPs and multiple intermediate phenotypes with respect to multiple diseases. Unlike current methods that focus only on identifying direct associations between diseases and SNPs, our method introduces the effects of intermediate phenotypes. We define intermediate phenotypes as disease-related phenotypes distinct from the target itself. Moreover, we simultaneously consider multiple diseases in the model. As a result, we generated a quantitative map simultaneously showing the relationships among multiple SNPs, phenotypes, and diseases.
Our detailed goals were to answer the following research questions.

SEM-based modeling procedure of multiple SNPs and multiple phenotypes
Consider an n×c data matrix A with three blocks for n samples: A = [X|Y|Z]. The first block X contains data from n×p SNPs, Y is a block corresponding to n×q intermediate phenotypes, and Z is a block corresponding to n×r diseases, where c = p+q+r. In order to investigate the multiple phenotypes that reflect joint action of multiple SNPs, we developed an SEM-based modeling procedure summarized in the following four steps. Fig 1 shows a summary of the proposed procedure.
The first step was to select the preliminary informative SNPs. To avoid computational complexity and multicollinearity due to the enormous scale of the SNPs in GWAS, non-contributing SNPs to each phenotype were excluded through a single-SNP analysis using regression or logistic regression models. Since intermediate phenotypes and diseases are likely to be heterogeneous according to demographic factors such age and sex, analyses were conducted using demographic factors as covariates. Instead, it is possible to choose the most significant SNP in each linkage disequilibrium (LD) block. In this paper, a set of significant SNPs for each phenotype or disease will be called SNP block. Thus, q+r SNP blocks consisting of informative SNPs are produced in this step.
The second step was to explore the underlying nature of the factors, or latent variables of the selected SNPs for each SNP block. We begin with theoretical assumption that there exists a common latent construct among the significant SNPs. Fig 2 shows a conceptual framework for the factor analysis. We assumed that SNPs having similar genetic functions or located near the same gene are manifested by the underlying latent factors. In the exploratory factor analysis steps, we performed a procedure of varimax rotation, which is one of the most common method of orthogonal rotation, producing uncorrelated factors and simplified interpretation [28]. In order to construct latent variables efficiently, SNPs with a very low communality were excluded. Here, communality was considered to represent the variance of any SNP that was shared with other SNPs via common factors.
The third step was to investigate the effect of intermediate phenotypes on diseases, or the associations among all these phenotypic variables. This association analysis of multiple phenotypes provided a conceptual framework for constructing an SEM structure in the following step.  In the final step, we applied SEM based on the previously constructed latent variables, or joint SNPs, obtained from step 2. The SEM reflected the relationships between all SNPs, their joint action or latent variables, intermediate phenotypes, disease, and morbidity. A typical process for SEM was performed [13]. That is, after identification of the model, we estimated and tested the model.
Because of the distribution issue of the indirect effect, several approaches have been suggested to test the unstandardized indirect effect [29]. Bootstrapping approach uses the empirical distribution of the statistics to approximate the theoretical distribution of the statistics [30], whereas PRODCLIN method is based on the distribution of a product for testing unstandardized indirect effect [31]. An alternative approach is to use maximum likelihood estimate [32]. It is known that likelihood-based confidence interval should capture the asymmetry on the distribution of the indirect effect [29]. Each method has its pros and cons, but we used maximum likelihood method in this study.
The model was modified if necessary. Finally, we identified the best SEM model with the highest goodness-of-fit measure. While there are no golden rules for assessment of model fit, reporting a variety of indices is necessary. Several fit indices including χ 2 , NFI, CFI, GFI and RMSEA can be considered, where NFI is the normed fit index, CFI is the comparative fit index, GFI is the goodness-of-fit index, AGFI is the adjusted goodness-of-fit index, and RMSEA is the root mean square error of approximation [33]. In general, the smaller the χ 2 value, the better the goodness-of-fit to the data. However, the χ 2 statistic is sensitive to the samples size and it nearly tends to reject the model when the sample size is large [34]. The other indices are independent of the sample sizes. Cut-offs indicating a good fit for each index are NFI �0.95, CFI�0.95, GFI�0.95, AGFI�0.90, and RMSEA�0.07 [33].

GWAS data and phenotypic measurements
We analyzed the GWAS data set from the Korea Association Resource (KARE) project. This project was initiated in 2007 in order to undertake a large-scale GWAS. The 10,038 participants were recruited from two community-based cohorts: Ansung, representing a mainly rural community, and Ansan, representing an urban community [35]. After standard quality control procedures for the subjects and SNPs, a total of 8,842 participants and 327,872 SNPs remained. Of them, 4,183 (47.31%) were male and 4,659 (52.69%) were female, with a mean age of 52.22 years (range, 39-70 years). The KARE data include demographic characteristics such as area, sex, and age, as well as multiple phenotypes related to obesity, T2D, and hypertension that were identified based on a review of the participants' medical records. Our research interests focused on T2D and hypertension as mediated through obesity. Table 1 presents the basic statistics for the data.
We defined T2D using fasting blood glucose (FBG) or blood glucose after 120 minutes (OGTT120), with criteria of FBG � 126 mg/dL, OGTT120 � 200 mg/dL, or the use of antidiabetic medication. Hypertension was defined as systolic blood pressure (SBP) � 140 mm Hg, diastolic blood pressure (DBP) � 90 mm Hg, or the use of antihypertensive medication. As intermediate phenotypes, we considered three traits related to obesity: BMI, WC, and SUB. More specifically, BMI reflects overall body adiposity [36], whereas WC reflects abdominal adiposity (for which visceral adipose tissue is largely responsible), and SUB reflects subcutaneous adiposity [36]. Height (cm), body weight (kg), and waist circumference (cm) were measured using standard methods in light clothes. BMI was calculated as the weight divided by the square of height (kg/m 2 ). SUB was measured using a caliper at a vertical fold taken 1 inch below the lowest point of the shoulder blade (mm).
Hypertension was present in 2393 (27.06%) participants, and 836 (9.45%) of participants had T2DM. The average (±SD) values of the obesity-related variables were 24.60 kg/m 2 (±3.12 kg/m 2 ) for BMI, 23.69 mm (± 10.96 mm) for SUB, and 82.7 cm (± 8.79 cm) for WC (Table 1). We analyzed these data using the proposed procedure. From step 1 to step 3, statistical analyses were done using SAS version 9.4 (SAS Corp., Cary, NC, USA). For the SEM analysis in step 4, Lisrel version 9.1 (Scientific software international, Skokie, IL, USA) was used. The threshold for statistical significance was set at α = 0.05, unless otherwise noted.

Step 1: Associations between single SNPs and phenotypes
Among the 327,872 SNPs in the KARE dataset, we selected informative SNPs affecting intermediate phenotypes and diseases. We conducted single-SNP analyses using simple linear regression models and logistic regression models for intermediate phenotypes and diseases, respectively. Sex, age, and area were used as covariates. The SNPs were regarded as statistically significant when they showed a p-value less than 1.0E-05 in the single-SNP analysis. The obesity-related phenotypes (BMI, SUB, and WC) were associated with 19 SNPs: 7 SNPs for BMI, 5 SNPs for SUB, and 7 SNPs for WC. For the dichotomous definition of T2D, a single SNP was determined to be significant, with a p-value less than 1.0E-05. For the dichotomous definition of hypertension, 4 SNPs were identified as significant. Finally, we selected 24 SNPs included in one of the 5 SNP blocks. The five SNP blocks consisted of significant SNPs for Sub, BMI, WC, T2D and Hypertension, respectively. Table 2 presents a list of the selected SNPs for each SNP block.

Step 2: Construction of latent variables for SNPs
During factor analysis, we investigated the communality between SNPs constructing latent variables. Excluded were three SNPs with very low communality (less than 0.3): rs17178527, which was related to BMI; rs16951883, which was related to SUB; and rs17092358, which was related to WC. A separate factor analysis was performed for each SNP block to obtain latent variables. Table 3 shows the factor loadings obtained by varimax rotation and variance explained for each phenotype. By using factor loadings, the four SUB-related SNPs in the first SNP block were constructed as two latent variables (hereafter, FACTOR11 and FACTOR12). The six BMI-related SNPs were also constructed as two latent variables (hereafter, FACTOR21 and FACTOR22), and the seven WC-related SNPs were composed of three latent variables, (hereafter, FACTOR31, FACTOR32, and FACTOR33). These latent variables may reflect the common joint action of SNPs on their respective phenotype. The latent variables for the SNPs related to diabetes (hereafter, FACTOR41) and hypertension (hereafter, FACTOR51 and FAC-TOR52) were also generated in a similar manner. The SNPs classified as corresponding to the same latent variable are shown in square brackets. The two SNPs comprising FACTOR52 have different signs, meaning that they were found to be related in different directions. Both FAC-TOR11 and FACTOR21 consisted of SNPs from the FTO gene, which is known to be associated with fat mass and obesity. The SNPs comprising FACTOR51 were from ATP2B1, which has been reported to be a hypertension-related gene [37].

Step 3. Investigation of the relationships among variables
We investigated the association among phenotypes by adjusting for the effect of covariates including area, sex, and age. Partial correlation coefficients of the considered phenotypes are presented in Table 4. All three intermediate phenotypes were directly correlated with each other (r>0.547). For hypertension, BMI showed the largest correlation coefficient of the obesity measures (r = 0.194), and WC showed the next largest correlation coefficient (r = 0.180). For T2D, WC showed the largest correlation coefficient (r = 0.122), implying that central body fat may be more closely associated with components of metabolic syndrome, such as T2D and hypertension.
Before the main SEM analysis, we conducted a path analysis between the intermediate phenotypes and diseases to determine the most proper model. First, we set T2D as a risk factor for hypertension according to recent studies [5,6]. Second, we established the paths regarding obesity indicators. As mentioned earlier, we considered three obesity-related phenotypes to capture the various dimensions of obesity including genetic influences on each obesity phenotype: WC reflects abdominal adiposity, BMI reflects overall body adiposity; SUB reflects subcutaneous adiposity [36]. WC is generally known to have a greater association with diabetes or hypertension than the other two obesity indicators. Clinical evidences consistently suggested that the association of insulin resistance (which is thought to be the key common pathway of T2D and hypertension) with WC was stronger than the corresponding associations of insulin resistance with BMI or SUB [36,38]. In our analysis, WC also showed the strongest correlation with T2D, and BMI or SUB showed considerable correlation with WC. Therefore, we have established the ordered paths to maximize the clinical relevance, where WC directly affects the diseases, and BMI or SUB affect diseases through WC. We considered the paths in which BMI or SUB directly affect T2D or hypertension independently of WC. As a result, we established the path from obesity to T2D and hypertension that is shown in Fig 3. It displays the coefficient estimates of the path analysis, which indicate the strength of the effect of the independent variable on the dependent variable. Paths with solid lines indicate that effects were statistically significant (p<0.05), whereas two paths with dotted lines showed statistically insignificant effects.

Step 4: Structural equation model of multiple SNPs and multiple phenotypes
The SEM was constructed based on the joint action of the multiple SNPs and multiple intermediate phenotypes identified in the previous steps. By developing an SEM, we reflected the relationships between all SNPs, their joint action or latent variables, obesity-related phenotypes, and their possible morbidity (i.e., their effect on diseases). In hypothesized model construction, we assumed that the error terms were uncorrelated, which was an important assumption for causal inference in performing mediation analysis. Model coefficients of SEM were estimated by maximum likelihood method using the sample covariance matrix of KARE data, with sex, age, and area as covariates, yielding consistent and efficient estimates. The fit of the hypothesized SEM was improved by allowing measurement errors correlated for rs9926289 and rs9939609, rs7193144 and rs8050136, rs7193144 and rs7193144, rs8050136 and rs8050136. Model diagnosis confirmed that the maximum likelihood estimation method was appropriate. Under the proposed SEM, we considered the causal relationships among endogenous latent variables of WC and BMI. Correlations between latent variables for each phenotype were not large enough to cause multicollinearity between latent variables.
Through the modification process, we identified the best SEM model based on various goodness-of-fit measures (χ 2 = 536.52, NFI = 0.997, CFI = 0.998, GFI = 0.995, AGFI = 0.993, RMSEA = 0.012). These indices indicated that the resulting SEM should have high goodnessof-fit. In contrast, in the analysis of genetic factors, information from 17 SNPs was used to predict T2D and information from 21 SNPs was used to predict hypertension. FACTOR11 and FAC-TOR12, which were latent variables consisting of SUB-related SNPs, showed significant direct effects on subcutaneous adiposity. They also had significant indirect effects on overall adiposity, abdominal adiposity, T2D, and hypertension. Similarly, the latent variables from the SNPs related to BMI (FACTOR21 and FACTOR22) affected overall adiposity directly and affected abdominal adiposity indirectly. FACTOR21 also significantly affected T2D and hypertension indirectly. Of the latent variables from the SNPs related to WC, FACTOR31 and FACTOR33 showed significant relationships with abdominal adiposity and T2D, but FACTOR32 did not. Of the WC-related latent variables, only FACTOR33 showed significance for hypertension. One of the latent variables from the hypertension-related SNPs (FACTOR52) was significantly associated with hypertension, but the other (FACTOR51) was not. However, although FAC-TOR51 consisted of SNPs that had significant individual effects on hypertension in the single-SNP analysis, the latent variables were no longer significant when the various factors were considered together.
Integrating these observations, T2D was found to be significantly affected by subcutaneous adiposity, overall adiposity, and abdominal adiposity, and all the latent variables from the  SNPs except FACTOR32. The largest direct effect on T2D came from abdominal adiposity (b = 0.097, t = 5.21). However, including indirect effects, subcutaneous adiposity had the largest total effect (b = 0.123, t = 11.60). Meanwhile, hypertension was directly and indirectly associated with overall adiposity, abdominal adiposity, and T2D. Subcutaneous adiposity significantly affected hypertension indirectly, but not directly. The largest effect on hypertension was obtained for FACTOR52 (b = 0.384, t = 2.37), and overall adiposity showed the second largest direct effect (b = 0.186, t = 6.88). The results were similar when only direct effects were considered.

Discussion
Hypertension and T2D are important public health concerns, as the prevalence of each is increasing worldwide [39,40]. The coexistence of hypertension and T2D dramatically increases the risk (2-to 4-fold) of cardiovascular disease and all-cause death [4]. Although studies have investigated the effects of obesity-related factors on T2D and hypertension separately, few studies have investigated the pathways underlying hypertension through obesityrelated traits.
In this study, we aimed to improve our understanding of the pathways underlying hypertension and T2D driven by genetic variants and obesity-related traits by conducting a multivariate analysis. In order to achieve these goals, we developed an analytical process consisting of four steps that yielded successful results. In step 1, we investigated GWAS variants that affected intermediate phenotypes and disease, the first research goal. In step 2, we found communalities or similarities among SNPs for each phenotype, the second research goal. Step 3, the path analysis of multiple intermediate phenotypes and diseases, suggested plausible associations among traits, the third research goal.
Step 4 enabled us to achieve our final research goal through an SEM analysis of the associations among multiple SNPs, multiple phenotypes, and multiple diseases. Conclusively we developed a quantitative map simultaneously showing the relationships among GWAS variants, intermediate phenotypes, T2D, and hypertension.
This analysis provides insights into the mechanisms underlying T2D and hypertension. Our findings highlight the importance of subcutaneous adiposity and abdominal adiposity, as well as latent variables from SNPs, as driving elements of T2D in the Korean population. The impacts of latent variables of the SNPs, overall adiposity, abdominal adiposity, and T2D on hypertension were also confirmed. The resulting model had high goodness-of-fit measures.