^{*}

Conceived and designed the experiments: NY. Performed the experiments: NY. Analyzed the data: NY. Contributed reagents/materials/analysis tools: NY. Wrote the paper: NY NL DZ JL.

The authors have declared that no competing interests exist.

Complex diseases and traits are likely influenced by many common and rare genetic variants and environmental factors. Detecting disease susceptibility variants is a challenging task, especially when their frequencies are low and/or their effects are small or moderate. We propose here a comprehensive hierarchical generalized linear model framework for simultaneously analyzing multiple groups of rare and common variants and relevant covariates. The proposed hierarchical generalized linear models introduce a group effect and a genetic score (i.e., a linear combination of main-effect predictors for genetic variants) for each group of variants, and jointly they estimate the group effects and the weights of the genetic scores. This framework includes various previous methods as special cases, and it can effectively deal with both risk and protective variants in a group and can simultaneously estimate the cumulative contribution of multiple variants and their relative importance. Our computational strategy is based on extending the standard procedure for fitting generalized linear models in the statistical software R to the proposed hierarchical models, leading to the development of stable and flexible tools. The methods are illustrated with sequence data in gene

Complex diseases and traits are likely influenced by many common and rare genetic variants and environmental factors. Next-generation sequencing technologies have provided unparalleled tools to sequence a large number of individuals, allowing for comprehensive studies of both common and rare variants. However, detecting disease-associated rare variants and common variants of small effects poses unique statistical challenges. We propose here a comprehensive hierarchical generalized linear model framework for simultaneously analyzing multiple groups of rare and common variants and relevant covariates. The proposed hierarchical generalized linear models introduce a group effect and a genetic score for each group of variants, and jointly they estimate the group effects and the weights of the genetic scores. This framework includes various previous methods as special cases, and it can effectively deal with both risk and protective variants in a group and can simultaneously estimate the cumulative contribution of multiple variants and their relative importance. The methods are illustrated with sequence data in gene

Many common human diseases and complex traits are highly heritable and are believed to be influenced by multiple genetic and environmental factors. Genome-wide association studies (GWAS) represent a powerful way for discovering disease-associated factors and investigating the genetic architecture of complex diseases

Next-generation sequencing technologies have provided unparalleled tools to sequence a large number of individuals in candidate genes, exomes, or even the entire genome, allowing for comprehensive studies of both common and rare variants. In addition to the common problems of handling large numbers of variants, however, detecting disease-associated rare variants and common variants of small effects poses unique statistical challenges

Several approaches along this line have been proposed

All the existing methods have been developed to assess only one group of variants at a time. Since common diseases are likely caused by a complex interplay among many genes and environmental factors, however, it is more appropriate to simultaneously model multiple groups of variants and covariates

We propose here a comprehensive hierarchical generalized linear model (GLM) framework for simultaneously analyzing multiple groups of rare and common variants and relevant covariates. The proposed hierarchical GLMs introduce a group effect and a genetic score (i.e., a linear combination of main-effect predictors for genetic variants) for each group of variants, and jointly estimate the group effects and the weights of the genetic scores. This framework includes various previous methods as special cases, and can effectively deal with both risk and protective variants in a group and can simultaneously estimate the cumulative contribution of multiple variants and their relative importance. The methods are illustrated with sequence data in gene

Suppose that a population-based association study consists of _{1}, ···, _{n}_{k}_{k}_{k}_{k}

We extend the hierarchical generalized linear model (GLM) of Yi and Zhi _{0} is the intercept, _{ij}_{j}_{ij}_{k}_{k}

The common coefficient _{k}_{k}_{k}_{k}_{k}

The mean of the response variable is related to the linear predictor via a link function

Our main goal is to estimate the group effects _{k}

The above hierarchical prior assumes that

If the number of groups is not large, the group effects

For the covariate effects

Our hierarchical GLMs include multiplicative parameters, a common coefficient _{k}_{ij}

For the multiplicative model to be useful, we need informative prior distributions on the multiplicative parameters that allow us to distinguish between the group effects and the individual weights. The prior means _{k}_{k}

The prior means

An alternative choice of the prior means is to set all

Our Bayesian hierarchical GLMs can be fitted using Markov chain Monte Carlo (MCMC) algorithms that fully explore the joint posterior distribution by alternately sampling each parameter from its conditional posterior distribution

Our algorithm updates the coefficients

We initialize our iterative algorithm by setting the parameters (^{th} iteration, and ^{−5}).

At convergence of the algorithm, we summarize the inferences using the latest estimates of the coefficients

Our model fitting strategy is based on extending the well-developed IWLS algorithm for fitting classical GLMs to our Bayesian hierarchical GLMs. The IWLS algorithm is executed in the glm function in R (

Our hierarchical multiplicative GLMs include various models as special cases. Although less comprehensive, these reduced models can be useful in some situations, and thus can be used as alternative approaches to analysis of multiple groups of rare and common variants. We here consider two types of reduced models. The first ignores the group effects and directly models the main effects of individual variants. Thus, the linear predictor (1) is reduced to

The second alternative approach is to preset the weights of individual variants using the previous methods

Romeo et al.

The top panel: the histogram of the log-transformed plasma levels of triglyceride and the 25th and 75th percentiles (the black dotted lines). The middle panel: the logarithm of the observed count of heterozygotes (Aa) and rare homozygotes (aa) for each variant in the continuous trait analysis. The bottom panel: the logarithm of the observed count of Aa and aa for each variant in the binary trait analysis. The gray dotted lines show the four groups: common non-synonymous, common synonymous, rare non-synonymous, and rare synonymous.

Romeo et al.

We divided the variants into four groups: common non-synonymous, common synonymous, rare non-synonymous, and rare synonymous. We used a minor allele frequency of 1% as the threshold to distinguish between common and rare variants

We first analyzed the data using the hierarchical multiplicative GLMs (Equations 1–3) with the proposed hierarchical prior distributions (Equations 4 and 5). For comparisons, we then used three alternative methods: 1) Setting all the scale parameters

All the analyses simultaneously fitted all the non-genetic variables (i.e., race, age, sex, and BMI), the two common non-synonymous variants (i.e., 8155_T266M and 8191_R278Q) and the three groups of variants. We used a normal regression and a logistic regression for the continuous and the binary traits, respectively. We set the prior means

_{j}

The top and bottom panels are for the continuous and binary traits, respectively. The left panel is for the covariates, the two common non-synonymous variants and the three group effects (G1: common synonymous; G2: rare non-synonymous; G3: rare synonymous). The right panel is for the adjusted main effects (the gray dotted line shows the two groups G1 and G2). The points, short lines and numbers at the right side represent estimates of effects, ±2 standard errors, and

The top and bottom panels are for the continuous and binary traits, respectively. The left panel is for the covariates, the two common non-synonymous variants and the three group effects (G1: common synonymous; G2: rare non-synonymous; G3: rare synonymous). The right panel is for the adjusted main effects (the gray dotted line shows the two groups G1 and G2). The points, short lines and numbers at the right side represent estimates of effects, ±2 standard errors, and

The top and bottom panels are for the continuous and binary traits, respectively. The left panel is for the covariates, the two common non-synonymous variants and the three group effects (G1: common synonymous; G2: rare non-synonymous; G3: rare synonymous). The right panel is for the adjusted main effects (the gray dotted line shows the two groups G1 and G2). The points, short lines and numbers at the right side represent estimates of effects, ±2 standard errors, and

The top and bottom panels are for the continuous and binary traits, and the left and right panels are for Simple-Sum and Weighted-Sum methods, respectively. The points, short lines and numbers at the right side represent estimates of effects, ±2 standard errors, and

The left and right panels are for the continuous and binary traits, respectively. The points, short lines and numbers at the right side represent estimates of effects, ±2 standard errors, and

The group effects in our model should be interpreted with caution; a positive group effect does not necessarily mean that the variants increase the phenotype, because for some variants the weights can be estimated to be negative (for example, the rare variant 1313_E40K in our analyses). The right panel of

_{j}

We used simulations to validate the proposed models and algorithm and to study the properties of the method. Although most published simulation studies of rare variants generated genotypes assuming a population genetics model for the propagation of rare variants, the best way will be to take real sequence data obtained from many individuals and simulate phenotypes based on variants in those sequences, making assumptions only about genetic effects of variants

We evaluated some factors that may affect the performance of the methods:

_{j}_{h}_{j}_{h}

Group | 1 | 2 | 3 | |

Number of variants | 10 (11) | 26 (16) | 44 (34) | |

Scenario | a | |||

b | ||||

c | ||||

d | ||||

e | ||||

f |

Group | 1 | 2 | 3 | 4 | 5 | 6 | |

Number of variants | 5 (5) | 5 (6) | 13 (8) | 13 (8) | 22 (17) | 22 (17) | |

Scenario | a | ||||||

b | |||||||

c | |||||||

d |

We simulated a continuous and a binary phenotype. As in our real data analyses, we simultaneously fitted all the non-genetic variables (i.e., race, age, sex, and BMI), the two common non-synonymous variants (i.e., 8155_T266M and 8191_R278Q) and the grouped variants. We assumed the non-genetic coefficients and the additive effects of 8155_T266M and 8191_R278Q to be their estimated values in the continuous trait analysis (see the top panel of _{i}_{i}

For each situation, 1000 replicated datasets were simulated. We calculated the frequencies of each effect estimated as significant at the threshold levels of

Power or Type I error rate for the proposed method (○), Yi and Zhi (×), Simple-Sum (Δ) and All-Variants (+) under the threshold level of 0.01. X8155_T266M and X8191_R278Q are the two common non-synonymous variants, and G1, G2 and G3 are the three group effects (G1: common synonymous; G2: rare non-synonymous; G3: rare synonymous). Red and blue symbols represent results for continuous and binary traits, respectively. The dashed line is the nominal 0.01 level.

Power or Type I error rate for the proposed method (○), Yi and Zhi (×), Simple-Sum (Δ) and All-Variants (+) under the threshold level of 0.01. X8155_T266M and X8191_R278Q are the two common non-synonymous variants, and G1, G2 and G3 are the three group effects (G1: common synonymous; G2: rare non-synonymous; G3: rare synonymous). Red and blue symbols represent results for continuous and binary traits, respectively. The dashed line is the nominal 0.01 level.

Power or Type I error rate for the proposed method (○), Yi and Zhi (×), Simple-Sum (Δ) and All-Variants (+) under the threshold level of 0.01. X8155_T266M and X8191_R278Q are the two common non-synonymous variants, and G1–G6 are the six group effects. Red and blue symbols represent results for continuous and binary traits, respectively. The dashed line is the nominal 0.01 level.

In all the simulation scenarios, the proposed method and the extension of Yi and Zhi

For the groups in which all variants affected the traits in the same direction, the summation of the additive-effect predictors could provide a useful genetic score of these variants, and thus the simple-sum method had reasonable power to detect the group effect (see

For the groups in which 60% of variants increase disease risk and others are disease-protective, the summation of the additive-effect predictors provides an inefficient genetic score to summarize the information of the variants, and thus the simple-sum method had low power to detect the association (see

We also evaluated power and type I error at several different levels (e.g.,

We have proposed here a Bayesian hierarchical generalized linear model framework for simultaneously analyzing multiple groups of rare and common variants and relevant covariates. Since complex diseases and traits are likely influenced by multiple genetic variants and environmental factors, the joint analyses of multiple groups of genetic variants can improve the power of detecting causal effects and lead to increased understanding about the genetic architecture of diseases. The proposed hierarchical generalized linear models introduce a group effect and a genetic score for each group of variants, and jointly estimate the group effects and the weights of the genetic scores. This can produce ‘optimal’ weights to different variants based on their contributions to the phenotype, yielding an effective summary of the information across variants. The simulation studies show that the proposed method can consistently provide reasonable power even in the presence of both risk and protective variants in a group, and has better power than existing approaches even when all variants act in the same direction. Application of the method to a large published dataset on resequencing of the gene

In addition to the properties described above, our method has several remarkable features. First, the proposed method can simultaneously estimate the group effects of multiple groups of variants and the individual effects of the variants, allowing us to not only identify significant genes (or groups of variants) but also assess the relative importance of single variants. Second, our hierarchical model includes various existing methods for rare variants as special cases. This shows that the proposed method is theoretically more advantageous than the existing methods, and allows us to conveniently analyze data using different ways. Third, any external information about variants, for example, the functional prediction, can be easily incorporated into our hierarchical model by specifying the prior means of the weights for variants. By doing so, our approach has the additional advantage of accounting for uncertainties about the prior assumptions. Fourth, our approach is based on the generalized linear model framework and thus can deal with various types of continuous and discrete phenotypes and covariates, and can fit any generalized linear models. Finally, the proposed algorithm extends the standard procedure for fitting classical generalized linear models in the general statistical package R to our Bayesian model, leading to the development of stable and flexible software.

Our approach is highly extensible; we have planned several extensions to the proposed method, some of which have been initially implemented in our software BhGLM. The key to our approach is the use of hierarchical prior distributions for the weights and the group effects, so that these multiplicative parameters are identifiable and can be simultaneously estimated from the data. We have proposed to use the hierarchical expression of the half-Cauchy distribution with the innovation of introducing both group- and variable-specific parameters. The half-Cauchy prior is an excellent default choice for many problems

Although demonstrated with only several groups of variants, our method can be adapted to deal with large-scale sequencing data involving thousands of exomes or candidate genes. For these high-dimensional settings, we need to modify the prior distributions of the group effects and the computational algorithm. We can place a shrinkage prior on the inverse scale in the gamma prior of

Our third extension could incorporate external gene or pathway level information into the hierarchical model. Candidate genes or pathways studies usually consist of data at different levels, i.e., genetic variants within multiple candidate genes or pathways which may be functionally related

Our fourth extension could incorporate genetic interactions (gene-gene and gene-environment interactions) into the model. Just as interactions must be considered in standard GWA studies

The proposed hierarchical generalized linear models may provide efficient tools for disease risk prediction and personalized medicine. GWA studies have raised expectations for predicting individual susceptibility to common diseases using genetic variants

The EM-IWLS Algorithm for Fitting Hierarchical Generalized Linear Models.

(DOCX)

We would like to thank Drs. Jonathan Cohen and Helen Hobbs for access to the Dallas Heart Study dataset. We are grateful for three reviewers for their constructive comments that improve the previous version of the manuscript.