A framework for integrating directed and undirected annotations to build explanatory models of cis-eQTL data

doi:10.1371/journal.pcbi.1007770

Fig 1.

Illustration of BAGEA model components.

(A) The core components of the BAGEA model in the summary statistics formulation. Observed variables are in squares while estimated variables are circled. Given are z_j, the eQTL z-scores for gene j, as well as the LD matrix Σ_j, defining the correlation between summary statistics. Further, z-scores are influenced by the true eQTL effects b_j. These effects in turn depend on directed and undirected annotations, V^j and F^j respectively. While undirected annotations can cover regions of any size, directed annotation have the same size as the genomic variants themselves. The impact of annotations on b_j is estimated from the data via ω and ν. (B) An example of the modeling of different priors of elements of ω using meta-annotations via υ variable vectors. We assume that directed annotations are available for nine annotations, which were derived from tissues Liver, Blood and Brain via 3 assay types DHS, H3K27ac and H3K4me3. It is reasonable to assume that for a given eQTL study, particular tissues or cell types are more relevant than others. We model this by introducing a variable υ for each tissue (or cell type) that affects the prior distribution of only those elements of ω that are derived from this tissue, e.g. υ_Liver only affects elements of ω tied to experiments performed in liver. We model different priors for various for assay types analogously. Shown is the resulting network of influences of the variable υ_tissue, υ_assay on ω. (We used the actual group names as indices, while in the main text, elements of υ’s and ω are indexed by natural numbers).

More »

Expand

Fig 2.

Gene expression variance can be partially explained by directed genome annotations.

The BAGEA model was fitted on genes in the training set (all genes on chromosomes 1 through 15) using monocyte eQTL data on genes with a top nominal p-value below 10⁻¹⁰, and with ExPecto-derived directed annotations. ExPecto includes 2002 annotations in total, of which one of two subsets were used: 253 annotations derived from histone and DHS assays in a blood related cell types (Blood), or, alternatively, 690 annotations derived from TF ChIP-Seq (TF). For each gene j in the test set (all genes on chromosomes 16 through 22 with a top nominal p-value below 10⁻¹⁰), we calculated the directed predictor of expression . As a measure of a predictor’s size, we use its squared magnitude . To evaluate the predictor’s performance, we calculated , the mean squared error (MSE) when predicting gene expression y_j from . To estimate what the smallest attainable would be, we estimated , the additive genetic variance in cis via Haseman-Elston regression per gene. (A) The relationship between the MSE of the predictor and its squared magnitude. We sorted results by predictor Size S_j and averaged within a sliding window containing 25% of genes and step size of 5% of data. Averaged Directed Predictor Size : The mean value of S_j per window on the horizontal axis; Averaged Directed MSE (): The averaged of genes falling into the window on the vertical axis. The 95% confidence interval for each window was derived by bootstrapping. Most variance is explained by genes in the top quartile when ranked by S_j. (B) The relationship between and for genes in the top quartile when ranked by S_j. Genetic Variance (): The estimated additive genetic variance in cis on the horizontal axis. Directed MSE () on the vertical axis. 95% confidence intervals for the mean of both the MSE^dir and are represented as the corners of the red diamond (i.e. the confidence interval for the average MSE^dir is given by the upper and lower corner, whereas the confidence interval for the average is given by the right and left corner respectively). A linear regression is plotted as the blue line, with 95% confidence interval shown in grey.

More »

Expand

Fig 3.

BAGEA, fitted on monocyte eQTL data, selects relevant epigenetic marks and increases directional effect sizes for SNPs close to a TSS.

Parameter estimates when applying BAGEA to monocyte eQTL data using as directed annotations histone and DHS ExPecto predictions derived from blood-related cell types (i.e. Blood from Fig 2). (A) For each chromatin assay type, BAGEA models an assay variance modifier that captures the extent to which that assay type is predictive of gene expression. Shown are the square roots for the assay types with the ten highest variance modifiers (from 17 assay types total). In the BAGEA model, DHS, H3K27Ac and H3K4me3 assays have the largest modifiers. (B) For each cell type, BAGEA models a cell type variance modifier , similar to the assay variance modifier in panel A. Shown are the square roots for the cell types with the ten highest variance modifiers (out of 61 cell types). In the BAGEA model, CD14 positive cells have the largest modifiers. (C) BAGEA reveals experiments underlying the directed annotations that were most predictive of gene expression. Assay Type x Cell Type: Each experiment is a particular assay type performed in a particular cell type. Effect Size (, for experiment i): The BAGEA-estimated effect on gene expression. Shown are the ten largest directed annotation effect sizes. In the BAGEA model, the experiments using DHS, H3k27Ac and H3Kme4 with CD14 positive cells have the largest effect sizes. We also see that most of the 253 annotations are estimated to have a close to zero effect. (D) Shown is the estimated distance modifier of the directed component, . We see a characteristic peak around the TSS, implying that the directed annotations are upweighted close to the TSS.

More »

Expand

Fig 4.

Directed annotations partially explain gene expression variance in GTEx.

The BAGEA model was fit using various GTEx eQTL data (supplemented with GEAUVADIS eQTL data) and with ExPecto-derived directed annotations on genes in the trainig set (chr1,‥,chr15) with a top nominal p-value<10⁻⁷. ExPecto includes 2002 total annotations, of which either 1187 histone and DHS annotations from Roadmap (Roadmap) or 690 non-histone ChIP-Seq from ENCODE (TF) were used. For the Roadmap annotation set we enforced structure on the priors of ω by using the meta-annotations available for cell type and assay type, (group-lasso), while for the (TF annotation set, each ω_i parameter be controlled by its individual υ_i parameter (lasso). For each gene j in the test set (chr16,‥,chr22 and top nominal p-value< 10⁻⁷), we calculated an approximate version of S_j, the squared magnitude of the directed predictor , where the approximation uses external LD information. Further, we calculated an approximate version of , the mean squared error (MSE) when predicting gene expression y_j from . (A) Displayed is the average (approximated) across all genes for each GTEx experiment, and annotation subset. 95% Confidence intervals are computed by bootstrap sampling. (B) For each GTEx experiment and annotation subset, we sorted results by predictor size S_j and and averaged within the top quartile. Displayed is the relationship between the MSE of the predictor and its mean squared magnitude S_j. Averaged S_j, top quartile : The mean value of the directed predictor size S_j in the top quartile on the horizontal axis; Averaged Directed MSE (): The averaged of genes falling into the top quartile in terms of S_j on the vertical axis. The 95% confidence interval for each window was derived by bootstrap sampling. We see that the average squared magnitude S_j is of similar size as the gains in directed MSE suggesting that the BAGEA does not substantially overfit.

More »

Expand

Fig 5.

Model fit for GTEx summary statistics selects directional annotations mainly from biologically consistent cell types.

Shown here are various parameter estimates from fitting 13 different GTEx eQTL summary statistics data (supplemented with GEAUVADIS eQTL data) using histone and DHS ExPecto predictions derived from Roadmap (1187 annotations). (A) BAGEA reveals the experiments underlying the directed annotations that are most predictive of gene expression. GTEx x Roadmap(Rm): Each GTEx eQTL dataset highlights particular Roadmap annotations. Shown here are the 10 largest positive effect sizes across all eQTL and annotation pairings. Effect Size: The estimate of for experiment i. (B) For each chromatin assay type, BAGEA models an assay variance modifier that expresses the extent to which that assay type is predictive of gene expression. Shown here is the distribution of the square roots of the assay variance modifier for any given assay type across all 13 GTEx eQTL datasets. Results are sorted by the maximal value achieved for each assay type and only the 10 highest scoring assay types are shown. We see that DNase.all.peaks H3K27ac annotations dominate. The DNase.fdr0.01.peaks was prioritized in Lung tissue, which had the lowest value for DNase.all.peaks among all experiments. The highest value in DNase.all.peaks was achieved in Fibroblast, an experiment that also showed low average MSE^dir.

More »

Expand