Multi-scale inference of genetic trait architecture using biologically annotated neural networks

doi:10.1371/journal.pgen.1009754

Fig 1.

Biologically annotated neural networks (BANNs) allow for efficient multi-scale genotype-phenotype analyses in a unified probabilistic framework by leveraging the hierarchical nature of enrichment studies to define network architecture.

(A) The BANNs framework requires an N × J matrix of individual-level genotypes X = [x₁, …, x_J], an N-dimensional phenotypic vector y, and a list of G-predefined SNP-sets . In this work, SNP-sets are defined as genes and intergenic regions (between genes) given by the NCBI’s Reference Sequence (RefSeq) database in the UCSC Genome Browser [50]. (B) A partially connected Bayesian neural network is constructed based on the annotated SNP groups. In the first hidden layer, only SNPs within the boundary of a gene are connected to the same node. Similarly, SNPs within the same intergenic region between genes are connected to the same node. Completing this specification for all SNPs gives the hidden layer the natural interpretation of being the “SNP-set” layer. (C) The hierarchical nature of the network is represented as nonlinear regression model. The corresponding weights in both the SNP (θ) and SNP-set (w) layers are treated as random variables with biologically motivated sparse prior distributions. Posterior inclusion probabilities PIP(j) ≡ Pr[θ_j ≠ 0 | y, X] and PIP(g) ≡ Pr[w_g ≠ 0 | y, X, θ_g] summarize associations at the SNP and SNP-set level, respectively. The BANNs framework uses variational inference for efficient network training and incorporates nonlinear processing between network layers for accurate estimation of phenotypic variance explained (PVE).

More »

Expand

Fig 2.

Receiver operating characteristic (ROC) curves comparing the performance of the BANNs (red) and BANN-SS (black) models with competing SNP and SNP-set mapping approaches in simulations (British cohort).

Here, quantitative traits are simulated to have broad-sense heritability of H² = 0.6 with only contributions from additive effects set (i.e., ρ = 1). We show power versus false positive rate for two different trait architectures: (A, B) sparse where only 1% of SNP-sets are enriched for the trait; and (C, D) polygenic where 10% of SNP-sets are enriched. We set the number of causal SNPs with nonzero effects to be 1% and 10% of all SNPs located within the enriched SNP-sets, respectively. To derive results, the full genotype matrix and phenotypic vector are given to the BANNs model and all competing methods that require individual-level data. For the BANN-SS model and other competing methods that take GWA summary statistics, we compute standard GWA SNP-level effect sizes and P-values (estimated using ordinary least squares). (A, C) Competing SNP-level mapping approaches include: CAVIAR [45], SuSiE [46], and FINEMAP [44]. The software for SuSiE requires an input ℓ which fixes the maximum number of causal SNPs in the model. We display results when this input number is high (ℓ = 3000) and when this input number is low (ℓ = 10). (B, D) Competing SNP-set mapping approaches include: RSS [26], PEGASUS [25], GBJ [27], SKAT [21], GSEA [43], and MAGMA [23]. Note that the upper limit of the x-axis has been truncated at 0.1. All results are based on 100 replicates (see S1 Text).

More »

Expand

Fig 3.

Scatter plots comparing how the integrative neural network training procedure enables the ability to identify associated SNPs and enriched SNP-sets in simulations (British cohort).

Quantitative traits are simulated to have broad-sense heritability of H² = 0.6 with only contributions from additive effects set (i.e., ρ = 1). We consider two different trait architectures: (A, B) sparse where only 1% of SNP-sets are enriched for the trait; and (C, D) polygenic where 10% of SNP-sets are enriched. We set the number of causal SNPs with nonzero effects to be 1% and 10% of all SNPs located within the enriched SNP-sets, respectively. Results are shown comparing the posterior inclusion probabilities (PIPs) derived by the BANNs model on the x-axis and (A, C) SuSiE [46] and (B, D) RSS [26] on the y-axis, respectively. Here, SuSie is fit while assuming a high maximum number of causal SNPs (ℓ = 3000). The blue horizontal and vertical dashed lines are marked at the “median probability criterion” (i.e., PIPs for SNPs and SNP-sets greater than 0.5) [57]. True positive causal variants used to generate the synthetic phenotypes are colored in red, while non-causal variants are given in grey. SNPs and SNP-sets in the top right quadrant are selected by both approaches; while, elements in the bottom right and top left quadrants are uniquely identified by BANNs and SuSie/RSS, respectively. Each plot combines results from 100 simulated replicates (see S1 Text).

More »

Expand

Table 1.

Notable enriched SNP-sets after applying the BANNs framework to six quantitative traits in heterogenous stock of mice from the Wellcome Trust Centre for Human Genetics. [47].

The traits include: body mass index (BMI), percentage of CD8+ cells, high-density lipoprotein (HDL), low-density lipoprotein (LDL), mean corpuscular hemoglobin (MCH), and body weight. Here, SNP-set annotations are based on gene boundaries defined by the Mouse Genome Informatics database (see URLs). Unannotated SNPs located within the same genomic region were labeled as being within the “intergenic region” between two genes. These regions are labeled as Gene1-Gene2 in the table. Posterior inclusion probabilities (PIP) for the input and hidden layer weights are derived by fitting the BANNs model on individual-level data. A SNP-set is considered enriched if it has a PIP(g) ≥ 0.5 (i.e., the “median probability model” threshold [57]). We report the “top” associated SNP within each region and its corresponding PIP(j). We also report the corresponding SNP and SNP-set level results after running SuSiE [46] and RSS [26] on these same traits, respectively. The last column details references and literature sources that have previously suggested some level of association or enrichment between the each genomic region and the traits of interest. See S11–S16 Tables for the complete list of SNP and SNP-set level results.

More »

Expand

Fig 4.

Manhattan plot of variant-level association mapping results for high-density and low-density lipoprotein (HDL and LDL, respectively) traits in the Framingham Heart Study [48].

Posterior inclusion probabilities (PIP) for the neural network weights are derived from the BANNs model fit on individual-level data and are plotted for each SNP against their genomic positions. Chromosomes are shown in alternating colors for clarity. The black dashed line is marked at 0.5 and represents the “median probability model” threshold [57]. SNPs with PIPs above that threshold are color coded based on their SNP-set annotation. Here, SNP-set annotations are based on gene boundaries defined by the NCBI’s RefSeq database in the UCSC Genome Browser [50]. Unannotated SNPs located within the same genomic region were labeled as being within the “intergenic region” between two genes. These regions are labeled as Gene1-Gene2 in the legend. Double daggers (‡) denote SNPs that are also identified when using SuSiE [46] to analyze the same traits, and hashtag symbols (#) denote SNP-sets that are identified by RSS [26]. Stars (★) denote SNPs and SNP-sets identified by BANNs that replicate in our analyses of HDL and LDL using ten thousand randomly sampled individuals of European ancestry from the UK Biobank [31]. Gene set enrichment analyses for these SNP-sets identified by BANNs can be found in S29 and S30 Figs. A complete list of PIPs for all SNPs and SNP-sets computed in these two traits can be found in S18 and S19 Tables. Results for the additional study with the independent UK Biobank dataset [31] are illustrated in S31–S33 Figs and full results are listed in S21 and S22 Tables.

More »

Expand

Table 2.

Top three enriched SNP-sets after applying the BANNs framework to high-density and low-density lipoprotein (HDL and LDL, respectively) traits in the Framingham Heart Study [48].

Here, SNP-set annotations are based on gene boundaries defined by the NCBI’s RefSeq database in the UCSC Genome Browser [50]. Unannotated SNPs located within the same genomic region were labeled as being within the “intergenic region” between two genes. These regions are labeled as Gene1-Gene2 in the table. Posterior inclusion probabilities (PIP) for the input and hidden layer weights are derived by fitting the BANNs model on individual-level data. A SNP-set is considered enriched if it has a PIP(g) ≥ 0.5 (i.e., the “median probability model” threshold [57]). We report the “top” associated SNP within each region and its corresponding PIP(j). We also report the corresponding SNP and SNP-set level results after running SuSiE [46] and RSS [26] on these same traits, respectively. The last column details references and literature sources that have previously suggested some level of association or enrichment between the each genomic region and the traits of interest. See S18 and S19 Tables for the complete list of SNP and SNP-set level results. *: Multiple SNP-sets were tied for this ranking. ♣: SNPs and SNP-sets replicated in an independent analysis of ten thousand randomly sampled individuals of European ancestry from the UK Biobank [31].

More »

Expand