Bayesian integrative analysis of epigenomic and transcriptomic data identifies Alzheimer's disease candidate genes and networks

Biomedical research studies have generated large multi-omic datasets to study complex diseases like Alzheimer’s disease (AD). An important aim of these studies is the identification of candidate genes that demonstrate congruent disease-related alterations across the different data types measured by the study. We developed a new method to detect such candidate genes in large multi-omic case-control studies that measure multiple data types in the same set of samples. The method is based on a gene-centric integrative coefficient quantifying to what degree consistent differences are observed in the different data types. For statistical inference, a Bayesian hierarchical model is used to study the distribution of the integrative coefficient. The model employs a conditional autoregressive prior to integrate a functional gene network and to share information between genes known to be functionally related. We applied the method to an AD dataset consisting of histone acetylation, DNA methylation, and RNA transcription data from human cortical tissue samples of 233 subjects, and we detected 816 genes with consistent differences between persons with AD and controls. The findings were validated in protein data and in RNA transcription data from two independent AD studies. Finally, we found three subnetworks of jointly dysregulated genes within the functional gene network which capture three distinct biological processes: myeloid cell differentiation, protein phosphorylation and synaptic signaling. Further investigation of the myeloid network indicated an upregulation of this network in early stages of AD prior to accumulation of hyperphosphorylated tau and suggested that increased CSF1 transcription in astrocytes may contribute to microglial activation in AD. Thus, we developed a method that integrates multiple data types and external knowledge of gene function to detect candidate genes, applied the method to an AD dataset, and identified several disease-related genes and processes demonstrating the usefulness of the integrative approach.


Introduction
Alzheimer's disease (AD) is a complex progressive neurodegenerative disease characterized clinically by impaired episodic memory and other impaired cognitive abilities [1]. To better understand disease mechanisms and to identify novel therapeutic targets, several studies of aging and AD have generated large molecular datasets from blood and/or post-mortem human brain samples. Some of these studies targeted multiple molecular levels and measured, for example, genetic variants, epigenetic modifications, mRNAs, or proteins, in the same set of samples [2][3][4]. However, jointly analyzing genome-wide multi-omic datasets remains challenging and requires novel computational methods to fully utilize these datasets [5].
Integration of multiple data types from the same set of samples has been referred to as vertical data integration [6]. Methods for vertical data integration can be further characterized by the primary goal of the integrative analysis as outlined in recent reviews [5][6][7][8][9][10][11]. While the main objective of the method presented in this work is the detection of genes with consistent differences between cases and controls in multiple data types, we review a wider range of vertical data integration methods in the introduction with a focus on those that were either successfully applied to data from AD or related complex diseases or share methodological similarity with our approach. Among the most frequently used are methods for integrating genetic and transcriptomic data. These have successfully identified genetic variants that affect gene transcription thereby improving our understanding of how the transcriptome mediates the effect of risk variants for various diseases including AD [12][13][14]. Most of these methods regress gene transcription on one or more genetic variants. A simple approach for integrating epigenomic and transcriptomic data is to replace the genetic with epigenetic measurements in the regression model [15][16][17]. Stepwise regression procedures were proposed to study interaction effects between different histone modifications that may not act independently on gene transcription [18]. Moreover, various machine learning approaches were applied to predict gene transcription levels based on transcription factor binding, histone modification or other epigenomic data [19,20]. If the primary goal is to predict the effect of an intervention experiment, the often unknown causal structure between different variables has to be learned [21]. Recently, Bayesian networks were applied to infer directed gene-wise graphs that model the relationships between epigenomic, transcriptomic, pathologic and clinical variables in the AD brain [22]. In the classic gene prioritization setting, the primary goal of an integrative analysis is to detect genes with consistent differences between case and control samples across different data types. The motivation for integrating data in this setting is twofold. First, gene prioritization can be improved assuming that a gene with differences in more than one data type is more likely to be a true positive finding. Second, if a specific epigenetic mechanism of a drug is known, target genes for this drug should ideally not only demonstrate differences in the epigenetic data, but also consistent differences at a functionally more relevant level such as gene transcription or protein expression [23]. Methods for detecting differential genes in a joint analysis of multiple data types were developed for experiments with only a few or no replicates [24,25]. These methods provide probabilistic frameworks for studying the relationship between data types and for classifying genes, but since heterogeneity among replicates is not modelled, statistical inference is challenging and can only be carried out by relying heavily on prior information.
Further need for data integration stems from our constantly increasing knowledge about gene functions and pathways which can be represented as a network where functionally related genes are connected by edges [26]. This prior knowledge can be integrated into an analysis to share information between related genes and thus improve statistical inference and interpretability of the results. For example, gene networks have been used to improve parameter estimation and gene selection in penalized regression models [27][28][29]. In Bayesian models, conditionally autoregressive (CAR) Markov random field priors were frequently used to incorporate gene networks into genome-wide data analyses [25,[30][31][32].
Here, we propose a new integrative method to detect genes with consistent differences between case and control groups across multiple data types. The method is based on an integrative coefficient that summarizes information from multiple data types in a case-control study design. To improve statistical inference, a CAR prior is used in a hierarchical Bayesian model to share information between functionally similar genes defined by a gene network. We applied our method to a large AD case-control study consisting of histone ChIP-seq, DNA methylation, and RNA-seq profiles from 233 subjects. Identified genes were validated using protein data and two independent RNA-seq studies of AD. Finally, in a post hoc analysis, we identified differential networks reflecting AD-related processes. Our new method is outlined in the flow chart S1 Fig and described in detail in the Methods section. The validation of our method and findings from the analysis of the AD data are presented in the Results section. We note that our approach can be adapted and applied to other similar multi-omic case-control studies.

Coefficient for integration of multiple genomic variates
We propose an integrative coefficient Z that summarizes observations made in different genomic data types for the same subjects and genes in a case-control study. Let X ðkÞ ij denote the value observed for gene i in individual j of the patient group in data type k, and Y ðkÞ ij denotes the respective value in the matched control subject. We define Z ij as the sum of standardized The variances of the differences s 2 X ðkÞ Y ðkÞ ¼ 1 nm P i¼1;...;n; j¼1;...;m ðX ðkÞ ij À Y ðkÞ ij Þ 2 are calculated across all genes and used to standardize the differences which may have a different range of values depending on the data type and technical platform used to generate the data. Whether a positive or negative association is expected between the genomic data types is modelled by the factor S (k) 2{−1,1}. If the differences X ðkÞ ij À Y ðkÞ ij for a gene i and individual j show consistent directions as modelled by S (k) across all K data types, the absolute value of Z ij is large. In contrast, a difference in one data type might be cancelled out by a difference in another data type if the directions of the differences do not meet the assumption defined by S (k) . Z ij is also expected to be close to zero if a gene does not have any differences in any data type. Previous work suggested multiplicative instead of additive coefficients for data integration [24,25,33], however, replacing the sum by a product in Eq (1) is a very conservative approach when multiple different data types are modelled. A small difference in a single data type would result in a small multiplicative coefficient even if distinct differences are observed in the other data types. In this work, we jointly analyzed gene transcription, histone modification and CG methylation both at promoters and at exons resulting in K = 4 different data types. To model the negative association between promoter methylation and transcription [34], we set S (k) = −1 for promoter methylation and S (k) = 1 for the other three data types.

Bayesian hierarchical model
Model. The Z ij are assumed to be normally distributed. The normal distributions' means are regressed on two gene-specific effects, H i and U i . The former is simply assigned a normal distribution, while the latter represents a spatial effect sharing information between functionally similar genes. Similarity information is extracted from an external gene network where two functionally related genes i and g are connected by an edge and the weight ω ig of the edge represents the strength or confidence of the relation, The spatially structured effect U i is given an intrinsic Gaussian CAR prior, and the network's similarity values ω ig are employed as weights, where δ i denotes a set ofñ i genes neighboring gene i in the gene network. Further, β 0 is PLOS COMPUTATIONAL BIOLOGY assigned an improper flat prior on R, and the variancesñ and ν H are assigned inverse Gamma distributions, The posterior distributions of the quantities E i ≔U i +H i ,i = 1,. . .,n, are used to classify genes as consistently differential, i.e. as presenting congruent differences between cases and controls across different data types. Specifically, gene i is assumed to be consistently differential if the 99% credible interval for E i lies either above or below zero. An implementation of the model in the BUGS language is given in S1 File. Prior elicitation. The hyperparameters of the distributions for 1=s 2 i ; 1=ñ and 1/ν H are important as they regulate the degree of confidence in the gene-level data and, additionally, in the functional similarities between transcripts reported by the gene network. We suggest an empirical Bayesian approach where the prior in the model is chosen based on the empirical variance observed in the data. To obtain hyperparameters forñ and ν H , we decompose the variability of the gene-wise mean coefficients Z i� ¼ 1 m P m j¼1 Z ij into a non-structural part and a structural part explained by the neighborhood relationship. The structural part of the variance is used to derive a prior forñ, and the remaining variance is used to derive the prior for ν H . Specifically, we assume The hyperparameters in Eqs (4) and (5) are calculated using that E(X) = α/β and Var(X) = α/β 2 for X~Gamma(α,β). To solve the equations, o d i is replaced by the average number of neighbors and the variance of the priors is set to 10 4 .
The parameters s 2 i , i = 1,. . .,n, model the variability of Z ij within a gene across subjects. To derive the hyperparameters in Eq (3), we assume s 2 i � median i2f1;...ng 1 m P m j¼1 ðZ ij À Z i� Þ 2 and use a prior variance of 10 4 . The hyperparameters obtained for the presented analysis are given in S1 Table. other cause of cognitive impairment) [37]. Subjects with a NIA Reagan diagnosis of low likelihood of AD or no AD and a clinical diagnosis of no cognitive impairment were considered as controls (persons with mild cognitive impairment were excluded from these analyses). This definition of AD and control cases resulted in a total of 141 AD cases (40 males, Ø 90.1 years; 101 females, Ø 91.7 years) and 92 control cases (34 males, Ø 83.0 years; 58 females, Ø 85.9 years) who had complete genome-wide molecular data after quality control. For matching AD cases to controls, age of death was stratified into five-year intervals between 70 and 95 years and an additional >95 years stratum. Then, each AD case was randomly matched to a control of the same gender and age stratum.
Matching of data types. Different genomic data types were matched at the transcript (isoform) level following a previously suggested approach [38]. Transcript abundances were estimated from RNA-seq data using RSEM [39]. Only active transcripts with an fpkm value �2 in at least 25% of the samples were considered. The targeted histone mark histone 3 lysine 9 acetylation (H3K9ac) is primarily located at active transcriptional start sites (TSS) [40]. To quantify the H3K9ac level for a given transcript i, we counted the number of ChIP-seq reads aligned within a genomic region R H i of 5,000 bp centered at the transcript's TSS. Often, multiple transcripts of a gene share the same TSS or have TSSs in close genomic vicinity. These transcripts cannot be distinguished at the H3K9ac level, and thus, we merged genomic regions R H i of two or more transcripts if they overlapped and summed their respective fpkm values ( Fig 1A). In total, we observed 23,674 active transcripts which were merged to 14,796 groups of transcripts with disjoint promoter regions R H i , and hence, distinct H3K9ac values. DNA methylation levels were measured by the Illumina HumanMethylation450 BeadChip limiting the methylation data to CG dinucleotides included in the chip design. DNA methylation in promoter regions is generally assumed to be negatively correlated with transcription, whereas the correlation with transcription has been reported to be positive when looking at exon methylation [34]. We calculated the promoter methylation for a transcript by averaging the methylation values of all probes located within the promoter region R M 1 i defined as 2,000 bp upstream of the transcript's TSS. Similarly, exon methylation was calculated by averaging the methylation values of all probes located within the transcript's exonic region R M 2 i . As illustrated in Fig 1A, for groups of transcripts, we combined the regions R M 1 i and R M 2 i of the individual transcripts. Overall, we obtained 10,857 transcripts or groups of transcripts with non-missing values for all four data types. For simplicity, we will use the term gene hereafter ignoring the detail that our features actually represent either a single transcript or a group of transcripts that share a promoter, and thus, that a gene with two or more active promoters is represented by two or more features in our dataset.
Data normalization. Large genomic datasets are inevitably affected by technical confounders. To reduce the effect of these covariates, we used the preprocessed datasets after quality control as described in the respective original publications [41][42][43] and subsequently regressed out technical covariates, biological covariates, and the estimated proportion of neurons in the cell type composition of the neocortical tissue. Pearson residuals obtained from the regression models were then plugged into Eq (1) as our normalized observations X ðkÞ ij and Y ðkÞ ij . Fig 1B depicts the correlation of the residuals of the same gene between different data types. Correlation between the residuals from gene transcription and H3K9ac data were shifted towards positive values, whereas correlation coefficients were almost centered for promoter and exon methylation. A strong correlation is not expected for the majority of genes, since we regressed out known factors like gender that impose a correlation structure between the data types. The remaining correlation structure is caused by AD or by unknown environmental and genetic factors. Environmental and genetic factors likely have a small effect due to the homogeneity of the ROS/MAP cohort.
In more detail, the following regression models were used. For the RNA-seq data (Synap-seID: syn3388564), we log-transformed the transcript-level fpkm values and fitted a linear regression model for each transcript with the covariates RNA integrity score, log-transformed total number of sequence reads, batch, postmortem interval, age of death, gender and proportion of neurons. For the ChIP-seq data (Synapse ID: syn4896408), we used the number of reads observed in the genomic region R H i as outcome and fitted a negative binomial regression model for each transcript with the log transformed total number of reads as offset and the covariates cross correlation, postmortem interval, age at death, gender and proportion of neurons. DNA methylation data (Synapse ID: syn3157275) contained methylation values between 0 and 1 for each CG dinucleotide. We applied beta regression models with the covariates bisulfite conversion rate, batch, postmortem interval, age at death, gender and proportion of neurons.
The proportion of neurons used in the regression models to adjust for changes in cell type composition during the course of AD were estimated from the RNA-seq data. We applied the Digital Sorting Algorithm (DSA) [44] to the expression values of the five neuronal marker genes GABBR2, MYT1L, ARL4C, CADPS and NRXN3 that were previously identified using an external human brain RNA-seq reference dataset of purified cells [42,45]. We observed a decrease of the proportion of neurons from 66.9% to 65.5% in AD subjects (p = 0.01, Wilcoxon rank-sum test, n = 141 AD subjects and 92 controls) indicating the need to adjust for neuronal

PLOS COMPUTATIONAL BIOLOGY
proportion. RNA-seq-derived estimations were also used to adjust the H3K9ac and DNA methylation data since these data were generated from adjacent specimens of the same tissue block.

Functional gene network
Information about the functional similarity of genes was obtained from the HumanNet [46]. HumanNet is a functional gene network consisting of 16,243 genes connected by 476,399 weighted edges. Weights ω ig range between 0.41 and 4.26 (Ø 1.14) and reflect the likelihood of a functional linkage between the two connected genes. The HumanNet was not developed specifically for the human brain and many genes and functional relationships are not observed in the human neocortex. Therefore, we first removed all genes that were not detected in our data. Then, edges of the induced subgraph were removed if the connected genes did not show a correlation coefficient larger or equal to 0.35 (85 th percentile) in an external gene transcription dataset from the Mount Sinai Brain Bank (MSBB) AD study [47]. The MSBB dataset consisted of 753 RNA-seq profiles of human aged and Alzheimer's brain samples generated from four different brain regions (S1 File). The modified network consisted of 6,470 genes connected by 41,093 edges with a mean weight of 1.24. The remaining 4,387 genes in our dataset that were either not represented in the original network or not connected by any edges after pruning remained in the analysis, but for these genes the structural component U i is ignored in Eq (2).

Detection of differential subnetworks
We applied the prize-collecting Steiner tree (PCST) algorithm implemented in the R package PCSF [48] to detect subnetworks that were enriched with consistently differential genes identified by our integrative Bayesian analysis. To detect multiple trees in the network, the algorithm introduces an extra root node connected to each node in the network with cost ω 0 [49]. After the PCST problem has been solved, the artificial root node is removed from the tree and the remaining forest structure F is returned. Specifically, the algorithm maximized the objective function where V F denotes the set of all vertices (genes) in the forest and L F the set of all edges (functional links between genes) in the forest. Variable κ F denotes the number of trees in the forest. The cost for an edge (i, j) was defined as c(ω ij ) = ω c −ω ij with constant ω c set to the sum of the maximum and minimum weight observed in the functional similarity gene network. The optimal solution depends on the two tuning parameters β and ω 0 that need to be specified. β balances the prizes associated with genes, i.e., the absolute values of the integrative statistics jÊ i j, and the costs assigned to the edges. A larger value of β results in larger trees. We set β = 22, which corresponded to the 75 th percentile of the costs c(ω ij ) divided by the 95 th percentile of the prizes jÊ i j in our data. The parameter ω 0 defines the costs for adding a tree to the forest. A larger value of ω 0 results in fewer trees in the optimal solution F. We set ω 0 to the 99 th percentile of bjÊ i j. Our choices for β and ω 0 preferred a solution with a few small trees, which is often better for biological interpretability. The optimal forest F with these settings consisted of nine trees. Three trees consisted of ten or more genes and were studied in more detail. These three trees were extended into the subnetworks by adding all edges to a tree that existed between any two genes of the tree in the initial functional gene similarity network. Thus, each of the three subnetworks corresponds to one single tree found by the PCSF algorithm with the identical set of genes but additional edges. Differential subnetworks were tested for an enrichment of gene ontology (GO) terms from the biological process ontology using the R package topGO [50, 51]. Fisher's test was applied to compare genes within a subnetwork to the background set of all genes included in the initial gene network. GO terms with less than 10 genes were excluded from the analysis.

Model fitting
The hierarchical Bayesian model was implemented in the BUGS language. The code is available in S1 File. The Gibbs sampler implemented by WinBUGS was used to carry out 400,000 iterations after an initial number of 40,000 burn-in iterations. A thinning of 200 was applied resulting in 2,000 samples from the posterior distributions. Median values and 99% credible intervals were obtained from these 2,000 samples to perform inference. Estimates for the parameters β 0 , ν H , andñ, and their respective trace plots are given in S2

Detection of genes with consistent differences across data types in AD
The primary goal of our integrative analysis was to identify genes with consistent alterations of the epigenome and transcriptome in AD. Evidence for differential gene regulation across transcription, H3K9ac, promoter methylation and exon methylation data was summarized by the integrative coefficient Z. In Eq (1), we set S (k) = −1 for promoter methylation and S (k) = 1 for the other three data types to model the negative association between promoter methylation and transcription [34]. Genes were classified as consistently differential if the 99% credible interval for the integrative statistic E i excluded 0. In total, 393 genes were significantly upregulated and 423 genes were significantly downregulated in AD. S3 Table contains the statistics for all genes included in the analysis. The first ten genes sorted by |E i | are shown in Table 1.
Among the differential genes in Table 1 are the neurotransmitter transporters SLC6A9 and SLC6A12, which were associated with cognition in human AD patients and AD model systems [52][53][54]. A recently developed SLC6A9 inhibitor is currently tested in a clinical trial [55]. The downregulated phosphatase DUSP9 and the upregulated kinase CDK18 have been suggested to modulate pathological tau phosphorylation in AD [56,57]. KIF5A is a motor protein that has been reported to be upregulated in AD and may contribute to AD-related mitochondrial defects [58,59]. Another candidate gene is APOD, which has a neuroprotective role and is upregulated in the aging and AD brain [60,61]. Overall, the differential genes identified by our analysis are involved in various biological processes in different cell types reflecting the complexity of AD. Before we highlight some of these processes and discuss novel insights, we first study and validate our integrative model in more detail.
The integrative statistic E i consists of a non-structural component H i and a structural component U i defined by the functional similarity gene network. If functionally related genes demonstrate congruent up-or downregulation in AD, we expect to observe large absolute values for U i . To assess the importance of the network in our analysis, we approximated the fraction of the variance of E i that is contributed by U i , Var(U i )/(Var(U i )+Var(H i )) = 0.06. A fraction of 6% indicates that while H i accounted for most of the observed differences between AD and controls, some functionally related genes were jointly dysregulated in AD. The structural component U i alone was significant for a total of 6 genes.

Assessment of specificity and sensitivity using simulated data
To validate our model and prior development, we simulated a dataset based on the 92 control subjects, which were randomly split into 46 cases and 46 controls. A total of 10,000 genes were randomly selected and assigned to 500 pathways with 20 genes each. Differences between cases and controls were simulated for half of the pathways by adding or subtracting a value Δ to the observed variables X ðkÞ ij ; k ¼ 1; . . . ; 4, of all genes in the pathway in the case samples. Then, a noisy gene network was simulated consisting of 47,500 correct edges between genes of the same pathway and 99,800 incorrect edges between genes of different pathways.
Only 19 out of the 5,000 non-differential genes were falsely classified as differential based on a 99% credible interval. The method correctly identified 1,173 of the 5,000 differential genes resulting in an FDR of 0.016 and a sensitivity ranging from 0.027 to 0.535 depending on the magnitude of the simulated difference Δ as depicted in Fig 2A. Notably, even the largest simulated difference Δ = 0.15 corresponded to a small standardized effect size of 0.083 (Cohen's d). Next, we compared the performance of our Bayesian model to alternative approaches using our simulated dataset. For our Bayesian approach, we observed an area under the receiver operating characteristic curve (AUC) of 0.84 (Fig 2B), which is a modest improvement over the gene-wise one-sample t-tests applied to the Z values (AUC of 0.80). A moderated t-test on the Z values as implemented in the limma software performed equally well as the regular t-test (AUC of 0.80) [62]. All three approaches operate on the integrative coefficient Z and therefore leverage information from all four data types and consider the directionality of the effects observed in the different data types. Alternatively, an integrative analysis can be performed as a meta-analysis on significance levels or rankings obtained from separate analyses of the different data types. To implement this strategy, we first ran paired two-tailed t-tests on the simulated X ðkÞ ij and Y ðkÞ ij values separately for each of the k = 1,. . .,4 data types to obtain a p-value per gene and data type. Subsequently, the results from the different data types were combined by either calculating meta-p-values using the z-score method (AUC of 0.76), or by calculating the genes' mean ranks (AUC of 0.75), or by applying the Robust Rank Aggregation method (AUC of 0.75). The latter method was specifically developed for integrating multi-omic data [63]. However, when combining p-values or gene ranks, information about the directionality of the effect sizes observed in the different datasets is lost resulting

PLOS COMPUTATIONAL BIOLOGY
in lower AUC values. Finally, we added the results obtained from applying a t-test to only one data type (transcription data as example) to our comparison in Fig 2B. As expected, the AUC of 0.71 was smaller compared to the integrative methods, since a large fraction of the data was not used.
To study how the Bayesian method performs if a non-informative network is given, we randomly permuted the edges of the simulated gene network. Fig 2C shows the results from the Bayesian method with the random network (AUC of 0.80) next to the unchanged results from the t-test for a better comparison. These results indicate that the improvement of the Bayesian method depicted in Fig 2B stems from the information provided by the gene network, and that if a random network is given, the Bayesian methods performs equally well as the t-test. Finally, we studied the effect of different sample sizes on the methods by repeating the simulation study on subsets of n = 46 ( Fig 2D) and n = 20 (Fig 2E) samples. While the performance of all methods declined with smaller sample sizes, the Bayesian method maintained an advantage as more relative weight was given to the prior. In summary, the simulation study demonstrated that the Bayesian model with the selected priors results in a small false positive rate, and, if an appropriate network is given, performs better than simple gene-wise t-tests on the integrative coefficients or methods that integrate p-values or ranks.

Validation in independent datasets
We first studied whether genes with consistent differences in epigenomic and transcriptomic data identified by our analysis also presented differences at the protein level. To do this, we utilized a targeted proteomics dataset generated from the same sample type and sample collection as our multi-omic data: DLPFC samples of ROS/MAP participants (Synapse ID: syn10468856). The targeted proteins were candidate genes from previous AD studies [64] and measured by liquid chromatography-selected reaction monitoring [65]. We applied the same case/control definition as for the main analysis resulting in 393 AD cases and 214 control subjects. Each protein was tested for difference in abundance in AD versus control subjects, adjusting for gender, age and postmortem interval (S1 File). Overall, we observed a positive correlation (Pearson's r = 0.54) between the integrative statisticÊ i and the observed differences in the protein data based on 98 proteins encoded by genes considered in our integrative analysis (Fig 3A). Out of 18 differential genes from our integrative analysis, 9 genes demonstrated significantly altered protein levels in AD (family-wise error rate � 0.05); the direction of effect was consistent between the two sets of results.
Since the protein data is limited to selected AD candidate genes and was not generated from an independent cohort, we additionally validated our findings in two RNA-seq datasets from other AD sample collections. The first dataset consisted of 79 AD samples and 37 control samples from the inferior frontal gyrus included in the MSBB study [47]. We identified 747 genes that were classified as differential in our integrative analysis and passed the detection threshold in the inferior frontal gyrus samples of the MSBB dataset (S1 File). When comparing AD to control samples, a majority of 601 out of the 747 genes showed a change in transcription consistent with the results from the integrative analysis (Fig 3B). These changes were significant at an unadjusted p-value of 0.05 for 102 out of 354 upregulated genes and for 97 out of the 393 downregulated genes. Similar results were observed for the second dataset of temporal cortex samples from the Mayo LOAD study (n = 71 control samples, n = 80 AD samples) [4]. We detected 759 of our differential genes in the temporal cortex (S1 File), and 553 of these genes showed a consistent increase or decrease in AD (Fig 3C). At an unadjusted p-value of 0.05, 154 out of 359 upregulated and 200 out of 400 downregulated genes were validated in the Mayo LOAD study.

Differential subnetworks
When we analyzed the posterior distributions we found that about 6% of the variance of E i is contributed by U i , suggesting that some parts of the gene network are collectively dysregulated in AD. Such a subnetwork of jointly differential genes often represents a disease-related biological process and is easier to interpret than single genes. To identify differential subnetworks post hoc, we applied a prize-collecting Steiner tree (PCST) algorithm [48,66]. The objective of the PCST algorithm was to find a subnetwork that maximizes the sum of jÊ i j of the genes in the subnetwork minus the costs for the edges c(ω ij ) needed to construct the subnetwork.
Three differential subnetworks with at least 10 genes were identified. The first subnetwork (Fig 4A) was enriched with genes involved in myeloid cell differentiation (S4 Table) and reflected the immune component of AD [67]. The network included myeloid transcription factors such as NFIC [68], and cytokines such as CSF1 and the corresponding receptor CSF1R, which have recently been studied in the context of microglia activation [69][70][71]. Cellular functions of the upregulated genes RHOQ and TRIP10 include endocytosis and regulation of cell shape and motility [72,73]. To verify that this gene network is transcribed by myeloid cells, we compared the genes' transcription levels in an external RNA-seq dataset of purified human brain cells (Fig 4B) [45]. Further, all five significant genes in the network were also differentially transcribed in the MSBB or the Mayo LOAD dataset (Fig 4C). We note that the myeloid genes that we prioritize are different from the well-validated myeloid AD susceptibility genes that have emerged from genome-wide association studies.
The second differential network (S3 Fig) was enriched for the Gene Ontology term protein phosphorylation (S5 Table). Protein phosphorylation regulates various cellular processes by altering protein activity, localization and stability, and this mechanism has been implicated in AD [74]. For example, the gene PRKAA2 (alias AMPK) encodes a kinase that regulates cellular energy homeostasis, is activated by amyloid-β, and phosphorylates tau at multiple sites [75][76][77]. Another kinase directly involved in the phosphorylation and accumulation of tau is TTBK1 [78,79]. Interestingly, TTBK1 also phosphorylates TDP-43, a protein which forms pathologic aggregates in aged and AD brains [80,81]. MAP2K4 (alias MKK4) has been suggested to phosphorylate tau [82] and to modulate amyloid-β toxicity [83]. Most kinases and phosphatases were depicted in the right half of the network (S3 Fig). The lower left part of the network included two genes, TUBA1B and TUBB2A, that encode major constituents of microtubules, which are disrupted by hyperphosphorylated tau in AD [84]. The tubulin genes were connected to the mitochondrial fission gene DNM1L (alias DRP1) in the network. The protein DNM1L interacts with amyloid-β and hyperphosphorylated tau, causing mitochondria fragmentation and thereby affecting mitochondrial health and axonal transport in AD neurons [85][86][87]. Thus, the lower left part of the network reflected impaired energy metabolism in AD synapses. The upper left part of the network consisted of dysregulated genes of the ubiquitin proteasomal system, such as PSMD2, BTRC, CUL9 and UBQLN1 [88]. UBQLN1 is involved in the degradation of PSEN1 and APP, two proteins which are essential for the generation of amyloid-β peptides (APP is the gene that encodes the amyloid-β peptide) [89,90]. Overexpression of UBQLN1 alleviates symptoms in some AD mouse models [91]. Thus, while this complex network captures several different processes, we refer to the network as protein phosphorylation network because of the enrichment with kinases and phosphatases (S5 Table). Many genes of this network were transcribed by neurons ( S3B Fig). The third differential network was characterized by the GO term synaptic signaling (S6 Table) and mainly consisted of downregulated synaptic genes (S4 Fig). For example, RGS7 regulates synaptic plasticity by modulating the signaling pathway downstream of the GABA B receptor [92]. RPH3A, another downregulated gene involved in synaptic signaling, correlates with cognitive decline and is specifically downregulated by amyloid-β [93]. Similarly, the glutamate receptors GRIA2 (alias GluR2) and GRIN2A (alias GluN2A) have been shown to be reduced in the postsynaptic density in AD and are associated with memory deficit [94,95]. Overall, this network, which is mainly transcribed by neurons (S4B Fig), reflects abnormalities in synaptic signaling and a reduction of synaptic density, which is a hallmark of AD and occurs before neuronal cell death [96,97].

Upregulation of the myeloid network is associated with amyloid-β pathology and promoted by CSF1 expressing astrocytes
To further investigate the role of the biological processes underlying the three differential networks in neurodegeneration, we leveraged the RNA-seq data from the Mayo LOAD study. In addition to 80 AD and 71 control samples, the dataset from the Mayo LOAD study also included 30 samples with a post mortem diagnosis of pathologic aging [4]. Individuals with pathologic aging have widespread cortical amyloid-β plaque deposits but demonstrate no or

PLOS COMPUTATIONAL BIOLOGY
only minimal neurofibrillary tau pathology and are cognitively non-impaired [98]. It is unclear whether pathologic aging is an early stage of AD or whether this condition develops in individuals who have protective factors that block processes downstream of amyloid-β pathology [99]. We summarized the transcriptional activity of a network in the Mayo cohort by calculating the first principal component of the normalized RNA-seq transcription profiles of the network's genes (S1 File). First, we verified that the three networks were differentially transcribed between AD and controls in this independent cohort (Fig 5A-5C). As expected, the myeloid cell differentiation network was significantly upregulated in AD compared to controls (p = 0.001, Wilcoxon rank-sum test), and the protein phosphorylation (p = 0.002, Wilcoxon rank-sum test) and the synaptic signaling (p<0.001, Wilcoxon rank-sum test) networks were significantly downregulated in AD. Interestingly, the individuals diagnosed with pathologic aging demonstrated an upregulation of the myeloid cell differentiation network (p = 0.047, Wilcoxon test) to a level similar to that seen in AD subjects (Fig 5A), whereas the protein phosphorylation and synaptic signaling networks were not dysregulated in this group of samples (Fig 5B and 5C). These findings indicate that the upregulation of the myeloid cell differentiation network does not require tau pathology and is probably an early event in the pathogenesis of AD preceding tau pathology and neuronal dysfunction that manifest as impairment in cognitive function. These results are consistent with a recent study suggesting that microglia interact with amyloid-β pathology to contribute to tau proteinopathy and downstream cognitive decline [100].
The myeloid cell differentiation network consisted of 14 genes of which 5 were classified as differential (Fig 4A). The cytokine CSF1, one of the 5 differential genes, is an interesting candidate gene, because of its role as a regulator of myeloid cell frequency and function during homeostasis and inflammation [69,101]. Previous studies of the corresponding receptor CSF1R in AD mouse models showed that blocking CSF1R reduced microglia density and attenuated the burden of AD pathology in the animals [70,102]. To investigate which cell types trigger CSF1 signaling, we employed single-nucleus RNA-sequencing (snRNA-seq) data from n = 48 subjects from the ROS/MAP study (S1 File) [103]. As shown in Fig 5D, CSF1 was primarily transcribed in astrocytes, oligodendrocytes and oligodendrocyte progenitor cells. Next, we confirmed that CSF1R was exclusively transcribed by myeloid cells in the human prefrontal cortex (Fig 5E). Finally, we tested which of the three cell types that transcribed CSF1 contributed to the differences between AD and controls observed at the tissue level. Interestingly, an upregulation of CSF1 in AD was only observed in astrocytes (p = 0.016, Wilcoxon rank-sum test), but not in oligodendrocytes or oligodendrocyte progenitor cells (Fig 5F). Although the sample size of the snRNA-seq data is limited, these findings suggest that astrocytes activate microglia cells via CSF1 signaling. The alternative CSF1R ligand IL34 was not detected as differential in our integrative analysis (S3 Table).

Discussion
Large multi-omic datasets are becoming more common in biomedical research and require novel integrative bioinformatics approaches to fully harness their potential. We developed an integrative method to detect genes consistently altered in multiple data types in case-control studies. In addition to integrating information from different data types, our method also utilizes functional gene similarity to share information across genes and thereby improve statistical inference.
Information from different data types is aggregated by the integrative coefficient given in Eq (1) at the gene level. Data was matched to genes based on genome annotation in our AD study, however, other data types, e.g. some enhancer marks, may require a more complex matching strategy as outlined elsewhere [38,104]. After matching, we observed primarily positive gene-wise correlations between transcription and H3K9ac, whereas no clear trend was observed for promoter or exon methylation (Fig 1B). This may reflect the complex relation between DNA methylation and transcriptional activity in the brain, including the role of hydroxymethylation in neurons [105], but we also note that a large correlation between the residuals of different data types should generally not be expected, since we regressed out the effects of major factors such as age, gender and proportion of neurons that impose a correlation structure on the data. The remaining correlation structure was likely caused by unknown genetic and environmental factors as well as the AD status, which may not affect many genes in all data types. To limit the effect of data types that are not associated with the outcome, we modelled an additive instead of a multiplicative coefficient suggested by previous studies [24,25]. Further, multiplicative coefficients follow a more complex product distribution and the sign of the coefficient is difficult to interpret if more than two data types are involved. In Eq (1), the factors S (k) model the relationship between data types so that the sign of the coefficient corresponds to an up-or downregulation, respectively. The factors S (k) can often be chosen

PLOS COMPUTATIONAL BIOLOGY
based on prior knowledge derived from studies like Encode or Roadmap Epigenomics [106,107]. A hierarchical Bayesian model is used to study the distribution of the integrative coefficient. An innovation of the model is the representation of the differences between AD cases and controls by a non-structural component H i and a structural component U i which shares information between functionally related genes. Based on the results in a recent comparative review, we selected the HumanNet to define functional similarity [108]. Functional similarity networks are constructed from various datasets from different tissues and organisms and are not brain specific. Thus, we had to customize the network and prune edges which were not supported by an external brain RNA-seq dataset in order to better utilize the information contained in the network for our specific analysis. After pruning the network, we observed that overall approximately 6% of the AD effects were contributed by U i indicating that parts of the gene network were jointly dysregulated in AD. The fraction of variation attributable to the structural component depends on the gene network and the structure of the studied data and may vary between diseases and tissues. In other domains like spatial epidemiology, a wide range of values has been observed that can be as large as 71% in extreme cases [109]. Future studies will have to show whether a fraction of 6% as observed in this study is a common value for genome-wide molecular data. We validated our model using simulated and independent data from other studies. The simulation study was important to demonstrate that our prior choices result in a reasonable small FDR of 0.016 when using 99% credible intervals to classify genes. Further, the simulation study showed that our method outperforms a one-sample t-test on the integrative coefficients, if an appropriate network is given. The one-sample t-test on the integrative coefficient resembles a paired t-test as the coefficient is the sum of the differences between the matched samples across data types, and thus, can be expected to be powerful in the setting of a matched casecontrol study. Consequently, when a random network was given, both methods performed equally well indicating that the advantage of the Bayesian method stemmed from the information provided by the network. Methods that integrate results from separate analyses were inferior as these methods ignore the directionality of the observed differences in the different data types. Further, these methods do not provide a statistical framework for assessing significance and controlling error rates. For example, the z-score approach as included in the comparison will likely result in inflated p-values since the data sets are not independent.
Comparing our results from the integrative analysis with the protein data and the external transcription data revealed that a majority of our findings can be reproduced at the protein level and at the mRNA level in an independent cohort, even though the aim of our method was not to predict differences at the protein or transcription level, but to identify genes with consistent differences across the given data types. In line with the validation results, we found that many of the differential genes given in Table 1 have been studied as candidate genes for AD. Thus, we anticipate that the complete result from the gene-wise analysis (S3 Table) is a useful resource for AD candidate genes. However, we take these results further, prioritizing a subset of genes that may be of greater interest: based on the gene-wise results, we studied which parts of the gene similarity network were collectively dysregulated in AD. Three different dysregulated AD subnetworks were identified: myeloid cell differentiation, protein phosphorylation, and synaptic signaling. Similar network-based approaches have been suggested to reveal disease related pathways which may not become obvious in a gene-wise analyses [110]. In contrast to single genes, a network signature can usually be replicated more robustly in model systems or independent datasets, and thus, can be helpful in follow-up studies to address questions such as the temporal progression of these three processes during the course of AD.
In this study, we further investigated the status of the three differential networks in pathologic aging, which is characterized by high amyloid-β loads similar as in AD but a lack of distinct tau pathology [98]. Consistent with the normal cognition of individuals with pathologic aging, the synaptic signaling network was not altered compared to controls. We also found no evidence for altered transcription of the protein phosphorylation network; however, the myeloid cell differentiation network was upregulated to a similar level as observed in AD. Although it is unclear whether the amyloid-β aggregation in pathologic aging reflects an early stage of AD, these findings support the hypothesis that microglia are already activated at the preclinical stage of AD before accumulation of hyperphosphorylated tau. An interesting member of the myeloid cell differentiation network is CSF1 because of its role as a regulator of myeloid cell numbers and functions [101]. A few studies of AD focused on the corresponding receptor CSF1R as a potential therapeutic target. Microglia cells depend on CSF1R signaling [111,112] and treatment of AD mice with CSF1R inhibitors results in reduced microglia activation and improved memory function [70,113], but little is known about the cells that contribute to CSF1R triggering in AD. Using snRNA-seq data, we showed that the upregulation of CSF1 observed at the tissue level is primarily caused by astrocytes in human AD brains. Altogether, our findings suggest that astrocytes contribute to microglial activation by expressing CSF1 at an early stage of AD preceding tau accumulation. Whether CSF1 overexpression by astrocytes is directly provoked by amyloid-β cannot be concluded from our data. Interestingly, activated microglia in return secrete signals that induce reactive astrocytes illustrating the complex relationship between these two cell types during the pathogenesis of AD [114].
In summary, we proposed a novel method for the joint analysis of multiple genome-wide datasets that utilizes external information about functional gene similarity. We applied the method to transcription, histone acetylation and DNA methylation data from a large AD study and discovered multiple well-known and new target genes as well as AD processes. Further study of one of these processes indicated that astrocytes may contribute to microglia activation by CSF1 expression at early stages of AD. Our approach can be adapted to analyze other multi-omic case-control datasets and thereby promotes integrative analyses to fully utilize these complex datasets.

S1 Fig. Schematic overview of the integrative multi-omic analysis.
Blue boxes indicate input datasets, red boxes indicate our novel integrative analysis, and green boxes indicate the explorative post hoc analysis of differential subnetworks. The genomic data from our AD case-control study consisted of four different data types, which were matched to genes and summarized by the integrative coefficient Z ij . Z ij modeled the differences observed across data types for gene i when comparing sample j to its matched control sample. The distribution of Z ij was modeled by a hierarchical Bayesian model to identify consistently differential genes. The Bayesian model incorporated a gene network (HumanNet) to share information between functionally related genes. Genes with consistent differences in the epigenomic and transcriptomic data between AD and control samples were the primary result of our integrative analysis (S3 Table). To further analyze and interpret the results, we subsequently employed an explorative network-based approach to detect AD-related subnetworks (green boxes). Therefore, we annotated the genes in the HumanNet with the integrative statisticÊ i derived from our Bayesian model and used a prize-collecting Steiner tree (PCST) algorithm to identify subnetworks enriched with consistently differential genes.