Bayesian variable selection with graphical structure learning: Applications in integrative genomics

Significant advances in biotechnology have allowed for simultaneous measurement of molecular data across multiple genomic, epigenomic and transcriptomic levels from a single tumor/patient sample. This has motivated systematic data-driven approaches to integrate multi-dimensional structured datasets, since cancer development and progression is driven by numerous co-ordinated molecular alterations and the interactions between them. We propose a novel multi-scale Bayesian approach that combines integrative graphical structure learning from multiple sources of data with a variable selection framework—to determine the key genomic drivers of cancer progression. The integrative structure learning is first accomplished through novel joint graphical models for heterogeneous (mixed scale) data, allowing for flexible and interpretable incorporation of prior existing knowledge. This subsequently informs a variable selection step to identify groups of co-ordinated molecular features within and across platforms associated with clinical outcomes of cancer progression, while according appropriate adjustments for multicollinearity and multiplicities. We evaluate our methods through rigorous simulations to establish superiority over existing methods that do not take the network and/or prior information into account. Our methods are motivated by and applied to a glioblastoma multiforme (GBM) dataset from The Cancer Genome Atlas to predict patient survival times integrating gene expression, copy number and methylation data. We find a high concordance between our selected prognostic gene network modules with known associations with GBM. In addition, our model discovers several novel cross-platform network interactions (both cis and trans acting) between gene expression, copy number variation associated gene dosing and epigenetic regulation through promoter methylation, some with known implications in the etiology of GBM. Our framework provides a useful tool for biomedical researchers, since clinical prediction using multi-platform genomic information is an important step towards personalized treatment of many cancers.


Introduction
The last decade has seen a proliferation of multi-platform genomic data, aided partly by the rapid evolution and declining costs of modern technologies, producing high-throughput multi-dimensional data. It is now technologically and economically feasible to collect diverse data on matched patient/tumor samples at a detailed molecular resolution across multiple modalities such as genomics (DNA copy number), epigenomics (methylation), transcriptomics (mRNA/gene expression) and proteomics. Such large scale coordinated efforts include worldwide consortiums such as the International Cancer Genome Consortium (ICGC; icgc. org), The Cancer Genome Atlas (TCGA; cancergenome.nih.gov) and more recently the Genomic Data Commons (GDC; gdc.cancer.gov), which have collated data over multiple types of cancer on diverse molecular platforms, to accelerate discovery of molecular markers associated with cancer development and progression. The resulting analytical challenges are to integrate these vast amounts of data into models that accurately predict the complex pathophysiology of cancer and to translate this complexity into clinically actionable outputs, towards the holy grail of precision medicine.
Initial studies in cancer genomics relying on single platform analyses (mostly gene expression-and protein-based) have discovered multiple candidate "druggable" targets such as KRAS mutation in colon and lung cancer [1], BRAF in colorectal, thyroid, and melanoma cancers [2], and PI3K in breast, colon and ovarian cancers [3]. However, it is believed that integrating data across multiple molecular platforms has the potential to discover more coordinated changes on a global (unbiased) level [4]. Through data integration, we espouse the philosophy that cancer is driven by numerous molecular/genetic alterations and the interactions between them, with each type of alteration likely to provide a unique but complementary view of cancer progression. This offers a more holistic view of the genomic landscape of cancer, with increased power and lower false discovery rates in detecting important biomarkers [5], and translates to substantially improved understanding, clinical management and treatment [6].
Our methods are motivated by a TCGA based study in glioblastoma multiforme (GBM), where-in diverse platform-specific features are obtained at genomic, epigenomic and transcriptomic resolutions across matched tumor samples. Our goals are two-pronged: first assess dependence within and between platform-specific features, and second, incorporate the dependence in finding important molecular markers associated with relevant clinical outcomes. Integrating data across platforms has sound biological justifications due to interplay of features between and within the platforms. For example, between platforms, attributes at the genomic/DNA level such as methylation and copy number variation can directly affect mRNA expression, which in turn is known to influence clinical outcomes such as cancer progression times and pathobiology of the tumors. Within platform interactions arise from pathway-based dependencies (e.g. functional and signaling pathways) as well as dependencies based on chromosomal/genomic location (e.g. copy number data). Furthermore, the molecular features are inherently on different scales: discrete (copy number variation) and continuous (DNA methylation and mRNA expression). In addition, there exist substantial prior knowledge on pathway/graphical interactions between these genes (e.g. from public databases and literature), which can be incorporated to achieve improved estimation, increase signal to noise ratio and more refined biological interpretations. Our proposed approach combines all the above aspects to develop an integrative model for predicting clinical outcomes.
There has been a growing but limited literature on statistical and computational approaches exploiting the information garnered from data integration in relating the platforms to the clinical outcome-which is usually the goal of translational research in finding markers of cancer progression. Choi et al. [7] propose a double layered mixture model to jointly analyze copy number and gene expression data. Recently, Wang et al. [5] and Jennings et al. [8] proposed integrative Bayesian analysis of genomics data (iBAG, in short), which models biological relationships between genomic features from multiple platforms, and subsequently uses the estimated relationships to relate the platforms to a clinical outcome. However, iBAG assumes independence between genes in discovering mechanistic relationships between platforms at a gene-centric level, which may not be biologically practical as genes are known to lie on functional or cell signaling pathways [9].
Given that the associations between genes and gene products can be captured efficiently via networks, there is a growing variable selection literature for graph structured genomic covariates coming from a single platform [10]- [13] which account for the inherent dependencies in relating genetic biomarkers to the clinical outcome of interest. Such approaches either assume a known network structure on covariates (supervised), or estimate the graph from the raw data without considering prior knowledge (unsupervised). Both these classes of approaches have critical drawbacks. Supervised approaches may not be practical in genomic studies, since no existing and curated knowledge can be considered as complete and the gene network is likely to vary over different conditions, tumor types and biological processes. On the other hand, unsupervised approaches may often lead to inaccurate estimates because of the low signal to noise ratio [14], especially for high throughput genomic data typically collected on a low/moderate number of replicates. In these scenarios, there is an increasing recognition of the practical advantages of including prior biological knowledge when estimating gene networks from the data [15], which is not accounted for in existing structured variable selection approaches. Moreover to our knowledge, the existing structured variable selection approaches consider data from a single platform and are not equipped to handle mixed covariates from multiple platforms, which may give rise to different sets of between platform interactions not captured in a single platform analysis.
Unlike previous approaches incorporating prior information to estimate the graph based on single platform data ( [16], [17]), the focus of our current work is to combine multi-platform and multi-scale genomic data, and prior knowledge on the gene network, to estimate the graph for mixed variables, and subsequently use structures in the estimated graph to inform variable selection via a novel clique based approach. In addition, the proposed network estimation approach involves a belief parameter to control the degree of fidelity to the prior knowledge in order to guard against mis-specification. Concisely stated, the major novelties of our approach are: (i) estimating graphical models for mixed data from multiple platforms, while incorporating prior graphical knowledge; (ii) developing a structured variable selection approach, which accounts for correlated groups of predictors within and across platforms, and can identify individual and groups of significant covariates related to the outcome (iii) allowing for both cis-and trans-acting relationships between molecular features, and providing appropriate controls for multicolleanarity and multiple testing. These goals are achieved via a principled multi-scale approach which involves a prior informed Bayesian graphical model for mixed variables in the first stage, which is then used to inform a subsequent Bayesian structured variable selection framework (see Fig 1). The above features make our approach distinct from existing structured variable selection approaches which typically focus on single platform data with known network knowledge [18].

Methods
We focus on a univariate continuous response, y 2 <, to be regressed on a p-dimensional vector of mixed covariates x = [x 1 , . . ., x D Ã ] obtained from D Ã (! 2) distinct platforms. However, we note that our approach can be generalized to binary or ordinal outcomes via latent variable based thresholding approaches [19], [20]. In our notations, x j is the covariate vector corresponding to the j-th platform which has p j features, j = 1, . . ., D Ã , so that P D Ã j¼1 p j ¼ p. We note at the outset that the proposed approach allows for unequal number of measurements for different platforms, and hence it is possible to accommodate additional or missing measurements in one or more platforms. This is a useful feature, for example, when one has to include a methylation measurement that is far away from a gene but is highly relevant for its expression. Let us denote the n × 1 vector of responses as y and the n × p dimensional covariate matrix as X = [X 1 , . . ., X D Ã ], where the covariates may be continuous, binary or ordered categorical. The mixed covariates have an underlying graphical structure which is to be estimated (e.g. panel (c) in Fig 1), while incorporating prior existing graphical knowledge, denoted by G 0 (e.g. panel (b) in Fig 1). This is the graphical modeling or structure learning step, which is used to inform the subsequent structured variable selection step. The above steps comprise our two step approach which is described in detail below. A schematic diagram of our integrative modeling approach. Panel (a) shows the heatmaps of the genes by sample matrix constructed from data for three platforms; panel (b) depicts the prior graph constructed using previous studies; while panel (c) is the estimated graph of the genes within and across the platforms. The dashed arrows determine graphical structure and the solid arrows represent the regression model incorporating graphical dependencies. Red and green lines in panel (c) represent high negative and positive partial correlations under the estimated graph, while all other edges with lower absolute partial correlations are depicted with watermark lines. We have also provided an interactive version of S1 Interactive Plot.

First stage: Integrative structure learning
The graphical modeling approach for mixed data models ordered categorical variables by rounding continuous latent variables [19], [20], and specifies a graphical model jointly on the observed continuous covariates and the latent continuous variables. The graph for mixed data involves the vertex set V = {1, . . ., p} and edge set E, and is used to: (1) model dependence between features within and across platforms-in our application, measurements for different platforms are available for the same set of genes, so that the joint modeling across platforms allows for both cis-acting (localized to a gene) and trans-acting (across gene locations); and (2) detect potentially overlapping subgroups of features within and across platforms, which define functional modules that work together to drive the clinical outcome. Such modules correspond to cliques in the graph, which are defined as a subgroup of V such that each node in this subgroup is connected to every other node in the subgroup.
Without loss of generality, let x O i Þ denote the covariate vector for the i-th subject, with the superscripts C, O denoting continuous and ordinal (and/or binary) covariates respectively. Let z O denote the generic notation for the latent continuous variable corresponding to ordinal predictor x O , and consider the following graphical model for mixed covariates where N [D] denotes a Gaussian distribution with truncated domains defined by the hyperrectangle D, M o − 1 is the number of ordinal levels, p O is the number of ordinal covariates, and O $ π(OjG 0 ) corresponds to a continuous shrinkage prior which depends on prior graph knowledge G 0 (to be described in the sequel). Under the generic continuous shrinkage specification (1), the MCMC samples can be simulated from the posterior Subsequently a post-MCMC step can be implemented in order to obtain the graph estimatê G by thresholding absolute partial correlations corresponding to the estimated precision matrixÔ, as elaborated in Section 3.
We use a continuous shrinkage prior on O as it enables us to update all elements of the precision matrix at every iteration, thus utilizing the full prior knowledge on all edges to drive inferences. We note that discrete mixture approaches [21] based on reversible jump Markov chain Monte Carlo may not be able to visit a sizable proportion of the edges even for moderate dimensional graphs under a finite number of Markov chain Monte Carlo runs. In such a case, these edges will not be updated at all, and will instead correspond to the initial choice of the adjacency matrix relying on the prior graph. Such an approach will not satisfy our objective of learning all possible edges of the graph from the data while incorporating prior knowledge, and hence we choose a continuous shrinkage approach over discrete mixture alternatives.

Incorporating prior graph information
As mentioned before, there exists a huge amount of literature/databases describing the functional behaviors of genes, as characterized in metabolic, signaling and other regulation pathways. These include publicly available information on genes, biological pathways, Gene Ontology (GO) terms, gene-gene interaction networks e.g. Kyoto Encyclopedia of Genes and Genomes (KEGG) and Ingenuity Pathway Analysis (IPA) [22] or context-specific literature in various tumor types [23]. These sources can be queried to yield prior known connectivity graph between genes that can be brought into the network inference towards more biologically plausible structures.
Let G 0 be the prior graph having vertex set V = {1, . . ., p} and edge set E 0 , with the corresponding adjacency matrix A 0 = (a 0,ij ), where a 0,ij is the inclusion indicator for edge (i, j). Throughout this article, we consider undirected graphs so that a 0,ij = a 0,ji for all (i, j). We propose an approach which specifies an exponential prior on the diagonals and double exponential priors on the off-diagonal elements of O. Further, the shrinkage parameters are assigned a mixture distribution which incorporates prior knowledge. In particular, we have the following hierarchical formulation, where p = {p ij : i 6 ¼ j, i, j = 1, . . ., p} are mixture weights, M + is the set of positive definite matrices, λ is the vector of shrinkage parameters with dimension p(p + 1)/2, and κ ij is the belief parameter for edge (i,j), for i 6 ¼ j (κ ij = κ ji under an undirected graph). In (2), the shrinkage parameters λ shrinks the precision off-diagonals corresponding to absent edges towards zero, and is modulated by the prior graph information via the mixing proportions p. These mixing proportions are modulated by prior graph knowledge, and involve belief parameters which control the degree of fidelity to such knowledge.
Role of belief parameter: To understand the role of the belief parameter in prior specification, observe that E(p ij ) = (a 0,ij κ ij + a p )/(κ ij + a p + b p ), which implies that for large κ ij >> b p , E(p ij ) % 1 when a 0,ij = 1, and E(p ij ) % 0, when a 0,ij = 0. In extreme case when κ ij ! 1, we have p ij ! 1 when a 0,ij = 1, and p ij ! 0 when a 0,ij = 0, which encourages small and large values of λ ij respectively, for small values of a λ /b λ . This suggests that as κ ij ! 1, the prior realizations of |ω ij | will be away from zero when G 0 suggests the edge (i, j), and they will be very close to zero otherwise.
Through the use of a belief parameter, we can control the degree of confidence we place on the available prior graph information. This is a useful feature in enabling investigators to be flexible i.e. either skeptical or fairly confident about the prior knowledge, as the situation demands. In practice, we expect the belief parameter to be calibrated based on domain knowledge, by assigning large values of the belief when investigators are reasonably certain of the prior knowledge, and near zero values when such knowledge is absent or doubtful. For example, in many genomic applications (including ours), there is sufficient prior knowledge on within pathway interactions, but scant information about between pathway dependencies. When one is not sure about the choice of the belief parameter, we can let the data determine it's value under a griddy Gibbs approach. More details about calibration of the belief parameter, as well as the griddy Gibbs approach, can be found in the sequel.

Second stage: Regression and structured variable selection
In the second step, called structured variable selection, we incorporate the structural knowledge represented by the estimated graphĜ in regressing the outcome of interest on covariates.
Although we consider continuous outcomes, it is straightforward to extend our approach to binary or ordinal outcomes via thresholding the latent continuous variables. We consider the following linear regression model Here γ = { γ j , j = 1, . . ., p} 2 Γ (the model space) is the vector of variable inclusion indicators, with γ j = 1 if the jth candidate predictor is included in the model and γ j = 0 otherwise, β γ = {β j : γ j = 1, j = 1, . . ., p} is the p γ × 1 vector of the regression coefficients with p γ ¼ P p j¼1 g j being the size of model γ, X γ is the n × p γ covariate matrix (excluding an intercept) containing the predictors in model γ and having the i-th row as x γ,i . Further, we have a $ Nðm a ; s 2 a Þ; Z $ Gaða Z ; b Z Þ; as the intercept and residual precision, respectively, while η 1 is the shrinkage parameter for the double exponential (DE) prior on the fixed effects. We address uncertainty in subset selection through pðγjĜÞ depending on the estimated graph structure on the covariates, while π(β j ) characterizes the prior knowledge of the size of the coefficients for the j-th predictor, j = 1, . . ., p.
Priors on model space: The prior on the model space γ $ pðγjĜÞ is defined using clique indicators. Let C 1 , . . ., C q , denote the cliques identified by the estimated graphĜ. The cliques are indicative of (potentially overlapping) groups of associated genetic features within and across platforms and gene locations. Denote the clique inclusion indicators as g C k , k = 1, . . ., q, and let us define the prior on the model space as follows where π controls the sparsity of clique inclusions, under a multiplicity adjusted prior [24]. We call the resulting approach in (3) and (4) Bayesian variable selection with structure learning (BVS-SL), a schematic representation of which is presented in Fig 2. We note that when all the cliques are disjoint with q < p, the model loosely resembles a clustering approach allowing for different magnitudes of effects within a selected cluster. In the special case when q = p, our method reduces to the usual stochastic search variable selection (SSVS) approach, with Laplace priors on the fixed effects. We focus on cliques as a building block in our structured variable selection approach, since (a) it is a systematic way of defining sub-groups of connected nodes in the graph which makes them intuitively appealing to work with, in structured variable selection problems involving correlated groups of predictors; and (b) cliques represent foundational blocks in a graph which have been used successfully in literature, to define probability distributions for Markov random fields under the Hammersley-Clifford Theorem [25], as well as to define likelihoods under decomposable graphs [21]. Although we focus on cliques, we note that the proposed approach can be generalized in a straightforward manner to alternate sub-groups of nodes having incomplete relationships. However, it is not immediately clear how one would define such incomplete subgroups in a manner which will facilitate variable selection, and this issue warrants a more thorough investigation.
Variable selection, clique identification and multiplicity controls: Variable selection proceeds by first identifying important cliques by computing the clique specific marginal inclusion probabilities. Individual significant covariates are then identified by including all covariates residing in significant cliques, and subsequently eliminating unimportant covariates which have near zero effect sizes from this set. This elimination proceeds via a post-MCMC approach which computes point-wise 95% credible intervals for β j using MCMC iterations for which the j-th covariate was present in one or more significant cliques, and excluding all covariates for which the credible interval spans zero. This step, in addition to the multiplicity adjusted priors over cliques, enables control over false positives. Finally, non-zero effect sizes of all variables in a selected clique is expected to provide protection against collinearity for members within a clique. Thus, our approach is designed to attain a desirable balance between detecting true positives and true negatives, a claim which is supported by our simulation studies.
The variable selection approach espouses the philosophy that features in each clique are indicative of functional modules which work in coordination to drive the outcome. This assumption is somewhat similar to graph constrained penalized regression approaches [10] which assume that two neighboring genes in a network should be more likely to (or not to) participate in the same biological process simultaneously. These articles smooth the covariate effects for connected variables in the graph while encouraging sparsity and similar fixed effects, whereas no such assumptions are made under our approach. The proposed approach encourages covariates residing in important and unimportant cliques to be simultaneously included and excluded from the model respectively in order to tackle collinearity. However, the variables belonging to important cliques may not end up being included together in the final estimated model under the post-processing step which is designed to control false positives.
We examine if the inherent collinearity in the variables within a clique will potentially hamper the post-processing step desgined to exclude unimportant variables using 95% point-wise credible intervals. Letting γ denote the model involving all covariates belonging to at least one significant clique, the corresponding regression coefficients are drawn from the posterior distribution which is multivariate Gaussian with mean A À 1 X T γỹ and covariance . . . ; t 2 p ÞÞ, and τ is the latent scale parameter under a scale mixture of Gaussians representation for the double exponential priors (see the posterior computation section for more details). Clearly, the posterior mean and variance are well-defined as long as ðX T γ X γ þ diagðt 2 1 ; . . . ; t 2 p ÞÞ À 1 is non-singular, which is the case in most practical scenarios where the latent scale parameter values are learnt in a data driven manner. Note that the invertibility of the covariance matrix A is assured due to the term diagðt 2 1 ; . . . ; t 2 p Þ which adds to the diagonals of X T X, ensuring positive definiteness. The above facts imply stable estimates for the variance for the estimated regression coefficients corresponding to the variables included via important cliques, which results in a successful elimination of false positives in the post-processing step, as evidenced in our extensive numerical studies.

Posterior computation
The posterior computation for the proposed approach contains two independent sets of steps, corresponding to the two stages, as described below.
Computation for Graphical Model Estimation: The graphical model estimation for mixed covariates proceeds via sampling the latent underlying continuous variables corresponding to the ordered discrete covariates, followed by drawing the joint precision matrix of (x C , z O ) under formulation (2). We adapt the procedure in Johnson and Albert (2001) to the case of dependent covariates, for posterior updates of the latent continuous covariates under the following posterior distributions where z O i ðÀ jÞ represents the vector of latent underlying variables for the i-th subject and excluding the j-th measurement, z L l;j ¼ max Once the latent variables have been updated at each MCMC iteration, we sample O using the method described in Wang [26], while λ ij is updated using the posterior where 1(Á) is an indicator function, δ ij $ Bernoulli(p ij ) and p ij is drawn from a Beta posterior. Following Wang [26], the point estimate of the graph is obtained as a post-MCMC step by including the (i,j)-th edge if and only ifr ij =E W ðr ij jXÞ > 0:5, wherer ij is the posterior partial correlation estimate under the continuous shrinkage approach, and E W ðr ij jXÞ represents the posterior mean of the partial correlation under the reference distribution W ¼ Wishartð3; I P Þ.
Note that the belief parameter is either chosen apriori or it can be updated using a griddy Gibbs sampling step as well.
Computation for Structured Variable Selection: The computation strategy described above yields an estimate of the graph, which is used to inform the variable selection approach in the second step as described here. The Gibbs sampling alternates as follows Step 1: Sample g C j ; j ¼ 1; . . . ; p, from Bernoulli(p þ j ) posterior distributions where p þ j is the posterior inclusion probability for the j-th variable.
Step 2: Given γ, sample the fixed effects β γ under a scale mixture of Gaussians representation for the double exponential distribution, as in [27].
Step 6: Sample the intercept α from a posterior distribution which is Gaussian.
Hyperparameter Choices: Below, we list the hyper-parameters used in the model, and elucidate the values we use for them, along with the justifications for such choices.
• The belief parameter κ is edge specific and chosen to have a high or low value according to whether we have high confidence on the prior graph knowledge or not. In the event where one is unsure about the level of confidence, a griddy Gibbs approach can be used, as outlined in S1 Appendix.
• Hyperparameters a p , b p , for π(p ij ) in Eq (2) are chosen such that the ratio a p /b p is small (we choose a p /b p % 0.1). This is because the prior mean for the edge inclusion probability is given by E(p ij ) = (a 0,ij κ ij + a p )/(κ ij + a p + b p ), which implies that E(p ij ja 0,ij = 1) % 1, and E(p ij ja 0,ij = 0) % 0, for large values of the belief parameter, and a small value of a p /b p . In the case of no prior information (i.e. when κ ij = 0,) we have E(p ij ) = a p /(a p + b p ), which is small for small values of a p /b p , resulting in sparse graphs.
• We choose hyperparameters a λ , b λ for π(λ) in Eq (2) such that a λ /b λ is small. As explained previously, this encourages |ω ij | to be close to zero or away from it, depending on whether the prior information suggests the absence or presence of the corresponding edge.
• In Eq (3), we chose the prior on the residual precision as η $ Ga(0.1, 1) in the linear regression model, so as to encourage a residual distribution with thick tails corresponding to a non-informative prior which can accommodate large errors.
• The shrinkage parameter in the Laplace prior for the regression coefficients in Eq (3) is modeled under a conjugate Gamma distribution as Z 2 1 $ Gað1; 2Þ, which is close to the choice in the seminal Park and Casella (2008) [27] paper, and works well in a variety of simulation scenarios. The prior density is designed such that it approaches 0 sufficiently fast as Z 2 1 ! 1 (to avoid mixing problems), and it is relatively flat and places high probability near the maximum likelihood estimate, as recommended in [27].
• We choose the hyperparameters a π = 0.1, b π = 1 for the prior on clique inclusion probabilities in Eq (4), to encourage a small number of cliques to be included in the regression model, which facilitates multiplicity control. This choice works well for controlling false positives for a wide variety of numerical experiments, in our experience.

Simulation studies
We perform simulation studies to assess the variable selection and prediction performance for the proposed approach under several scenarios with varying dimensions and association structures for the covariates. The goal of our simulations is to examine the performance of our approach with existing unstructured variable selection approaches which do not take into account underlying structure information, by either assuming independence among covariates, or accounting for dependence in a way which is not tailored towards underlying network knowledge.
We implement the proposed approach both without and with prior knowledge corresponding to κ = 0 and κ = 50 respectively. In the first case, the graph is estimated completely from the data, and in the second case, the prior graph is taken to be the true graph G 0 used to generate the data. The same value of the belief parameter (50) was used for all edges corresponding to strong confidence; however, we also examine the effect of varying the belief parameter as well as prior mis-specification as elaborated in S2 Appendix. We compare the proposed approach to stochastic search variable selection (SSVS) [28] assuming independence of predictors, the penalized joint credible regions approach (PenCred) by Bondell and Reich [29], and the spike and slab approach or SSL [30] which fuses the Bayesian spike and slab approach with elements of penalized likelihood estimation. We also compared the performance with penalized approaches such as Lasso [31], elastic net [32], and smoothly clipped absolute deviation or SCAD [33], using R packages 'lars', 'elasticnet' in CRAN, and 'SSL' in the authors' website, respectively. The PenCred approach accounts for dependence within predictors, while the other approaches do not explicitly account for any such dependence but are they are widely used variable selection approaches. In addition, results were also included under a sparse fused lasso approach (Flasso) similar to the one described in [18], which encourages the coefficients of related features to share similar magnitudes under the penalized criteria where γ and λ are penalty parameters controlling the sparsity and the similarity between coefficients for connected variables in the graph, respectively, and E denotes the edge set in the given graph. This approach was implemented via the fusedlasso function in the genlasso package in R (https://cran.r-project.org/ web/packages/genlasso/index.html). For the proposed approach, the cliques under the estimated graphĜ was determined via the 'igraph' package in R, and 10,000 MCMC iterations were run with a burn in of 3000. The training and test sample sizes were 100 each, and we consider p = 40, 80. All results are reported over 50 replicates.
For Cases I(a)-(d) stated below, the data was generated from a linear regression model having p covariates out of which nine were ordinal (generated by thresholding the continuous latent variables) and taking values 0-4, and the rest were continuous. The true inclusion status is set to g 0 j ¼ 1; j ¼ 1; . . . ; 8; 23; 24; with four discrete variables included, and g 0 j ¼ 0 otherwise. The continuous covariates and the continuous latent variables for discrete covariates were generated using a multivariate Gaussian distribution with covariance S T . We consider different block-diagonal structures for S T (listed hereafter), specifying subgroups of predictors with varying partial correlations. The true graph G 0 which was used for BVS-SL with κ = 20, was obtained by including all edges (i, j) with jS À 1 T ði; jÞj > 0:0001. Case I(a): This case corresponds to high partial correlations with the precision matrix having four sub-blocks and all precision diagonals being 1. The first sub-block (4 × 4) has off-diagonals as 0.95, the second and third sub-blocks (4 × 4 each) have precision off-diagonals as 0.7, and the fourth sub-block (p À 12 Â p À 12) is identity. The true coefficient vector was

Case I(c):
This case corresponds to a block diagonal with two sub-blocks-one having an AR(1) structure for the precision matrix with S À 1 T ði; jÞ ¼ 0:95 jiÀ jj ; i; j ¼ 1; . . . ; 8; and the other sub-block being identity. The coefficients were same as those in Case I(a).

Case I(d):
This case corresponds to S T having the same structure as S À 1 T in Case I(c). The coefficients were same as those in Case I(b).

Case II:
We used the network for 48 genes in the TCGA data analysis to construct the inverse covariance matrix. In particular p was chosen as 48 × 2 and the inverse covariance matrix has a block diagonal form with sparse associations across two equally sized sub-blocks of dimension 48 × 48 each, and the associations within each sub-block being determined by the gene network information provided in [34]. Data was generated from a Gaussian graphical model having the true coefficient vector as (0. Cases I(a)-(d) capture the different simulation scenarios with distinct platforms, where measurements within platforms are captured via an auto-regressive structure or they are uncorrelated, and there are no connections across platforms. The unequal sized sub-blocks represent measurements which are available on only one platform but are not available on others. On the other hand, Case II resembles the TCGA data example with two equally sized platforms, where the associations within each platform is determined via prior network knowledge [34] and there being sparse associations across platforms.
Performance evaluation: One can obtain an ordered sequence of regression models by varying the cut-off for the marginal inclusion probability under Bayesian approaches and varying the penalty parameter for frequentist approaches. To evaluate the ordering of the models, we look at receiver operating characteristic (ROC) curves which plot the sensitivity versus 1-specificity, and precision recall characteristic (PRC) curves which plot the precision (ratio of true positives to the total number declared as positive) versus sensitivity. From the ROC and PRC curves presented in Figs 3-6, it is clear that BVS-SL with and without prior graph knowledge essentially always dominate the competing curves, while also having a significantly higher area under the curve as shown in columns 2 and 3 in Tables 1-4. Moreover, BVS-SL demonstrates a significantly and uniformly higher power when the false discovery rate is controlled at 10%, which points towards a superior performance in tackling collinearity for a given multiplicity level, as shown in column 4 in Tables 1-4. Prediction and Variable Selection: In addition to looking at the ordered sequence, we also investigated the predictive performance of each approach, as well as to assess the point estimate under the optimal model. The point estimate is selected using the Bayesian information criterion under PenCred, Lasso, and elastic net, while the median probability rule along with subsequent thresholding (using credible intervals) is used for BVS-SL, and the median probability rule is used for SSVS. We report the model size (MS) and false positives (FP) under the point estimate. This point estimate is also used for prediction under PenCred, Lasso, and elastic net, while the posterior predictive distribution is used under BVS-SL and SSVS. We look at the predictive performance in terms of out of sample mean squared error (RMSPE) and out of sample coverage of 95% predictive intervals (COV 95 ). The coverage refers to the proportion of test sample values contained within predictive intervals. The predictive intervals correspond to credible intervals for the Bayesian approaches BVS-SL and SSVS, whereas for PenCred, as well as the frequentist approaches, they correspond to pseudo confidence intervals that are constructed as ðxβ À 1:96s 0 ; xβ þ 1:96s 0 Þ, where σ 0 is the true residual variance.
It is seen from the first and last columns in Tables 1-4 that the proposed approach has superior performance in terms of out of sample prediction and 95% coverage with respect to competitors for almost all cases. The number of true covariates (MS-FP) detected under the proposed approach, as well as the coverage, is essentially always the best or the second best among all the approaches considered. We also see from the second last column that while the SSVS may have an advantage compared to BVS-SL with no structural information in terms of controlling false positives, the BVS-SL with κ = 20 essentially has similar or better multiplicity control compared to SSVS, thus demonstrating the advantages of incorporating prior information. On the other hand, the SSVS demonstrates drawbacks in terms of collinearity, as evidenced by smaller model sizes, and poor power to detect true positives for a given level of false discovery. Moreover, it is also evident that the SSL approach may have a lower FP rate in some cases; however this is possibly due overly sparse models reported by SSL (evident from the small model sizes) which can also result in a poor overall performance under the method. Finally, we note that the fused lasso approach may result in some improvements in variable selection performance over alternate approaches not incorporating prior knowledge, but it essentially always has less accurate performance compared to the proposed method. Moreover, the predictive performance under the fused lasso approach may not be optimal and even lower than generic approaches assuming independence between predictors. In summary, it is clear that the proposed approach seems to perform well both in terms of variable selection and prediction, while simultaneously tackling the conflicting issues of collinearity and multiplicity in the presence of correlated predictors.
Sensitivity to link function: In order to examine the performance of our approach when the link function which is used to relate the latent underlying variable to the discrete variables is mis-specified, we performed additional experiments where the ordinal variables in Cases I(a)-(d) were replaced with binary variables generated via a logit link. In particular, we modified model (1) as follows     where O T ¼s and 0 i $ Gaðñ=2;ñ=2Þ,s 2 ¼ p 2 ðñ À 2Þ=ð3ñÞ, withñ ¼ 7:3. We note that varðz O ij Þ ¼s 2 0 À 1 i and the off-diagonal elements of O T encode within and between platform interactions. The latent variables z O are thresholded at zero to yield binary predictors which marginally follow a logistic distribution. We considered different structures for O T similar to Cases I(a)-(d), incorporating prior graph information G 0 on the mixed covariates via the inverse covariance matrix. The true inclusion status is set to g 0 j ¼ 1; j ¼ 1; . . . ; 8; 23; 24; with four binary variables included, and g 0 j ¼ 0 otherwise. We examine the graph estimation performance of our method when the mixed covariates are generated as above, and compare these results with the scenario when a probit link is used to generate the latent variables which can be implemented by settings 2 ¼ 0 À 1 i ¼ 1 in (5). The results presented in Table 4 clearly suggest that (a) irrespective of the link used to generate the binary variables, a higher value of the belief parameter results in better graphical model estimation performance; and (b) the sensitivity and the specificity of the estimated graphs are very similar under both the links, even though there may be possible differences in the precision matrix estimation accuracy. Based on the above findings, we conclude that there are no systematic differences in terms of graphical model estimation, when the latent variables are generated under different links, which illustrates the robustness of the proposed approach.

Integrative network analysis of TCGA glioblastoma data
Our motivating dataset arises from a TCGA-based study in glioblastoma multiforme (GBM), which is the most common and aggressive form of primary brain cancer in human adults. The TCGA data portal provides multiple levels of molecular data for a large cohort of GBM tumor specimens. Each qualified specimen was assayed using multiple assays among which we concentrate on the following: messenger RNA (mRNA) expression using HT-HG-U133A (Affymetrix) arrays, DNA methylation (METH) using HumanMethylation27K (Illumina) and Table 3. Simulations for Case II, training sample size = 100, test sample size = 100. BVS-SL(κ) represents the Bayes variable selection with belief parameter κ for all edges. Pencred, SSVS, Lasso, EL, SCAD, SSL, and Flasso represent the penalized joint credible regions approach, stochastic search variable selection, L 1 penalized regression, and elastic net, the smooth clipped absolute deviation, the spike and slab lasso, and sparse fused lasso respectively. MSPE: out of sample predictive MSE; Pwr(10% FDR) is sensitivity controlling for 90% specificity; MS: estimated model size; FP: false positives, and Cov 95 is coverage under 95% predictive intervals. The true model size is 10. DNA copy number (CN) HG-CGH-244A (Agilent) arrays. All the resulting data from the three platforms are pre-processed, normalized and annotated to the gene level. We focus our analysis on 48 genes that overlap with the three critical signaling pathways-RTK/PI3K, p53, and Rb, which are involved in migration, survival and apoptosis progression of cell cycles in cancer [23]. These pathways are dominantly dis-regulated in GBMs, as confirmed by integrative analyses of TCGA GBM samples [35]. Furthermore, the activity of these pathways is seen to vary across molecular subtypes, suggesting potential for therapeutic targeting (via inhibition of receptor tyrosine kinase activity) and prognostic assessment [23]. Thus reconstructing the topology and connectivity of these genes and pathways and evaluating the downstream impact on GBM prognostic time can shed light into the underlying cellular and biological mechanisms involved in the evolution of the GBM disease process. Thus our covariate matrix consists of 48 genes mapped to these core pathways from D Ã = 3 platforms (mRNA, METH, CN) resulting in p = 48 × 3 = 144 regressors. Note that mRNA and METH are continuous, while CN is discrete having three categories corresponding to loss, gain, or neutral. The outcome is log-transformed survival times for 233 patients which is regressed on the covariates using an accelerated failure time model. Among 233 patients, 70 were censored, whose survival times were imputed using Kaplan-Meier imputation. Prior knowledge: The prior knowledge on the graphical structure between these 48 genes is based on previous studies in GBM [34], and is denoted as G 0,pr (shown in panel (b) of Fig 1). This prior graph is obtained by assessing sequence mutations, copy number alterations and proteins and confirm and extend the observation that GBM alterations tend to occur within specific functional modules. The prior graph in our analysis comprises 144 nodes, across the 3  platforms, and is constructed so as to preserve the prior graphical structure G 0,pr within the platforms, while allowing the data to infer interactions between two different platforms. Thus the prior graph can be concisely written as: G 0 = G 0,pr I 3 , where represents the Kronecker product of the two matrices. Since we have strong prior knowledge about within platform interactions, we choose a high value for the belief parameter (κ = 50) within platforms. However we are unsure of the between platform associations and hence we choose a near zero κ value corresponding to these interactions, so that we learn these interactions directly from the raw data without imposing a strict prior belief. Results for survival-time association: We first surveyed the main prognostic (multi-platform) markers that were associated with the survival time of the GBM patients. The marginal inclusion probabilities of the variables using our analysis are presented in Table 5, with a corresponding plot in Fig 7 in the manuscript. We select the posterior probability threshold to infer important features under a false discovery rate criteria controlled at a pre-specified level, similar to the method described in [36]. In particular, we can choose a threshold ϕ θ for posterior probabilities so as to control the average Bayesian FDR at level θ, which essentially implies that we expect 100θ% of the significant markers to be false positives. To obtain such an estimate, first sort the posterior probabilities for all markers in ascending order to yield pr (j) , j = 1, . . ., p.

Method
Then ϕ θ = pr (z) , where z ¼ max fj Ã : j ÃÀ 1 P j Ã j¼1 pr ðzÞ yg. Under a level 0.2 (corresponding to a posterior probability threshold of 0.7), and after thresholding, seven genes are significantly associated with progression through various mechanisms. They are (a) HRAS, TP53, CCND1, BRAF and CDKN2C, through copy number, (b) GRB2 through methylation, and (c) MDM2 through mRNA. Of these CDKN2C and GRB2 are positive drivers of progression, while the remaining genes are negatively associated with progression. HRAS is a member of the RAS oncogene family, whose negative effect on Glioblastoma is previously observed on the overall and progression-free survival [37]. CCND1 belongs to the Cyclin D family of cell cycle regulators, which are known to be up-regulated and amplified in malignant glioma [38]. Similarly, MDM2 the inhibitor of the tumor suppressor TP53, is established to be a candidate gene associated with short progression [39]. TP53 copy-number itself is associated with poor progression of GBM via deletion [40]. Although, there is no evidence of BRAF amplification in GBM, a previous study established that BRAF amplification via gene duplication event activates the MAPK signaling in low-grade glioma [41]. Moreover, CDKN2C is a well characterized tumor suppressor gene associated with many cancers and known to be deleted in Glioblastoma [42]. On the other hand, GRB2 is a key protein in epidermal growth factor receptor signaling in the Glioblastoma tumoroginesis pathway [43].
Clique analysis: The important cliques are identified ass those which have significant marginal clique inclusion probabilities. The clique analysis depicted multiple interesting two-way interactions. In certain cases, the multiple cliques containing the same molecular probe but with different partners have highly significant marginal inclusion probabilities. For instance AKT1 (METH) clique interaction with many different molecular probes is significant ( Table 6). These cliques constitute both tumor suppressing as well as activating interactions. The cliques involving AKT1 (METH), PTEN (mRNA) and AKT1 (METH), PIK3R2 (mRNA) can be construed as tumor suppressing, while cliques involving AKT1 (METH), CCND1 (mRNA) and AKT1 (METH), GRB2 (CN) probably are tumor activating. The diverse biological functionality of the cliques represent the inherent biological subtypes within GBM [44].
Neighborhood analysis: In addition to detecting important prognostic markers for GBM, we also examine the estimated graph (panel (c) of Fig 1) within and across platforms. We take a closer look at the neighborhood of GRB2, which plays a central biological role in this molecular network as a trigger of the RAS signaling upon the activation of upstream receptor tyrosine kinase family members. The presence of three important tumor suppressor genes of GBM in the neighborhood of GRB2 (RB1, CDKN2B and PIK3CG) is interesting, although they have no direct interaction with GRB2. RB1 and PIK3CG seem to lose their functionality through DNA methylation, while CDKN2B through copy number loss, enabling the RTK/RAS activation cascade via GRB2. These events reinforce the previous illustration in GBM that hypermethylation and deletion of RB1 and CDKN2B respectively contribute to the loss of tumor suppressor function [45]. The partial correlations of genes between the platforms is demonstrated via clustering heatmaps in Fig 8. From the Figure, it is clear that there is a enrichment of positive correlations between the mRNA and copy number data, and an enrichment of negative correlations between the mRNA and DNA methylation data, which further supports our biological-hypothesis driven integrative models.
We performed additional data analysis where (a) no graph information was used (κ = 0); and (b) only 75% confidence was placed on the prior graph knowledge for within platform interactions, which was implemented by setting (κ + a p )/(κ + a p + b p ) = 0.75. The results are presented in detail in S2 Appendix and point to considerable overlap between the different analyses results. The proposed approach could be further improved by accounting for non-linear graphical connections, as well as non-linear relationships between the outcome and predictors. Although the proposed method relies on cliques in the estimated graphical structure to account for collinearity, the approach can be generalized more incomplete connections between variables if it is possible to define such subgroups in a meaningful manner. Moreover, in practical applications where tumor heterogeneity is present, it is reasonable to expect subgroups of subjects corresponding to different but unknown tumor types to have different gene networks. In such cases, the proposed approach needs to be extended to unsupervised clustering approach incorporating a distinct gene network for each cluster. In addition, another potential improvement would be to propose a more efficient computational strategy which allows for greater scalability in terms of the number of covariates, which would enable us to construct genome-wide networks. In summary, network science is a rapidly evolving field with the main focus on the exploration of structural properties and dynamical behaviors of complex networked systems [47]- [49], and the proposed approach makes an important and timely contribution to this research area.

S1 Interactive Plot. Interactive version of Fig 1.
We have generated an additional interactive pdf figure containing subpanels (a) and (c) in Fig 1, which enables to reader to zoom in a look at these panels of the diagram in greater detail. (PDF) S1 Appendix. Calibration of the belief parameter. This Appendix contains guidance on the choice of the belief parameter based on the degree of confidence that one would like to put on the prior belief. (PDF) S2 Appendix. Sensitivity to prior knowledge. This Appendix contains further results on the sensitivity of the BVS-SL approach when the mis-specification of the prior knowledge is varied in simulations. It further contains separate analysis of the TCGA data when (a) no graph information was used (κ = 0); and (b) only 75% confidence was placed on the prior graph knowledge for within platform interactions. (PDF) S1 Data. TCGA data used in the article. The data file is a. Rdata file which contains the survival time and censoring status for subjects, as well as the copy number variation, gene expression and methylation measurements of the probes matched to the 48 genes considered in the real data analysis.