Bayesian Mixture Models for Assessment of Gene Differential Behaviour and Prediction of pCR through the Integration of Copy Number and Gene Expression Data

We consider modeling jointly microarray RNA expression and DNA copy number data. We propose Bayesian mixture models that define latent Gaussian probit scores for the DNA and RNA, and integrate between the two platforms via a regression of the RNA probit scores on the DNA probit scores. Such a regression conveniently allows us to include additional sample specific covariates such as biological conditions and clinical outcomes. The two developed methods are aimed respectively to make inference on differential behaviour of genes in patients showing different subtypes of breast cancer and to predict the pathological complete response (pCR) of patients borrowing strength across the genomic platforms. Posterior inference is carried out via MCMC simulations. We demonstrate the proposed methodology using a published data set consisting of 121 breast cancer patients.


Biological Background
Copy number and arrayCGH. Human beings have two copies of each gene, defined as a segment of DNA. The normal copy number of a gene is therefore two. Copy number aberration (CNA) refers to cytogenetic events in which the DNA replication process is disrupted such that the gene either is replicated multiple times (copy number gains) or loses one or both copies (copy number loss) in newly generated cells. Comparative Genomic Hybridization (CGH) has emerged as a dominant technique for detecting CNA [1], especially when combined with microarrays. The resulting arrayCGH techniques [2], [3], [4] and [5] measure thousands or millions of genomic targets or ''probes'' that are spotted or printed on a glass surface. These probes usually span the whole genome with a resolution of the order ranging from 1 MB (one million base pairs) for BAC (bacterial artificial chromosome), to 50-100 kb (kilo base pairs) for more recent microarrays. In an arrayCGH experiment, a DNA test sample of interest is labeled with a dye (say Cy3) and then mixed with a diploid reference sample labeled with a different dye (say Cy5). The combined sample is then hybridized to the microarrays and intensities of both colors are measured through an imaging process. The quantity of interest is the log 2 ratio of the two intensities for each color. The collection of the intensity ratios then provide useful information about genome-wide changes in copy numbers between the two samples. Since the reference sample is presumed to be diploid, the intensity ratio is determined by the copy number of the DNA in the test sample. If the copy number of the test sample is also two, then the theoretical log 2 intensity ratio equals zero. If there is a single copy loss in the test sample, the theoretical ratio is log 2 1=2~{1 assuming all the cells in the test sample lost one copy of the DNA fragment. If there is a single copy gain, the theoretical ratio is log 2 3=2~0:58: Multiple copy gains are called amplifications, and the corresponding theoretical intensity ratios are log 2 4=2 , log 2 5=2 , etc. When both copies are lost, the theoretical ratio is {? and a large negative value is usually observed in experiments.
Integration of DNA copy number and RNA expression. Expression microarrays measure RNA expression which, by the central dogma of molecular biology, are resulted from the transcription of DNAs. Microarray technology for measuring RNA gene expression has been well known to the statistical community, and its review is omitted here. Naturally, we are prone to think that CNAs impact the intensities of the relative RNA expressions in that more copies of DNA should lead to higher levels of RNA expression. It is therefore of great interest to study the intensity of such interaction, if there is any, between aCGH and RNA expression measurements on different genes.
Gene expression and copy number variation data have been broadly studied, to assess differential expression of genes [6] and to find segments along the DNA that show CNAs [7], [8]. Statistical and computational models for integrating different types of data are becoming a popular topic in the recent literature, even though only few considered full model-based approaches. [9] was among the earliest to investigate the direct association between the two types of data in breast cancer cell lines and tissue samples, and their approach was based mainly on descriptive statistics. Van Wieringen and Van de Wiel [10], attempting to mitigate the high noise in the raw expression measurements of the DNA and RNAs, proposed a sampling model for RNA expression incorporating estimated probabilities of corresponding CNAs. They subsequently developed nonparametric adaptive tests to study whether the estimated copy number variations in the DNA level would induce differential gene expression at the RNA level. More recently [11] presented a double-layered mixture model (DLLM ) that directly modeled segmental patterns in the copy number data to produce CNA profiles, and simultaneously scored the association between copy number and gene expression data. The DLMM assigned high scores to elevated or reduced expression measurement only if the expression changes are observed consistently across samples with copy number aberration.
An important biological premise to the description of the model is that by integrating DNA copy number and RNA expression data, we will gain more knowledge about the underlying biological process. For example, a high or low correlation between a copy number aberration (CNA) for a gene marker and its abnormal RNA expression would indicate different carcinogenic mechanism and therefore different treatment selections [12] [13].
We describe a Bayesian Mixture Model that converts the noisy raw intensity measurement of the DNA and RNAs into probability of expression, which are subsequently modeled as latent parameters. Thus the integration of the two platforms is realized by joint modeling the probabilities of expression through a probit regression. Our aim, however, is not only to evaluate the relative contribution of large genetic variants such as CNAs, to gene expression but also make inference using both differential expression of the genes and differential copy number variations of the same set of genes. Moreover our full model-based approach allows us, after new information on the patients in the study are acquired, to exploit the latent integrated structure of our model and achieve better predictive performances for the clinical outcome of new patients coming into the study.
In the next paragraph we present a motivating example with matched arrayCGH and microarray samples from breast cancer patients. In the materials and methods section we introduce probability models with a particular focus on the probit regression that allows for integration of both platforms, along with some simulation studies. Thus, in the result section, the focus is on posterior inference of the interaction between the two platforms, differential behaviour, which takes into account both differential gene expression and differential CNA, and prediction of the pCR of patients after treatment. A final remark is provided in the discussion section.

Motivating Example
We consider data in breast cancer consisting of 121 patients from three disease subgroups, ER+, HER2+, and triple negative (TN). ER+ patients have present estrogen receptors -a protein related to hormone and regulation of gene expression -in their cancer cells. HER2+ patients are instead those whose tumor cells test positive for a protein called human epidermal growth factor receptor 2. Finally TN patients lack three ''receptors'' in their cancer cells: ER, HER2, and progesterone receptors. ER+ and HER2+ patients were therefore collapsed in the same group, in order to compare TN patients versus others.
On a slightly reduced set of 116 patients we have a measure, formalized as a dichotomous variable, on their positive or negative pCR to treatment. Numerosities are specified in the table 1.
The mRNA expression data was obtained with Affymetrix U133A gene chips. The data was normalized with MAS5 algorithm, scaled to target intensity of 600 and log2 transformed. The expression profiles of the cancers are available at GEO accession number GSE22093 [14]. The DNA copy number data was generated with Agilent 4x44K CGH arrays, processed as log2 ratios of the intensities of the two colors, and is available at ArrayExpress accession number E-TABM-584.
ArrayCGH and microarray RNA experiments have been performed using the 121 breast samples to obtain the copy number data on 22,944 probes and RNA expression data for 11,306 genes. We then mapped 22,944 probes to the 11,306 genes, which gave us a matching between the probe ids on the aCGH and the gene ids on the microarrays.

Ethics Statement
All the research used public data, published in 2009 in the following paper: ''Molecular characterization of breast cancer with high-resolution oligonucleotide comparative genomic hybridization array'' written by Andre F. et al. and published in Clinical Cancer Research [14].

Sampling model for w and y
On arrayCGH, the experimental unit is probe b belonging to gene g. On RNA microarray, the experimental unit is gene g. Denote w bt the log2 intensity ratio for probe b at sample t, and y gt the RNA expression level for gene g at sample t, b~1,:::B g~1,:::,G, and t~1,:::T: Denoting fb [ gg the set of arrayCGH probes corresponding to gene g, the matched copy number and RNA expression data for sample t is then We propose mixture models for w and y and introduce latent variables representing the differential expression status of the DNA and RNA, respectively. We then integrate the two models by constructing a prior probit regression linking the latent variables from both platforms.
We use a mixture model [15] to introduce trinary latent indicator variables for the CNA state for each probe and the differential expression (DE) state for each gene. Specifically, let e w bt take values in the set f{1,0:1g , respectively corresponding to the copy-loss (v2 copy number), copy-neutral (~2 copy number), and copy-gain (w2 copy number) states and e y gt take values in the set f{1,0:1g , respectively corresponding to the under-, normal-, and over-expression states. Conditional on e w bt and e y gt , the sampling models for copy number log2 ratios w bt and for gene expression y gt are given by In (2), the mixture model for gene expression data y gt includes a gene effect m g and a sample effect a t . This is not the case in the mixture model for aCGH data w bt . The main reason is because w bt is already a log ratio between the cancer sample copy number and the reference sample copy number and therefore the and y z={ g define the tail overdispersion with respect to normality, associated with copy losses or gains for aCGH and under-or over-expression for microarrays.

Latent probit scores and probit regression
Anticipating the integration of both platforms using a regression model, we further introduce latent Gaussian variables z w bt and z y gt to define a probit scores for the trinary indicators e w bt and e y gt . Specifically, define Before we introduce the probit regression for integration, we present a prior for z w bt that allows for inference of different CNAs across different conditions, in our case of breast cancer data, different subtypes of breast cancer. Let x t is a clinical categorical covariate indicating which subgroups the patient belongs to, we assume that where fx t~j g,j~1,0 respectively if the patient belongs to TN subgroup or not, z w b , a probe-specific mean, describes a baseline CNA status (e.g., a reference subtype) and d w g a trinary indicator accounting for differential CNA in the two subtypes, following a prior distribution given by The integration of the two platforms is easily done using the latent probit scores and a linear model. First, we introduce a genelevel score for the aCGH data, defined as z w Keeping in mind that there is a natural biological causal relationship between DNA copy number change and altered gene expression for the corresponding RNAs, we assume that where x t is the clinical binary covariate mentioned above, while d y g and d yw g trinary indicators accounting respectively for differential gene expression in TN subgroup and interaction between the two measurement for gene g , following similar prior to the one mentioned above for d w g . Markov dependence across probes. A Markov dependence is assumed across the probes and it is defined in the following conditional prior on the probe specific effect. Define z w~( z w 1 ,:::,z w B ): Assuming that the index b is ordered according to Figure 2. Posterior probabilities of positive interaction between the two platforms (A), differential CNA (B) and differential joint behaviour (C) after simulation 2. The red dots highlight posterior probabilities of genes which are claimed by the model to show respectively positive interaction between the two platforms, differential CNA and differential joint behaviour. doi:10.1371/journal.pone.0068071.g002 (3) Figure 3. Posterior probabilities of differential CNA (on the x-axis) and differential expression (y-axis) obtained respectively through the marginal models on CNA data and gene expression data (A). Black dots highlight posterior probabilities of genes which are claimed by the model to show joint differential behaviour (A). Comparison between differences in means of the gene expression data and posterior locus proximity on the chromosome, the dependence across adjacent probes is described as follows. Let z 1 *N(0,1) and for b [ f2,:::,Bg: In this formulation the parameters b~(b 1 ,:::,b b{1 ) can be directly interpreted as partial correlation coefficients, defining the strength of dependence between log 2 ratios associated with probes that are adjacent on the chromosome.
Priors. The last step is the specification of the priors for the set of parameters that index the sampling model. We assume conditionally conjugate priors. Denoting G(a,b) a gamma distribution with mean ab, we assume Particular attention is given to the formulation of the prior for with s 1 much larger than s 2 and k 1 fixed at 1. The prior for b 's is given by for b [ f1,2,:::,B{1g , with t 2 v1 so that the marginal variance of z b 's is bounded above. Note that this model assumes that adjacent probes are equally correlated, characterized by b 's and t 2 . Alternatively, one could model the correlation between probes as a function of their genomics distances, and this can be easily achieved by modeling b b{1 as a distance between probes b and b{1, for example. Finally we assume conditionally conjugate priors for the gene and slide specific effects subject to X a t~0 . Finally, the normal range of variability in mRNA expression the tail over-dispersion parameters

Modified Probability Model for the prediction of pCR
The idea of this section raises from the question of whether or not we could use the same latent structure underneath gene expression and copy number variation data to make inference on a clinical outcome of new patients in the study, in particular u t , the pCR of patients to treatment.
The chosen approach is to state a model for y gt and w bt , p(w bt ,y gt Dh) , and to assume a Bernoulli distribution for u t . This leads us to the sought model p(u t Dw bt ,y gt ) and posterior probabilities of u t being 1 give us a measure for the prediction of the outcome of the new patient.
The advantages of our model with respect to, for example, a simple logistic regression p(u t Dy gt ,w bt ) are mainly the noise probability of differential expression (B). Comparison between sample correlations and posterior probabilities of positive interaction between platforms (C). doi:10.1371/journal.pone.0068071.g003 in equation (4) and (5) (with Bernoulli priors with probability 1{p and p very close to 1) allow for a reduction of the number of covariates (genes) and avoid the problem of overestimation.
In summary, as a new patient comes into a study and we have measurements of his gene expression and copy number variation, we run the model p(w bt ,y gt Dh) and assume for his clinical outcome u t a Bernoulli distribution with probability p . Through MCMC methods we obtain updated posterior probabilities of u t being 1 that give us a measure for the prediction of his outcome.
In this particular case the outcome refers to the pCR to the   treatment of patients in a breast cancer study, which is defined as a complete disappearance of the tumor with no more than a few scattered tumor cells detected by the pathologist in the resection specimen [16]. As before we use a mixture model (equations 1 and 2) [15] to introduce a trinary latent indicator variables for the CNA state for each probe and the expression level state for each gene, and latent Gaussian variables z w bt and z y gt to define a probit scores for the trinary indicators e w bt and e y gt (3). The next two equations embody our assumption that positive or negative clinical response of patients could be related to differential behaviour of a small subgroups of the 11,306 genes, i.e. copy number variation and gene expression. We assume where u t is the clinical outcome mentioned above, measured on the 116 patients, and d wu g is a binary indicator introduced for controlling the number of covariate in the regression.
The integration of the two platforms is implemented as a regression with the probit scores, z y gt Dz w gt *N(a g zd yu g q g u t zz w gt l lg ,1), ð5Þ where l lg characterizes the relationship between the two platform, d yu g is a binary indicator introduced for controlling the number of covariate in the regression and u t is the same variable as above.
As new patients tz1,:::,tzn come into the study, and supposedly they do not have an information on pCR, an assumption on their outcome is made, as follows: u tzi * iid Bernoulli (p), i~1,:::,n: so that we can learn about u t through the above prior and p(w bt ,y gt Du t ,h) , using Bayes formula and MCMC methods. The  Bernoulli probability p was set to be equal to the sample proportion of patients with positive pCR.
Priors. Priors are defined as in section 2.4, with the only exception of the regression parameters p g and q g , and the binary indicators d wu g and d yu g . For both the first parameters an informative prior is assumed p g *N(p p g ,s(p p g ) 2 ) q g *N(q q g ,s(q q g ) 2 ) withp p g ,q q g and the variances estimated using the data. While for the two indicators with p very close to 1 to allow for the selection of a very small subgroups of genes as covariates in the two regressions.
A summary of the model is given in the lower part of Figure 1.

Bayesian Multiplicity Control
Posterior inference for the proposed model is carried out using MCMC simulations by a Gibbs sampling scheme, iterating from the complete set of full conditionals reported in the appendix.
Since the analysis deals with high throughput gene expression data and our final aim is that of selecting interesting genes [17] multiple comparison problems arise.
A useful generalization of frequentist Type-I error rates to multiple hypothesis testing is the false discovery rate (FDR) introduced in Benjamini and Hochberg [18], and reviewed in a Bayesian framework by Storey [19], [20].
Let d g denote the indicator for gene g being differentially expressed under two biological conditions of interest (in our case we will be facing two different indicators d g1 and d g2 whether the comparison is ER+ vs TN or HER2+ vs TN).
H 0g : d g~0 ; H 1g : d g~1 : Let d g denote an indicator for rejecting the g-th comparison and D~X G g~1 d g denote the number of rejections; it is defined as the fraction of false rejections, relative to the total number of rejections. As such it is neither Bayesian nor frequentist. Under a Bayesian perspective, since the only unknown quantity is d g in the numerator, it can be defined an expected FDR. Let r g~P (d g~1 DY ) , then It was proved by M€ u uller et al. [21] that under several loss functions that combine false negative and false discovery counts the optimal rule is of the following form d Ã g~I (r g wt) . The problem is now that of specifying t so that the FDR is controlled at a desirable level.
An algorithm that allows us to compute FDR levels for number of discoveries, and therefore to select differentially expressed genes so that the FDR level is controlled at level a , consists in sorting, from the lowest to the highest, the marginal posterior probabilities p g~( 1{r g ) , to obtain (p (1) ,:::,p (G) ) . Thus, if p (1) =1wa, we do not reject any null hypothesis; otherwise, if (p (1) zp (2) )=2wa, we reject H (1) only. We iterate this procedure until the first time X G g~1 p (g) =Gwa , and reject H (1) ,:::H (G{1) .

Simulation Study
We perform a small simulation study and generate data in a way that the last 50 (out of 1,000) genes show joint differential behaviour in copy number and RNA expression. We firstly generated two matrices for gene expression (y gt ) and copy number log2 ratios (w bt ) , respectively of dimensions G|T and B|T, with B~2000 probes, G~1000 genes (exactly two probes per gene) and T~50 samples. The clinical covariate x t is set to be 1 for the first 10 patients and 0 for the remaining 40 patients. Sample and gene effects were generated from the corresponding priors in the model, a t *N(0,s 2 a ) subject to X a t~0 and m g *N(h m ,s 2 m ) . Observed log2 ratios and expression values were sampled from two Gaussian distributions, respectively centred at a t zm g and 0. To induce differential joint behaviour for the last 50 genes, we did the following: for RNA expression, we generated y gt *U({10,0) for g[f950,:::,1000g and t[f1,:::,10g; for copy number, we generated w bt *U({2,0) for b[f1900,:::,2000g and t[f1,:::,10g; The second simulation study generates data from the proposed mixture model. We started from setting l d yw g to be 2 for the first 50 genes and 0 for the remaining 950. and generated the latent scores , randomly with proportions respectively 30% and 70%, a g *N(0,1) for g[f1,2,:::,1000g and z y gt *N(a g zx t b d y g zl d yw g z w gt ,1) . Once the latent scores are generated, using (1 and 2), we generate gene y z={ g~+ 10 and s 2 g~1 , g [ f1,2,:::,1000g: In both cases roughly 2000 iterations were needed for convergence of the MCMC chain.
For the sake of simplicity, we report only results for the second simulation. In Figure 2 we show the posterior probabilities of positive interaction between platforms (fd yw g =0g ), differential CNA (fd w g =0g) and joint CNA and RNA differential expression (fd y g =0,d w g =0g ). As we expected, posterior probabilities of positive interaction between platforms for the first 50 genes and posterior porbabilities of differential CNA and differential joint behaviour for the first 100 genes are among the highest.
While these simulations merely show that our proposed models achieve what is expected, we direct attention to selection of differentially behaved genes with multiplicity control and then data analysis based on breast cancer samples.

Results
We applied our model to the breast cancer data set. As comparison, we also applied a simpler version of our models by setting l d yw g~0 for all the genes. The simpler models assume that the gene expression and copy numbers are independent and therefore there is no integration. We call these simpler versions ''marginal models''.
In the upper plot of Figure 3 dots refer to the posterior probabilities of DNA copy number amplification, P(d w g~1 Dw b(g)t ) , and over expression, P(d y g~1 Dy gt ) , based on the marginal models; black dots highlight the list of over-expressed genes which jointly showed copy number amplification obtained through the integrated model. As expected the joint model selects, coherently, mostly genes in the upper right corner, but still differently from the intersection between the marginal ones.
A simple model checking was achieved plotting posterior probabilities of differential gene expression and difference in means of the gene expression measurements for TN and non TN group. Following the same criteria, we plotted posterior probabilities of positive interaction between platforms and sample correlations. Lower plots of Figure 3 show, respectively, a very good match between no difference in sample means and low posterior probabilities of differential expression, and between strong positive sample correlations and high posterior probabilities of positive interaction between platforms.
Our main focus was on five lists of interesting genes: under (over)expressed genes which jointly showed DNA copy number deletion (amplification) in TN subgroup, under (over)-expressed genes conditional on DNA copy number aberration only in TN subgroup and genes which showed positive interaction between the two platforms. We therefore respectively defined N r g~P (d w g~{ 1,d y g~{ 1Dw b(g)t ,y gt ) y gt ) where t~1,:::T and b(g) indicates all the probes belonging to the gene g.
FDR levels were computed with the algorithm presented in the previous section for the distinct r g 's, and genes were selected choosing a cutoff a~0:05 The lists of selected genes could be of greater interest for clinicians since they indicate which genes show differential expression and copy number variation in TN patients versus patients who tests positively for ER and HER2 receptors.
On the other hand, for prediction of pCR, we split the data sets into a training set and a test set; the training set, consisting of 94 patients, was used to obtain samples from the posterior distribution of the parameters while the test set, consisting of 22, to check for prediction performances through the ROC curve. Both sets were randomly selected, and numerosities with respect to pCR of training and test samples are reported in table 2. We constrained numerosities in order for the test sample to be equally balanced between positive and negative pCR, and for the training sample to respect proportions of the original data set. The adopted method for the estimation of the smoothed ROC curve is LLoyd and Yong's one [22], which is proved to perform better than the empirical estimation. They proposed to estimate this curve from kernel smoothing of the distribution functions of the diagnostic measurement underlying the binary decision rule, i.e. the conditional posterior probabilities of positive pCR, and showed the significant accuracy achieved by this method for realistic sample size compared with the empirical estimation.
As mentioned above, the tests we performed were done on a sample of 22 patients, for which we had previously measured their pCR, and are based on the posterior probabilities of the clinical outcome being 1, P(u t~1 Dw b(g)t ,y gt ) , obtained running the Gibbs Sampler for 30.000 iterations. We performed the same analysis using marginally the two platforms and obtaining respectively posterior probabilities P(u t~1 Dw b(g)t ) and P(u t~1 Dy gt ) . These posterior probabilities, obtained through the joint and marginal models, are showed in Figure 4.
The ROC curves are compared in Figure 5 and such comparison confirms our choice of borrowing information between the two genomic platforms, since the ROC curve corresponding to the integrated model has by far the highest Area Under the Curve, slightly below 0.9.
We finally tried and compared our method with a simple logistic regression with LASSO variable selection (LLR) [23] [24], whose corresponding ROC curves are plotted in Figure 6. We performed the analysis using the package glmnet in R, and set the elastic net mixing parameter a to 1. The penalty is defined as and a~1 correponds to the Lasso penalty, which in this case gave the best prediction performances.
We therefore plotted in Figure 7 the smoothed ROC curves based on posterior probabilities of pCR obtained through the integrated model and on predictive probabilities obtained through LLR using only copy number variation data. The AUC under the curve obtained through our integrated model shows to be much higher that the one under the curve obtained through LLR.

Discussion
We have introduced a Bayesian hierarchical model to integrate two types of genomics data, copy number and RNA expression. The proposed model can be easily extended to multiple platforms, with modification to the modeling of latent probit scores. Since the entire statistical inference is based on a coherent probability model, scientific questions can be addressed with probability statements, allowing for reporting uncertainty measures such as FDR. This is the main advantage of the proposed models over existing ones.
In table 3 we reported the list of genes which show jointly over expression and copy number amplification in TN patients, which was of great interest for clinicians and was also the list associated with the lowest FDR levels. Gene MYC appeared in the list and the result is promising since MYC is a key regulator of cell growth, proliferation, metabolism, differentiation, and apoptosis and MYC deregulation contributes to breast cancer development and progression and is associated with poor outcomes. Multiple mechanisms are involved in MYC deregulation in breast cancer, including gene amplification, transcriptional regulation, and mRNA and protein stabilization, which correlate with loss of tumor suppressors and activation of oncogenic pathways [25].
Breast cancer has been classified into 5 or more subtypes based on gene expression profiles, and each subtype has distinct biological features and clinical outcomes. Among these subtypes, basal-like tumor is associated with a poor prognosis and has a lack of therapeutic targets. MYC is overexpressed in the basal-like subtype and may serve as a target for this aggressive subtype of breast cancer. Tumor suppressor BRCA1 inhibits MYC's transcriptional and transforming activity [25]. Loss of BRCA1 with MYC overexpression leads to the development of breast cancer, especially, basal-like breast cancer. As a downstream effector of estrogen receptor and epidermal growth factor receptor family pathways, MYC may contribute to resistance to adjuvant therapy. Targeting MYC-regulated pathways in combination with inhibitors of other oncogenic pathways may provide a promising therapeutic strategy for breast cancer, the basal-like subtype in particular [26].
As far as the model is concerned, there are a few possible weaknesses in the procedure, mainly related to the prior specification for parameters d 0 g s , related to differential expression and prediction. We were dealing with highly parametrized models and few observations data sets, reason why we chose some easier shortcuts in order to achieve faster MCMC convergency. Some interesting modifications of our prior specifications are now to be implemented, since we found in literature new and more efficient approaches to the issue of sparsity, such as the horseshoe prior [27].
Also, it was very hard to compare our models' performances with other methods, either due to the lack of codes or to the scarcity of works on the specific topic of prediction using integrated genomic platforms; we therefore chose a simple LASSO logistic regression which showed to be a poor fit for this particular data and this is mainly due to the high correlation between them.
Future work includes the development of models for integration of three or more platforms, and the extension to new type of genomics data, such as next-generation sequencing (NGS) data. In the latter case, the main challenge is the inclusion of a model for the count data from the NGS experiment. The intuitive statistical method for such an extension would be a graphical model, where network priors will be considered treating each platform as a node, and edges among the nodes will be interpreted as dependence between platforms.
Finally, all this project was focused on a specific data set, with rather particular features. The natural hierarchical structure and correlation between DNA and RNA makes very hard to think of the application of our methodology to different problems, though an interesting path to follow could be that of demographical sciences, where this hierarchical structure could be found for example in data at country level and regional level.