Detecting disease-associated genomic outcomes using constrained mixture of Bayesian hierarchical models for paired data

Detecting disease-associated genomic outcomes is one of the key steps in precision medicine research. Cutting-edge high-throughput technologies enable researchers to unbiasedly test if genomic outcomes are associated with disease of interest. However, these technologies also include the challenges associated with the analysis of genome-wide data. Two big challenges are (1) how to reduce the effects of technical noise; and (2) how to handle the curse of dimensionality (i.e., number of variables are way larger than the number of samples). To tackle these challenges, we propose a constrained mixture of Bayesian hierarchical models (MBHM) for detecting disease-associated genomic outcomes for data obtained from paired/matched designs. Paired/matched designs can effectively reduce effects of confounding factors. MBHM does not involve multiple testing, hence does not have the problem of the curse of dimensionality. It also could borrow information across genes so that it can be used for whole genome data with small sample sizes.

In the followings, we briefly describe the eLNN model. For details of eLNN and other models, please refer to the corresponding references. All these 4 models assume that samples are independent.
The eLNN model assumes that there are 2 clusters of gene probes: (1) gene probes nondifferentially expressed (NE) between cases and controls and (2) gene probes differentially expressed (DE) between cases and controls. Let Z g,mc+2 . . .
i denote the gene expression profile for the g-th gene across two different types of tissues with m c and m n tissues, respectively. Without loss of generality, we assume the first m c elements of Z g correspond to the first type of tissue samples (cancer tissue samples) and the second m n elements of Z g correspond to the second type of tissue samples (normal tissue samples).
Similarly, for the g-th DE gene probe, its marginal distribution of Z g is (A5)

B The marginal distributions of eLNNpaired
For simplicity, we use the original parameters, rather than re-parameterized parameters, in the following formulae.
In the derivation of the marginal distributions, we used the following facts about the Pearson type VII distribution and the property of a probability density function.
The density of the Pearson type VII distribution (symmetric 3-parameter family) is where B is the Beta function iii and The marginal distribution for over-expressed (OE) gene probes is The marginal distribution for under-expressed (UE) gene probes can be derived similarly The marginal distribution for non-differentially expressed (NE) gene probes is simpler and has the form:
Since z g 's are unknown, we integrate out z g and get the expected log likelihood function: In the M-step, we maximize the log likelihood over parameters π and ψ. We used the built-in function optim of the statistical software R and its L-BFGS-B optimization procedure to do the maximization.
In the followings, we derive the gradients of Q. Since π 1 + π 2 + π 3 = 1, we can eliminate It is also worthy to point out the fact that g (z g1 +z g2 +z g3 ) = G.
We take partial derivative of Q w.r.t π 1 and π 2 : solve the equations for π: We derive the Hessian matrix to show Q reaches its maxima w.r.t. π: viii So for any x = (x 1 , x 2 ) ′ = (0, 0) ′ , we have: This indicates that the Hessian matrix is negative definite.
For the rest derivatives, we list them by clusters: Then we have Then we have .
We iterate the E-step and M-step until the difference of parameters π and ψ between two consecutive iterations is small or the number of iterations has reached the allowed maximum number.

D Initialization of the EM algorithm
To get initial estimates for the 10 parameters in ψ and the two parameters in π, we first run limma to get raw p-value of moderated t-test for each gene probe, based on which we can partition the dataset into three clusters (OE, UE, and NE) and then estimate π by the cluster proportions. Next, within each cluster, we can use moment estimator to get crude estimates of the model parameters for each cluster (µ 1 , k 1 , α 1 , and β 1 for OE; µ 2 , k 2 , α 2 , and β 2 for UE; and α 3 and β 3 for NE). We used sample median and sample median absolute deviation (mad) to estimate mean and standard deviation. We then re-parameterize these estimated parameters to get a rough estimate of ψ.