MC EMiNEM Maps the Interaction Landscape of the Mediator

The Mediator is a highly conserved, large multiprotein complex that is involved essentially in the regulation of eukaryotic mRNA transcription. It acts as a general transcription factor by integrating regulatory signals from gene-specific activators or repressors to the RNA Polymerase II. The internal network of interactions between Mediator subunits that conveys these signals is largely unknown. Here, we introduce MC EMiNEM, a novel method for the retrieval of functional dependencies between proteins that have pleiotropic effects on mRNA transcription. MC EMiNEM is based on Nested Effects Models (NEMs), a class of probabilistic graphical models that extends the idea of hierarchical clustering. It combines mode-hopping Monte Carlo (MC) sampling with an Expectation-Maximization (EM) algorithm for NEMs to increase sensitivity compared to existing methods. A meta-analysis of four Mediator perturbation studies in Saccharomyces cerevisiae, three of which are unpublished, provides new insight into the Mediator signaling network. In addition to the known modular organization of the Mediator subunits, MC EMiNEM reveals a hierarchical ordering of its internal information flow, which is putatively transmitted through structural changes within the complex. We identify the N-terminus of Med7 as a peripheral entity, entailing only local structural changes upon perturbation, while the C-terminus of Med7 and Med19 appear to play a central role. MC EMiNEM associates Mediator subunits to most directly affected genes, which, in conjunction with gene set enrichment analysis, allows us to construct an interaction map of Mediator subunits and transcription factors.


EM algorithm
This section is an extended version of the EM Section in the main text. We report the results that arise from the EM algorithm when applied to our situation. The calculations involve only elementary algebra but are sometimes tedious. According to [1], the likelihood function ( We assume edge-wise independent priors, π(Θ, H) = π S (Θ)·π E (H), and π S (Θ) = i,j π S (Θ ij ), π E (H •k ) = k π E (H •k ).
1.1. The general EM algorithm. Throughout this section, the data D resp. the matrix R is considered given and xed. We want to nd the maximum a posteriori estimateΘ for the signals graph, Θ t+1 = arg max Θ Q(Θ; Θ t ) + log π S (Θ) , which is usually a much easier task than solving (1.4) directly. In the following, both steps are described in detail.
Then, the terms in Q(Θ; Θ t ) can be rearranged We seek for an expression for (1.6) which is amenable to analytic maximization strategies. Let V = R E be an m-dimensional vector space, which is spanned by the unit column vectors e k ∈ V , k ∈ E, and let e 0 = 0 ∈ V . We assume further that the prior for H factorizes into priors for each eect, (1.9) π E (H) = k∈E π E k (He k ) .
Let d j be the j-th unit column vector of dimension n, and d 0 the n-dimensional null vector. The NEM model assumes that each eect assigns to at most one signal, so π E k (v) = 0 for each vector v ∈ {d j , j = 0, ..., n}, k ∈ E, and (1. 10) π E k (d j ) = π jk , j = 0, 1, ..., n, and n j=0 π E jk = 1 .
The m × m unit matrix is denoted by E. Be aware of the identity E = k∈E e k e T k . We take advantage of the fact that the trace of a quadratic matrix is a linear function, and that tr(AB) = tr(BA) for arbitrary (compatible) matrices A, B.
Analogously, (1.13) P (D | H, Θ t ) ∝ exp(tr(Θ t HR T )) = k∈E f k (He k , Θ t ) , with f k (v, Θ t ) = exp(g k (v, Θ t )). For convenience we suppress the dependence of g k on Θ (and write g k (v) instead of g k (v, Θ)) and of f k on Θ t (and write f k (v) instead of f k (v, Θ t )). Let W = {0, 1} n . The evaluation ofQ(Θ; Θ t ) can be simplied considerably. For r = 1, ..., m, let (1.14) We introduce two more terms, By reverse induction we prove the formula Am . The induction step is completed by Note that for a deterministic prior, xing an eects gene As a further simplication we assume edgewise independent priors: (we may disregard the cases in which π S ij ∈ {0, 1}, because this means that the corresponding edge Θ ij is xed as absent or present and is therefore not subject to optimization). The log of the prior is then a linear function in each Θ ab : This implies that the objective function (1.24)Q(Θ; Θ t ) + log π S (Θ) is a polynomial in the variables {Θ ab | a = 1, ..., n; b = 1, ..., n} of total degree 1. The partial derivatives of the objective function with respect to Θ ab are therefore constant, i.e., independent of Θ (Note that Θd j equals the j-th column of Θ, so g k (d j , Θ) = e T k R T Θd j is linear in the entries of Θ): (1.31) In the general case of an arbitrary prior π S (Θ), it can be dicult to nd a global optimum of the objective functionQ(Θ; Θ t )+log π S (Θ). However it is not necessary to nd a global optimum, it is sucient to nd a Θ t+1 that increases the value of the objective function over the current valueQ(Θ; Θ t ) + log π S (Θ t ).
It has been shown in [3] that such a "stepwise" EM still converges to a local maximum of P (Θ | D). Therefore, we start with Θ = Θ t and go through all edges Θ ab in a random order and check whether alteration of Θ ab improves the objective function. If yes, we perform this change in Θ and continue until all edges were checked. The resulting Θ is our new Θ t+1 . (1) Initialize Θ 0 (2) Proposal step: Given Θ n , draw a candidate Θ from the proposal distribution q(Θ n → Θ ) (3) Acceptance step: With probability min(1, L(Θ )·π(Θ )·q(Θ →Θn) L(Θn)·π(nn)·q(nn→Θ ) ), let Θ n+1 = Θ (accept). Otherwise, let Θ n+1 = Θ n (reject). (4) Increment n by one and repeat steps 2. and 3. until convergence Convergence. Starting from a (generally) randomly chosen initial parameter value, it takes some time until the chain converges to the true probability distribution. Thus, the rst part of the chain, the so-called burnin phase is removed and only the so-called stationary phase is used for further analysis. Thus, based on a Θ t = {a → c}, EMiNEM is not able to cross the low-probability states to arrive at the high probability state, changing only one edge at the same time. Chain length. Each chain consists of 60.000 steps. A major component of Markov Chain Monte Carlo sampling is the decision after how many steps the chain has converged, i.e., how many steps have to be excluded at the beginning of the sequence, such that the nal part reliably represents the desired posterior distribution. Here, this decision is trivial: traceplots of the simulation runs showed, that after well less than the 60.000 steps the chain converges to one nalΘ, i.e., the MCMC sampling can be seen as an additional EM algorithm (see Fig. S3.3). The MCMC runs of the Mediator data showed the same behavior (see Fig. S4.4). Thus, any information drawn from this nal part of the sequence is good.
However, for reasons of consistency, and since the eect gene attachment is updated every 5000 steps (as described before), only parameters according to the nal attachment, i.e., the 5000 last parameters of the sequence are retained.
Prior information. Since nothing is known about the signals graph, a uniform prior is chosen (i.e., edge frequency = 0.5). For the eects graph, an (edgewise independent) prior is calculated for each eect node, as explained in the main text. However, if additional prior information on either the signals graph or the eects graph is available, its incorporation can speed up convergence time.
Initial parameter. The chain is initialized with a randomly sampled signals graph, based on a sparse edge frequency. Independence of the chain from the initial signals graph (which is an essential property of Markov Chains) has been veried by simulation.
Proposal function. At each step, a new candidate parameter Θ is suggested based on the last one (Θ n ).
The crux is to choose a proposal function q(Θ n → Θ ), such that it is a trade-o between steps that are big enough to scan the whole parameter space in reasonable time, but small enough to still being accepted. Simulation has shown that randomly selecting 1.5 · |S| edges and replacing them according to the predened edge frequency results in both sucient acceptance and good mixing of the chain.
Acception / Rejection. The newly suggested parameter is accepted (and added to the chain), if otherwise it is rejected and the old parameter is added once again. Note that Θ is the signals graph suggested by the proposal function, whileΘ is the corresponding local maximum derived by EMiNEM, as explained in the main text. We included an additional prior π sparse (Θ) = j,k f for sparsity of the sampled graph (f edge is the expected relative edge abundance of the sampled graph).
The corresponding weighting parameter w sparse = 0.5 has been determined empirically during simulation. Moderate variation of w sparse did not change the results qualitatively (data not shown).
Resulting signals graph. The Markov Chain provides an approximation of the posterior distribution of the sampled parameters. We extract one resulting signals graph from this chain by weighting all edges by their frequency in the last 5000 steps and only retaining those that appear in at least 50%. In general, this marginalization might result in a loss of information because dependencies between edges are not considered any more. However, since the Markov Chain in our case converges very fast to a unique, dominating signals graph which then will be extracted as the resulting graph, there basically is no marginalization and so this problem does not arise here.

A theoretical motivation for the sampling of local maxima. EMiNEM is viewed as a function
EM : Θ →Θ = EM (Θ), which maps the signals graph space M S onto the space N = EM (M S ) of local maxima of the posterior. The current paragraph is devoted to constructing a sequence in N that provides a representative sample of P | N , the restriction of the posterior probability P to N . Our task is complicated by the fact that we cannot construct functions that sample from N directly, because the calculation of each member requires the application of EMiNEM. Instead, we use Metropolis-Hastings Markov Chain Monte Carlo (MCMC) sampling [46] to construct a sequence in M S , and lift it to N (see Supplements S2.2 for details on our implementation). Let (Θ i ) i=1,2,... be a sequence of signals graphs in M S obtained by MCMC sampling from the distribution P . The corresponding sequence (EM (Θ i )) i=1,2,... is then an approximate empirical sample from the distributionP (Θ) = {P (Θ | D); EM (Θ) =Θ} ≈ P | N (Θ) on N . This approximation is valid under the assumption that the probability of P (Θ | D) is substantially larger than P (Θ | D) for all other Θ ∈ EM −1 (Θ), which is presumably the case. However, the convergence speed of this Markov chain is very slow, the reason being implicit in the assumption: In order to move from one local maximum to a dierent one, the underlying Markov Chain in M S needs to traverse regions of substantially lower probability. We remove this obstacle by sampling then an approximate empirical sample from the distribution The last approximation assumes that the pre-image ofΘ under EM has a similar size c for allΘ ∈ N .
In any case, we expect the relative probabilityP (Θ1) P (Θ2) to be dominated by the quotient P (Θ1|D) P (Θ2|D) , which justies our approximation in Equation (2.1) for the purpose of nding high-scoring graphsΘ.
2.4. Empirical Bayes estimation of the eects graph prior. The attachment probability H i jk of eect node k to signal node j, based on one signals graph Θ i , is: The new attachment probability, based on the preceding N = |chain| 12 steps of the Markov Chain, is In our approach, we do not sample from M S directly, but we sample from a set of local maxima N . This set is much smaller and develops slower than M S , as can be seen in the traceplots. Note that this set changes every epoch, since the prior is updated empirically.  Above, the signals graph is shown, below the corresponding R matrix, clustered according to the gene attachment (rows: perturbations, columns: eects on measured genes). Red color indicates a positive log-ratio value, blue color indicates a negative log-ratio value.
The stronger the color of a eld R kj , k ∈ E, j ∈ S, the higher the probability that the measured data is due to the fact that there actually is an eect of signal j on gene k, or, that there is no eect, respectively. stronger the color of a eld R kj , k ∈ E, j ∈ S, the higher the probability that the measured data is due to the fact that there actually is an eect of signal j on gene k, or, that there is no eect, respectively. The R matrix here is the same as in Fig. S3.1, but the ordering of genes (columns) is dierent, since it depends on the gene attachment derived by the MCMC sampling.
noise parameters µ and δ. Θ true and H true are randomly sampled according to | S | true and | E | true and the true data matrix (| S | true × | E | true ) is calculated according to these graphs. A noisy log-odds ratio matrix is then calculated based on the true eects by sampling its values from two normal distributions with (mean = − µ 2 , sd = δ) and (mean = µ 2 , sd = δ), respectively. µ and δ have been chosen such that the α − level and the β − level have the values described in the main text. A simulated NEM and the corresponding prediction of MC EMiNEM, for | S |= 8 and β − level = 49% are shown in Fig. S3.1 and Convergence. The convergence of the Markov chain is important in order to get a representative sample of parameter values. It has been veried in simulation runs, as outlined in the main part of this paper.
Traceplots for the example mentioned above (Fig. S3.1, Fig. S3.2) are shown in Fig. S3.3 (all edges) and   Since various signals graphs can yield the same local maxima, the sampled graphs vary strongly throughout the whole sampling process, while the local maxima vary slower and in a more restricted model space and converge in the second half of the Markov chain.
This behavior has been discussed extensively in the main text.  for a given eect, the attachment probability is the same for any signal node (or no signal node at all) ). Obviously, even though the initial prior for the eects graph is calculated according to the data matrix, the entropy is still very high. During the Empirical Bayes procedure, some eects turn out to be quit deterministic, while others remain exible until the end of the Markov chain.
Random. For each NEM, 5000 random signals graphs have been sampled, according to the parameters described in the main text. Every (unique) graph has than been weighted by its posterior and a consensus signals graph has been built including all edges with a (weighted) value of ≥ 0.5. This is the most trivial method for parameter estimation.
As expected, this method yields quit good results for small numbers of signal nodes, where the probability of randomly drawing reasonable graphs is higher. However, for larger number of signal nodes, independent of the noise level, this method is not able to detect the correct edges at all.
EMiNEM. This method is based on the random sampling approach, except that not the sampled signals graphs but their corresponding local maxima have been weighted and combined to a consensus signals graph.
EMiNEM is slightly better than random sampling, because by only taking into account local maxima unlikely graphs are excluded from the consensus. However, it still relies on random drawing of signals  Nessy. Nessy is a publicly available NEM implementation, introduced by [1]. Unlike (MC) EMiNEM it's a maximum full likelihood / posteriori approach, where not only the maximum for the signals graph but also for the eects graph should be identied. Since no prior knowledge regarding the signals graph is available, but a sparse graph is assumed, Nessy is initialized with the empty graph.
MC EMiNEM is a maximum marginal posteriori approach, it only calculates the maximum for the signals graph and marginalizes over the eects graph. For good data, with low amount of noise, the eects graph is clearly identiable and MC EMiNEM and Nessy perform comparably. However, for higher noise the calculation of the maximum eects graph is error-prone and the risk of getting stuck in the wrong model is high, so Nessy is clearly outperformed by MC EMiNEM there.
nem. nem is the original NEM implementation, publicly available through Bioconductor [8]. Recently, [9] published a review of all currently available NEM algorithms, where they recommend the Bayesian greedy hillclimbing approach for small networks as the method of choice. It calculates the original NEM score by integrating over all eects graphs. According to these ndings, we applied nem on the log-odds ratios with the following parameters: inference="nem.greedy" and type="CONTmLLBayes". Again, we chose a signals graph prior and an eects graph prior as described in the main text. Med7C/Med10/Med19/Med21. The S. cerevisiae strains used were derivatives of SLY101: MATα ade-can1-100 cyh2r his3-11,15 leu2-3,112 trp1-1 ura3 [11]. Samples were grown in SD (synthetic complete) medium medium overnight, diluted to an OD600 of 0.1 the next day and grown to a nal OD600 of 0.8.
Cells were centrifuged at 4,000 rpm for 1 min and cell pellets were immediately ash frozen in liquid nitrogen. Total RNA was extracted using the RiboPure-Yeast Kit (Ambion/Applied Biosystems), following the manufacturer's protocol RNA was extracted as above. Labeling of samples was performed using the 4.2. Data processing. Data processing has been done using R [12].
The arrays were read in and transformed to expression values one by one, using expresso() from the R/Bioconductor package ay [13] with the following parameters for background correction and summarization: bgcorrect.method="rma",pmcorrect.method="pmonly",summary.method="avgdi". Some arrays included S.pombe probes, they were ltered to S.cerevisiae. The median expression values were centered to zero (on the log-scale) for each array(this step has only the purpose of generating a sensible average expression distribution in the subsequent quantile normalization step). The expression values were log2 transformed and quantile normalization was performed afterwards using quantile.normalization() from the ay-package.
The R/Bioconductor package limma [14] was used for further assessment of dierential gene expression.
A design matrix was constructed that takes into account batch-specic eects as well as subunit-specic eects. The linear regression model was tted using lmFit(). Finally, the log-odds ratios corresponding to subunit specic eects were extracted using ebayes(). The essential four lines of codes are listed below (the design matrix design is the standard design matrix for multiple groups vs. one reference experiments, expr are the expression values derived from the data, lodsmat is the desired log-odds matrix R (see main text)): with respect to these contrasts were removed from the subsequent analysis. Additionally, genes that do not react to any perturbation (here with a log-odds ratio < 0 in all cases) were removed. It is important to check the nal result R = (R jk ) for artifacts. Sometimes, R contains erratic, extraordinarily high entries that dominate the likelihood term. This is due to the fact that limma performs a variant of the t-test, in which the variance term is estimated. Although this is done in a robust manner, it may still sometimes underestimate the true variance, leading to overly signicant results. We advise the user to manually threshold the entries in the R-matrix (e.g., to the 99% quantile of the entries in the R matrix), if outliers in the R-matrix occur. In our Mediator application, this was not necessary.
4.3. Gene set enrichment analysis for transcription factor targets. The gene set enrichment analysis was done according to [15], using the R/Bioconductor package mgsa (version 1.2.0) [16], with the fol- Comparison with cluster analysis. In Fig. S4.7, the clustering of Mediator subunits and genes based on fold changes and log-odds ratios is depicted. Both approaches lead to an almost identical (isomorphic) dendrogram, which also agrees well with the MC EMiNEM's signals graph (if edge directions are ignored).
This means that the coarse grouping of Mediator subunits can already be read o the expression proles.
However, MC EMiNEM provides more detailed information on the hierarchical structure of the Mediator organization, as well as on the attachment of eects.  log-ratio value, blue color indicates a negative log-ratio value. The stronger the color of a eld R kj , k ∈ E, j ∈ S, the higher the probability that the measured data is due to the fact that there actually is an eect of signal j on gene k, or, that there is no eect, respectively. There exists an edge Med10 → Med21 as well as Med21 → Med10.The similarity between the two perturbations is also clearly visible in the perturbation prole.
Thus, in the following, the two Mediator subunits are treated as one node in the NEM. perturbations, columns: eects on measured genes). Red color indicates a positive logratio value, blue color indicates a negative log-ratio value. The stronger the color of a eld R kj , k ∈ E, j ∈ S, the higher the probability that the measured data is due to the fact that there actually is an eect of signal j on gene k, or, that there is no eect, respectively. There exists an edge Med10 → Med21 as well as Med21 → Med10.The similarity between the two perturbations is also clearly visible in the perturbation prole.
Thus, in the following, the two Mediator subunits are treated as one node in the NEM. indicates a negative log-ratio value. The stronger the color of a eld R kj , k ∈ E, j ∈ S, the higher the probability that the measured data is due to the fact that there actually is an eect of signal j on gene k, or, that there is no eect, respectively. A detailed discussion of the results can be found in the main text.  Since various signals graphs can yield the same local maxima, the sampled graphs vary strongly, while the local maxima vary slower and in a more restricted model space and converge faster. This behavior has been discussed extensively in the main text.     # start estimation process and visualize results net = nem(lodsmat, inference="mc.eminem", models=models, control=control); plot(net); A short introduction to MC EMiNEM is provided above (see the nem package vignette for more details).
It also lists all parameters that have to be set in order to run MC EMiNEM: type and inference dene the method to be used (here: MC EMiNEM); Sgenes (the signal names) is dened by the input data; Pe (the eects graph prior) and Pm.frac_edges (the sparsity prior) incorporate prior knowledge and are no parameters in the proper sense; mcmc.nsamples (the length of the stationary phase), mcmc.nburnin (the length of the burn-in phase), eminem.sdVal (the width of the proposal function) and eminem.changeHfreq (the Empirical Bayes parameter) only inuence the length of the Markov chain and its mixing properties, i.e., as long as the chain is long enough, they do not aect the nal distribution. Hence, the only parameter that has to be adjusted is the weight of the sparsity prior lambda. As already discussed in the Supplementary Section S2.2, the sparsity prior itself is important, but moderate variation of lambda did not change the results qualitatively.