The relation between crosstalk and gene regulation form revisited

Genes differ in the frequency at which they are expressed and in the form of regulation used to control their activity. In particular, positive or negative regulation can lead to activation of a gene in response to an external signal. Previous works proposed that the form of regulation of a gene correlates with its frequency of usage: positive regulation when the gene is frequently expressed and negative regulation when infrequently expressed. Such network design means that, in the absence of their regulators, the genes are found in their least required activity state, hence regulatory intervention is often necessary. Due to the multitude of genes and regulators, spurious binding and unbinding events, called “crosstalk”, could occur. To determine how the form of regulation affects the global crosstalk in the network, we used a mathematical model that includes multiple regulators and multiple target genes. We found that crosstalk depends non-monotonically on the availability of regulators. Our analysis showed that excess use of regulation entailed by the formerly suggested network design caused high crosstalk levels in a large part of the parameter space. We therefore considered the opposite ‘idle’ design, where the default unregulated state of genes is their frequently required activity state. We found, that ‘idle’ design minimized the use of regulation and thus minimized crosstalk. In addition, we estimated global crosstalk of S. cerevisiae using transcription factors binding data. We demonstrated that even partial network data could suffice to estimate its global crosstalk, suggesting its applicability to additional organisms. We found that S. cerevisiae estimated crosstalk is lower than that of a random network, suggesting that natural selection reduces crosstalk. In summary, our study highlights a new type of protein production cost which is typically overlooked: that of regulatory interference caused by the presence of excess regulators in the cell. It demonstrates the importance of whole-network descriptions, which could show effects missed by single-gene models.


Model description
We consider a cell that has a total of M transciptionally regulated genes, which can be either active or inactive. We assume that each gene is regulated by a single unique transcription factor (TF) -its cognate TF. Each gene has a short DNA binding site to which a TF can bind to affect its regulatory state. A fraction 0 ď p ď 1 of the genes is regulated by activators and the remaining p1´pqM genes are regulated by repressors. When no activator is bound, activator-regulated genes are inactive (or active at a low basal level) and only become active once an activator TF binds their binding site. In contrast, repressor-regulated genes are by default active, unless a repressor TF binds their binding site and inhibits their activity. We assume that different environmental conditions require the activity of different subsets of proportion 0 ď q ď 1 of these genes, while the remaining fraction 1´q should be inactive. As both activity and inactivity of genes can be attained by means of either activator or repressor regulation, our model distinguishes between four sets of genes: (i) a ď q, p activator-regulated genes which are active, (ii) q´a repressor-regulated genes which are active, (iii) p´a activator-regulated genes which are inactive, and (iv) p1´pq´q`a repressor-regulated genes which are inactive.
The special cases in which all the genes have the same form of regulation, either repression or activation (namely p " 0 or p " 1), were studied in a previous work [1]. We assume the system is generally at steady state, such that the required gene expression pattern does not change in time and all molecular concentrations are fixed. We then consider the average crosstalk over different gene sets of the same size. This represents a series of different gene expression patterns required in different external conditions. We assume that the system only seldom shifts from one steady state to another, such that the transient time needed for gene regulation to equilibrate following each transition is negligible. We do not consider any form of feedback exerted by the products of these genes. Rather, we assume an idealized situation in which all necessary regulators are present exactly at the time and quantity needed. Any deviation from these conditions is expected to increase crosstalk levels.
Hence, our analysis refers to a lower bound of crosstalk levels.
Each gene is associated with a short regulatory DNA sequence (binding-site), to which its specialized cognate TF preferentially binds to affect its regulatory state, either positively or negatively. Although the regulatory sequences of different genes differ from each other and we assume that each TF is specific to the regulation of only one unique gene, TFs are known to have limited specificity to their DNA targets and can occasionally bind slightly different sequences, albeit with lower probability [2]. We define cases when a TF binds a noncognate binding site or when a binding site that should have been bound remains unoccupied, as 'crosstalk', potentially leading to an undesired regulatory outcome. To quantitate the probability of these events, we use the thermodynamic model of gene regulation [3,4,5,6], which asserts that the occupancy of regulatory binding sites by TFs determines the expression level of the genes associated with these binding sites. The probability of this occupancy depends on the copy number of active TF molecules available to bind and on the binding energy between the binding site and TF. This binding energy is determined by the number of mismatches between the particular binding site sequence and the consensus sequence of that TF. We assume full symmetry in the biophysical properties of the binding sites associated with different genes: all have the same sequence length and equal binding energy to their cognate TFs, and all genes have the same dynamic range of expression.
x bound refers to crosstalk when the binding site should be bound (by either activator or repressor) but is either unbound or bound by a non-cognate molecule. x unbound refers to crosstalk when the binding site should remain unbound and no cognate binder is available, but is still bound by some non-cognate molecule. E a is the energy difference between cognate bound and unbound states. The expression ř j‰i C j e´ d ij then captures the sum of all interactions with foreign regulators that binding site i might receive.
To account for the sum of all non-cognate interactions received by a particular binding site in a model of multiple TF species, we define an average measure of similarity between binding site i and all other binding sites j ‰ i [1]: S i is defined as the average of the Boltzmann factors taken over the distribution of mismatch values P pdq between binding sites i and j, @j. In the last equality in Eq. (S2), we assume that all available TFs are found in equal concentrations C j " C{T, @j, where C is the total TF concentration and T is the total number of available TF species. We assume full symmetry between binding sites i, such that each binding site i has the same distribution of mismatches d ij with respect to all the other genes, hence S i " S @i. The value of S can either be estimated using binding site data (see example in Fig 4) or analytically calculated under different assumptions on the pairwise mismatch distribution P pdq. Following our symmetry assumptions, the crosstalk probabilities in Eq. (S1) are independent of the gene identity i, such that we only need to distinguish between the four different regulatory states.
In the following, we use rescaled similarity defined as s " S¨M , which represents a sum of all non-cognate interactions at a binding site.

For which TF usage t˚is crosstalk maximized?
For a fixed s X˚pt, sq has a maximum at a certain t value, which we denote t˚(marked with a black circle on Fig 2A). We find that t˚" t˚psq and its value monotonically increases with s. For low similarity values (s Ñ 0), it asymptotically approaches the value of 2{3, with the limit lim sÑ0 t˚" 2{3.
For s ą 0, t˚ą 2{3 and approaches t Ñ 1 for high s. See Fig A for illustration. Find file argumentWhichMaximizesCrosstalk.nb for more formal formulation.

Optimal TF concentrations in the two strategies
The optimal concentration which minimizes crosstalk is c˚" 0 in regime I, c˚" 8 in regime II, and c˚" CM " in regime III. The concentration in each strategy is obtained by choosing the corresponding value of t, i.e., t busy " p1´pq`2 min pp, qq´q and t idle " p1´pq´2 min p1´p, qq`q.
where C 1 and C 2 are the per species concentration of activators/repressors, A and pT´Aq the number of activator/repressor regulated genes, and C the total concentration of TFs as used in the main calculation of our model. We numerically tested which combination of concentrations C 1 and C 2 leads to lowest X˚value. Surprisingly, we found that minimal crosstalk is achieved by equal concentrations for all transcription factors, activators and repressors, i.e., C 1 " C 2 . In other words, adding an additional degree of freedom of second concentration can only increase the crosstalk values.
The intuitive explanation is that it only matters if a gene is regulated or not, but not which type of regulation it employs. This is in similarity to our previous result, where minimal crosstalk only depends on the number of available TFs, such that regulation by activators crosstalk is a mirror image of regulation by repressors alone [1].

Non-uniform similarity values
In the basic model, we assumed uniform similarity values s i " s for all genes, which allowed us to obtain analytical solutions for crosstalk. As data show (see Fig 4), similarity values vary between genes even within the same organism. Here, we relax this simplifying assumption to test its significance. We analyze a special case with two subsets of genes, each with a different similarity value. The two subsets are of relative size r 1 and r 2 (r 1`r2 " 1), and similarity values s 1 and s 2 , correspondingly. In each subset, there is a weighted proportion of regulated genes, t i " r i t for i P t1, 2u. We use fixed values for s 1,2 and then calculate s " r 1 s 1`r2 s 2 for each pr 1 , r 2 q combination. As before, the total crosstalk X˚is computed by summing crosstalk contributions of all individual genes and then numerically minimizing X˚with respect to the TF concentration. We still allow only equal concentrations for all available TFs. In Fig B, we plot X˚vs. the proportion r 1 for different fractions of available TFs, t. We compare X˚values obtained for uniform and non-uniform s. We find that non-uniform s provides lower crosstalk than uniform s. This is obtained, however, at the cost of higher TF concentration C˚needed for the non-uniform similarity.
2 Achievement of minimal crosstalk 2.1 Minimal crosstalk is always obtained by one of the two extreme reg-

ulation strategies
The 'busy' and 'idle' modes explained in the main text are the two extreme regulation strategies. Intermediate strategies, where some genes follow the first strategy and others follow the second, are possible. However, minimal crosstalk is always obtained by one of the two extremes.
We denote the proportion of TF species following 'idle' and 'busy' modes by t 1 and t 2 , respectively (see Fig C). A combination of the two modes would lead to a linear combination of the fraction of TF species, t mixed " αt 1`p 1´αqt 2 , with α P r0, 1s. Since X˚ptq is a concave function of t with a single maximum, minimal X˚will always be obtained at the edges of the t domain, α " 1 or α " 0. Thus, any mixed strategy would always bring about higher crosstalk than the extreme ones X˚pt mixed q ě min pX˚pt 1 q, X˚pt 2 qq (see Fig C).

Relation between t and the choice of regulatory strategy yielding lowest crosstalk
The non-monotonic dependence of crosstalk on TF usage, t, can explain the non-trivial transition between the 'busy' and 'idle' modes in the pp, qq phase space, as shown on Fig 3B. There we illustrate for each pp, qq which of the two modes yields lower crosstalk. Recall that X˚ptq has a maximum at t " t˚psq, such that it is an increasing function of t for t ă tå nd a decreasing function of t for t ą t˚. The relationship of t idle and t busy in regards to t˚determines which strategy is more advantageous. As t idle ă t busy @t (see Eq. S12 below), it follows that if X˚is increasing with t, idle mode is more advantageous (lower t Ñ lower X˚ptq). Conversely, if X˚is decreasing function of t, busy mode leads to lower crosstalk X˚(higher t Ñ lower X˚ptq). To address which mode is more advantageous, we examine t idle pp, qq and t busy pp, qq at each pp, qq value (see Eq. 5 in main text). We distinguish three cases: 1. For t idle , t busy ă t˚ñ idle mode is the most advantageous strategy.
2. For t˚ă t idle , t busy ñ busy mode is the most advantageous strategy.
3. For t idle ă t˚ă t busy ñ which strategy is optimal depends on exact values of pp, qq.
We summarize these results in Fig  The three case for all combinations between t idle pp, qq, t busy pp, qq, and t˚psq in the pp, qq space. Each case is shown in different color. For t idle , t busy ă t˚(white area), idle mode leads to lower crosstalk; for t˚ă t idle , t busy , busy mode is more advantageous; the third region, for which t idle ă t˚ă t busy , is partitioned between the two strategies. The boundary between the optimal two modes is shown in red dashed line. For this plot we used s " 0.01.

Maximization and minimization of TF usage: the four combinations
The number of genes that can be associated with activators and repressors is restricted by the number of regulators of each type. When we require that a proportion q of the genes is active and a proportion p of the regulators are activators, we distinguish four cases depending on the relative magnitudes of these variables: • q ă p and q ă 1´p, • q ą p and q ă 1´p, • q ă p and q ą 1´p, • q ą p and q ą 1´p.
In Fig E, we illustrate how TF usage is maximized and minimized in each of these cases (one of them appeared as Fig 2D in the main text).

Crosstalk expression for 'busy' strategy
In the busy mode, the proportion of active genes that are regulated by activators is a " min pp, qq. The proportion of TFs involved in the busy strategy equals: t busy " p1´pq`2 min pp, qq´q. (S6) To obtain the lower limit on crosstalk, we use t busy in the equation for X˚ptq and obtain: Xb usy " p1´p`2 min pp, qq´qq¨´´spp`q´2 min pp, qqq`2 a s pp`q´2 min pp, qqq(

Crosstalk expression of 'idle' strategy
Similarly, in the 'idle' mode, the proportion of active genes that are regulated by activators is a " q´min p1´p, qq. The proportion of involved TFs is then: and the lower bound on crosstalk in the idle mode equals: (S9)

Crosstalk expression of both strategies in regime I
The lower limit on crosstalk in regime I is described by X˚ptq " t and: Xb usy " p1´pq`2 min pp, qq´q, (S10a) Xi dle " p1´pq´2 min pp, qq`q. (S10b)

Crosstalk expression of both strategies in regime II
The lower limit on crosstalk in regime II is described by X˚ptq " 1´t{p1`stq and: Xb usy " 1´p 1´pq`2 min pp, qq´q 1`s rp1´pq`2 min pp, qq´qs , (S11a) Xi dle " 1´p 1´pq´2 min pp, qq`q 1`s rp1´pq´2 min pp, qq`qs . (S11b) The proportion of TFs in busy mode is always higher than in idle mode. This can be easily shown by: We distinguish between four cases: • q ă p and q ă 1´p ñ ∆t "´2q`2q`2q " 2q ě 0, • p ă q and 1´p ă q ñ ∆t "´2q`2p1´pq`2p " 2p1´qq ě 0, In all cases, the difference ∆t ą 0, which shows that for any value of parameters, the proportion of TFs is always larger (or equal in the extreme case of p P t0, 1u) in 'busy' mode compared to 'idle'.

Idle mode minimizes, busy mode maximizes the number of transcrip-
tion factor species t.
The busy and the idle mode maximize and minimize the fraction of TF species, respectively. This can be shown by taking a system, where a proportion of p genes is activator-regulated, while the rest p1´pq is repressor-regulated. Within the activator-regulated genes, k 1 are active and k 2 " p´k1 are inactive. Moreover, within p1´pq repressor-regulated genes, k 3 are active genes and k 4 " p1´pq´k 3 inactive genes. Within these, several constraints exist: where q is the proportion of active genes. The total proportion of TF species is t " k 1`k4 " 2k 1`p 1´pq´q. Of course, due to the definition of the system, it holds: • q´k 1 ď 1´p.
In attempt to see how the total proportion of TF species changes if we change different parameters of the system (k i ) with fixed q and p, we compute the derivative of t: Therefore, the change of TFs with increasing k 1 is linear and 4 different scenarios exist. For each, the constraints described above must be met. Therefore: 1. if q ă 1´p and q ą p: • t is minimized by k 1 Ñ 0 ñ k 2 " p, k 3 " q, k 4 " p1´pq´q, which is the idle mode, • t is maximized by k 1 Ñ p ñ k 2 " 0, k 3 " q´p, k 4 " 1´q, which is the busy mode.
In Fig F, we plot the difference in optimal crosstalk ∆X˚" Xi dle´Xbusy , where black areas denote the anomalous regime. With increasing similarity, the anomalous regime grows and covers an increasingly larger portion of the phase space. For sufficiently high similarity values, the idle strategy will always lead to the most optimal crosstalk for any value of pp, qq.  The differences between regulatory strategies depend on s. We plot the minimal (blue) and maximal (red) value of ∆X˚" Xi dle´Xbusy over the entire region of p P t0, 1u and q P t0, 1u as a function of similarity, s. The point where the maximal value (red) becomes negative (at s « 5), is where the busy mode completely vanishes and idle mode leads to lowest crosstalk (for any pp, qq).
7 Probabilistic gene activity model

Probabilistic model description
So far, we considered a deterministic model in which the numbers of active genes and available TF species were fixed, resulting in a single crosstalk value per pp, qq configuration.
In reality, these numbers can temporally fluctuate, for example, because of the burst-like nature of gene expression [7,8]. In the deterministic model, we also assumed uniform gene usage, such that all genes are equally likely to be active. In reality, some genes are active more frequently than others.
To account for this, we study the following crosstalk in a probabilistic gene activity model. We assume independence between activities of different genes, where each gene i has demand (probability to be active) D i . We then numerically calculate crosstalk for a set of genes. This approach enables us to incorporate a varying number of active genes and a non-uniform gene demand and compare our results to the deterministic model studied above.
Assume that to comply with its demand D i , each gene i is regulated with probability γ i , i " 1...M , where γ i " D i if regulation is positive and γ i " 1´D i if it is negative. We assume that the TF species needed for these genes are available in the cell. The distribution f t ptq of the fraction of TF species follows Poisson-Binomial distribution [9] with mean and variance of: Assuming that the number of genes is large, M " 1, the central limit theorem applies here: we approximate the probablity distribution of t, f t ptq, by a Gaussian distribution with the mean value and variance as given with Poisson-Binomial distribution (mean value xγ i y i and standard deviation σ t " xγ i p1´γ i qy i , with x¨y i representing the average over all genes) [9].
Since minimal crosstalk X˚is a function of t (Eq. 4, main text), X˚becomes a random variable and its distribution reads: where g´1 l pX˚q " t l represents the l´th branch of the inverse function (for some X˚values there exist two solutions t l that satisfy the inverse equation) [9]. The solutions for g´1 l pX˚q and its derivative exist and can be analytically computed, which enables us to solve for the distribution of crosstalk f X˚p X˚q: where the computation of both xty and X˚pxtyq is straightforward given γ i . We discuss below the conditions under which this approximation holds. The distribution of X˚is typically narrow, such that for practical purposes the distribution mean provides a very good estimator of crosstalk values. The Gaussian approximation of distribution of t follows from the central limit theorem. A numerical example is shown in Fig H (a) and (c).
The linearity assumption is fulfilled for t values that are much lower than t˚(t˚being the value at which X˚reaches maximum). For visual example of linearity of X˚ptq for t ă t˚, see

The probabilistic gene activity model leads to a distribution of the number of active genes -Example
In the probabilistic model, the fraction of active genes q becomes a random variable, rather than being fixed, as we assumed before. We demonstrate this in an example below. The crosstalk behavior in the probabilistic case can be obtained as a superposition of the relevant deterministic cases taken with their corresponding weights.
(S19) Therefore, in this example, on average, one gene is active.  The TF with the shorter consensus sequence will bind easily to the binding site of the second TF. Therefore, the similarity between the transcription factor with the shorter consensus sequence and longer binding site will be high. However, the transcription factor with the longer consensus sequence and a shorter binding site would have a lower similarity, as it is less likely that the long transcription factor binds to the short binding site.
Indeed, the matrix of pairwise similarity values is asymmetric. Clearly, we observe many vertical lines of similar value (TFs that easily bind many binding sites -yellow strips, or that are very unique to only few binding sites -blue strips), but much weaker signatures of rows (binding sites that are very similar or very dissimilar to all others). This demonstrates that the similarity value between a transcription factor and a binding site is dominated by the transcription factor properties, and much less by the binding site's. High similarity between a transcription factor and other binding sites is highly correlated with short consensus sequences of that transcription factor. The similarity measure of a gene i, S i , is determined as the contribution of all noncognate transcription factors:

Alternative calculations of similarity values and crosstalk from data
Similarity values and crosstalk of S. cerevisiae and other organisms were estimated [1] based on PCM data of the TFs in our previous work, but using a different computational approach.
The main difference between these two calculations is that in the first approach, the similarity was calculated between a consensus sequence of a particular TF and an ensemble of binding sequences of the same length randomly drawn from a uniform distribution. It was shown there analytically that the average similarity between random sequences of length L and uniform energy mismatch per position is simply S " p1{4`3{4 exp p´ qq L , hence only this effective needs to be calculated. In contrast, in the current work, we calculate similarity between actual pairs of TF and non-cognate binding sites.
Another major difference is the calculation of mismatch energy penalties. By using the PCMs, we obtain an energy matrix which gives energy penalties for every position and every nucleotide separately. In the previous work, only an effective uniform for all positions was calculated using either of two approaches: (i) information method [10] or, (ii) pseudo-count method (also used here) [11,12]. In the information method, the total information of the motif was calculated and then an effective eff which evenly distributed this information between all L positions, was calculated. The advantage of this method is the avoidance of pseudo-count usage, which could bias the results. Its major drawback is the lack of positionspecific energy information which we need to calculate similarity between actual pairs of This difference in estimating energy penalties leads to a different approach for computing similarity measures. In our approach, we use energy matrices to compute the energy of binding for every pair of TF-BS, i.e., E ij .
The two distinct approaches lead to different, but similar, distributions of similarity measures for a given gene j, s j´F ig L. The main difference are long tails of the current approach. The median value of the similarity with the previous approach in S. cerevisiae was medianps consensus-random q " 0.8¨10´4, while in the current approach, we obtain medianps TF-BS q " 1.4¨10´4.
Differences between s values obtained in these two approaches can emanate from various sources: • sequences of actual binding sites are not well captured by a uniform distribution be-cause of biases in favor of AT-content. For example in our S. cerevisiae dataset, we have 31% of nucleotide A, 21% of C, 22% of G, and 26% of T.
• actual binding sites can vary in length; taking the relative position with best match is clearly non-random. See in Fig L a comparison to s values calculated when the relative position is randomly selected.
• equally partitioning the total energy of the motif between all its positions consistently under-estimates the similarity.
• if actual TF-BS are considered, insufficient data can lead to biases in similarity estimates.

Using real transcription factor copy numbers
So far, we assumed that all transcription factors are present in equal concentrations (Eq. 1, main text), to simplify the calculations and enable the analytical derivation of the lower bound on crosstalk. We then minimized crosstalk with respect to the TF concentrations, such that these concentrations and the energy gap E a between cognate bound and unbound states were all left out of the minimal crosstalk expression (Eq. 4, main text). In general crosstalk does depend on the TF concentrations and can be numerically calculated for general concentration values. Such calculations require an extension of the crosstalk minimization procedure and additional parameter values which were not necessary for the lower bound. In this section we demonstrate this calculation using experimentally measured proteome data of S. cerevisiae. We follow the model described in [13] for a single gene and generalize it for our case.
We obtain the copy number values for all 23 of TFs measured in [14]. For proteins which were below the detection level of 50 molecules/cell, we take half the detection level of 25 molecules/cell. Generally speaking, a transcription factor molecule can be in one of three reservoirs: bound to its cognate site, bound to any non-cognate site, free in the cytoplasm or bound non-specifically to the DNA. Non-specific binding is independent of the DNA sequence and has no effect on crosstalk, but only effectively reduces the TF availability. As until now, we focused on minimal crosstalk assuming the TF availability is optimized we did not include non-specific binding in our expressions (Eq. 2, main text). Now, in order to properly account for the available TF copy numbers we add it to the expressions. We use the grand-canonical ensemble formulation to estimate the fugacities, i.e., the available number of TF molecules.
We write down the mass balance equation for each of the 23 TF molecules (compare to Eq. 40 in [13]): where A j is the total number of molecules of the j-th TF, and N ns is the total number of available non-specific binding sites, N j is the total number of cognate binding sites for j-th TF and N k is the total number of non-cognate binding sites (which are cognate for the other TFs). Θ ns j is the occupancy of non-specific binding site by the j-th TF: where λ j represents the fugacity of j-th TF, E ns j the binding energy of j-th TF to nonspecific site, E a the energy of the unoccupied state, and the sum in the denominator goes over all TFs. The product N ns Θ ns j gives the total number of molecules of j-th TF, bound to all non-specific sites.
The second term in Eq. S21 represents the total number of j-th TF molecules that are bound to non-cognate binding sites. These are cognate sites for all k ‰ j-th TFs. Each term in the sum represents the number of j-th TF molecules bound to cognate binding sites of k-th TF. Similarly, Θ nc jk is the occupancy of one cognate binding site of k-th TF by the j-th TF: where E jk represents binding energy of j-th TF to cognate binding site of k-th TF. E jj represents energy of cognate binding of j-th TF.
The last term in Eq. S21 represents the number of j-th TF molecules bound to one of its N j cognate sites. Again, the occupancy of a cognate site by the j-th TF equals: This gives us a set of 23 coupled non-linear equations with 23 variables λ j (the fugacity values of all TFs). To set the energy scale we define for every binding site that its energy level when bound by its cognate factor is zero. The parameters we need to specify in to numerically solve these equations are: • A j the total number of molecules of the j-th TF species per cell. We obtain these values from [14].
• E ns j the binding energy of the j-th TF to a non-specific site. We assume that nonspecific sites are random DNA sequences [5]. Hence, we calculate these energies by averaging energy contributions at each position using the TF energy matrix and sum the individual contributions.
• N ns , the total number of non-specific sites. The yeast genome length is roughly 10 7 nucleotides, but we assume that only 10% of it is accessible [10]. Hence we assume that 10 6 non-specific binding sites are available.
• N j , the number of cognate binding sites of j-th TF. Here we simply take the number of genes regulated by the j-th TF.
• E jk , the binding energy of the j-th TF to a cognate binding site of the k-th TF. These values are used in calculation of similarity matrix S ij´s ee Fig. J.
• E a the energy gap of a binding site between its state when occupied by its cognate factor (which we set as zero) and its unoccupied state. We assume the same value for all TFs. Unfortunately, measurement of this parameter are rare. We found estimates of this energy gap only for a few bacterial [5] and yeast [2] TFs. In the following we show crosstalk calculations for several different values of this parameter.
The binding energy is usually sequence dependent. However, when the binding becomes very unfavorable, other contributions come into play, thus effectively setting a bound on the binding energy. Therefore, we have used this bound on all binding energies E ns j and E jk .
Using all the details described above, we can numerically obtain all fugacity values λ j , for all j and then calculate the total crosstalk.
However, the solutions of crosstalk using the correct fugacities seem to be relatively sensitive to the threshold values we set. Here we show solutions for a range of realistic values (Fig. M left). All results using real concentrations exhibit higher crosstalk estimates compared to our lower bound where optimal concentrations were used (Fig. M).  [14] is higher than the minimal crosstalk calculated for optimal TF concentrations. We illustrate crosstalk values as a function of the proportion of available TF species t, using measured TF concentrations [14], for various energy thresholds (left) and energies of unoccupied binding sites E a (right). For comparison, optimal crosstalk curve is added (black). We defined E " 0 as the state with of the binding site when a cognate TF is bound. In the left figure we used E a " 14, and in the right figure we used E threshold " 18. The gray dots at t " 1 represent points where all measured TFs copies were included in the crosstalk calculation. Otherwise, for each t ă 1 we drew 300 times by random a subset of TFs to be present, such that only t proportion of the genes are regulated. Fugacities of TFs that were not chosen were set to zero.
9 Comparison of our global crosstalk model to single-gene models Previous models of regulatory crosstalk [15,16] studied crosstalk at the single gene level and focused on the choice between activation and repression as the leading mode of regulation.
The main innovation of our model is the consideration of multiple genes and multiple regu- where the limit c Ñ 8 was taken in Eq. 2 in the main text. The total crosstalk, namely the average fraction of genes in any crosstalk state, is then which, for s ! 1, leads to X « p1´tq, which is the proportion of non-regulated genes. The large increase of crosstalk comes from optimization of crosstalk values of individual genes.
We minimize it for the case when gene is regulated (x bound ), obtaining large concentrations of all TFs that are required to regulate. This will enforce that each of these individual genes will suffer barely any crosstalk when it is being regulated. However, by doing this we overlook the inevitable large TF concentrations needed that are likely to cause crosstalk to other genes that are not being regulated. Therefore, the main contribution to crosstalk comes from genes that should be unregulated (i.e., binding sites that should be unoccupied) but are instead bound by the ample non-cognate TFs.
See Fig. N

A TF which is both an activator and a repressor
In this section we explore the case that every TF has a dual role: it serves as an activator for one gene and as a repressor for another gene. This means that the genes are now organized in pairs, such that each pair of genes shares a common TF where one of the genes is positively regulated and the other is negatively regulated by that TF. The total number of TFs is then half of their number in the basic model, in which every gene was regulated by a unique TF.
That however, also constrains the ability to individually determine the regulatory state of each gene, in contrast to the basic model case. We begin by examining the two extreme scenarios: 1. If the two genes in every pair are always in opposite regulatory states, namely when one needs to be active the other one needs to be inactive, both require that the TF is present (absent) at the same time. We call this the non-conflict scenario. Then the number of regulated genes is twice the number of TFs in use in the basic model. The expressions for single-gene crosstalk and total crosstalk probabilities (Eqs. 2-3, main text) are then slightly modified: x bound " e´E a`c s{2 c{t`e´E a`c s{2 , (S28) where the only difference with respect to Eq. 2 in main text is the factor two in cs{2.
The factor two is present as the rescaled similarity s is rescaled by half of the total TF species compared to the basic model. We have already shown [1] (SI, p. 12) that the case that each TF regulates Θ genes is equivalent to an effective similarity scaling s Ñ s{Θ. As to first order in s, X˚" ? s (main text, Eq. 4), this yields an effective reduction of crosstalk by a factor of ? 2.
2. Alternatively, if the two genes in each pair always need to be in the same regulatory state, one of them should be regulated by its cognate TF and the other should be left with unoccupied binding site. This means that only one of the genes requires the TF to be present, but the other one favors its absence to reduce crosstalk. Here we assume that if at least one gene requires the TF, then the TF is present. We call this the conflict scenario.
The single-gene and total crosstalk probabilities now read: x bound " e´E a`c s{2 c{t`e´E a`c s{2 , (S31) x unbound " cs{2`c{t c{t`e´E a`c s{2 , (S32) The change with respect to Eq. 2 in the main text is the additional term c{t in the x unbound expression as the cognate TF is present, even though it is not required. We define binding of cognate TF when not required as crosstalk. Furthermore, the weights in crosstalk X represent the proportion of regulated (unregulated) genes, i.e., t{2 and Minimal crosstalk when every TF regulates two genes: one as an activator and the other as a repressor couples between the regulatory states of the genes sharing a common TF. When both genes in all pairs require their TF to be either present or absent (non-conflict), crosstalk is lowered (red curve) compared to the basic model of the main text where every TF regulates only one gene (blue). In contrast, if all paired genes have opposite demands for the TF presence, crosstalk is considerably larger (green dot).
We also illustrate combinations of these two extremes (magenta for p=0.25 and black for p=0.75), both still showing much higher crosstalk compared to the basic model. Similarity s " 10´2.

Combinatorial regulation
So far we studied regulatory architectures in which each gene is regulated by a single TF (although each TF could regulate multiple genes, as in the previous section). There are however known cases in which a particular gene is regulated by a combination of distinct TF species. We study a model for such an architecture in this section. We assume that every gene is regulated by two distinct and unique TF species having separate and non-overlapping binding sites. Binding to one site does not affect binding to the other in any way: neither inhibits it nor makes it more favorable by any form of cooperativity. We assume an ANDgate logic, such that only if both binding sites are bound by the cognate factors the gene is regulated, but not if only one of them (or neither) is bound and the other is unoccupied or bound by a non-cognate.
As opposed to the basic model where energy levels referred to a single binding site, we now refer to the energy levels of a pair of binding sites regulating a common gene. They can now be different binding states with the following statistical weights: • w 1 " e´2 Ea if both binding sites are unoccupied, • w 2 " e´E a¨c¨s if one binding site is unoccupied while the second one is occupied by non-cognate TF.
• w 3 " pc¨sq 2 if both binding sites are occupied by non-cognate TFs.
• w 4 " e´E a¨c t if one binding site is unoccupied while the second one is occupied by the cognate TF.
• w 5 "`c t˘2 if both binding sites are occupied by the cognate TFs.
• w 6 "`c t˘¨p c¨sq if both binding sites are occupied, one by the cognate and the other by a non-cognate TF.
The single gene crosstalk probabilities now read: x bound " w 1`2 w 2`2 w 4`w3`2 w 6 w 1`2 w 2`2 w 4`w3`w5`2 w 6 " e´2 Ea`2 e´E a pcs`c{tq`pcsq 2`2 c 2 s{t e´2 Ea`2 e´E a pcs`c{tq`pcsq 2`p c{tq 2`2 c 2 s{t , x unbound " w 3 w 1`2 w 2`w3 " pcsq 2 e´2 Ea`2 e´E a cs`pcsq 2 , (S40) X " t x bound`p 1´tq x unbound (S41) w 2 , w 4 , and w 6 have the pre-factor 2 because they can apply to either of the two binding sites.
In Fig. P we illustrate the minimal crosstalk in this case, compared to the basic model.
We find that the AND-gate configuration leads to lower minimal crosstalk for low t, but higher crosstalk for high t, compared to the basic model. The difference between these models comes from the double number of TFs used in the AND-gate, which lead to higher crosstalk for high t, but also from the stricter definition of what is considered crosstalk for genes that should be left unregulated. To demonstrate this, we can define a more lenient definition of x unbound : x lenient unbound " 2w 2`w3 w 1`2 w 2`w3 " pcsq 2`2 e´E a cs e´2 Ea`2 e´E a cs`pcsq 2 , (S42) where a state that should be unbound and is only partially occupied by one non-cognate TF, is already considered crosstalk. This leads to an elevation of X˚for all t, compared to the basic model.