A Null Model for Pearson Coexpression Networks

Gene coexpression networks inferred by correlation from high-throughput profiling such as microarray data represent simple but effective structures for discovering and interpreting linear gene relationships. In recent years, several approaches have been proposed to tackle the problem of deciding when the resulting correlation values are statistically significant. This is most crucial when the number of samples is small, yielding a non-negligible chance that even high correlation values are due to random effects. Here we introduce a novel hard thresholding solution based on the assumption that a coexpression network inferred by randomly generated data is expected to be empty. The threshold is theoretically derived by means of an analytic approach and, as a deterministic independent null model, it depends only on the dimensions of the starting data matrix, with assumptions on the skewness of the data distribution compatible with the structure of gene expression levels data. We show, on synthetic and array datasets, that the proposed threshold is effective in eliminating all false positive links, with an offsetting cost in terms of false negative detected edges.


Text A The Pearson Correlation Coefficient 1
Text B A 3-dimensional example of Proposition S1 5 Text C Main moments of the Pearson correlation 6 Text D The restricted secure thresholdp k 6 Text E Functional relations undetected by correlation networks 7

Text A The Pearson Correlation Coefficient
Let x, y ∈ R n with n ≥ 3. Then the Pearson Correlation Coefficient (PCC) ρ between x and y is defined as: where x, y denote the arithmetic means 1 n n i=1 x i and 1 n n i=1 y i respectively.
Define two new random variablesx andỹ as follows: where σ x and σ y are the standard deviations of x and y. From the definition, the following identities immediately descend: Sincex (andỹ) lies on the unit sphere because x = 1, Equation (S2) yields that = 0 , and the same holds forỹ, too. Thus we can rephrase Equation (S2) as follows: Proposition S1. Let x, y,x,ỹ be as in Equation (S1). Thenx,ỹ ∈ S n−1 ∩ H S n−2 , where H is the vectorial hyperplane defined as n i=1 w i = 0 and w i are the coordinates of R n .
An example for n = 3 of the situation described in Proposition S1 is shown in Section Text B.
To prove now Equation (2), we first combine Equation (S1) and Equation (S2) to obtain that ρ(x, y) = ρ(x,ỹ) =xỹ = cos β , where β is the angle between the two vectorsx andỹ. Equation (S3) and Proposition S1 yield that P (|ρ(x, y)| > p) is the proportion between the area of the spherical cap in n − 2 dimensions included within an angle β from x and the whole surface of the (n − 2)-dimensional sphere [1]. A compact formula for the area A cap n−1 (r) of a (n − 2)-dim spherical cap is given in [2] as: and, since the area of the whole surface is the thesis follows from the position r = 1. In Proposition S1 the transformed vectors are assumed to be uniformly distributed on the spherical surface. This assumption holds in the case of a normal distribution, but it does not hold in general. However, in the following paragraph we show that it is a good approximation, since x and y are independent. In fact, Equation (2) can be generalised to other distributions [3][4][5][6], when data skewness can be bounded [1].
Let G δ (n, p) be an empirical distribution generated by k couples of two vectors x, y ∈ R n sampled according to a given distribution function δ. Let then be the t-error function evaluating the difference between the theoretical distribution F (n, p) and the empirical distribution G δ (n, p). Hereafter we report the results of the simulations for k = 50000 and n = 8, 20, 100, where δ is one of the following three distribution functions: • U (min, max), the uniform distribution in [min, max]; • N (µ, σ), the normal distribution with mean µ and standard deviation σ; • L(µ log , σ log ), the lognormal distribution with mean-log µ log and standard deviation-log σ log .
In particular, in Table S1 we list the values of E 2 (F, G δ ) and in Figure S1 we display the curves of the Cumulative Distribution Functions (CDF) of G δ (n, p) corresponding to the three functions δ, separately for the three different values of n = 8, 20, 100. The distribution parameters we used are min = 0, max = 0, µ = 0, σ = 1, µ log = 2, σ log = 3. Regardless of the value of n, the empirical distribution fits the exact formula Equation (2) when x and y are uniformly sampled, while it does not fit the same equation when the two vectors come from extremely skewed distributions such as the lognormal. Table S1. Error function E 2 (F, G δ ), for n = 8, 20, 100 and different

4/10
Text B A 3-dimensional example of Proposition S1 Consider a dataset Y consisting of n = 3 samples described by m = 100 genes. Then Y can be represented by 100 points in [0, 1] 3 ⊂ R 3 as shown in Fig. S2(a). The new variables are built through a two-stages procedure applied to each gene. First the mean is subtracted, so the transformed dataset lies on the hyperplane H described in Proposition S1 as displayed in Fig. S2(b,c). Finally. each gene is normalized to unitary variance, and the resulting dataset lies on S n−1 ∩ H, which is the circumference in Fig. S2(d).

Text C Main moments of the Pearson correlation
Finally, we conclude deriving the mean and the variance of the function |ρ| starting from Equation (2). The density function f (n, p) can be computed as Using the above expression for f (n, p), the two moments follow straightforwardly: Text D The restricted secure thresholdp k We list here the analogue of Table 1 of Main Text for the restricted secure threshold for k = 2 and k = 5.

Text E Functional relations undetected by correlation networks
Correlation networks, like other univariate methods, are unable to capture relations between genes when the independence hypothesis does not hold. In these situations, quite common throughout -omics studies when the number of genes are much larger of the number of samples, the correlation values (e.g., PCC) between expressions result negligible or small even when the corresponding genes are (functionally) related.
Hereafter we show three cases demonstrating the aforementioned behaviour.
1. Toy model Consider a simple system with four genes g 1 , g 2 , g 3 , g t , where the expression of gene g t depends on the expression of {g i } i=1,2,3 according to the linear rule g t = g 1 + 1 2 g 2 + g 3 . Moreover, suppose that the expression of g 1 , g 2 and g 3 is respectively uniformly, normally and gamma distributed, i.e. g 1 ∈ U (0, 1), g 2 ∈ N (1, 0) and g 3 ∈ Γ(a = 0.1, s = 1), where the density of the gamma distribution is given by f (x) = 1/(s a Γ(a))x a−1 e − x s . Finally, randomly extract the expression of g 1 , g 2 , g 3 on 100 samples, compute PCC(g i ,g t ) and repeat the experiment 10000 times. Although, by definition, g t is strongly functionally related to g 1 , g 2 and g 3 , the corresponding average PCC are quite low, namely µ σ PCC(g 1 ,g t ) 0.70 0.12 PCC(g 2 ,g t ) 0.59 0.15 PCC(g 3 ,g t ) 0.28 0.27 and thus the links between g t -g 2 and g t -g 3 are likely to be not detected in coexpression networks. 2. ODE model A similar situation occurs when the dynamics along time of the gene expressions is (more realistically) driven by a system of ordinary differential equations. Consider for instance the toy model on three genes g 1 , g 2 and g 3 described in [7]: g 1 is repressed by g 3 , g 2 is activated by g 1 and g 3 is activated by both g 1 and g 2 .ġ 1 = k 1,s 1 1 + k 1,3 g 3 − k 1,d g 1 g 2 = k 2,s k 2,1 g 1 1 + k 2,1 g 1 − k 2,d g 2 where the reaction rate constants are set as follows: k 1,s = 2, k 2,s = 2, k 3,s = 15, k 1,d = k 2,d = k 3,d = 1, k 2,1 = k 3,1 = 1, k 3,2 = 0.01, k 1,3 = 100. Setting to zero all the initial conditions and solving the above ODE system for the time interval t ∈ [0, 5] and time step δt = 0.01 we obtain three time series on 501 points corresponding to the dynamics of the gene expressions for g 1 , g 2 and g 3 , whose curves are plotted in Fig. S3. Again, despite the strong functional relation among g 1 , g 2 and g 3 , some of the corresponding PCC values are quite small, namely PCC(g 1 ,g 2 )=0.085 and PCC(g 1 ,g 3 )=-0.288, while PCC(g 2 ,g 3 )=0.927 for the only link which is likely to be inferred by the coexpression approach.

A promoteromic example
The third example comes from the mammalian promoterome atlas of the FANTOM5 project [8].