Stability Indicators in Network Reconstruction

The number of available algorithms to infer a biological network from a dataset of high-throughput measurements is overwhelming and keeps growing. However, evaluating their performance is unfeasible unless a ‘gold standard’ is available to measure how close the reconstructed network is to the ground truth. One measure of this is the stability of these predictions to data resampling approaches. We introduce NetSI, a family of Network Stability Indicators, to assess quantitatively the stability of a reconstructed network in terms of inference variability due to data subsampling. In order to evaluate network stability, the main NetSI methods use a global/local network metric in combination with a resampling (bootstrap or cross-validation) procedure. In addition, we provide two normalized variability scores over data resampling to measure edge weight stability and node degree stability, and then introduce a stability ranking for edges and nodes. A complete implementation of the NetSI indicators, including the Hamming-Ipsen-Mikhailov (HIM) network distance adopted in this paper is available with the R package nettools. We demonstrate the use of the NetSI family by measuring network stability on four datasets against alternative network reconstruction methods. First, the effect of sample size on stability of inferred networks is studied in a gold standard framework on yeast-like data from the Gene Net Weaver simulator. We also consider the impact of varying modularity on a set of structurally different networks (50 nodes, from 2 to 10 modules), and then of complex feature covariance structure, showing the different behaviours of standard reconstruction methods based on Pearson correlation, Maximum Information Coefficient (MIC) and False Discovery Rate (FDR) strategy. Finally, we demonstrate a strong combined effect of different reconstruction methods and phenotype subgroups on a hepatocellular carcinoma miRNA microarray dataset (240 subjects), and we validate the analysis on a second dataset (166 subjects) with good reproducibility.


INTRODUCTION
The problem of inferring a biological network structure starting from a set of high-throughput measurements (e.g.gene expression arrays) has been positively answered by a huge number of deeply different solutions published in literature in the last fifteen years.Nonetheless, network reconstruction suffers from being a underdetermined problem, being the number of interactions highly larger than the number of independent measurements (De Smet and Marchal, 2010): thus any algorithm has to look for a compromise between accuracy and feasibility, allowing simplifications that inevitably mine the precision of the final outcome, for instance including a relevant number of false positive links (Kamburov et al., 2012).This makes the inference problem "a daunting task" (Baralla et al., 2009), not only in terms of devising an effective algorithm, but also in terms of quantitatively interpreting the obtained results.In general, the reconstruction accuracy is far from being optimal in many situations with the presence of several pitfalls (Meyer et al., 2011), related to both the methods and the data (He et al., 2009), with the extreme situation of many link prediction being statistically equivalent to random guesses (Prill et al., 2010).In particular, the size (and the quality) of the available data play a critical role in the inference process, as widely acknowledged (Logsdon and Mezey, 2010;Gillis and Pavlidis, 2011;Miller et al., 2012).All these considerations support deeming network reconstruction a still unsolved problem (Szederkenyi et al., 2011).
Despite the ever rising number of available algorithms, only recently efforts have been carried out towards an objective comparison of network inference methods also highlighting current limitations (Altay and Emmert-Streib, 2010;Krishnan et al., 2007) and relative strengths and disadvantages (Madhamshettiwar et al., 2012).Among those, it is worthwhile mentioning the international DREAM challenge (Marbach et al., 2010), whose key result in the last edition advocated integration of predictions from multiple inference methods as an effective strategy to enhance performances taking advantage from the different algorithms' complementarity (De Smet and Marchal, 2010).Nevertheless, the algorithm uncertainty has been so far assessed only in terms of performance, i.e. distance of the reconstructing network from the ground truth, wherever available, while not much has been instead investigated with respect to the stability of the methods.This can be of particular interest when no gold standard is available for the given problem, and thus there is no chance to evaluate the algorithm's accuracy, leaving the stability as the sole rule of thumb for judging the reliability of the obtained network.Here we propose to tackle the issue by quantifying inference variability with respect to data perturbation, and, in particular, data subsampling.If a portion of data is randomly removed before inferring the network, the resulting graph is likely to be different from the one reconstructed from the whole dataset and, in general, different subsets of data would generate different networks.Thus, in the spirit of applying reproducibility principles to this field, one has to accept the compromise that the inferred/non inferred links are just an estimation, lying within a reasonable probability interval.In brief, we aim at proposing a set of four indicators allowing the researcher to quantitatively evaluate the reliability of the inferred/non-inferred links.In detail, we quantitatively assess, for a given ratio of removed data and for a give number of resampling, the mutual distances among all inferred networks and their distances to the network generated by the whole dataset, with the idea that, the smaller the average distance, the stabler the network.Moreover, we provide a ranked list of the stablest links and nodes, where the rank is induced by the variability of the link weight and the node degree across the generated networks, the less variable being the top ranked.As a network distance we employ the HIM distance (Jurman et al., 2012), which represents a good compromise between local (link-based) and global (structure-based) measure of network comparison.
As a first testbed in a controlled situation the four indicators are computed on a synthetic dataset for different instances of a correlation network with different measures, highlighting the impact of a FDR filter on the network reconstruction method.Finally, we show the use of the stability measures in comparing the relevance networks inferred on a miRNA microarray dataset with paired tissues extracted from a cohort of 241 hepatocellular carcinoma patients (Budhu et al., 2008).Data have two phenotypes, related to disease (tumoral or non-tumoral tissues) and to patient's sex (male or female), allowing the construction of four different networks, displaying different levels of stability.
Due to the relevant computational workload, all the analysis were run as R and Python scripts on multicore workstations and on the FBK HPC facility Kore Linux cluster.

METHODS
Before defining the four stability indicators we briefly summarize the main definitions and properties of the HIM network distance.

HIM Network Distance
The HIM distance (Jurman et al., 2012) is a metric for network comparison combining an edit distance (Hamming (Tun et al., 2006;Dougherty, 2010)) and a spectral one (Ipsen-Mikhailov (Ipsen and Mikhailov, 2002)).As discussed in (Jurman et al., 2011), edit distances are local, that is they focus only on the portions of the network interested by the differences in the presence/absence of matching links.Spectral distances evaluate instead the global structure of the compared topologies, but they distinguish isomorphic or isospectral graphs, which can correspond to quite different conditions within the biological context.Their combination into the HIM distance represents an effective solution to the quantitative evaluation of network differences.
Let N 1 and N 2 be two simple networks on N nodes, described by the corresponding adjacency matrices A 1 and A 2 , with a (1) ij , a (2) ij ∈ F , where F = F 2 = {0, 1} for unweighted graphs and F = [0, 1] for weighted networks.Denote then by I N the identity N × N matrix , by 1 N the unitary N × N matrix with all entries equal to one and by 0 N the null N × N matrix with all entries equal to zero.Finally, denote by E N the empty network with N nodes and no links (with adjacency matrix 0 N ) and by F N the undirected full network with N nodes and all possible N(N − 1) links (whose adjacency matrix is The definition of the Hamming distance is the following: To guarantee independence from the network dimension (number of nodes), we normalize the above function by the factor η = Hamming(E N , F N ) = N(N − 1): (1) When N 1 and N 2 are unweighted networks, H(N 1 , N 2 ) is just the fraction of different matching links (over the total number N(N − 1) of possible links) between the two graphs.In all cases, where the lower bound 0 is attained only for identical networks A 1 = A 2 and the upper bound 1 is reached whenever the two networks are complementary Among spectral distances, we consider the Ipsen-Mikhailov distance IM which has been proven to be the most robust in a wide range of situations (Jurman et al., 2011).Originally introduced in (Ipsen and Mikhailov, 2002) as a tool for network reconstruction from its Laplacian spectrum, the definition of the Ipsen-Mikhailov metric follows the dynamical interpretation of a N-nodes network as a N-atoms molecule connected by identical elastic strings, where the pattern of connections is defined by the adjacency matrix of the corresponding network.The dynamical system is described by the set of (2) We recall that the Laplacian matrix L of an undirected network is defined as the difference between the degree D and the adjacency A matrices L = D − A, where D is the diagonal matrix with vertex degrees as entries.L is positive semidefinite and singular (Chung, 1997;Atay et al., 2006;Spielman, 2009;Tönjes and Blasius, 2009;Atay et al., 2006), so its eigenvalues are 0 The vibrational frequencies ω i for the network model in Eq. 2 are given by the eigenvalues of the Laplacian matrix of the network: λ i = ω 2 i , with λ 0 = ω 0 = 0.The spectral density for a graph as the sum of Lorentz distributions is defined as where γ is the common width and K is the normalization constant defined as The scale parameter γ specifies the half-width at half-maximum, which is equal to half the interquartile range.Then the spectral distance ǫ γ between two graphs G and H on N nodes with densities ρ G (ω, γ) and ρ H (ω, γ) can then be defined as The highest value of ǫ γ is reached, for each N, when evaluating the distance between E N and F N .Defining γ as the (unique) solution of we can now define the normalized Ipsen-Mikahilov distance as so that IM(G, H) ∈ [0, 1] with upper bound attained only for (G, H) = (E N , F N ).Finally, the HIM distance is defined as the product metric of the normalized Hamming distance H and the normalized Ipsen-Mikhailov IM distance, normalized by the factor √ 2 to set its upper bound to 1: We can represent the HIM distance in the [0, 1] × [0, 1] Hamming/Ipsen-Mikhailov space, where a point P (x, y) represents the distance between two networks N 1 and N 2 whose coordinates are x = H(N 1 , N 2 ) and y = IM(N 1 , N 2 ) and the norm of P is √ 2 times the HIM distance HIM(N 1 , N 2 ).The same holds for weighted networks, provided that the weights range in [0; 1].In Fig. 1 we provide an example of this representation of the HIM distance between networks of four nodes.Roughly splitting the Hamming/Ipsen-Mikhailov space into four main zones I,II,III,IV as in Figure 1, we can say that two networks whose distances correspond to a point in zone I are quite close both in terms of matching links and of structure, while those falling in the zone III are very different with respect to both characteristics.Networks corresponding to a point in zone II have many common links, but their structure is rather different, while a point in zone IV indicates two networks with few common links, but with similar structure.Full mathematical details about the HIM distance and its two components H and IM are available in (Jurman et al., 2012).

Stability indicators
We introduce now the four stability indicators, for a given subset of the original data and a given number of replicates, producing a set of corresponding inferred networks.The first two indicators concern the stability of the entire network, measuring the mutual distances of the networks inferred from the different replicates and their distances to the network constructed on the whole dataset.The other two indicators concern instead the stability (and thus the reliability) of the single nodes and links, in terms of mutual variability of their respective degree and weight.In Fig. 2

FDR effect on correlation networks
As a first experiment, we want to assess the different level of stability in a correlation network inferred by a set of synthetic high-throughput signals when the inference (absolute value of Pearson correlation) is computed with or without False Discovery Rate control (see for instance (Jiao et al., 2011)).As the correlation measure, we use the classical (absolute) Pearson correlation of the WGCNA (Horvath, 2011) and the novel correlation measure called Maximal Information Coefficient (MIC), component of the Maximal Information-based Nonparametric Exploration (MINE) statistics (Reshef et al., 2011;Speed, 2011;Nature Biotechnology, 2012).For a set of values n < m and an adequate number of resampling r = min{20, m n }, compute the indicators I j (n, r) for j = 1, . . ., 4 for all the used algorithms.2. Choose two integers n, r with n < s and r ≤ s n , and build a set D (n,r) = {D 1 , . . .D r } where D i is a dataset built choosing n samples from D.
3. Reconstruct, by using the same algorithm ALG, the corresponding networks N D i on the subsampled data.

5.
For each set of values I i compute the mean, the range (defined as the difference between maximum and minumum value) and the 95% studentized bootstrap confidence intervals (Davison and Hinkley, 1997) as implemented in the R package boot (Canty and Ripley, 2012).
6. Comparative analysis of the statistics of the four indicators I 1 , . . .I 4 will describe the level of confidence (stability) in the network N D , in its links and in its nodes.

Data generation
As a synthetic benchmark for evaluating differences between Pearson and MIC correlation measures, and to assess the impact of the FDR filter on the construction of a correlation network, we built a dataset S consisting of 100 measurements (samples) of 20 variables (features) f i , from which we constructed the corresponding correlation networks on 20 nodes.The dataset S was generated starting from its correlation matrix M S , which was randomly generated with the following three constraints: for Corr the Pearson correlation.The correlation matrix M S is plotted in Fig. 4: clearly, the correlation values in the three groups defined by the above constraints represent true relations between the variables, while all other smaller correlation values are due to the underlying random 1.Let D a dataset with m samples described by q features, and let C(h, k) = |cor(x h , x k )| where x j is the j-th feature of D across the m samples and cor is a correlation measure.
2. Build the standard correlation network N D using the rule a hk = C(h, k) 3. Build the FDR controlled (at p-value ℘ = 10 −z ) correlation network M ℘ D using the rule where the set F z is defined as follows generation model for M S .

Results
Starting from the dataset S we built five correlation networks, using MIC, absolute Pearson correlation without FDR correction (WGCNA) and absolute Pearson correlation with FDR correction, with p-values ℘ = 10 −2 , 5 • 10 −3 , 10 −4 .The plots of the graphs for three of the networks are displayed in Fig. 5.As expected, while the WGCNA networks with highest FDR correction ℘ = 10 −4 is discarding all links as not significant apart from the edges connecting the two disjoint sets of nodes {f i : 1 ≤ i ≤ 5} and {f i : 6 ≤ i ≤ 11} (the strongest correlations in the matrix M S ), WGNCA and MIC generates two fully connected networks with a majority of weak links.Then we computed the four indicators I 1 , . . .I 4 for all the five networks described above, in the setup described in Sec.2.2.Main statistics for all the indicators I 1 and I 2 are reported in Tab. 1 and displayed in Fig. 6.As expected, the ratio of the discarded data has a strong impact on both the indicators I 1 and I 2 : in the leave-one-out case the indicators' values are close to zero regardless of the algorithm, while in the k-fold cross-validation case the stability is worsening for decreasing values of k, in terms of both mean and confidence intervals.This means that the networks inferred from a subset of data have larger distance both mutually and from the network reconstructed from the whole datasets, but also that these distances have larger variability.From the point of view of the different algorithms involved, the stricter the p-value in the FDR controlled WGCNA networks, the stabler the networks, with non controlled WGCNA and MINE as the worst performer in terms of stability.This is due to the fact that they are taking into account all possible correlation values, while most of the smaller values do not represent existing relations between variables, but they are rather a noise effect.As a first result then we showed that the use of a FDR control procedure for correlation help stabilizing the inference procedure, improving the performance of a method already acknowledged as effective (Allen et al., 2012).
We move now on to discuss the stablest links and nodes in the three cases WGCNA, WGCNA FDR 1e-4 and MIC: in particular, in Tab. 2 and 3 we show the top-ranked links and nodes ordered for decreasing range over mean of their weights across all resampling k4.The results collected in the tables are consistent with the structure of the starting correlation matrix M S and the behaviour of the inference algorithms.For the WGCNA case, the top 20 stablest links are those of the two fully connected subgroups F 1,5 = {f i : 1 ≤ i ≤ 5} and F 6,10 {f i : 6 ≤ i ≤ 10} with largest Pearson correlation values in M S .The same applies to WGCNA FDR 1e-4 (and with approximately the same values of weight range over weight mean as for WGCNA), for which these 20 links are the only existing (see Fig. 5).Among the following ranked links in WGCNA, those belonging to the F 11,15 = {f i : 11 ≤ i ≤ 15} group (whose correlation of about 0.3 was imposed as a constraint for M S ) are emerging, with a couple of exceptions, but with larger instability values (0.33-0.78 vs. 0.03-0.14).The remaining links are the unstablest, displaying Range/Mean values always larger than 0.83: they are the randomly correlated links of M S .It is interesting to note that the MIC network, due to the nature of the MIC statistics aimed at detecting relations between variables other than linear, displays similar but not identical results: the values of Range/Mean are confined in a narrower interval, and, although many links belonging to the F 1,5 and F 6,10 groups are highly ranked, some of them can also be found in much lower positions of the standing.Similar considerations hold for the ranking of the stablest nodes: for WGCNA, the top ranking nodes are the F 1,5 and the F 6,10 (with similar Range/Mean values), with those in F 11,15 come next, leaving the remaining five as the most unstable, with higher Range/Mean values.These five nodes, on the contrary, are the stablest for WGCNA FDR 1e-4: in fact, they are not wired to any other definitely higher values.In fact, although the nodes F 11,15 have degree zero in the WGCNA FDR 1e-4 inferred from the whole S, some links involving them exist in some of the resampling on the subset of data.To conclude with, in the MIC case again the ranking values span a much narrower range than the other two cases, and the obtained dwranking has most of the nodes in F 1,5 in top positions, while for the other nodes the relation with the structure of M S is very weak.Finally, the analogous tables for other ratios of the data subsampling schema (LOO, k2 and k10) are almost identical.

miRNA network on a Hepatocellular Carcinoma dataset
Investigating the relations connecting human microRNA (miRNA) and how they evolve in cancer has been recently a key topic for researcher in biology (Volinia et al., 2010;Bandyopadhyay et al., 2010), with hepatocellular carcinoma (HCC) as a notable example (Law and Wong, 2011;Gu et al., 2012).In the following example, we use the stability indicators I 1 , . . ., I 4 on a recent miRNA microarray dataset with two phenotypes to highlight differences in the corresponding inferred networks.As reconstruction algorithm we use the Context Likelihood of Relatedness (CLR) approach (Faith et al., 2007), belonging to the relevance networks class of algorithms and generating undirected weighted graphs with weights bounded between zero and one.In particular, interactions are scored by using the mutual information between the corresponding gene expression levels coupled with an adaptive background correction step.Although suboptimal if the number of variables is much larger than the number of variables, it was observed that CLR performes well in terms of prediction accuracy and some CLR predictions in literature were later experimentally validated (Ambroise et al., 2012).

Data description
We start out from the Hepatocellular Carcinoma dataset introduced in the paper (Budhu et al., 2008) and later used in (Ji et al., 2009), publicly available at the Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) at the accession number GSE6857.The dataset collects 482 tissue samples from 241 patients affected by hepatocellular carcinoma (HCC).For each patients, a sample from cancerous hepatic tissue and a sample from surrounding non-cancerous hepatic tissue are available, hybridized on the Ohio State University CCC MicroRNA Microarray Version 2.0 platform consisting of 11520 probes collecting expressions of 250 non-redundant human and 200 mouse microRNA (miRNA).After a preprocessing phase including imputation of missing values as in (Troyanskaya et al., 2001) and discarding probes corresponding to non-human (mouse and controls) miRNA, we end up with the dataset HCC of 240+240 paired samples described by 210 human miRNA, with the cohort consisting of 210 male and 30 female patients.We thus parted the whole dataset HCC into four subsets combining the sex and disease status phenotypes, collecting respectively the cancer tissue for the male patients (MT), the cancer tissue for the female patients (FT) and the corresponding two datasets including the non cancer tissues (MnT, FnT).

Results
Using the CLR algorithm we first generated the four networks inferred from the whole sets of data and corresponding to the combinations of the two binary phenotypes: a portrait of the resulting graphs is depicted in Fig. 8, discarding links whose weight is smaller than 0.1.As a first observation, the four networks have a different structure, for instance the tumoral tissues graphs being more connected than the controls and the female graphs more than the corresponding male ones (see for instance the density values in Fig. 8).In particular, their mutual HIM distances are reported in Tab. 7, together with the corresponding two-dimensional scaling plot, showing that the networks corresponding to the female patients (and, in particular, the one inferred from cancer tissue) are notably different from those arising from the subset of data for the male patients.We then computed the stability indicators I 1 and I 2 in the setup described in Sec.2.2, and the corresponding statistics are collected and displayed in Tab. 4 and Fig. 9.
It is immediately evident the different sample size impact on the network stability: the networks corresponding to male patients have smaller values for I 1 and I 2 (and thus they are much stabler) than the corresponding female counterparts, and this effect is even stronger than the one due to the ratio of the chosen subsets of data: the leave-one-out stability for FT and FnT is worse than k10 and k4 stability for MT and MnT.On the other hand, while control and cancer networks display similar level of stability in the male networks at all levels of subsampling ratio, in the female group the network associated to the controls is much stabler than the matching control networks, and this is evident when the size of the subset used for inference gets smaller, in particular for k = 2.
Finally, to show how to use indicators I 3 and I 4 to extract information about stability of some interesting links, we first rank all links according to their weight Range/Mean value for all the four cases MT, MnT, FT, FnT, and then we point out six links worth a comment, listed in Tab. 5.The link (a) is top ranking in all four cases as expected, since hsa-mir 321No1 and hsa-mir 321No2 denote essentially the same miRNA (identical or with very similar sequences, (Ambros et al., 2003).The same applies to the links (b) and (c), but in these cases the stability of these two links in the FnT network is not as good as in the other three cases, probably due to the presence of noise in the data.The link (d) is interesting because of the difference of its stability between the male and the female networks, indicating a link probably associated to sex rather than HCC.The behaviour of link (e) is even more singular: it is one of the stablest links for the FT network, while is not even picked up as a link by CLR in the FnT network.Finally, link (f) is a very well known connection in literature, strongly associated to cancer (Volinia et al., 2010;Braun et al., 2008;Georges et al., 2008) as confirmed by its high stability in the MT and FT networks only.

CONCLUSIONS
We introduced a suite of four stability indicators for assessing the variability of network reconstruction algorithm as functions of a data subsampling procedure.The aim here is to provide the researchers with an effective tool to compare either the inference algorithms or the investigated dataset.Two indicators are based on a measure of a normalized distance between networks and they are global, giving a confidence measure on the whole inferred dataset, while the other two are local, associating a reliability score to the network nodes and detected links.They are of particular interest when no gold standard is known for the studied task, so they can work as a substitute for the algorithm accuracy.We demonstrated their consistency on a synthetic dataset, and we showed their use on a high-throughput microarray experiment, with two widely known inference methods such as WGCNA and CLR.
Address correspondence to:

Figure 1 :
Figure 1: An example of HIM distance.(a) Network A (top) and Network B (bottom); (b) Representation of the HIM distance in the Ipsen-Mikhailov and Hamming distance space between networks A versus B, C and D, where C is the fully connected network and D is the empty one.
we detail the mathematical formulation of the four indicators: the smaller the indicators' values, the stabler the indicators' targets.In particular, for all experiments on both synthetic and biological datasets we used n = s − 1, r = 1 [leave-one-out stability, LOO for short], and 20 different instances of k-fold cross validation (discarding the test portion) for k = 2, 4, 10 (denoted by k2, k4 and k10 in what follows), and thus n = ⌊ s(k−1) k ⌋ and r = 20k.

1.
Given a dataset D with s samples and p features, reconstruct (with a chosen algorithm ALG) the network N D on the whole dataset D; denote the p nodes of N D by x D 1 , . . ., x D p and its edges' weight by a D hk , for k, h = 1, . . ., p.

Figure 3 :
Figure 3: Construction of a FDR-corrected correlation network.

Figure 4 :
Figure 4: The correlation matrix M S used to generate the synthetic dataset S

Figure 5 :Figure 6 :
Figure 5: Correlation networks inferred by the dataset S using (a) absolute Pearson, (b) absolute Pearson with FDR correction at p-value 10 −4 and (c) MIC.Node label i corresponds to feature f i , node size is proportional to node degree and link colors identify different classes of link weights.

Figure 8 :Figure 9 :
Figure 8: CLR networks (and corresponding density values) inferred from the 4 subsets (a) Male Tumoral (MT) (b) Male not Tumoral (MnT) (c) Female Tumoral (FT) and (d) Female non Tumoral (FnT) of the datasets HCC.Links are thresholded at weight 0.1, node position is fixed across the four networks, node dimension is proportional to the degree and edge width is proportional to link weight.

Table 1 :
Statistics (mean, bootstrap confidence intervals and range) of the stability indicators I 1 and I 2 for different instances of the WGCNA and MIC networks on the dataset S and for different values of data subsampling.

Table 3 :
Top ranked nodes, ordered by degree range over degree mean across all 20 resampling of k4 4-fold cross validation, for the three algorithms WGCNA, WGCNA FDR 1e-4 and MIC.(*) indicates that Ratio and Mean are both zero.

Table 4 :
Statistics (mean, bootstrap confidence intervals and range) of the stability indicators I 1 and I 2 for the CLR inferred networks on the datasets MT, MnT, FT, FnT, for different values of data subsampling.

Table 5 :
Position in the weight Range/Mean ranking in the four cases MT, MnT, FT, FnT for six miRNA-miRNA links.