^{1}

^{1}

^{1}

^{2}

^{*}

Conceived and designed the experiments: PL SH. Analyzed the data: PL RL MCO. Wrote the paper: PL SH.

The authors have declared that no competing interests exist.

In many applications, one is interested in determining which of the properties of a network module change across conditions. For example, to validate the existence of a module, it is desirable to show that it is reproducible (or preserved) in an independent test network. Here we study several types of network preservation statistics that do not require a module assignment in the test network. We distinguish network preservation statistics by the type of the underlying network. Some preservation statistics are defined for a general network (defined by an adjacency matrix) while others are only defined for a correlation network (constructed on the basis of pairwise correlations between numeric variables). Our applications show that the correlation structure facilitates the definition of particularly powerful module preservation statistics. We illustrate that evaluating module preservation is in general different from evaluating cluster preservation. We find that it is advantageous to aggregate multiple preservation statistics into summary preservation statistics. We illustrate the use of these methods in six gene co-expression network applications including 1) preservation of cholesterol biosynthesis pathway in mouse tissues, 2) comparison of human and chimpanzee brain networks, 3) preservation of selected KEGG pathways between human and chimpanzee brain networks, 4) sex differences in human cortical networks, 5) sex differences in mouse liver networks. While we find no evidence for sex specific modules in human cortical networks, we find that several human cortical modules are less preserved in chimpanzees. In particular, apoptosis genes are differentially co-expressed between humans and chimpanzees. Our simulation studies and applications show that module preservation statistics are useful for studying differences between the modular structure of networks. Data, R software and accompanying tutorials can be downloaded from the following webpage:

In network applications, one is often interested in studying whether modules are preserved across multiple networks. For example, to determine whether a pathway of genes is perturbed in a certain condition, one can study whether its connectivity pattern is no longer preserved. Non-preserved modules can either be biologically uninteresting (e.g., reflecting data outliers) or interesting (e.g., reflecting sex specific modules). An intuitive approach for studying module preservation is to cross-tabulate module membership. But this approach often cannot address questions about the preservation of connectivity patterns between nodes. Thus, cross-tabulation based approaches often fail to recognize that important aspects of a network module are preserved. Cross-tabulation methods make it difficult to argue that a module is

Network methods are frequently used in genomic and systems biologic studies, but also in general data mining applications, to describe the pairwise relationships of a large number of variables

This article describes several module preservation statistics for determining which properties of a network module are preserved in a second (test) network. The module preservation statistics allow one to quantify which aspects of within-module topology are preserved between a reference network and a test networks. For brevity, we will refer to these aspects as connectivity patterns, but we note that our statistics are not based on network motifs. We use the term “module” in a broad sense: a network module is a subset of nodes that forms a sub-network inside a larger network. Any subset of nodes inside a larger network can be considered a module. This subset may or may not correspond to a cluster of nodes.

Many cluster validation statistics proposed in the literature can be turned into module preservation statistics. In the following, we briefly review cluster validation statistics. Traditional cluster validation (or quality) statistics can be split into four broad categories: cross-tabulation, density, separability, and stability statistics

While many cluster validation statistics are based on within- and/or between cluster variance, several recent articles used prediction error to evaluate the reproducibility (or validity) of clusters in gene expression data

No. | Preservation Statistic | Network | Ref. netw. input | Test netw. input | Used in composite | ||||||||

Name | Eq. | Type | Lbl | Adj | Lbl | Adj | |||||||

1 | coClustering | Supp. | Cross-tab | not used | yes | no | no | yes | no | no | no | no | no |

2 | Supp. | Cross-tab | not used | yes | no | no | yes | no | no | no | no | no | |

3 | −log(p-value) | Supp. | Cross-tab | not used | yes | no | no | yes | no | no | no | no | no |

4 | 8 | Density | general | yes | no | no | no | yes | no | no | no | yes | |

5 | 9 | Density | general | yes | no | no | no | yes | no | no | no | no | |

6 | 10 | Density | general | yes | no | no | no | yes | no | no | no | no | |

7 | 11 | Connect. | general | yes | yes | no | no | yes | no | yes | yes | yes | |

8 | 12 | Connect. | general | yes | yes | no | no | yes | no | yes | yes | yes | |

9 | 13 | Connect. | general | yes | yes | no | no | yes | no | no | no | no | |

10 | 14 | Connect. | general | yes | yes | no | no | yes | no | no | no | no | |

11 | 27 | Separab. | general | yes | yes | no | no | yes | no | no | no | no | |

12 | 19 | Den.+Con. | cor | yes | no | yes | no | no | yes | yes | yes | no | |

13 | 20 | Connect. | cor | yes | no | yes | no | no | yes | yes | yes | no | |

14 | 21 | Density | cor | yes | no | yes | no | no | yes | yes | yes | no | |

15 | 22 | Den.+Con. | cor | yes | no | yes | no | no | yes | yes | yes | no | |

16 | 23 | Connect. | cor | yes | no | yes | no | no | yes | yes | yes | no | |

17 | 24 | Connect. | cor | yes | no | yes | no | no | yes | no | no | no | |

18 | 28 | Separab. | cor | yes | no | yes | no | no | yes | no | no | no | |

19 | 1 | Compos. | cor | yes | yes | yes | no | yes | yes | ||||

20 | Compos. | cor | yes | yes | yes | no | yes | yes | |||||

21 | 34 | Compos. | cor | yes | yes | yes | no | yes | yes | ||||

22 | 35 | Compos. | general | yes | yes | no | no | yes | no |

Term | Definition |

(Undirected) Network | Generally speaking, an undirected network consists of nodes (for example, gene expression profiles), and connection strengths between pairs of nodes. The connection strengths can be either categorical (connected vs. unconnected), or continuous between 0 (no connection) and 1 (strongest connection). |

Adjacency matrix | The connection strengths in an undirected network can be represented by the |

Correlation network | This type of network is built from numerical data |

Gene co-expression network | In gene co-expression networks, the nodes represent genes (or probesets of a microarray) measured across a given set of microarray samples, and the connections represent the strength of co-expression. Various measures of co-expression can be used, for example Pearson or robust correlation (in which case the co-expression network is also a correlation network), information-theoretic methods such as mutual information, and other measures of co-expression similarity. |

Sub-network | A subnetwork of a network can be any collection (subset) of nodes from the network, together with the adjacencies (connection strengths) between the nodes. Thus, a subnetwork of a network also forms a (smaller) network on its own. |

Module | A network module is a subset of nodes that forms a sub-network inside a larger network. Any subset of nodes inside a larger network gives rise to a module. This subset may or may not correspond to a cluster of nodes. |

Cluster | A cluster of nodes within a network is usually defined as a group of nodes that are strongly connected. Many definitions and algorithms for finding clusters in data have been proposed in the literature. |

Network density | The mean adjacency (connection strength) among all nodes in the network. |

Connectivity | For each node, the connectivity (also known as degree) is defined as the sum of connection strengths with the other network nodes: |

Intramodular connectivity |
Intramodular connectivity measures how connected, or co-expressed, a given node is with respect to the nodes of a particular module. Thus, intramodular connectivity is also the connectivity in the subnetwork defined by the module. The intramodular connectivity may be interpreted as a measure of module membership. |

Module eigennode |
The module eigennode |

Eigennode-based connectivity |
For the i-th vector |

Even when modules are defined using a module detection procedure, cross-tabulation based approaches face potential pitfalls. A module found in the reference data set will be deemed non-reproducible in the test data set if no matching module can be identified by the module detection approach in the test data set. Such non-preservation may be called the

We distinguish network statistics by the underlying network. Some preservation statistics are defined for a general network (defined by an adjacency matrix) while others are only defined for a correlation network (constructed on the basis of pairwise correlations between numeric variables). Our applications show that the correlation structure facilitates the definition of particularly powerful module preservation statistics. Preservation statistics 4–11 (

It is often not clear whether an observed value of a preservation statistic is higher than expected by chance. As detailed in

Because preservation statistics measure different aspects of module preservation, their results may not always agree. We find it useful to aggregate different module preservation statistics into composite preservation statistics. Composite preservation statistics also facilitate a fast evaluation of many modules in multiple networks. We define several composite statistics.

For correlation networks based on quantitative variables, the

Since biologists are often more familiar with p-values as opposed to Z statistics, our R implementation in function modulePreservation also calculates empirical p-values. Analogous to the case of the Z statistics, the p-values of individual preservation statistic are summarized into a descriptive measure called

The Z statistics and permutation test p-values often depend on the module size (i.e. the number of nodes in a module). This fact reflects the intuition that it is more significant to observe that the connectivity patterns among hundreds of nodes are preserved than to observe the same among say only

Several studies have explored how co-expression modules change between mouse tissues

We consider a single module defined by the genes of the gene ontology (GO) term “Cholesterol biosynthetic process” (CBP, GO id GO:0006695 and its GO offspring). Of the 28 genes in the CBP, 24 could be found among our 17104 genes. Cholesterol is synthesized in liver and we used the female liver network as the reference network module. As test networks we considered the CBP co-expression networks in other tissue/sex combinations.

Each circle plot in

The module is defined as a signed weighted correlation network among genes from the GO category Cholesterol Biosynthetic Process. Module preservation statistics allow one to quantify similarities between the depicted networks. The figure depicts the connectivity patterns (correlation network adjacencies) between cholesterol biosynthesis genes in 4 different mouse tissues from male and female mice of an F2 mouse cross. The thickness of the line reflects the absolute correlation. The line is colored in red if the correlation is positive and green if it is negative. The size of each black circle indicates the connectivity of the corresponding gene; hubs (i.e., highly connected) genes are represented by larger circles. Visual inspection suggests that the male and female liver networks are rather similar and show some resemblance to those of the adipose tissue. Module preservation statistics can be used to measure the similarity of connectivity patterns between pairs of networks.

We now turn to a quantitative assessment of this example. We start out by noting that a

Quantitative evaluation of the similarities among the networks depicted in

The

We briefly compare the performance of our network based statistics with those from the IGP method

Here we study the preservation of co-expression between human and chimpanzee brain gene expression data. The data set consists of 18 human brain and 18 chimpanzee brain microarray samples

A. Hierarchical clustering tree (dendrogram) of genes based on human brain co-expression network. Each “leaf” (short vertical line) corresponds to one gene. The color rows below the dendrogram indicate module membership in the human modules (defined by cutting branches of this dendrogram at the red line) and in the chimpanzee network (defined by branch cutting the dendrogram in panel B.) The color rows show that most human and chimpanzee modules overlap (for example, the turquoise module). B. Hierarchical clustering tree of genes based on the chimpanzee co-expression network. The color rows below the dendrogram indicate module membership in the human modules (defined by cutting branches of dendrogram in panel A.) and in the chimpanzee network (defined by branch cutting the dendrogram in this panel.) C. Cross-tabulation of human modules (rows) and chimpanzee modules (columns). Each row and column is labeled by the corresponding module color and the total number of genes in the module. In the table, numbers give counts of genes in the intersection of the corresponding row and column module. The table is color-coded by

The most common

We now turn to approaches for measuring module preservation that do not require that module detection has been carried out in the test data set.

A. The summary statistic

Since the modules of this application are defined as clusters, it makes sense to evaluate their preservation using cluster validation statistics.

While composite statistics summarize the results, it is advisable to understand which properties of a module are preserved (or not preserved). For example,

For co-expression modules, one can define an alternative density measure based on the module eigengene (

A. Heatmaps and eigengene plots for visualizing the gene expression profiles of the yellow module genes (rows) across human brain microarray samples (columns). In the heat map, green indicates under-expression, red over-expression, and white mean expression. The module eigengene expression depicted underneath the heat map shows how the eigengene expression (y-axis) changes across the samples (x-axis) which correspond to the columns of the heat map. The eigengene can be interpreted as a weighted average gene expression profile. The color bar below the eigengene indicates the region from which the sample was taken: light blue color indicates cortical samples, magenta indicates cerebellum samples, and orange indicates caudate nucleus samples. Scatter plots B.–D. show that the connectivity patterns of the yellow module genes tends to be highly preserved between the two species. B. Scatter plot of gene-gene correlations in chimpanzee samples (

Although density based approaches are intuitive, they may fail to detect another form of module preservation, namely the

A related but distinct connectivity preservation statistic quantifies whether intramodular hub genes in the reference network remain intramodular hub genes in the test network. Intramodular hub genes are genes that exhibit strong connections to other genes within their module. This property can be quantified by the

Another intramodular connectivity measure is

To further illustrate that modules do not have to be clusters, we now describe an application where modules correspond to KEGG pathways. KEGG (Kyoto Encyclopedia of Genes and Genomes) is a knowledge base for systematic analysis of gene functions, linking genomic information with higher order functional information

Here we present the composite statistics

The first column presents summary preservation

Since KEGG pathways are not defined via a clustering procedure it is not clear whether cluster preservation statistics are appropriate for analyzing this example. But to afford a comparison, we also report the findings for the IGP statistic

To understand which aspects of the pathways are preserved, one can study the preservation of density statistics (

The lack of preservation of the apoptosis pathway cannot be explained in terms of low module size.

This application outlines how module preservation statistics can be used to study the preservation of KEGG pathway networks. The analysis presented here is but a first step towards characterizing molecular pathway preservation between human and chimpanzee brains, and should be extended through more detailed analyses with additional data sets in the future. A limitation of our microarray data is that they measured expression levels in heterogeneous mixtures of cells. KEGG and GO (gene ontology) pathways all essentially describe interactions that take place within cells. So when data have been generated from a heterogeneous mixture of different cell types, it is possible that these relationships are somewhat obscured. It is not obvious that all of the elements of a KEGG pathway should be co-expressed, particularly since the pathways describe protein-protein interactions.

We briefly describe an application that quantifies module preservation between male and female cortical samples. The details are described in Supplementary

In Supplementary

Our preservation statistics allow one to evaluate whether a given module is preserved in another network. A related but distinct data analysis task is to construct modules that are present in several networks. By construction, a consensus module can be detected in each of the underlying networks. A challenge of many real data applications is that it is difficult to obtain independent information (a “gold standard”) that allows one to argue that a module is truly preserved. To address this challenge, we use the consensus network application where by construction, modules are known to be preserved. This allows us to determine the range of values of preservation statistics when modules are known to be preserved. In Supplementary

In

The (average linkage) hierarchical cluster trees visualize the correlations between the preservation statistics. The preservation statistics are colored according to their type: density statistics are colored in red, connectivity preservation statistics are colored in blue, separability is colored in green, and cross-tabulation statistics are colored in black. Note that statistics of the same type tend to cluster together. A derivation of some of these relationships is presented in Supplementary

We derive relationships between module preservations statistics in the sixth section of Supplementary

To illustrate the utility and performance of the proposed methods, we consider 7 different simulation scenarios that were designed to reflect various correlation network applications. An overview of these simulations can be found in

The first column outlines 6 (out of 7) simulation scenarios. Results for the seventh simulation scenario can be found in Supplementary

No. | Statistic | Type | Network | Simulation scenario | Mean | ||||||

1 | 2 | 3 | 4 | 5 | 6 | 7 | |||||

1 | coClustering | Cross-tab | not used | 1 | 4 | 3 | 1 | 4 | 2.6 | ||

2 | Cross-tab | not used | 1 | 4 | 4 | 1 | 3 | 2.6 | |||

3 | −log(p-value) | Cross-tab | not used | 1 | 4 | 3 | 1 | 4 | 2.6 | ||

4 | Density | general | 4 | 4 | 3 | 3 | 4 | 1 | 1 | 2.9 | |

6 | Connectiv. | general | 2 | 3 | 4 | 3 | 3 | 3 | 4 | 3.1 | |

12 | Den.+Con. | cor | 4 | 4 | 4 | 4 | 1 | 1 | 2 | 2.9 | |

13 | Connectiv. | cor | 3 | 4 | 4 | 4 | 1 | 3 | 4 | 3.3 | |

14 | Density | cor | 3 | 4 | 3 | 2 | 4 | 1 | 1 | 2.6 | |

15 | Den.+Con. | cor | 4 | 3 | 4 | 4 | 3 | 1 | 2 | 3 | |

16 | Connectiv. | cor | 2 | 4 | 3 | 3 | 1 | 3 | 4 | 2.9 | |

17 | Connectiv. | cor | 4 | 2 | 4 | 3 | 3 | 1 | 1 | 2.6 | |

18 | Separabil. | cor | 1 | 1 | 3 | 3 | 1 | 1 | 1 | 1.6 | |

19 | Composite | cor | 3 | 4 | 4 | 3 | 3 | 3 | 4 | 3.4 | |

21 | Composite | cor | 4 | 4 | 2 | 4 | 4 | 3.7 |

Since no thresholds can be defined for the statistic

No. | Statistic | Type | Network | Simulation scenario | Mean | ||||||

1 | 2 | 3 | 4 | 5 | 6 | 7 | |||||

19 | Composite | cor | 3 | 4 | 4 | 3 | 3 | 3 | 4 | 3.4 | |

21 | Composite | cor | 4 | 4 | NA | 4 | 2 | 4 | 4 | 3.7 | |

IGP | Dens+Sep | General | 1 | 4 | 3 | 3 | 3 | 1 | 1 | 2.3 | |

IGP perm. p | Dens+Sep | General | 1 | 4 | 3 | 2 | 2 | 1 | 1 | 2.0 |

In the following, we describe the different simulation scenarios in more detail.

In the

In the

In the

In the

In the

In the

Additional descriptions of the simulations can be found Supplementary

Preservation statistics described in this article have been implemented in the freely available statistical language and environment R. A complete evaluation of observed preservation statistics and their permutation

This article describes powerful module preservation statistics that capture different aspects of module preservation. The network based preservation statistics only assume that each module forms a sub-network of the original network. Thus, we define a module as a subset of nodes with their corresponding adjacencies. In particular, our connectivity preservation statistics (

For a special class of networks, called approximately factorizable networks, one can derive simple relationships between network concepts

Our applications provide a glimpse of the types of research questions that can be addressed with the module preservation statistics. In general, methods for quantifying module preservation have several uses. First and foremost they can be used to determine which properties of a network module are preserved in another network. Thus, module preservation statistics are a valuable tool for validation as well as differential network analysis. Second, they can be used to define a global measure of module structure preservation by averaging the preservation statistic across multiple modules or by determining the proportion of modules that are preserved. A third use of module preservation statistics is to define measures of module quality (or robustness), which may inform the module definition. For example, to measure how robustly a module is defined in a given correlation network, one can use resampling techniques to create reference and test sets from the original data and evaluate module preservation across the resulting networks. Thus, any module preservation statistic naturally gives rise to a module quality statistic by applying it to repeated random splits (interpreted as reference and test set) of the data. By averaging the module preservation statistic across multiple random splits of the original data one arrives at a module quality statistic.

We briefly point out situations when alternative procedures may be more appropriate. To identify modules that are present in multiple data sets it can be preferable to consider all data sets simultaneously in a consensus module detection procedure. For example, the consensus module approach described in application 6 results in modules that are present in multiple networks by construction. To identify individual genes that diverge between two data sets, one can use standard discriminative analysis techniques. For example, differentially expressed genes can be found with differential expression analysis and differentially co-expressed genes can be found using differential co-expression analysis

While cluster analysis and network analysis are different approaches for studying high-dimensional data, there are some commonalities. For example, it is straightforward to turn a network adjacency matrix (which is a similarity measure) into a dissimilarity measure which can be used as input of a clustering procedure (e.g., hierarchical clustering or partitioning around medoids)

The proposed composite preservation statistics

Although not the focus of this work, we mention that a major application of density-based statistics is to measure module

The proposed preservation statistics have several limitations including the following. First, our statistics only apply to undirected networks. Generalization of our statistics to directed networks is possible but outside of our scope.

A second limitation concerns statistics of connectivity preservation that are based on correlating network adjacencies, intramodular connectivities, etc, between the reference and the test networks. Because Pearson correlation is sensitive to outliers, it may be advantageous to use an outlier-resistant correlation measure, e.g., the Spearman correlation or the biweight midcorrelation

A third limitation is that a high value of a preservation statistic does not necessarily imply that the module could be found by a

A fourth limitation is that it is difficult to pick thresholds for preservation statistics. To address this issue, we use permutation tests to adjust preservation statistics for random chance by defining Z statistics (Equation 29). The R function modulePreservation also calculates empirical p-values for the preservation statistics. A potential disadvantage of permutation test based preservation statistics (compared to observed statistics and

A fifth limitation is computational speed when it comes to calculating permutation test based statistics (e.g.

A sixth limitation is that the different preservation statistics may disagree with regard to the preservation of a given module. While certain aspects of a module may be preserved, others may not be. In our simulation studies, we present scenarios where connectivity statistics show high preservation but density measures do not and vice versa. Since both types of preservation statistics will be of interest in practice, our R function modulePreservation outputs all preservation statistics. Although we aggregate several preservation statistics into composite statistics, we recommend to consider all of the underlying preservation statistics to determine which aspects of a module are preserved.

While we describe situations when cross-tabulation based preservation statistics are not applicable, we should point out that cross-tabulation statistics also have the following advantages. First, they are often intuitive. Second, they can be applied when no network structure is present. Third, they work well when module assignments are strongly preserved and the modules remain separate in the test network. In the first section of Supplementary

We note that the interpretation of gene co-expression relationships depends heavily on biological context. For example, in a dataset consisting of samples from multiple tissue types, co-expression modules (that is, modules defined by co-expression similarity) will often distinguish genes that are expressed in tissue-specific patterns (e.g.,

Although elucidating the functional significance of identified co-expression modules requires substantial effort from biologists and bioinformaticians, the importance of co-expression modules lies not only in their functional interpretation, but also in their reproducibility. Because transcriptome organization in a given biological system is highly reproducible

Given the above-mentioned limitations, it is reassuring that the proposed module preservation statistics perform well in 6 real data applications and in 7 simulation scenarios. Although it would be convenient to have a single statistic and a corresponding threshold value for deciding whether a module is preserved, this simplistic view fails to realize that module preservation should be judged according to multiple criteria (e.g., density preservation, connectivity preservation, etc). Individual preservation statistics provide a more nuanced and detailed view of module preservation. Before deciding on module preservation, the data analyst should decide which aspects of a module preservation are of interest.

Due to space limitations, we have moved our description of cross-tabulation based preservation statistics to the first section of Supplementary

Our methods are applicable to weighted or unweighted networks that are specified by an

To simplify notation, we introduce the function

A network represented by its adjacency matrix can be characterized by a number of network concepts (also known as network indices)

The

The Maximum Adjacency Ratio (MAR)

The clustering coefficient

Many network analyses define modules, that is subsets of nodes that form a sub-network in the original network. Modules are labeled by integer labels

Here we describe module preservation statistics that can be used to determine whether a module that is present in a reference network (with adjacency

Intuitively, one may call a module

Other network concepts may be used to obtain a summary statistic of a module. For example, our R function modulePreservation also calculate preservation statistics based on the mean

Connectivity preservation statistics quantify how similar connectivity of a given module is between a reference and a test network. For example, module connectivity preservation can mean that, within a given module

If module

Correlation networks are a special type of undirected networks in which the adjacency is constructed on the basis of correlations

An important choice in the construction of a correlation network concerns the treatment of strong negative correlations. In

The default method for defining modules in weighted correlation networks is to use average linkage hierarchical clustering coupled with dynamic branch cutting

Many module construction methods lead to correlation network modules comprised of highly correlated variables. For such modules one can summarize the corresponding module vectors using the first principal component denoted by

The module eigennode

Both

The specific nature of correlation networks allows us to define additional module preservation statistics. The underlying information carried by the sign of the correlation can be used to further refine the statistics irrespective of whether a signed or unsigned similarity is used in network construction. To simplify notation, we define

To measure the preservation of connectivity patterns within module

The concept of the module eigennode also gives rise to several preservation statistics that in effect measure module density, or, from a different point of view, how well the eigennode represents the whole module. For example, one can use the proportion of variance explained (defined in the fifth section in Supplementary

The

Our statistic

Intuitively, if the internal structure of a module is preserved between a reference and a test network, we expect that a variable with a high module membership in the reference network will have a high module membership in the test network as well; conversely, variables with relatively low module membership in the reference network should also have a relatively low module membership in the test network. In other words, intramodular hubs in the reference network should also be intramodular hubs in the test network. For a given module

A network module is distinct if it is well separated from the other modules in the network. A distinct module in a reference network may be considered well preserved in a test network if it remains well separated from the other modules in the test network. In the following, we describe several separability based preservation statistics. Denote by

In clustering applications based on Euclidean distance it is customary to measure module distinctiveness, or separability, by the between-cluster distance. For correlation networks we propose to measure module separability by 1 minus the correlation of their respective eigennodes. Specifically, for two modules

In the sixth section of Supplementary

Our separability statistic is conceptually related to the separability score used in

Typical values of module preservation statistics depend on many factors, for example on network size, module size, number of observations etc. Thus, instead of attempting to define thresholds for considering a preservation statistic significant, we use permutation tests. Specifically, we randomly permute the module labels in the test network and calculate corresponding preservation statistics. This procedure is repeated

In the sixth section of Supplementary

The relationships derived in Supplementary

Our simulated as well as empirical data show that the separability tends to have low agreement (as measured by correlation) with the other preservation statistics (

Since

It seems intuitive to call a module with

The modulePreservation R function calculates multiple preservation

In some applications such as the human vs. chimpanzee comparison described above, one is interested in ranking modules by their overall preservation in the test set, i.e., one is interested in a relative measure of module preservation. Since our simulations and applications reveal that

While all examples in this article relate to correlation (in particular, co-expression) networks, we have also implemented methods and R function that can be applied to general networks specified only by an adjacency matrix. For example, this function could be used to study module preservation in protein-protein interaction networks. We also define a composite statistic

A detailed description of the methods is provided Supplementary

In the second section, we briefly review a hierarchical clustering procedure for module detection. Many methods exist for defining network modules. In this section, we describe the method used in our applications but it is worth repeating that our preservation statistics apply to most alternative module detection procedures.

In the third section, we review the definition of signed and unsigned correlation networks. Correlation networks are a special case of general undirected networks in which the adjacency is constructed on the basis of correlations between quantitative variables.

In the fourth section, we present module quality statistics, which we are implemented in the modulePreservation R function. While our main article focuses on statistics that measure preservation of modules between a reference and a test network, we briefly discuss the application of some of the preservation statistics to the related but distinct task of measuring module quality in a single (reference) network. More precisely, the density and separability statistics can be applied to the reference network without a reference to a test network. The results can then be interpreted as measuring module quality, that is how closely interconnected the nodes of a module are or how well a module is separated from other modules in the network.

In the fifth section, we review the notation for the singular value decomposition and for defining a module eigennnode. The section describes conditions when the eigenvector

In the sixth section, we investigate relationships between preservation statistics in correlation networks.

The KEGG database and many textbooks describe these fundamental pathways in more detail but the following terse descriptions may be helpful. The Wnt signaling pathway describes a network of proteins most well known for their roles in embryogenesis and cancer, but also involved in normal physiological processes in adult animals. The Hedgehog signaling pathway is one of the key regulators of animal development conserved from flies to humans. The apoptosis pathway mediates programmed cell death. Endocytosis is the process by which cells absorb molecules (such as proteins) from outside the cell by engulfing them with their cell membrane. The Transforming growth factor beta (TGF-

Preservation statistics of human brain modules in chimpanzee samples and vice-versa. This table contains observed preservation statistics and their permutation Z scores of human brain modules in chimpanzee samples and vice-versa. Columns indicate the reference data set, test data set, module label (color), module type, module size, observed preservation statistics, their Z scores, empirical p-values, and Bonferoni-corrected empirical p-values. The grey (improper) modules contain all unassigned genes, and the gold module is a random sample representing the entire network as a single module.

(0.03 MB CSV)

Preservation statistics of male human brain modules in the corresponding female samples and vice-versa. This table contains observed preservation statistics and their permutation Z scores of male human brain modules in the corresponding female samples and vice-versa. Columns indicate the reference data set, test data set, module label (color), module type, module size, observed preservation statistics, their Z scores, empirical p-values, and Bonferoni-corrected empirical p-values. The grey (improper) modules contain all unassigned genes, and the gold module is a random sample of genes representing the entire network as a single module.

(0.26 MB CSV)

Preservation statistics of female mouse liver modules in the corresponding male samples. This table contains observed preservation statistics and their permutation Z scores of female mouse liver modules in the corresponding male samples. Columns indicate the reference data set, test data set, module label (color), module size, observed preservation statistics, their Z scores, empirical p-values, and Bonferoni-corrected empirical p-values.

(0.02 MB CSV)

Preservation statistics of consensus modules across the data sets in which they were identified. This table contains observed preservation statistics and their permutation Z scores of consensus modules across the data sets from which the consensus modules were obtained. Columns indicate the reference data set, test data set, module label (color), module type, module size, observed preservation statistics, their Z scores, empirical p-values, and Bonferoni-corrected empirical p-values. The grey (improper) modules contain all unassigned genes, and the gold module is a random sample representing the entire network as a single module.

(0.34 MB CSV)

Preservation statistics of simulated modules. This table contains observed preservation statistics and their permutation Z scores of simulated modules in our simulation studies. Columns indicate simulation model, module label, simulated status (preserved or non-preserved), observed preservation statistics, Z scores, empirical p-values, and Bonferoni-corrected empirical p-values. The grey (improper) modules contain all unassigned genes, and the gold module is a random sample representing the entire network as a single module.

(0.16 MB CSV)

Detailed methods description. A detailed description of the methods is provided which contains the following sections. First, we describe standard cross-tabulation based module preservation statistics. Specifically, we present three basic cross-tabulation based statistics for determining whether modules in a reference data set are preserved in a test data set. These statistics do not assume that a test network is available. Instead, module assignments in both the reference and the test networks are needed. Second, we briefly review a hierarchical clustering procedure for module detection. Many methods exist for defining network modules. In this section, we describe the method used in our applications but it is worth repeating that our preservation statistics apply to most alternative module detection procedures. Third, we review the definition of signed and unsigned correlation networks. Correlation networks are a special case of general undirected networks in which the adjacency is constructed on the basis of correlations between quantitative variables. Fourth, we present module quality statistics, which we are implemented in the modulePreservation R function. While our main article focuses on statistics that measure preservation of modules between a reference and a test network, we briefly discuss the application of some of the preservation statistics to the related but distinct task of measuring module quality in a single (reference) network. More precisely, the density and separability statistics can be applied to the reference network without a reference to a test network. The results can then be interpreted as measuring module quality, that is how closely interconnected the nodes of a module are or how well a module is separated from other modules in the network. Fifth, we review the notation for the singular value decomposition and for defining a module eigennnode. The section describes conditions when the eigenvector E is an optimal way of representing a correlation module. It also reviews the definition of the proportion of the variance explained by the eigennode). We derive a relationship between PVE and the module membership measures kME, which will be useful for deriving relationships between preservation statistics. Sixth, we investigate relationships between preservation statistics in correlation networks. An advantage of an (unsigned) weighted correlation network is that it allows one to derive simple relationships between network concepts (Horvath and Dong 2008). We characterize correlation modules where simple relationships exist between i) density-based preservation statistics, ii) connectivity based preservation statistics, and iii) separability based preservation statistics. Apart from studying relationships among preservation statistics in correlation networks, we also briefly describe relationships between preservation statistics in general networks.

(0.17 MB PDF)

Details regarding module preservation between human and chimpanzee brain networks. In this document we provide detailed results regarding the preservation of human brain modules in chimpanzee brains.

(0.22 MB PDF)

Detailed description of the human brain. In this document we provide detailed results of Application 4: Preservation of cortical modules between male and female samples.

(2.51 MB PDF)

Detailed description of female mouse liver modules in male mice. Detailed results of Application 5: Preservation of female mouse liver modules in male mice.

(3.82 MB PDF)

Detailed description of the consensus module application. Here we study preservation of consensus modules constructed previously, namely the consensus modules across human and chimpanzee brain samples, across samples from 4 tissues of female mice, and across samples from male and female mouse livers.

(1.41 MB PDF)

Detailed description of the simulation study. Detailed performance analysis of the proposed module preservation statistics in seven simulation scenarios. The design and main results of the simulations are summarized in

(2.78 MB PDF)

We thank our UCLA collaborators Tova Fuller, Chaochao Cai, Lin Song, Jeremy Miller, Dan Geschwind, Giovanni Coppola, Aldons J. Lusis, Art Arnold, and Roel Ophoff for their input. The mouse data were generated by the lab of A.J. Lusis lab.