^{*}

Conceived and designed the experiments: VV DJ. Performed the experiments: VV DJ. Analyzed the data: DJ. Contributed reagents/materials/analysis tools: VV DJ. Wrote the paper: VV DJ. Other: Developed the web site: VV.

The authors have declared that no competing interests exist.

Although the use of microarray technology has seen exponential growth, analysis of microarray data remains a challenge to many investigators. One difficulty lies in the interpretation of a list of differentially expressed genes, or in how to plan new experiments given that knowledge. Clustering methods can be used to identify groups of genes with similar expression patterns, and genes with unknown function can be provisionally annotated based on the concept of “guilt by association”, where function is tentatively inferred from the known functions of genes with similar expression patterns. These methods frequently suffer from two limitations: (1) visualization usually only gives access to group membership, rather than specific information about nearest neighbors, and (2) the resolution or quality of the relationships are not easily inferred.

We have addressed these issues by improving the precision of similarity detection over that of a single experiment and by creating a tool to visualize tractable association networks: we (1) performed meta-analysis computation of correlation coefficients for all gene pairs in a heterogeneous data set collected from 2,145 publicly available micorarray samples in mouse, (2) filtered the resulting distribution of over 130 million correlation coefficients to build new, more tractable distributions from the strongest correlations, and (3) designed and implemented a new Web based tool (StarNet,

Correlations were calculated across a heterogeneous collection of publicly available microarray data. Users can access this analysis using a new freely available Web-based application for visualizing tractable correlation networks that are flexibly specified by the user. This new resource enables rapid hypothesis development for transcription regulatory relationships.

Several approaches to microarray data analysis make use of clustering techniques

Synthesis and visualization of publicly available data remains a challenge for biologists. Available microarray data is thus typically not exploited beyond the scope of the original experiment. Visualization platforms such as Cytoscape

Dynamic Bayesian networks offer a viable approach for the discovery of gene regulatory network topology

2145 array samples were selected for the Affymetrix whole genome mouse 430 2.0 array platform. Data were normalized and scaled using justRMALite. The resulting distribution of over 130 million Pearson correlation coefficients was filtered to produce various distributions of the strongest relationships. Correlation data and Entrez Gene annotations were used to populate a new database. StarNet was developed to allow users to make database queries to create and draw correlation networks local to their gene of interest on-the-fly, and to provide supporting information about genes in those networks.

We present a user-directed approach to network elucidation, and provide an intuitive Web-based interface (StarNet,

We selected 2,145 sample hybridizations performed on the Affymetrix GeneChip Mouse Genome 430 2.0 Array which are available from the Gene Expression Omnibus (GEO)

Features on the array were mapped to Entrez Gene

Pearson correlation coefficients were calculated for all pairwise comparisons of genes on the array using Octave. This yielded 132,787,956 coefficients for each cohort.

Several subsets of the collection of correlation coefficients were built. First, we selected the 20,000 (20K) largest positive correlation coefficients. This procedure was repeated for 40,000 (40K) and 100,000 (100K) coefficients. The 20K, 40K, and 100K sub-distributions were also formed for the largest negative coefficients. We further considered the union of positive and negative “extreme tails”, for each of the three sizes. This procedure was executed for both the full and cardiac cohorts, yielding a total of 18 different sub-distributions.

To guarantee that each gene on the array is represented in our distributions, a “genecentric” distribution was built. The ten largest positive correlations to each gene were selected, with the proviso that the p-value of the correlation was less than .05. This was repeated for negative correlations, and again the union of positive and negative correlations was considered. This procedure was carried out for both full and cardiac cohorts, thus obtaining an additional six distributions.

We built two further classes of “specialty” distributions, each a variant on the genecentric distribution. First, the genecentric construction was repeated, but constrained to those genes whose GO

Both sets of correlation coefficients were loaded into a MySQL database, and the partitioning of the set of correlation coefficients was executed using MySQL database calls scripted with Perl.

The database was also populated with Entrez Gene data and Gene Reference Into Function (RIF) files available at NCBI's FTP site (

The network construction algorithms were implemented in Perl. The algorithms allow a variety of choices for defining network topology. For details see the user manual at

The CGI script that takes user input from the StarNet submission page and produces the results pages was written in Perl. Graphs are drawn using AT & T's Graphviz package (

For further details regarding scales, procedures used to build graphs of known interactions, and other details of the visualization script, see the white paper and user manual available at

StarNet takes a user-specified gene as input, as well as the parameters indicated below. Using the distributions described above, a network is then drawn centered about the specified gene. The gene of choice is level 0; those genes to which it is directly connected by correlations from the distribution of choice, and using the graphing methodology of choice, are level 1, etc. Two graphs are produced, one for the cardiac cohort and one for the full cohort. In addition to the correlation graphs we provide

To use StarNet (available at

For further details on and explanation of parameter choices, in particular for choices of network topology, see the user manual available at

Pearson correlation coefficients between genes on the array were computed using Octave (full cohort: n = 2,145, cardiac cohort: n = 239). A two-tailed t-test was used to compute p-values for each coefficient. After using the Fisher z-transformation to normalize the correlation coefficients, confidence intervals in the normalized setting were computed, and the inverse of the Fisher z-transform applied to yield confidence intervals in our original variables.

Enrichment of GO terms was evaluated using the hypergeometric test, following the recommendations of Rivals and colleagues

For each of our distributions we computed the mean, standard deviation, and skew. Skew was computed without bias correction. We ran several tests for normality on the distributions: Kolomogorov-Smirnov, Lilliefors, and Jarque-Bera. All were run at the 5% significance level. The Kolmogorov-Smirnov test was run using the sample mean and standard deviation as the parameters for the normal to which to compare our empirical distribution. As the sub-distributions were all found to be non-Normal, we used the Mann-Whitney rank sum test to compare respective sub-distributions from the cardiac and full cohorts. All tests were performed using MATLAB.

The main contribution of this work is the creation of StarNet, a new freely available Web-based tool that facilitates the reconstruction of transcription regulatory networks via rapid hypothesis development and providing provisional gene groups for focused modeling efforts. A brief description of the tool's features and usage, along with links to supporting documentation, are given in the

The genecentric distribution and the distribution of the 100,000 (100K) largest correlations (see

Cardiac 20K Negative | −0.7951 | 0.0231 | −1.2851 | 2,746 |

Full 20K Negative | −0.5150 | 0.0264 | −1.6448 | 3,486 |

Cardiac 20K Positive | 0.9568 | 0.0117 | 1.2876 | 1,534 |

Full 20K Positive | 0.9559 | 0.0167 | 0.5900 | 1,494 |

Cardiac 40K Negative | −0.7755 | 0.0259 | −1.2907 | 3,712 |

Full 40K Negative | −0.4944 | 0.0282 | −1.5856 | 4,734 |

Cardiac 40K Positive | 0.9458 | 0.0141 | 1.0664 | 2,067 |

Full 40K Positive | 0.9361 | 0.0239 | 0.5426 | 2,265 |

Cardiac 100K Negative | −0.7457 | 0.0304 | −1.2628 | 5,122 |

Full 100K Negative | −0.4648 | 0.0309 | −1.5276 | 6,670 |

Cardiac 100K Positive | 0.9263 | 0.0192 | 0.8767 | 3,362 |

Full 100K Positive | 0.8957 | 0.0386 | 0.5479 | 4,077 |

Cardiac Genecentric Negative | −0.5907 | 0.1342 | 0.3833 | 16,297 |

Full Genecentric Negative | −0.3568 | 0.1048 | 0.4748 | 16,295 |

Cardiac Genecentric Positive | 0.7678 | 0.1126 | −0.2972 | 16,297 |

Full Genecentric Positive | 0.6856 | 0.1371 | 0.1511 | 16,297 |

The mean of the 100K most positive correlations in the cardiac cohort (.9263) is statistically different from the mean of the 100K most positive correlations in the full cohort (.8957) with p<1e-16 (Mann-Whitney rank sum test). The same is true of the means of the 100K most negative correlations in cardiac and full cohorts (p<1e-16), as well as for both positive and negative genecentric distributions (p<1e-16). The correlations in the cardiac cohorts show a general trend of being more extreme than those in the full cohort (

a: The cardiac cohort (dark blue) and full cohort (light blue) Genecentric distributions. b: The largest positive correlations in the cardiac cohort. The highest 335 of these associations have a Pearson correlation of 1 (to 16 decimal places), and these form a notable spike at the tail of the distribution (arrow). c: The 335 associations indicated in panel b are composed of 80 genes in the 22 groups of genes shown here. d: The genomic structure of the Pcdhgb1 family of genes (drawn with the UCSC Genome Browser).

The full cohort represents a large number of regulatory states, from a variety of tissues, whereas the cardiac cohort represents a relatively fewer number of related regulatory states. The bimodal distributions from the full cohort thus represent an average state consisting of coregulatory and correlative relationships that are relatively weaker on average than those of the cardiac distributions. There are two main factors that contribute to this observed difference between the cardiac and full cohort sub-distributions:

At the positive tail of both the cardiac and full cohort highest correlations, there is a spike of 335 correlations equal to 1 (to a precision of 16 decimal places,

As a representative example, below we analyze

a: Full cohort correlation network with

We have noted that known markers for cardiac development appear together more frequently in full cohort networks than in cardiac cohort networks. In the full cohort, where the network is constructed from a more general milieu of associations, genes specifically active in embryonic stages are prominently associated, although it is with relatively smaller correlations than in the cardiac network, on average. Upon examining a finer resolution of associations in the cardiac/development milieu alone (i.e. upon conditioning the measured associations to a narrower range of tissue types), we find that the prominence of correlated genes that are known markers for embryonic or cardiac tissue types in the full cohort networks is frequently displaced by other genes that are more highly correlated within the narrower milieu of the cardiac cohort. Higher correlations between expressed genes are an expected result of conditioning on a narrower field of tissue types. It is also expected that a systematic comparison of the high ranking correlations from each of the cohorts, where networks are built about selected genes implicated in cardiac development, will reveal insights into previously uncharacterized features of cardiac transcriptional regulatory networks.

Correlation, while indicating relationships, does not imply causality. For this reason, the networks built by StarNet should not be viewed as directional, or as indicating that any given gene in the graph is a direct influence on any other. Important relationships are captured by correlation, however, and may thus suggest further experimentation or modeling. Recent work has indicated the utility of correlation as a measure of gene co-expression relationships. For example, Reiss and colleagues

The full cohort and cardiac cohort networks given here as examples of StarNet's analysis are not immediately amenable to quantitative comparisons. One obvious obstacle to comparison is that the networks do not have any nodes in common besides the central node. Many networks drawn with StarNet do have several nodes in common, but the common nodes are frequently a minority of the total network nodes. The networks are constructed with arbitrary cutoffs for the highest correlations with a given node, so many biologically important associations may be missing from a particular network. One approach to comparing these networks would then be to create a ‘super-network’ for each cohort, where all the unique nodes from the full and cardiac network are combined, and a completely connected network is created from the original distribution (full or cardiac) of correlation coefficients. These completely connected networks can be analyzed using the tools of social network analysis

The methodology and algorithms developed to create StarNet may be easily applied to other organisms, other platforms, and any subset of the arrays may be selected as a cohort. Future efforts will expand the utility of StarNet in these areas, and consider comparisons of more than two cohorts.

StarNet adds a useful tool to the repertoire of the biomedical scientist. It is easy to use, and the results are readily interpretable. It can be used in conjunction with the other tools at the biologist's disposal, either as a tool for generating hypotheses for new experimental investigations, or as the first step towards reconstructing and modeling transcriptional regulatory networks.

We thank Ben Bolstad for his help in implementing justRMALite. Cynthia Meininger, David Zawieja, Harris Granger, Hung-Chung Huang, Tetsuya Tanaka, and James Littlejohn provided valuable feedback on the development of the Web interface.