A Scale-Free Structure Prior for Graphical Models with Applications in Functional Genomics

The problem of reconstructing large-scale, gene regulatory networks from gene expression data has garnered considerable attention in bioinformatics over the past decade with the graphical modeling paradigm having emerged as a popular framework for inference. Analysis in a full Bayesian setting is contingent upon the assignment of a so-called structure prior—a probability distribution on networks, encoding a priori biological knowledge either in the form of supplemental data or high-level topological features. A key topological consideration is that a wide range of cellular networks are approximately scale-free, meaning that the fraction, , of nodes in a network with degree is roughly described by a power-law with exponent between and . The standard practice, however, is to utilize a random structure prior, which favors networks with binomially distributed degree distributions. In this paper, we introduce a scale-free structure prior for graphical models based on the formula for the probability of a network under a simple scale-free network model. Unlike the random structure prior, its scale-free counterpart requires a node labeling as a parameter. In order to use this prior for large-scale network inference, we design a novel Metropolis-Hastings sampler for graphical models that includes a node labeling as a state space variable. In a simulation study, we demonstrate that the scale-free structure prior outperforms the random structure prior at recovering scale-free networks while at the same time retains the ability to recover random networks. We then estimate a gene association network from gene expression data taken from a breast cancer tumor study, showing that scale-free structure prior recovers hubs, including the previously unknown hub SLC39A6, which is a zinc transporter that has been implicated with the spread of breast cancer to the lymph nodes. Our analysis of the breast cancer expression data underscores the value of the scale-free structure prior as an instrument to aid in the identification of candidate hub genes with the potential to direct the hypotheses of molecular biologists, and thus drive future experiments.


Gene Regulatory Networks and Gene Expression Data
Knowledge of the interactions among genes and gene products that occur within a cell is vital for understanding cellular behavior. These activities are largely a consequence of gene expression, the process whereby genes transcribe signature mRNA molecules that are translated into gene products of numerous kinds and functions. As it happens, genes do not express independently of one another; instead, their activities are coordinated in a complex system of control in which distinguished genes, called transcriptions factors, regulate the expression of other genes via their gene product proxies.
An undirected network G is a mathematical object consisting of a set of nodes and a set of unordered pairs of nodes called undirected edges. It differs from a directed network, which is also denoted by G, in that the latter is defined in terms of ordered pairs of nodes known as directed edges. Applying these straightforward abstractions to cellular processes has gained currency throughout the biosciences, so much so that a network mind-set has become a necessary precondition for thinking about systems of gene regulatory interactions. For the purposes of this paper, a gene regulatory network is a directed network in which genes are identified with nodes and regulatory interactions with directed edges. From a purely statistical standpoint, it is best to regard a gene regulatory network as a convenient depiction of the true regulatory interactions of a system that, in reality, must be estimated from data.
Indeed, the network approach toward understanding gene regulatory systems only came to prominence in response to the advent of DNA microarray technology, which makes the profiling of mRNA expression levels for individual genes possible on a genome-wide scale. A typical experiment consists of a library of n expression profiles, each one a snapshot of the expression levels for p genes under a different experimental condition. The raw expression profile data is preprocessed and then arranged by row in an n|p data matrix, D. In practice, not only is gene expression data notoriously noisy [1], but to make matters worse the number of samples is typically at least an order of magnitude smaller than the number of genes, that is, n%p (the ''small n, large p'' problem), making the inference of regulatory interactions a challenging statistical problem [2].

Graphical Models
Graphical models [4], [5], [6] are a suite of probabilistic models, widely used for estimating large-scale gene regulatory networks from gene expression data [7]. In this framework, genes are identified with the random variables of a multivariate distribution X with covariance matrix S, and each row of D is taken as a random sample from X . The conditional independence structure of X defines a network with the random variables as nodes and conditional dependencies latent between the random variables as directed or undirected edges; a diversity of models arise from the extent to which the dependencies are resolved [8].
Relevance networks comprise the simplest class of graphical model with absent edges corresponding to marginal independencies between the components of X . These networks have long been used in the analysis of genetic data [9]. But in terms of identifying regulatory interactions, relevance networks are bound to be misleading because marginal independence alone cannot discriminate among direct and indirect dependencies.
GANs provide a better alternative, circumventing this drawback by appealing to conditional independence as a criterion for edge exclusion. Gaussian graphical models (GGMs) are the gold standard. In a GGM, a pair of nodes do not share an edge when their underlying random variables from X are conditionally independent given all of the remaining random variables. However, GGMs too are not without disadvantages, as their estimation can be computationally intensive in a ''small n, large p'' setting [10]. A class of GANs, bridging the gap between relevance networks and GGMs, has been advanced with this consideration in mind where absent edges are identified with lower order conditional independencies [11], [12], [13].
Lastly, Bayesian networks are a variety of graphical model founded on a more refined notion of conditional independence, conferring directionality to the edges; they are also well-established as a methodology for estimating gene regulatory networks [14].

The Structure Prior
Inference within the graphical modeling paradigm amounts to an often painstaking exercise in covariance estimation and model selection. We defer a discussion on the problem of covariance estimation to the Methods section. That is because our interest pertains to model selection, which in a Bayesian setting is accomplished by sampling from the posterior distribution over the appropriate space of networks using either heuristic searches or else Markov chain Monte Carlo (MCMC). The term P(DDG) is the likelihood and p(G) the structure prior, that is, a prior assigning a probability to each possible network. The role of the structure prior is to direct inference toward graphical models consistent with biological prior knowledge, which may come in the form of a priori topological considerations or from a posteriori sources apart from the dataset. As far as the latter is concerned, previous research has concentrated on Bayesian networks [15], [16], [17], [18]. On the other hand, biologically-motivated topological assumptions are a consistent feature of graphical models tailored for genetic data. Heuristic search strategies often include implicit assumptions concerning network sparsity [19], [20], [21], [22]. In instances in which the structure prior is given explicit specification, standard practices include using a uniform prior capped at a small number of potential regulators per gene [23], or assigning it as a sparse random network [24], [25].

Random and Scale-Free Networks
The theory of random networks was given its first systematic expression by Erdös and Rényi [26], [27]. According to the theory, a p-node random network is defined by an eponymous, generating algorithm -the Erdös-Rényi (ER) model -that works by connecting each pair of nodes in an initially empty network independently with probability b. This simple procedure gives rise to a probability distribution over the space of p-node networks, which is used to define the so-called random structure prior, p r (G). The degree distribution P(k) -the probability that a given node is connected to k other nodes -of a random network is binomially distributed according to where the degree, k, of a node denotes the number of edges incident upon it. It follows, therefore, that degree in a random network has a strong central tendency, implying that the average degree of a random network is representative of the degree of a typical node. Empirical studies, however, have firmly established that a wide variety of large-scale networks in nature, society, and technology exhibit heavy-tailed degree distributions that cannot be accounted for by random network theory [28], [29], [30]. This property is often approximately described by a power-law degree distribution, P(k)!k {c , over a large range of k with exponent c typically between 2 and 3. A network that follows a power-law is called scale-free. It gets this name because the functional form of P(k) is retained under a scaling of the argument k by a constant factor a: P(a|k)!P(k). The scale-free property is thought to be a key organizational feature of cellular networks [31], and analyses suggest such an architecture for the gene regulatory networks of the model organisms S. cerevisiae [32] and C. elegans [33].

Introducing a Scale-Free Structure Prior
Proposing a structure prior which incorporates the scale-free property is the topic of this paper. We define the scale-free structure prior, p sf (G), according to the probability of a network under a simple, scale-free network model. As for the underlying network model itself, a multitude of candidates have been proposed in the literature [34]. They fall into two broad categories: 1) growing models, where a network is generated via the successive addition of nodes and edges to a small seed network, and 2) nongrowing models, where to a fixed number of nodes, pairs of nodes are chosen randomly and connected by edges.
The growing model approach employs a handful of simple universal mechanisms, thought to underpin disparate natural phenomena, to drive the stochastic evolution of networks toward power-laws. Preferential attachment is, perhaps, the best known mechanism. The idea works something like this: the probability of attaching an edge from a newly added node to a node already in the system is roughly proportional to the degree of the old node. The Bárabasi-Albert (BA) model [35] is the latter-day progenitor of a wide variety of preferential attachment models. The BA model generates a network via the successive addition of nodes and edges to a small seed network. At each step, a node is added to the system with a fixed number of emanating edges, which are subsequently preferential attached to the existing nodes. The resulting network follows a power-law with c~3 on average.
Preferential attachment is not considered to be the main driving force behind genome evolution; instead, gene (node) duplication and point mutations (edge dynamics) play the dominant role in shaping of gene regulatory networks [36]. The duplication model as formulated by [37] is such a network model, which in an analysis by [38] is suggested to approximately follow a power-law.
By contrast, in the non-growing approach, each node is assigned a fixed weight with the probability of a particular network depending on those weights. The ER model is an example of a non-growing model with uniform weights. Another non-growing model is the static model [39], which is a generalization of the ER model that has been shown to follow a power-law with c tunable to any value greater than 2, depending on the specification of the model parameters; see Methods for details. We use the static model to define p sf (G). Indeed, this model is an appealing candidate for the purpose as the probability of a network is easy to compute compared with growing models of similar complexity. Moreover, the static model actually includes the ER model as a limiting case.

A New Metropolis-Hastings Sampler for Networks
We implement an MCMC algorithm with p sf (G) for GGMs adapted from [25], although it is important to point out that our methodology applies to graphical models in general. Reworking the algorithm is not simply a matter of plugging in a formula for p sf (G) because it depends furthermore on a labeling of the nodes of G. Confronted with this complication, we design a novel Metropolis-Hastings sampler that solves the problem by including a node labeling, s, which is defined in the Methods section, as a variable in the state space, thereby allowing it to be estimated.

Summary of Contributions
In this paper, we advance a scale-free structure prior, p sf (G), for graphical models defined by the formula for the probability of a network under the static model. Our objective is to compare the performance of this prior with the commonly used random structure prior, p r (G), in the arena of simulation as well as with a real data example. We choose GGMs for this purpose, modifying the MCMC algorithm of [25] to include p sf (G). As mentioned above, one challenge of implementing p sf (G) is that, unlike with p r (G), it requires a labeling of the nodes of G. We address this issue by introducing a Metropolis-Hastings sampler that includes the node labeling as a variable in the state space.
In a simulation study, we generate networks with given degree distributions together with Gaussian data in accordance with their implied conditional independence structures. As a case study we show that p r (G) and p sf (G) are equally effective at recovering a random network, but that p r (G) is comparatively ineffective at recovering a scale-free network. In the full simulation study, we confirm that the aforementioned result holds, illustrating our main conclusion: p sf (G) recovers random networks on an equal footing with the p r (G), yet surpasses it in recovering scale-free networks. Finally, we illustrate our methodology by analyzing a real gene expression dataset taken from a breast cancer tumor study by [40], showing that in contrast with the random structure prior, the scale-free structure prior recovers hubs, including the estrogen regulator FOXA1 and the zinc transporter SLC39A6, which was previously unrecognized as a hub.

Network Notation
The terms ''network'' and ''graph'' are used synonymously throughout this paper. An undirected network G~(V ,E) is a mathematical object defined by a set of nodes V~v 1 ,v 2 , . . . ,v p È É together with a set of undirected edges E consisting of unordered The set of all pnode, undirected networks is denoted by G p . A directed network is defined in an analogous manner, save that the elements of E are ordered pairs v i ,v j À Á called directed edges; v i is called the parent and v j the child.
It should be understood that a network refers to an undirected network, and likewise an edge is to be understood to mean an undirected edge. However, the following definitions are applicable to both undirected and directed networks. An empty network has no edges, that is E~1, while, in a complete network E is defined as the cross product V |V . A subnetwork of G is a network whose node set V ' is a subset of V , and whose edges are a subset of E restricted to V '. The subnetwork of G induced by a given subset of nodes V '(V is the subnetwork containing all edges from E that connect nodes in V '. Two nodes are said to be neighbors when they are connected by an edge. And, a network is itself connected when every pair of nodes is connected by a sequence of neighbors. Finally, a node labeling s~s 1 ,s 2 , . . . ,s p À Á is a permutation of the integers 1,2, . . . ,p, applied to the nodes of G so that each v i [V is represented by the integer s i ; see Figure 1. This node labeling is used later for defining the structure prior.

Gaussian Graphical Models
In this section we sketch out the theory of GGMs essential to this paper. A detailed overview of the GGM estimation procedures outlined here is described in [25], while [8] is a good starting point for understanding the niche they occupy in the larger context of graphical models.
Let X~X 1 ,X 2 , . . . ,X p À Á be a p-dimensional Gaussian random vector with zero-mean and positive definite covariance matrix, S. Two random variables X i and X j are not conditionally independent given the remaining variables in X if, and only if, there is a corresponding nonzero entry in the precision matrix, V~S {1 [41]. The conditional independence structure of X can be represented by a network, G, where X i is the value at node v i and there is an edge between v i and v j when X i and X j are not conditionally independent. A GGM for X is the family of pdimensional Gaussian distributions from N(0,S), constrained by the structure of G.
Fitting a GGM to a given dataset -a task known as covariance selection -amounts to identifying zeros in the estimated precision matrix. In the classical setting when nwp, ensuring that S is positive definite, this is typically accomplished by inverting the estimated covariance matrix and then applying statistical tests to identify any entries significantly different from zero [4]. With genomic data, however, ''small n, large p'' is the norm and, consequently, S will not generally be invertible.
This problem can be addressed in one of two ways. One way calls for restricting inference to pairwise independencies conditioned on fewer than all p{2 remaining random variables. A relevance network, for example, is constructed by estimating the pairwise correlations between all random variables, connecting any pair with correlation exceeding a specified cutoff value [9]. A related approach goes one step beyond a relevance network by estimating a GAN based on not only marginal but also first-order conditional independencies [11].
A more ambitious approach is to compute satisfactory small sample estimates for S and V using Bayesian methods. Empirical Bayesian solutions are exemplified by shrinkage estimates [21] and sparsity encouraging lasso regression estimates [22]. Meanwhile, the full Bayesian scheme of [42] works by marginalizing over S to compute the likelihood term in (1), using a prior that constrains elements of the precision matrix to zero depending on G: The term P(DDS,G) is multivariate Gaussian, while the prior P(SDG) is hyper-inverse Wishart with hyperparameters W, a positive definite dispersion matrix, and dw0, a degrees of freedom parameter. Jones et. al [25] advise a small value for d as a reflection of ignorance, and take W as the diagonal matrix tI, which assumes that the underlying Gaussian variables have common variance. A consequence of this assignment is that d can be used to specify t by making use of the fact that the marginal prior mode for each variance term is var(X i )~t=(dz1). GGM theory comes equipped with powerful techniques for computing the likelihood function when the underlying network is decomposable. Roughly speaking, a decomposable network can be broken down into distinguished subsets of nodes called maximal cliques. A clique is a subset of nodes whose induced subgraph is complete, and is called maximal when it is not contained within a larger complete subgraph. Computing the likelihood for a subset of X corresponding to a maximal clique is particularly tractable because the density is just an unrestricted multivariate Gaussian. Hence, when a network is decomposable, the evaluation of the likelihood term in (1) reduces to the computation of many likelihoods of smaller dimension [43]. We will return to these issues in the section on our MCMC implementation.

The Static Model
A network model is a stochastic algorithm for generating networks that may depend on a vector of parameters, h~h 1 ,h 2 , . . . ,h M ð Þ . Associated with any model is a probability distribution, assigning a probability P(GDh,s) §0 to each G[G p , where s is a node labeling of G.
The static network model [39] works by first assigning a weight w i~s ,v p where m, the Zipf exponent, is a tunable parameter in (0,1). To generate a network, G, the following step is repeated p|K (p {m %K%p 1{m ) times: select nodes v i and v j with probabilities w i and w j and connect them with an edge, unless v i~vj or v i and v j are already connected, in which case no edge is added to the network. The overall model parameter is h~(m,K).
In order to work out the functional form of the degree distribution, it is enough to notice that, on average, each node acquires edges in proportion to its weight. Supposing for a moment that k i denotes the degree of node v i , we may write this as k i !s {m i . The probability distribution over the k i 's is known as Zipf's law, and it has been shown to be equivalent to a power-law degree distribution with c~1z1=m [44]. It follows that the static model generates networks that follow a power-law with 2vcv? depending on the choice of m. A rigorous derivation of the powerlaw appears in a comprehensive analysis of the static model by Lee et al. [45]. In the case when 1=2vmv1, the exponent, c, lies between 2 and 3, which is the most interesting range of values from the point of view of scale-free architecture. In contrast, for values of mv1=2, which corresponds to cw3, the tail of P(k) is less pronounced. In the limit of m?0, or equivalently c??, each weight w 1 , . . . ,w p tends to 1=p, resulting in the ER model with edge inclusion probability b~1{(1{2=p 2 ) pK . To be clear, the static model actually includes the ER model as a special case.
A formula for the probability of a network is provided in the same analysis. The probability that nodes v i and v j are connected in the final network is 1{(1{2w i w j ) pK , which is well-approximated by e {2pKwiwj when p is large. The probability of a network, then, is given by overall product of the edge inclusion probabilities assuming independence.

A Scale-Free Structure Prior
The structure prior is generically defined as where P(GDh,s) is the probability of a p-node network, G, under a certain network model given a vector of parameters, h, and a node labeling, s; the summation is over all permutations of s. It is obvious from the definition that h and s are hyperparameters and must be dealt with accordingly. In the case of s, each one of the p! possible node labeling assignments is alloted uniform weight. In our work, we additionally impose that p(h) is a uniform prior, leaving the details to be described below within the contexts of specific network models. The simplest means of dealing with uncertainty about graphical structures is to assign uniform weight either to each G[G p , that is, p(G)~1=DG p D, or to the subspace of decomposable networks [43]. This approach is in fact a special case of the probability of a network under the ER network model when b~1=2. A related approach uses a prior that distributes probability mass uniformly according to the number of edges as opposed to individual networks [42]. More recently, the ER model has been explicitly employed as a structure prior [24], [25]. The random structure prior, p r (G), is formally defined by where h~(b) and T~p 2 is the number of possible edges. A node labeling would be superfluous due to symmetry. To foster sparsity b may be fixed at 2=(p{1) so that the expected number of edges comes out to be p when (5) is taken over all networks. Although strictly speaking the value will be somewhat lower in the decomposable case [25]. The approach taken in this paper goes one step further by simply taking p(b) as uniform over the unit interval.
As explained above, the static model with parameter h~m,K ð Þ is a generalization of the ER model that is accommodating to scale-free topologies. We define the the scale-free structure prior, p sf (G), according to the probability of a network under (3). The static model has two parameters, h~m,K ð Þ, and they are not exactly independent as the domain of K is a function of m. This means that the prior p(m,K) must actually be treated as the product of p(m) and p(KDm). We take each term to be uniform over its respective domain.

MCMC Implementation
MCMC algorithms are commonly used for sampling from highdimensional probability distributions such as those encountered in modern bioinformatic applications [46], [47], [48]. In this section, we describe a Metropolis-Hastings sampling scheme for updating the state variables G, h, and s. Our main interest is in inference for the posterior P(GDD). We take the approach of estimating the target distribution P(GDD,h,s)!P(DDG)P(GDh,s)p(h)p(s) ð6Þ with p(s)~1=p! and then marginalize over h and s to obtain P(GDD). In the process, any h m (m~1, . . . ,M) or s i (i~1, . . . ,p) can be estimated from a histogram of values, constructed from an MCMC chain. While methodology for sampling from P(GDD,h)!P(DDG)P(GDh)p(h) is well-established for GGMs [43], [42], [25], the concept of including s as a state space variable is new to our work. In principle, it is possible to marginalize over all permutations of s at each step, that is, P(GDh)~(1=p!) P s P(GDh,s). This approach, however, quickly becomes unfeasible as the number of nodes becomes large. What is more, very few assignments for s actually capture scale-free network structure, making the marginalization difficult to estimate by random sampling. Instead, we include s in the MCMC directly. We describe a Metropolis-Hastings sampler for s below, and provide an implementation in C computer code for decomposable GGMs, built largely on the work of [25].

Metropolis-Hastings Sampler
Updating G. The space of decomposable graphs can be traversed by adding or deleting a single edge in the transition from a current network, G, to a proposed network, G' [49]. In an arrangement of this sort, G and G' will have nearly identical maximal cliques, leading to extensive cancellation in the likelihood ratio P(DDG)=P(DDG') [42]. This coupled with the closed form expressions for (2) in the decomposable case, results in considerable computational savings in comparison with the same computations for non-decomposable models. However, in the transition from G to G', special care is required to preserve decomposability. To that end, a theorem of [43] provides easily verifiable, necessary and sufficient conditions to determine whether or not a network is decomposable. In their implementation [25], a transition is accomplished by first deciding to either add or delete an edge to G by the flip of a coin. Next the appropriate move is made at random to obtain G' as shown in Figure 2. If G' happens to be non-decomposable, then it is rejected outright.
Updating h. Each hyperparameter h m (m~1, . . . ,M) is updated as follows: select a value for h m' uniformly from (h m {e,h m ze) for a given step size e, rejecting when h' m falls outside its domain h mmin ,h mmax ð Þ . Updating s. In order to obtain s' we select an integer h[f1, . . . ,p{1g at random, find nodes v i and v j such that s i~h and s j~h z1, and then exchange the values of s i and s j ; see Figure 2.

Network and Parameter Estimation
Estimating G. An MCMC sample of the posterior (1) becomes increasingly threadbare as the number of variables grow, so much so that the frequency of a network in a chain is an inadequate approximation to its true probability, even for problems of moderate dimension. So too for the maximum posterior network -the single most probable network in a chain -unless its probability mass dominates a possibly multi-modal landscape, comprising a near-infinity of alternative models, its status as a representative estimator is questionable [50]. This is even more important in our implementation, as we carry the model parameters through the computation. Alternatively, a more representative estimator can be pursued by exploiting marginal probabilities of edge inclusion, which do reflect posterior density. We took our estimated network to be the network of all edges in the sample with marginal probability greater than c, which we denote byĜ G c p ; the subscript p denotes the structure prior.

Simulation Design
We carried out a simulation study in order to evaluate the relative performance of the random and scale-free structure priors. In our experiments, we generated trees invested with a variety degree distributions that can be thought of as falling along a spectrum ranging from binomial to scale-free on through to more extreme heavy-tail forms, called crumple trees, culminating finally with a star tree. For each tree, we generated multivariate Gaussian data under the assumption that a tree represents the true underlying conditional independence structure of a GGM. We then ran our Metropolis-Hastings sampler for both structure priors in an effort to recover each true tree from the data.
Data generation. In order to simulate trees we more or less relied on the stochastic algorithm of [51]. Their approach rests on specifying a formula for the degree distribution, P(k), for a p-node connected tree. Then, roughly speaking, they use MCMC to draw a tree that is maximally random under P(k).
The reason for restricting our simulation to trees is that data satisfying their implied conditional independence structures can be generated by a simple iterative procedure. With this end in mind, it is convenient to imagine the edges as being directed according to index so that an edge from v i to v j implies that ivj. The procedure begins with simulating X 1 , which is identified with node v 1 , as a standard normal random variable, Z 1 . Next, any X j corresponding to a child of v 1 is simulated as X j~( X 1 zZ j )= ffiffi ffi 2 p . The step X j~( X i zZ j )= ffiffi ffi 2 p is then repeated from parent v i to child v j until all nodes have been reached. The scaling factor, ffiffi ffi 2 p , ensures that each X i [X has unit variance.
Performance measures. Let TP (true positive) denote the number of edges correctly identified by the estimated network with FP (false positive), FN (false negative), and TN (true negative) defined similarly. The positive predictive value, PPV~TP=(TPzFP), and the sensitivity, Se~TP=(TPzFN), are reported for each estimated network. While it is often customary to include specificity, TN=(FPzTN), along with PPV and Se, its conspicuous absence here is for good reason. Since GANs are sparse, TN is sure to be very large in comparison to FP. As a result, even a moderate change to FP will have little influence on the specificity, making this an unsuitable measure of performance.

Simulated Example
This section serves as a prelude to an extensive simulation study, illustrating our methodology by means of a simple example. Specifically, we set p~250 and generated a binomial tree and a scale-free tree with c~2:3, and then simulated an n~75 Add Delete Figure 2. An example of the Metropolis-Hastings transition step. The current network, G, with node labeling s~3,4,2,1,5 ð Þis updated to a proposed network, G', by adding/deleting a single edge to/from G at random. This picture shows two possible ways for G to be updated. In one instance, a new edge is added between v 1 and v 3 to obtain G', while the other G' is obtained by deleting the the edge between v 1 and v 4 . As for the proposed node labeling, an integer h, in this example h~4, is selected randomly from 1,2,3,4 f g . From there the node labels s 2~h~4 and s 5~h z1~5 are swapped to get s'~3,5,2,1,4 ð Þ . doi:10.1371/journal.pone.0013580.g002 observation dataset from each. In each case, we attempted to recover the true tree from the scaled dataset using our Metropolis-Hastings sampler implemented with 1) p r (G), and 2) p sf (G).
For each chain, the Metropolis-Hastings sampler was run for 10 7 steps after a burn-in of 10 6 , starting from the empty network and identity node labeling. The value for the step size, e, required for updating the hyperparameters was set to 0:05 with m min~0 :01. As for the hyper-inverse Wishart parameters, we choose d~3 which fixes t at 4 since the data was standardized. The values of the hyperparameters were recorded at every 100'th step after burn-in. The runtime for the Metropolis-Hastings sampler with p r (G) on a dual 1:8 GHz PowerPC G5 processor was 5:92hrs for the binomial tree and 5:67hrs for the scale-free tree. The corresponding runtimes with p sf (G) were 15:20hrs and 12:93hrs.
The results of the case study are shown in Table 1. In this experiment, our expectation that p sf (G) will recover the scale-free tree more accurately than p r (G) is confirmed. It should also be noted that p sf (G) was able to recover a reasonable value for the scale-free exponent, too. Not to mention that it recovered the binomial tree on par with p r (G), thereby allaying the potential drawback that it would infer a heavy-tailed network, even from binomial data. Remember, this can be explained by the rather large value of b c c. Recall that when c is large, p sf (G) actually approximates p r (G). And although it may seem odd that p sf (G) fared slightly better on the binomial tree, the disparity falls within the boundaries of sampling variation. More precisely, we ran the Metropolis-Hastings sampler 10 times for each structure prior, starting each run from a different random seed, and found that the standard deviation of the sensitivity was 0:02 in each case. Finally, we ran the uniform structure prior on both trees, but decided against including the results in Table 1 due to very poor performance. Table 2 contains the results of our main simulation. In the previous section, we focused on two particular trees: one binomial, the other scale-free. This time we generated 25 trees (p~250) under each model listed in the table together with accompanying datasets of n~75 observations. The models listed as scale-free, not including the BA model, and the crumpled one were generated from a two-parameter family of distributions [51]. The parameter setting for generating the crumple trees was a~8 and c~2. The simulation settings used for each MCMC run are identical to those of the case study. Finally, the values of PPV , Se, and b c c reported in the table are averaged over the 25 chains. The simulation was run on the supercomputer, Tsubame [52]. The system has a total of 639 Sun Fire 64600 nodes. Each node has 8 AMD Opteron Dual Core model 880, 2.4GHz cpus.

Extended Simulation
Just as with the simulated example, p sf (G) recovers the binomial trees equally as well as p r (G). In fact, the PPV agreed to two decimal places, while the Se was actually a little higher under the scale-free structure prior. This slight discrepancy can be accounted for by noting that the standard deviation of Se was 0:04 for both priors. Also as expected, the more heavy-tailed the underlying trees become, the more p sf (G) outperforms p r (G). The difference becomes huge in the extreme case of a star tree. Moreover, p sf (G) demonstrated the ability to roughly recover the scale-free exponent of the underlying tree.

Real Data Example
We demonstrate our methodology on a subset of the gene expression data from a breast cancer study by [40] that was originally analyzed in [25]. The dataset (Dataset S1) consists of expression profiles for p~150 genes related to the estrogen receptor gene ESR1 (also known as ER-alpha) derived from n~49 tumor samples. This gene is an estrogen-activated transcription factor key to the proliferation of cancerous cells that is found to be overexpressed in luminal type A and B breast cancers. The overall level of ESR1 expression is higher in type A than in type B with the former correlating with better prognosis [53].
The Metropolis-Hastings sampler was run on the standardized data with both the random structure prior and its scale-free counterpart, yielding the corresponding GANsĜ G c pr andĜ G c psf . For comparison's sake, the edge inclusion threshold, c, was tuned for each run so that the resulting GAN comprised exactly 150 edges; the value of c is 0:370 for p r (G) and 0:747 for p sf (G). In both cases, the Metropolis-Hastings sampler was started from the empty network with identity node labeling and 11|10 6 iterations were run with the first 10 6 discarded as burn-in. The hyperparameter assignments were identical to those of the simulated examples. The runtime on a dual 1:8 GHz PowerPC G5 processor was 2:70hrs with p r (G) and 11:93hrs with p sf (G).
At this stage, comparing the performance of the scale-free structure prior in a broader context is of key importance. To this end, we used the software packages ARACNE [54], [55] and BANJO [23] to analyze the gene expression data as well. ARACNE constructs a relevance network based on estimated mutual information between all pairs genes, but there is a twist. After a relevance network is inferred by connecting any pair of genes with mutual information greater than a certain cutoff value, some edges suspected to represent indirect interactions are eliminated using the data processing inequality principle. We chose the cutoff value to be 0:2735 so that the number of estimated edges was 150, while all other program arguments were set at their default values. The code itself was run in a matter of minutes. BANJO, on the other hand, constructs a Bayesian network from discrete data using a heuristic search strategy to explore the space of p-node, directed networks without cycles. Each network happened upon in the search is ranked using a Bayesian Dirichlet equivalence scoring metric. We discretized the      decentralized with no single gene dominating the network. Additionally, the estimated value of exponent c in the static model was 2:28, in line with findings in the literature for gene regulatory networks [31]. Turning now to Figures 3(C) and (D), it is interesting to see that the topology of the relevance network echoes that of the GAN inferred using the scale-free structure prior. The same can be said for the Bayesian network and the random structure prior GAN. Of course, a more exquisite experimental technique is the only sure-fire way to validate the individual regulatory interactions suggested by these graphical models. These results, however, are telltale in one respect. In a study comparing different reconstruction methods on simulated data [56], it was reported that BANJO performs well only when n&p, while ARACNE shows good performance even when n%p. The topological dissimilarity between the two GANs is again made evident by a visual inspection of their degree distributions, plotted in Figure 4. The most abundantly connected node inĜ G c pr has degree 15, whereasĜ G c p sf contains four nodes with degree exceeding this value; the largest hubs correspond to the genes FOXA1 (HNF-3A), SLC39A6 (LIV-1), and E2F3 (KIAA0075) and have degree 50, 28, and 18, respectively. The main hub FOXA1 is a forkhead box family transcription factor that is necessary for optimum expression of roughly half of all ESR1regulated genes [57]. In a recent study [58], it was found that FOXA1 is expressed predominantly in luminal type A carcinomas, making it a potential marker of good prognosis. Previously unrecognized as a hub, SLC39A6 functions as a zinc transporter, and was identified in [59] to be highly expressed in ESR1-positive tumors as well as showing a highly significant association with the spread of breast cancer to the lymph nodes. Meanwhile, E2F3 is a transcription factor that has been shown to regulate numerous genes involved in cell cycle progression [60].
Finally, both GANs agreed with the relevance network on some established regulatory interactions as can be seen in Figure 5. For instance, FOXA1 is connected to AR (androgen receptor), which is known to regulate estrogen receptor expression [61]. FOXA1 has also been shown to play a direct role in the transcription of the TFF1 (pS2) gene [62], and our work agrees with [24] on the role of TFF3 (ITF) as an intermediary. By contrast, the Bayesian network agreed on very few of these interactions. Part of the explanation is likely to rest in using the maximum posterior network as the estimated network. As we drew attention to in the section Network and Parameter Estimation, a single network of high posterior probability may be a less representative estimator than an network consisting of edges that occur with high frequency in an MCMC chain. Another possible contributing factor is that the number of observations was insufficient for BANJO, but what is also unclear is the extent to which discretizing the expression data affected the quality of the inference.

Discussion
The main purpose of this paper has been to introduce a scalefree structure prior, p sf (G), for graphical models with a view toward the inference of large-scale GANs from datasets consisting of few observations, n, for a comparatively large number of variables, p. It is important to point out that the true network need not follow a power-law in order for the scale-free prior to be applicable; rather, p sf (G) is a convenient distribution that can account for heavy-tailed degree distributions -a crucial limitation of the random structure prior. That said, we have shown in simulated examples that p sf (G) performs markedly better than the random structure prior at recovering networks characterized by heavy-tailed degree distributions. What is more, p sf (G) proved versatile enough to recover random networks on par with p r (G) itself. Above all, our analysis of the breast cancer expression data illustrates the practical value of the scale-free structure prior as an instrument to aid in the identification of candidate hub genes with the potential to direct the hypotheses of molecular biologists, and thus drive future experiments.
A node labeling s~s 1 ,s 2 , . . . ,s p À Á , that is, a permutation of the integers 1,2, . . . ,p applied to the nodes of G so that each v i [V is represented by the integer s i , is an essential prerequisite for any MCMC implementation of the scale-free structure prior. The reason is that the scale-free network model underlying p sf (G), or any other scale-free network model for that matter, is so elaborate that the nodes are not interchangeable in regard to computing the probability of G. And, although easily overshadowed by p sf (G) itself, our new Metropolis-Hastings sampler for s is an innovative contribution in its own right. Our sampler uses a simple pair swapping strategy for updating s, and one future topic of research is to investigate the comparative performance of more ingenious update schemes. More research is also required in order to assesses how accurately s can be estimated.
We take pains to point out that while our implementation is for GGMs, the methodology described here applies to graphical models more generally. For instance, p sf (G) could be applied crudely to Bayesian network inference by simply ignoring edge directionality, or else the underlying static model could be modified to have directed edges. The latter approach raises an interesting consideration: in gene regulatory networks, according to the prevailing wisdom [31], it is actually only the out-degree distribution that follows a power-law. By contrast, the in-degree of a node is usually small and its distribution is better approximated by a sort of restricted exponential function. While this distinction gets blurred when inference is conducted with undirected graphical models, Bayesian networks provide an obvious incentive for taking it into account. Indeed, Bayesian networks may prove to be a more promising area of application because they currently able to handle much larger networks than GGMs [63].
Although the static model is not biologically motivated, it is a defensible choice as an underlying model for p sf (G) on the grounds that it is a simple model with the potential to describe any network topology; not to mention that it includes the ER model as a limiting case. But there is more, implementing a structure prior based on a growing network model poses some added difficulties because not only will the probability of a network depend on the choice of seed network, but evaluating P(GDh,s) will result in a greater expenditure of computational resources as the edge inclusion probabilities depend on the order in which they were added to the network.
All the same, we implemented two other scale-free structure priors based on growing models; one on the Poisson-growth, preferential attachment model [64], and another on the biologically meaningful duplication model. In the former case, we were able to get away with using a single node as the seed network, and we found that while this prior recovered heavy-tailed networks as well as p sf (G), yet it understandably struggled to accurately recover random networks. Meanwhile, the duplication model based structure prior was highly sensitive to the choice of seed network in addition to being unstable due to the complexity of the model. One future avenue of research is to adapt these models, or the MCMC implementation, to be more applicable for use as a prior distributions. The primary motivation for doing so is that the model parameters have biological meaning, and their estimation could prove of independent interest.
The estimation of network model parameters has been an incidental aspect of our work; however, it is related to the quite different problem of fitting network models to known biological networks. Likelihood and likelihood-free methods have been developed [65], [66] in order to fit a hybrid preferential attachment/duplication and divergence model to some proteinprotein interaction networks, obtaining estimates of the model parameters. These methodologies assume that the ordering of the nodes in time, that is s, is known, but in most cases this information is unknown. In the future, our Metropolis-Hastings sampler could very well be applied to this problem.
Software is available from the corresponding author upon request.

Supporting Information
Dataset S1 This file contains the gene expression data that we analyzed in our paper. Found at: doi:10.1371/journal.pone.0013580.s001 (0.05 MB TXT)