Highly scalable maximum likelihood and conjugate Bayesian inference for ERGMs on graph sets with equivalent vertices

The exponential family random graph modeling (ERGM) framework provides a highly flexible approach for the statistical analysis of networks (i.e., graphs). As ERGMs with dyadic dependence involve normalizing factors that are extremely costly to compute, practical strategies for ERGMs inference generally employ a variety of approximations or other workarounds. Markov Chain Monte Carlo maximum likelihood (MCMC MLE) provides a powerful tool to approximate the maximum likelihood estimator (MLE) of ERGM parameters, and is generally feasible for typical models on single networks with as many as a few thousand nodes. MCMC-based algorithms for Bayesian analysis are more expensive, and high-quality answers are challenging to obtain on large graphs. For both strategies, extension to the pooled case—in which we observe multiple networks from a common generative process—adds further computational cost, with both time and memory scaling linearly in the number of graphs. This becomes prohibitive for large networks, or cases in which large numbers of graph observations are available. Here, we exploit some basic properties of the discrete exponential families to develop an approach for ERGM inference in the pooled case that (where applicable) allows an arbitrarily large number of graph observations to be fit at no additional computational cost beyond preprocessing the data itself. Moreover, a variant of our approach can also be used to perform Bayesian inference under conjugate priors, again with no additional computational cost in the estimation phase. The latter can be employed either for single graph observations, or for observations from graph sets. As we show, the conjugate prior is easily specified, and is well-suited to applications such as regularization. Simulation studies show that the pooled method leads to estimates with good frequentist properties, and posterior estimates under the conjugate prior are well-behaved. We demonstrate the usefulness of our approach with applications to pooled analysis of brain functional connectivity networks and to replicated x-ray crystal structures of hen egg-white lysozyme.


Introduction
Networks are relational structures composed of individual entities (vertices or nodes) together with a set of pairs or ordered pairs of entities (ties or edges) that share a specific relationship.Networks arise in many scientific fields, ranging from biology and epidemiology to social science and engineering.For example, social science researchers are frequently interested in interpersonal model to each individual separately, subsequently combining the resulting estimates by taking their respective means and medians as point estimates of the parameters of a group-level representative model.A similar approach was used by [36] in studying friendship across schools, with separate models fit to each network and the resulting statistics then summarized to infer general patterns.Under such a framework, the time complexity of inference scales linearly in the number of graphs, making this generalization expensive as the number of networks grows (particularly if the individual networks are themselves large).A second problem with the unpooled approach is that it may be difficult or impossible to find a model that is both estimable on each individual network and that includes all effects of substantive interest.For instance, where the model sufficient statistics for a particular network are sufficiently extreme (in a sense to be clarified below), the MLE may not exist; this condition is common for effects involving subgroup interactions when said subgroups are small.Importantly, this failure need not mean that the model family is generally inappropriate or ill-behaved, instead stemming from limitations in the ability to fit some models to a single graph realization.A natural way to avoid such problems is by pooled estimation.In the ERGM context, pooled estimation has been studied e.g. in the case of independently observed intraschool friendship networks [58], where the adjacency matrices representing the observed networks of each distinct school are aggregated to a block-diagonal matrix with structural zeros assumed for all off-diagonal blocks.However, the high cost of performing MCMC MLE on such a pooled network greatly limits the scale of cases that can be considered in this way.Similar schemes using hierarchical Bayesian models have also been proposed [59], but require high-quality MCMC simulations that can be computationally demanding when the number of network observations is large.Importantly, all of these methods share the property that likelihood calculations must de facto be performed for each network separately (whether those networks are notionally joined together in one large synthetic network as part of a pooling scheme or treated separately), which greatly increases both storage and time complexity (especially for large graphs).
This cost poses a substantial barrier in applications such as neuroscience or biophysics, where pooled inference on large collections of networks is of potential interest.Importantly, however, some of these cases have the special property that the collections of networks to be analyzed (or large subsets thereof) involve equivalent vertex sets (with either absent or equivalent covariates).For instance, [40] study two collections of approximately 1,000 networks representing independently drawn local energy minima for structures of wild type and E22G variants of Aβ 1−40 , a protein that plays a key role in the etiology of Alzheimer's disease.Each collection represents a series of independent draws from a respective common graph distribution, with identical vertex properties for graphs in each set.Likewise, in neuroscience settings one may (as in e.g.[41]) observe anatomically defined networks on collections of subjects that are (at least provisionally) considered exchangeable within groups, and that can be usefully modeled as draws from a common graph generating process.Although less common to date in social science settings, collections of exchangeable networks with equivalent vertex sets can arise e.g. from behavior in human subject experiments [60], semantic networks extracted experimentally or from texts [61], or as the result of replicated outcomes of agent-based simulations [62].In such cases, it is possible to perform pooled ERGM inference at vastly lower cost than is possible with conventional techniques -indeed, obtaining computational costs for estimation that are identical to the single-graph case.
In this paper, we propose a novel method for fitting multiple graph observations embedded in equivalent node sets using a pooled ERGM approach.By exploiting a simple property of statistical exponential families, we are able to convert the problem of pooled inference for a (possibly large) graph set to the problem of inference for a single pseudo-graph of size equal to an individual input graph, subsequently correcting the single graph information matrix for the sample size of the pooled data set.We also show that a minor adjustment to this technique can be used to perform maximum a posteriori (MAP) estimation under conjugate priors at no additional cost, either for a single graph or for multiple graph observations.Because our technique works entirely within the mean value space of the chosen ERGM family, it can be performed via data adjustments with existing software intended for single-graph estimation, and is compatible with any estimation method that works via sufficient statistics (including the widely used Geyer-Thompson [63,64] and stochastic approximation [65] methods).In addition to pooling information and providing inexpensive Bayesian answers (using the Laplace approximation to the posterior distribution), our approach provides a simple and effective mechanism for regularization, with particular virtue in resolving the "convex hull problem" that frequently arises in discrete exponential families; we describe a simple approach to prior specification that is well-suited to this purpose, and that can be easily extended to provide more informative priors when appropriate background information is available.
The remainder of this paper is organized as follows.We begin with general concepts and notation for ERGMs in Section 1.1, and present our framework for scalable inference under both frequentist and Bayesian settings (including issues of prior specification) in Section 2. In Section 3, we employ simulation studies to examine the performance of our approach, and the behavior of posterior inference as a function of prior weight.In Section 4, our proposed methods are demonstrated on two different multiple network applications: brain functional connectivity networks from multiple subjects (Section 4.1); and protein structure networks from replicated x-ray crystal structures of hen egg-white lysozyme (Section 4.3).Finally, we close with a brief discussion and comment on potential future work.

Exponential Family Random Graph Models
Consider an order-n graph, G, represented via adjacency matrix Y on support Y n , such that Y ij corresponds to the state of the edge between vertices i and j; we make no particular assumptions about Y n (e.g., it may consist of directed or undirected graphs, with or without loops, and may be valued), save that all elements of y ∈ Y n are real and finite.An exponential family random graph model (ERGM) for Y is then given by where g : Y n → R p is a vector of real-valued sufficient statistics capturing network features of interest (which may implicitly incorporate e.g., nodal or dyadic covariates) and θ ∈ Θ ⊆ R q is vector of (curved) model parameters mapped to canonical parameters η : θ → R p [19].The reference measure h determines the baseline behavior of the ERGM distribution when η(θ) = 0, and plays an important role in fixing the shape of the distribution when edges are valued [32].In general, computation involving (1) is challenging due to the intractable nature of the log-partition function (i.e.normalizing factor), ψ(η(θ)) = log y ∈Yn h(y ) exp {η(θ) g(y )}, as ) in the binary case), the summand is generally too rough for naive Monte Carlo strategies to converge well, and ψ rarely has a closed-form solution.In the context of iid draws from the same ERGM pmf, we obtain the (homogeneous) pooled ERGM, where Y = (Y 1 , . . ., Y m ) is a vector of random graphs with realizations y = (y 1 , . . ., y m ).Although most work focuses on the single-graph case, our emphasis here is on the case where m > 1 (either because of multiple graph observations, or -in the case of conjugate prior inference -because of "effective" prior observations that are equivalent to an increased m).
As exponential families, the ERGMs have a number of convenient properties of which we will make use [35].Subject to mild regularity conditions, we may define an invertible function µ(η) = E η g(Y ) that provides the mean value parameterization of an ERGM on random graph Y .From Eq. 2 it is evident that the corresponding function µ m (η) = E η g(Y ) = mµ(η) is simply a constant multiple of the base mean value function for a single graph (foreshadowing a property that we employ below).Likewise, the Fisher information matrix of Y is given by , with the pooled equivalent being I m (η) = mI(η).

Mean Value Inference for Pooled ERGMs
Although a number of variants exist, standard approaches to inference for pooled ERGMs share the basic approach of computing likelihoods (or in some cases pseudo-likelihoods) for all observed graphs, and using the resulting joint likelihood for inference.Computationally, this may involve (as e.g. in [58]) combining the observed graphs into a single large synthetic network of order |V | = mn (with support constraints prohibiting cross-graph ties), and then performing MCMC MLE or comparable Bayesian analyses on the synthetic graph; for pseudo-likelihood methods (e.g., [30,66]), edge variables may simply be combined across networks, possibly with resampling over networks (as with bootstrap [53,67] or Bayesian bootstrap [40] strategies).These strategies lead to computational and storage costs that increase at least linearly in the number of graphs, which can become prohibitive for large systems or when the number of graphs is substantial.Here, we observe that a much faster strategy based on the mean values of the sufficient statistics becomes available in the IID case, and that this same strategy can also be leveraged for conjugate Bayesian inference.To our knowledge, this very simple but powerful trick has not previously been exploited in the ERGM context.

Maximum Likelihood Inference for Pooled ERGMs
We begin with the simple case of maximum likelihood inference.Given IID ERGM observations y obs = (y 1 , y 2 , • • • , y m ), the joint log-likelihood follows immediately from Eq. 2, the maximizer of which ( θ = arg max θ (θ; y obs )) is the maximum likelihood estimator (MLE).As observed, the primary challenge in finding the MLE is in dealing with the log normalizing factor, ψ.Running a Markov chain over the states of each of the m graphs in the set can be used to accomplish this, or equivalently (as is done in e.g.[58]) running a single Markov chain on a combined graph of order nm containing the union of all individual graphs, but the form of Eq.3 shows that this is superfluous in the IID case.Specifically, observe that any maximizer of is also a maximizer of any positive constant multiple of , and thus

5/38
Since the latter does not depend on θ, we may further simplify the above to which is immediately recognizable as the MLE for a hypothetical single "pseudo-graph" of order n with whose statistics are the means of the observed statistics.It is thus possible to find the MLE for a pooled model on m graphs by fitting a single-graph model (a considerable simplification).
To see the corresponding implications for the sampling distribution of the MLE, we note that inference for θ benefits from standard asymptotics in m (see e.g., [35]), including the consistency and asymptotic normality of the MLE under suitable regularity conditions.In particular, if θm is the MLE for Y with m observations drawn from a pooled ERGM with parameter θ 0 , then it follows from standard exponential family theory [68] that where I(θ) can be obtained from I(η) via the chain rule, i.e. ∇η(θ) I(η)∇η(θ).It thus follows that the asymptotic variance-covariance matrix of the MLE in the m-graph case is equal to that of the single-graph case, divided by m; i.e.
Var θm It follows, then, that we may perform maximum-likelihood inference for y obs with arbitrarily large m at no greater cost than fitting to a single network (and without the use of customized software tools): we simply find the MLE for the a single (imaginary) graph with statistics equal to the mean of the observed statistics using any standard method (e.g., MCMCMLE [30] or stochastic approximation [65]), and then rescale the associated variance-covariance matrix by a factor of m to correct for sample size.This procedure is summarized in Algorithm 1.When m is large, this can result in considerable computational savings; although the trick is quite trivial to implement, it has not to our knowledge been employed for ERGM inference in prior work.

Algorithm 1 Maximum Likelihood Inference for a Pooled ERGM Using Mean Values
Output: θm , Var θm

Conjugate Maximum-a Posteriori Inference for ERGMs
We now consider the problem of IID pooled ERGM inference in the Bayesian setting.Given prior π(θ), we are interested in the posterior distribution of θ, π(θ|y obs ), Our focus here is on conjugate priors, in the canonical exponential family context for which η(θ) = θ.
In addition to their mathematical convenience, conjugate priors are attractive in the context of exponential families due to their interpretability (being able to be expressed in terms of prior "pseudo-data," consisting of a prior "mean" and effective "sample size" expressed in the same units as the observed data), the fact that they admit natural non-informative limits, and their status as maximum entropy distributions [69].To our knowledge, conjugate priors for ERGMs were first examined in the unpublished work of [70], who considered them along with a number of other ERGM prior specifications, but to date they have not been extensively studied.As we show, ERGM conjugate priors allow for extremely computationally efficient inference via their mean value representation.Moreover, there are particularly natural choices of weakly informative conjugate priors that are well-suited to regularization; we consider these in section 2.2.1.
For a canonical ERGM family with η(θ) = θ, conjugate priors take the following form [54]: Here, τ are prior expected values of the vector of sufficient statistics, and n 0 is a positive number that measures the confidence in those prior expectations, which can be viewed as the number of pseudo-observations' worth of information (in units of observed graphs) contained in the prior; H(τ , n 0 ) denotes the normalizing factor that makes π(θ|τ , n 0 ) a legitimate probability density function of θ.The existence of such distribution is ensured by [71], who showed ( 8) is normalizable provided that n 0 > 0 and τ lies in the interior of convex hull of the support of the measure θ.Substitute (8) for π(θ) in ( 7), we have where δ ≡ n 0 n 0 +m , taking values in [0, 1].With (9), we note that an analytical form for the prior π(θ|τ , n 0 ) is not necessary for Bayesian inference, because the prior can be fully characterized by δ (or n 0 ) and τ .Standard Bayesian theory tells us that the posterior expectation of ∇ψ(θ) is the Bayes estimate of θ with respect to quadratic loss [72], and is a weighted average of ḡ(y obs ) and τ , with δ controlling the relative weight of contribution of the prior information.For any given prior hyperparameters (τ , n 0 ), as the sample size m becomes large, δ, the relative weight of prior, approaches zero, and hence the sample-based information dominates the posterior.
Given a prior specified by δ and τ , the maximum a posteriori probability (MAP) estimate θMAP m,δ , is also Bayes estimate under a different choice of loss function (the 0-1 loss; see e.g.[72]).Note that since MAP is indeed the maximizer of the kernel of posterior density (9), we can employ the same arguments as in the derivation of (4), to obtain θMAP

7/38
It follows, then, that the pooled ERGM MAP estimator θMAP m,δ is equal to the MLE θ that would be obtained for a single pseudo-observation with sufficient statistics δτ + (1 − δ)ḡ(y obs ).
Under standard regularity conditions, the posterior distribution π(θ|y obs , τ , n 0 ) becomes asymptotically Gaussian as m → ∞, according to the classical Berstein-von Mises theorem [73].Following the same basic "mean value" procedure used in Algorithm 1 for obtaining the pooled ERGM MLE θm , we are able to compute the MAP estimate θMAP m,δ by fitting an ERGM to a single 'pseudo'-graph whose node set is the same as the observed networks but whose network statistics are taken to be equal to δτ + (1 − δ)ḡ(y obs ).In addition to the MAP estimate, we can also obtain an estimate of the observed Fisher information Î( θMAP m,δ ), which is approximately the negative Hessian of logposterior generated by product of the prior and the likelihood of a single 'pseudo'-graph.However, the Laplace approximation of posterior distribution requires the Hessian of true log-posterior, which should be generated as by the product of prior and the likelihood of all actual observations.Note that the negative Hessian matrix Q m,δ (θ) of true log-posterior (9) can be approximated by ). Laplace's approximation of the posterior yields [74] the following result, We complete the approximation by noting that Î( θMAP m,δ ) = Var θMAP m,δ g(Y ), which can be obtained e.g. by Markov Chain Monte Carlo simulation [30].
Putting the pieces together, Algorithm 2 provides a simple procedure for performing MAP inference for pooled ERGMs under conjugate priors.We begin by specifying the prior parameters τ and n 0 , and computing the mean data vector ḡ(y obs ) and relative prior weight δ.The key steps are lines 3-4, which obtain the MAP estimate and associated approximate posterior variance-covariance matrix by performing the same calculations as are required for obtaining a single-graph MLE and its sample variance-covariance matrix: we simply fit to the posterior expectation δτ + (1 − δ)ḡ(y obs ) instead of to an observed data value, and then adjust the information matrix to reflect the total posterior weight (prior pseudo-observations plus m).Not only does this allow us to perform inference for large-m data sets at no additional cost (as we did for the MLE), but it also allows us to perform Bayesian inference using algorithms and/or software implementations that were designed for maximum likelihood inference (or for first-order method-of-moments, which corresponds to maximum likelihood in this case) without additional modification.

Algorithm 2 MAP Inference for a Pooled ERGM Using Mean Values
Input: Observed data y obs , prior data expectation τ and sample size n 0 In addition to its computational convenience, we note that the posterior expected statistic δτ + (1 − δ)ḡ(y obs ) has an intuitive geometric interpretation as a convex combination of the prior information and the observed information, with the respective weight being determined by the relative size of the prior weight n 0 versus m.In particular, note that as δ → 0, we approach the MLE, while the prior becomes unchanged by the observed data in the limit as δ → 1.We examine this behavior in greater detail below.We also observe that so long as τ lies in the relative interior of the convex hull of {g(y) : y ∈ Y n }, then θMAP m,δ exists (and is unique).This suggests the use of conjugate MAP to address a common practical problem in ERGM inference, namely the non-existence of the MLE when the observed statistics g(y obs ) lie on the face of the convex hull of possible statistics.In such cases, there is a direction of recession within the parameter space, with respect to which the MLE diverges; often, however, such divergent parameter values arise from very minimal information, as e.g. when a small subset of vertices in a sparse graph have no ties to each other (leading to a divergence in the corresponding homophily term).Use of MAP inference with a small δ can improve performance in these cases by acting as a regularizer, shrinking in extreme parameter estimates that have little support from the likelihood without otherwise greatly altering the solution.We examine this further in Section 2.2.1.

Conjugate Prior Specification
Conventional research on Bayesian analysis of ERGMs focuses on priors assigned on the natural parameter space (see e.g., [28,29,33]), whereas the ERGM conjugate prior here is actually specified in the mean-value parameter space.This has the potential advantage that prior parameters are specified in terms of hypothetical observables (i.e., graph statistics), which are both concrete and generalizable from previously observed data; for instance, it may be easier for the analyst to specify an expected mean degree for a hypothetical network belonging to a well-studied class (e.g., friendship nominations within high schools) than to specify prior mean parameter values per se.By turns, given an intuition regarding plausible parameter values, it is straightforward to obtain corresponding values of τ by simulation.Here, we discuss some basic strategies for selecting reasonable prior parameters in practice, with the impact of prior choices being examined further in Section 3.2.
As discussed in Section 2.2, the specification of ERGM conjugate prior consists of two components: the a priori expected sufficient statistics, τ , and the corresponding prior weight, n 0 .As with other exponential families, we may imagine this prior as arising from a situation in which we initially have no information regarding θ (in the sense of a limiting "flat" prior with n 0 → 0), and then observe n 0 IID graph draws with mean statistics τ ; our resulting state of knowledge is then summarized by the corresponding conjugate prior.This "prior pseudo-data" interpretation makes the conjugate prior particularly easy to understand and communicate, and it can greatly facilitate sanity checking: for instance, if we observe that a proposed τ value implies a mean degree far in excess of any value that could plausibly be observed in practice, then we are immediately aware of the need for refinement.
While prior specification is by nature problem specific, we here suggest several reasonable strategies for selection of τ .Where the analyst has access to a sample of networks, y comp , that are similar to the network of interest (i.e., that are believed to have been produced by a similar generative process) setting τ = ḡ(y comp ) is a natural informative choice; in this case, the posterior expectation of g(Y ) is shrunk towards the prior population mean.In other cases, however, the analyst may lack such a sample, or may wish to posit a minimally informative prior that regularizes inference without strongly influencing the final estimate (a long-established tradition in Bayesian analysis, per e.g., [75][76][77][78], etc.).In this context, it is useful to consider the homogeneous Bernoulli graphs (in which each edge is an IID Bernoulli trial), as a basis for the prior distribution; proposed as early as [79], then later described independently by [80] and [81] as the Gilbert-Erdős-Rényi model in graph theoretic research, the Bernoulli graphs also arise for typical (counting measure) ERGMs as the base case where all parameters other than that associated with the edge count are equal to 0. Given a prior expected degree d (chosen e.g. on the basis of observations of similar networks, or from prior domain knowledge), we may then set τ by ( 1 ).(In some cases, it may also be feasible to derive the expected statistics analytically from p, in which case these values may be used directly; however, exact sampling of Bernoulli graphs is extremely efficient, and a Monte Carlo approach may be easier to implement in practice.)As the Bernoulli graphs coincide with the de facto null model against which estimated parameters are typically assessed, setting τ to the Bernoulli graph expectation effectively shrinks estimates towards the null model (analogously to the use of a zero-centered Gaussian or other prior in the natural parameter space), making it a reasonable default choice when more refined information is not available.
We now turn to the prior weight ("pseudo-sample size"), n 0 .It is convenient to consider n 0 via the relative prior weight, δ = n 0 /(n 0 + m), which quantifies the contribution of the prior to the posterior mean statistics -the prior will dominate the data in determining the posterior when δ → 1 (i.e.n 0 → ∞), whereas a more "objective" analysis which lets the data "speak for themselves" can be obtained by letting δ → 0 (i.e.n 0 → 0, which as noted converges to the MLE).As noted above, a small-δ prior can also be viewed as a tool to regularize the model to avoid the extreme inferences resulting from data that is at or near the face of the convex hull of the sufficient statistics.While the impact of δ on the posterior mean of the sufficient statistics is self-evident from e.g.Eq. 10, its effect in the natural parameter space is less obvious.We examine this numerically via simulation in Section 3.2.

Simulation Studies
In this section, we conduct simulation studies to assess the behavior of the pooled MLE as m becomes large, and to examine how prior specifications affect conjugate MAP inference.To provide a realistic basis for evaluation, we base our simulated networks on Goodreau's Faux Mesa High School (FMHS) data [64], a synthetic network based on proprietary data on attributions of friendship among students in a high shool in the southwestern United States [82].The FMHS network represents simulated in-school friendships among the 205 students in the school, along with their individual attributes, and was constructed to preserve the structural properties of the underlying data set.For our study, we first fit an ERGM model to the FMSH network with the following three statistics as implemented in [67]: number of edges; uniform homophily by gender; and geometrically weighted edgewise shared partners (GWESP) (a common term for inducing triad closure), with the decay parameter λ fixed at 0.25.A detailed definition of above-mentioned network statistics is in Appendix A. Given the specified model, we first compute the MCMC MLE and treat the estimated coefficients as the networks' "true" parameter values θ 0 = (−5.88,0.52, 1.86); we henceforth refer to this model (i.e., ERGM distribution) as M, from which we draw random samples for our simulation studies (i.e., Y ∼ M ).All computations in this paper were carried out with the statistical environment R [83], using the statnet libraries [64,[84][85][86].ergm version 4.1.2was used for all ERGM-specific computation, using default simulation and estimation settings except as otherwise noted.

Behavior of the MLE in Pooled-Likelihood Inference
For our first study, we vary the sample size m and examine the observed coverage rates of nominal 95% confidence intervals for model parameters as a function of sample size.Specifically, for each value of m, we generate K = 1000 datasets of size m from M, performing pooled likelihood-based inference for each sample as discussed in Section 2.1.Respective burn-in and thinning intervals of 1 × 10 6 and 2 × 10 5 were employed for each simulated sample (for both data simulation and MCMC-MLE inference), with MCMC-MLE termination based on the ergm Hotelling criterion (an autocorrelation-adjusted T 2 test of expected versus target statistics obtaining p > 0.5).Table 1 10/38 presents the observed coverage rates of nominal 95% confidence intervals based on the asymptotic distribution of the MLE, as estimated from the size-corrected Fisher information obtained from a pooled (single-graph) estimate.1 shows the observed bias of the pooled MLE, as well as its standard error and the observed coverage rates of its nominal 95% CIs for all three model parameters under different sample sizes ranging from 1 to 100 graphs.Bias is negligible even for a single graph, declining to the levels of numerical noise once m is greater than 5-10.Likewise, efficiency (as measured by the standard error of the estimator) is high, and scales with √ m in the manner expected from asymptotic theory.It is also evident that CIs based on the asymptotic distribution of Eq. 5 perform well, maintaining approximately nominal coverage rates over a wide range of sample sizes.As a practical observation, we note that the construction of such CIs is based on statistical uncertainty, and does not take into account numerical sources of error (arising, e.g. from imperfect optimization, Monte Carlo error, etc.) 1 .As m → ∞, the statistical error becomes arbitrarily small, thus increasing the proportion of de facto error arising from numerical approximation; put another way, it is possible to enter a regime in which our inferential precision is limited by our ability to compute the MLE (and Î) rather than by the limits of our data.To ensure accurate coverage in such extreme-m scenarios, it may be necessary to adopt more stringent MCMC burn-in and thinning settings than are typically necessary for single-graph inference (where statistical uncertainty dominates), or to devise improved error estimates that better account for approximation error.That said, we do find excellent performance for sample sizes considered here, suggesting that the problem may be limited in practice.Further, we observe that the excellent coverage obtained for small m (even m = 1) provides practical validation of the traditional practice of using asymptotic confidence intervals in the m = 1 case; for a review of different types of ERGM asymptotics (and their relationship to classical results) see e.g.[35].

Prior Weights and MAP Inference
As discussed in Section 2.2.1, choosing the relative prior weight (δ) is an important aspect of the prior specification; while the choice of τ can often be made based on either prior data or domain knowledge, the impact of n 0 (hence δ) is less obvious.Here, we examine the impact of δ on the MAP estimate with a particular interest in identifying prior parameter values that are likely to serve as reasonable starting points for use in regularization.Our analysis looks first at the impact of δ on the MAP estimate itself (i.e., the extent of interpolation between the implicit prior natural parameter and the MLE), and then considers the effect of δ on the frequentist properties of the MAP estimate (bias, and the frequentist coverage of the posterior credible intervals).
To specify a prior, we first simulate homogeneous Bernoulli random graphs on the node set of the FMHS network, given expected mean degree fixed at the average degree of all the nodes in three comparable networks (i.e.Goodreau's Faux Magnolia High School data, Faux Dixon High School data, and Faux Desert High School data [64]).The observed average degree across these data sets is 1.974, leading to an edge coefficient of log( p for the Bernoulli family, the parameters for the other two terms are set to 0. (We note in passing that calibration of this kind should generally be done using mean degree rather than density, as mean degree is often close to size-invariant for comparable relations while density is not; see e.g.[26,87].)We then calculate the average network statistics of 500 draws from this distribution, giving us the prior expected statistics τ = (201.64,99.89, 3.62).Since our focus is on δ, we fix our sample size at m = 1 and vary n 0 to obtain the posterior inference under different values of relative prior weights.We perform MAP estimation on 1000 independent realizations of Y ∼ M for each choice of δ, comparing the resulting parameter estimates to their true values (θ 0 ) to assess the bias of MAP estimate and the frequentist coverage probability of the 95% posterior credible intervals arising from the Laplace approximation to the posterior distribution.We begin by examining the impact of δ on the MAP estimate.As noted above, the MAP estimate must interpolate between the natural parameter equivalent of τ at δ = 1 and the MLE at δ = 0; equivalently, we may think of the conjugate prior as shrinking the estimate towards the (natural Average MAP estimates for Y ∼ M , by δ.Solutions interpolate between the natural parameters corresponding to the prior (δ = 1) and the MLE (δ = 0); shrinkage is nearly linear in δ near the non-informative limit.
parameter equivalent) of the prior expectation.The detailed pattern of shrinkage is depicted in fig. 1, which shows that parameters change roughly linearly over most of the unit interval, with the most extreme changes occurring near δ = 1.Importantly, shrinkage is approximately linear near the non-informative limit (δ = 0), suggesting that small differences in choice of δ do not have a large impact on the posterior mode (a convenient property when selecting minimally informative priors for regularization purposes).A more quantitative picture emerges from Table 2, which shows the mean MAP estimates for each parameter as a function of δ.We observe that choosing δ 0.02 yields estimates that are extremely close to the MLE (agreeing to 2-3 decimal places), while still placing sufficient weight on the prior to be useful for regularization (i.e., to ensure that the mean value parameter lies in the relative interior of the convex hull of possible statistics).We now turn to the frequentist properties of the MAP estimate, as a function of δ.Here we compare the MAP estimate (and the 95% posterior intervals arising from the Laplace approximation) to the coefficients of true model M, which is θ 0 = (−5.89,0.53, 1.87); at the other extreme, we have the natural parameter equivalent of the location of the conjugate prior ((−4.63,0, 0)).Table 3 shows the estimated bias and frequentist coverage probability for our simulation sample, as a function of δ.As can be seen, bias is minimal until δ ≈ 0.02, becoming substantial for δ > 0.1.Likewise, the 95% posterior intervals maintain good frequentist calibration until roughly δ ≈ 0.02, though coverage degrades rapidly thereafter.For regularizing/minimally informative applications, a choice of n 0 ≈ 0.01 (giving the prior approximately 1% of the weight of a single graph observation) would seem to be a reasonable starting point.

Applications
To demonstrate the pooled ERGM/conjugate prior analysis in practice, we provide two illustrative applications.The first is to the analysis of brain functional connectivity networks, where we seek a common model for brain structure across individuals.The second considers the use of ERGMs 9.000 1.1619 0.000 -0.4786 0.000 -1.3466 0.000 0.9500 19.000 1.2112 0.000 -0.5106 0.000 -1.5278 0.000 0.9900 99.000 1.2500 0.000 -0.5361 0.000 -1.7776 0.000 1.0000 ∞ 1.2593 0.000 -0.5425 0.000 -1.8811 0.000 † CP: Coverage Probability approximated by coverage rates of the simulated data; Laplace approximation is not applicable to δ = 1 because in that case the prior dictates the posterior inference, the posterior interval degenerates to a point mass ‡ n 0 is the equivalent sample size contained in the prior given its relative weight δ to model variation in protein structures obtained by X-ray crystallography, in this case using hen egg-white lysozyme (a widely studied reference protein).In each case, we show how the approach used here facilities the simultaneous analysis of multiple networks, and provides a fast and simple means of performing Bayesian inference.

Pooled ERGM Analysis for Brain Functional Connectivity Networks
The study of group-based brain functional connectivity networks has become a topic of increasing interest in neuroscience, due the need to characterize both central tendencies and patterns of variation in interactions among brain regions.Importantly, it is of interest not only to measure specific or mean interactions, but to be able to characterize the distributions of interaction patterns arising under particular conditions, and/or within particular subpopulations.ERGMs have been identified as a promising tool for this purpose, due to their ability to assess how local brain network features give rise to the global structure, and due to their capacity to account for both heterogeneity and dependence among interactions [41,88].
Brain functional connectivity networks often exhibit both functional segregation and integration [89], where functional segregation in the brain is the ability for specialized processing to occur within densely interconnected groups of brain regions, while functional integration corresponds to the ability to rapidly amalgamate specialized information from scattered brain regions.As an attempt to produce a model with appropriate network sufficient statistics that are able to capture those two concurrent opposing driving forces, [42] proposed to first select the "best" metrics from a broader set of potential candidates identified in the literature using model selection techniques for ERGMs, then refit the networks of all subjects with those "best' metrics.They then employed the mean (respectively median) of the resulting individual estimates as estimates of a global, group-level "representative" whole-brain connectivity network model (which they refer to as a "mean" (respectively) "median" ERGM).This method of amalgamating models in the natural parameter space is straightforward and intuitive, but has several disadvantages: as shown in Eq. 4, the appropriate pooling for a joint ERGM occurs in the mean value parameter space rather than the natural parameter space; separate estimation of an ERGM for each individual is computationally expensive (and, for the MLE, may encounter problems if some individuals' networks have statistics that lie on the face of the convex hull of potential statistics); the statistical properties of the amalgamated model are unclear (especially in the median case); and model selection by this approach does not exploit the joint likelihood (which may lead to an inferior pooled model).
By contrast, a pooled-ERGM approach provides a more principled and computationally efficient alternative to the mean/median ERGM approach.For large n, the properties of the pooled estimates and their confidence intervals are ensured by the large sample theory of exponential families, and as shown in Section 3.1 good results can be obtained with even modest numbers of graphs.Moreover, instead of having to fit each observed network separately, as proposed in [42] (with the risk that the MLE will not exist in one or more cases), exactly one ERGM fit is required (and the target statistics for that fit lie on the face of the convex hull only if all input networks do as well).Moreover, the ability to use conjugate-MAP inference for pooled ERGMs provides an inexpensive way of obtaining approximate Bayesian answers where desired, or (when viewing the prior as a regularizer) obtain regularized likelihood estimates.Here, we demonstrate all three approaches in the context of brain functional connectivity networks, building on prior work by [41,42].Due to the large number of model fits required for cross-validation, we use ergm's stochastic approximation method for estimation in this section, with all Markov chains having a thinning interval of 5 × 10 4 following 2 × 10 5 burn-in iterations.

Data
We consider the data reported in [41,42], which includes brain functional connectivity networks among 10 normal subjects (5 female, average age: 27.7 years old, standard deviation: 4.7 years) who were part (Subject No. 002, 003, 005, 008, 009, 010, 012, 013, 016, 021) of a larger functional MRI study of age-related changes in cross-modal deactivations [90].Fig. 2 depicts the brain connectivity networks of subjects 002 and 003, illustrating both common properties (e.g., clustering, increased probability of ties within brain regions) and heterogeneity across networks; here, we are interested in capturing this distribution via an ERGM form.Note that brain connectivity networks are defined on equivalent sets of nodes, which here correspond to 90 prespecified brain regions (ROIs -Regions of Interest), according to the Automated Anatomical Labeling atlas (AAL) [91].Each of these 10 brain connectivity networks is represented by binary adjacency matrix, in which element (i, j) denotes the presence or absence of a functional connection between node i and node j.The establishment of binary functional connections was done by thresholding the temporal correlation coefficient adjusted for motion and physiological noise (see [41,92] for further details), and hence those brain networks are undirected by construction.The thresholds were selected by the original authors at the subject level to make each network with log(n) log( d) ≈ 2.8, or equivalently d ≈ n 1/2.8 , where n is the total number of nodes (here, n = 90).
Covariate information associated with these networks includes not only the nodal covariates Hemisphere and Area, but also an edge-level covariate for the spatial distance matrix among the ROIs (Mean: 76.28 mm, SD: 28.93 mm).The 90 regions are divided symmetrically across left and 15/38 right hemispheres, with each hemisphere consisting of 28 areas as presented in Table 4.  [41,42], with the latter two being proposed as proxies for functional segregation and functional integration respectively.As such, their joint effects are modeled explicitly as a combination of network statistics: edge count (Edges), GWESP, and geometrically weighted null shared partners (GWNSP).Such a model specification yields a homogeneous ERGM that is permutation invariant [16,35], which leaves covariate information underutilized, and in turn makes the estimation difficult and unstable due to multimodality of the distribution [65].Similar to the multicollinearity issue in regression, it can be problematic to include two closely correlated network statistics in an ERGM model, and the presence of GWESP and GWNSP in previous models is found to be associated with convergence issue in the present case.Here, we thus modify and extend the homogeneous model used in prior work by incorporating node-level heterogeneity and distance effects associated with the spatial structure of the brain, along 16/38 with a less collinear combination of GWESP and graphletCount(1) terms to capture dependence.Specifically, we include as covariates: a homophily effect for hemisphere (hemisphere-nodematch), as introduced in [43]; a mixing effect for brain regions (Area-nodemix) as a measure of the strength of interaction between the brain regions belonging to different areas of the brain; and a dyadic covariate that controls for spatial proximity (log.spatial.dist-edgecov),implemented by an effect for the log of the distance between regions (a common choice for modeling geographical effects, e.g., [94]).In addition to providing substantive insight into the drivers of connectivity, we also observe that such covariate effects also improve model performance by separating clustering and bridging due to physical brain structure from emergent network properties arising from dependence effects.The latter are captured by two effects.First, a GWESP term with decay fixed at 0.5 (GWESP, φ = 0.5) is included to capture residual tendencies towards endogenous local clustering net of controls, and second, a graphletCount(1) term [95] helps capture open two-path structure (aka local bridging) like that previously examined using GWNSP in the models of [41,42].

Results
Table 5 presents maximum likelihood estimates of model coefficients and associated standard errors for the group-based brain connectivity network model under pooling, enabling us to infer the extent to which each of the proposed effects shapes the overall distribution of networks across test subjects.We see a positive and statistically significant parameter estimate for the GWESP statistic, indicating high levels triadic closure net of spatial and anatomical features; this is compatible with the theory of functional segregation proposed in prior work.Likewise, we see that bridging is significantly disfavored (i.e., a negative effect for graphlet 1), suggesting that open triads tend not persist (net of other factors).In estimating mixing effects, we aggregate all areas other than Frontal and Temporal to a single level as "Others" due to the small sizes of these regions, providing a tripartite mixing structure; we see inhibition of ties between different areas, and null or positive tendencies towards formation of within-area ties, which provides additional evidence for functional segregation.We note that these effects persist net of the overall inhibition of ties between more distant regions, with tie probability declining (ceteris paribus) as approximately one over the inverse of the distance between nodes.An important exception is the case of cross-hemispheric interactions, which are actually favored (the negative nodematch indicating that within-hemispheric interactions are disfavored relative to those that cross hemispheres).This can be viewed as an indicator of functional integration, with the need for coordination across hemispheres working against the general tendency against long-range ties.Care is required in the quantitative interpretation of the positive Edges coefficient, given the existence of log.spatial.dist-edgecov.Specifically, note that mean of pairwise distances of the ROIs is 76.28 mm and hence at the mean log(76.28)≈ 4.334, we have Pr(Y i,j = 1|Y c i,j = y c i,j ) = exp(1.193−1.081×4.334)1+exp(1.193−1.081×4.334)≈ 0.029, conditional on the rest of the graph and all other effects held at zero, meaning that the baseline conditional probability of observing an edge (not involved in the creation of other network statistics included in the model) between pairs of regions at the average distance is still very low, as expected for sparse graphs.

Approximate Conjugate Bayesian Analysis of Brain Functional Connectivity Networks
In this subsection, we demonstrate how one can conduct approximate conjugate Bayesian analysis as introduced in Section 2.2 for the dual purposes of approximating full Bayesian analysis and regularization.The construction of the prior is crucial regardless of the ultimate purpose.We adopt the simulation-based approach of Section 2.2.1 to specify the prior by noting that d ≈ n 1/2.8 by construction (i.e., choice of correlation threshold) for all brain functional connectivity networks in this dataset, and thus we set τ is set to be equal to the mean of network sufficient statistics under a Bernoulli graph with p = d n−1 ≈ n 1/2.8 n−1 ≈ 0.056 (n = 90).The selection of relative weight δ is subject to vary depending upon the purpose, which is explored and discussed in detail with examples.

MAP Estimation for the Pooled Model
In the absence of strong a priori information regarding almost all aspects of the brain functional connectivity networks except for the mean degree, it is advisable to incorporate weak prior information; we do this by assigning a small value to the hyper-parameter δ, in this case setting δ = 0.02.Given a specified prior, we conduct Bayesian inference based on Algorithm 2, the resulting parameters being shown in Table 6.The parameter estimates from the Bayesian analysis are very similar to those of the pooled MLE, supporting the same qualitative conclusions.However, imposing a prior on the parameter vector permits interpretation of the results in terms of Bayesian answers, which may be useful in some settings; we may also use the Laplace approximation to sample from the approximate posterior, enabling us to obtain e.g.posterior predictive distributions for network properties that take into account uncertainty in the model parameters.

Regularizing ERGMs with MAP
As noted above, the MLE for the natural parameter of an exponential family distribution does not exist when the observed sufficient statistics lie on the relative boundary of C, the convex hull of the set of possible values of sufficient statistics.A common case of this type in ERGM modeling arises when mixing or differential nodematch parameters are specified for networks containing many small subgroups; if any of the associated statistics are equal to 0 (e.g., there are no observed ties between two groups), then the likelihood has no finite maximizer with respect to the respective directions in the natural parameter space.In the context of the brain connectivity networks, we observe that there are many small areas containing few nodes, potentially leading to such a circumstance.For instance, consider an extension of our previous model intended to quantify the mixing pattern between nodes in the Occipital and Cingulum areas; we may do so by augmenting M 1 with nodemix terms involving Occipital and Cingulum, with all other terms in the model unchanged.We denote this model as M 2 .It happens, however, that there there are no edges observed between Occipital and Cingulum for any of the networks in the dataset, and hence the vector of mean observed network sufficient statistics is no longer located in rint(C) (as the Occipital.Cingulum-nodemix value of 0 is smallest possible value that can be obtained).From an optimization perspective, we are unable to obtain a finite estimate for model coefficients of this augmented model, because the likelihood can always be further optimized by letting the vector of candidate estimates of model coefficients move towards the direction of recession.Statistically, this reflects the non-existence of the MLE.We now show such issues can be resolved by incorporating an appropriate conjugate prior into the inference to regularize the model and thus avoid extreme inferences on model parameters.
We construct a conjugate prior in the form of (8), where hyper-parameter τ is determined by calculating the mean of network sufficient statistics observed on 1000 independent random realizations of Bernoulli random graphs with p = 0.056.As our goal here is regularization, we view the prior as a convenient penalty function (rather than as a formal statement of prior knowledge), and treat δ as a hyperparameter subject to optimization.Given our pooled setting, it is natural to evaluate model performance by cross-validation (CV); specifically, we vary δ (or, equivalently, the prior sample size n 0 ), computing the expected squared Hamming error for each graph under leave-one-out CV based on 1000 draws from each simulated model, and select the value that minimizes the expected loss on the held-out networks.
The results of the hyperparameter tuning process are shown in Table 7.As expected, the unregularized MLE (n 0 = δ = 0) yields suboptimal performance, with improvements obtained until n 0 = 0.004 (δ = 0.0004).Further increases in prior weight (here interpreted as the strength of the penalty function) result in diminished performance, as the fitted model is drawn towards the prior mean.We thus select δ = 0.0004 for subsequent analysis.
We may now perform penalized maximum likelihood inference, using the tuned conjugate prior as a regularizer.Table 8 shows the corresponding parameter estimates, standard errors, and significance levels for model M 2 .As expected, the results for the shared effects (triangulation, spatial interaction, bridging, and hemispheric interaction) after breaking out the additional brain areas remain very similar to what was seen in the unregularized MLE for the collapsed model, though we now have a more complete description of the mixing pattern among localized areas.Importantly, we also observe that the Occipital/Cingulum parameter (for which the MLE does not exist) is now well-characterized.As we would expect from the fact that none of the observed networks had Occipital/Cingulum ties, the estimated coefficient is significantly negative; however, the magnitude is now plausible (and in line with the other observed effects).

Analysis of Lysozyme Structure Networks via Pooled ERGMs
The functions of proteins and other macromolecules are heavily influenced by their three-dimensional structure.With the increasing sophistication of both experimental technique and molecular modeling, new methods for analyzing the growing body of protein structure data are of increasing interest.Network analytic methods have emerged as particularly useful tools for this purpose, providing a 20/38 rich representation for topological complexity while still offering substantial coarsening relative to atomistic structure.Among other applications, network representations of protein structure have been used to identify functionally important residues [96], summarize protein dynamics [97], identify functionally significant sub-units [98], distinguish active site conformations [7], and characterize structural differences between protein families [99].
One potential application of ERGMs in the context of protein structure is the characterization of variation within structures of the same protein (either in equilibrium, or in different functional or measurement contexts).ERGMs were first applied to protein structure networks by [95], who used them to control for intrinsic molecular features (e.g., chain membership) while testing hypotheses regarding fold-specific structure.In more recent work, ERGMs have been employed to characterize transient structure in intrinsically disordered proteins [40], and to model protein aggregation [8,100].Here, we consider the problem of characterizing variation in measured protein structures obtained via X-ray crystallography (the primary workhorse technique of modern structural biology).While it is common to treat globular proteins as having a native fold associated with a single threedimensional structure obtained via crystallographic methods (or, more rarely, Nuclear Magnetic Resonance, neutron scattering, or cryo-EM), proteins in solution are extremely dynamic; even in a crystallographic context, repeated crystallization of the same protein will often yield slightly different structures. 2 Currently, this variation is not well-characterized, and is often ignored (with a single conformation selected as "the" structure of the protein).Statistically, it is natural to think of these observed structures as being drawn from a broader distribution of low-energy conformations, and to attempt to model this distribution using the measured conformations.Here, we apply this notion to observed variation in crystal structures of hen egg-white lysozyme (a widely used reference protein in biophysical research).Lysozyme (N-acetylmuramide glycanhydrolase), is an enzymatic antimicrobial agent produced as part of the innate immune system.A glycoside hydrolase, lysozyme attacks polysaccharides within bacterial cell walls, compromising their integrity and ultimately causing cell lysis; as such, it is produced in large quantities in settings where bacterial growth must be discouraged (e.g., eggs, tears, milk).Our data consist of network representations of 66 independently solved lysozyme structures, each of which is formed from 129 residues (i.e.amino acids) constituting the main chain of wild type hen egg-white lysozyme (residues 19-147 of Uniprot B8YK79).Atomistic protein structures were obtained from the Protein Data Bank (PDB; https://www.rcsb.org/pdb/home/home.do),with the search query limited to X-ray crystallography structures containing only the 129-residue main chain with no modified or substituted residues, missing residues, ligands, or other complexes.Where more than one distinct conformation appeared in the asymmetric unit, each was isolated and treated as a separate conformation for purposes of analysis.Each isolated protein structure was protonated using REDUCE [101], with the resulting coordinates employed to generate a residue-level protein structure network (i.e. an undirected adjacency matrix of size 129 × 129) according to the pairwise distances among residues -any pair of residues is considered to be adjacent if they contain respective atoms that are closer together than 1.2 times the sum of their respective van der Waals radii.Two representative lysozyme structure networks are displayed in Fig. 3; while the conformations are very similar, they do show subtle differences (compare e.g., the residues in the top right).A 3D molecular structure of lysozyme is shown in Fig. 4, together with the equivalent protein structure network (PSN).

Model Specification
Model terms: Our model specification includes three categories of effects: covariates relating to residue properties that enhance or inhibit interaction; "contextual" covariates relating to the overall fold of the protein; and dependencies among contacts arising from steric and other effects.Beginning with the first group, we add a Coulomb-like term for interactions based on nominal residue charge, ChargeMatch-edgecov, coded as 1 for pairs with complementary charges, -1 for pairs with non-complementary charges (i.e., positive/positive or negative/negative), and 0 otherwise.We include two terms for Polar/Polar (PolPol-edgecov) and Nonpolar/Nonpolar (NPolNPol-edgecov) residue pairs, respectively, accounting for the fact that the two affinities are non-identical.To account for the distinctive interaction patterns of aromatic residues, we include an overall effect for interaction by aromatics (Aromatic-nodecov) as well as an effect for pairwise interactions among aromatic residues per se (referred to mnemonically as PiStack-edgecov).Finally, we account for the greater contact potential of larger residues by incorporating a term for residue surface area (SurfaceArea-nodecov).
With respect to the second class of effects, we first observe that distance along the protein backbone is an important predictor of interaction, and we hence include the logged backbone distance as an edgewise covariate (logBBDist-edgecov); separately, we also incorporate adjacency along the backbone as a support constraint (reflecting the fact that each residue is covalently bound to its backbone neighbors).Because our objective is to model variation in folded lysozyme structures (and not predicting the fold de novo), we incorporate an effect for the average distances among residues.Specifically, we encode the log of the mean distance between alpha carbons (Cα) for every residue pair (taken over all structures) as an edge covariate (logMeanDist-edgecov), expressing the intuition that residues that are on average spatially proximate in folded lysozyme are more likely to be adjacent in any particular structure.To account for the fact that surface residues have solvent and/or crystal contacts that are not captured by the structure (resulting in a lower mean degree within the PSN), we also include the mean Cα distance from the coordinate center as a nodal covariate (meanCADist-nodecov).To adjust for differences in the ability of larger or bulkier residues to form contacts at longer Cα distances, we also add respective product terms (i.e., interaction effects in a statistical rather than relational sense) between the aromatic and surface area statistics and the log Cα distances (logMeanDistAro-edgecov and logMeanDistSurf-edgecov).
Finally, we consider terms relating to the interdependence among contacts.To model the fact that each of a residue's existing contacts increases the difficulty of forming new contacts, we include a 23/38 2-star term (2-stars); likewise, we include a triangle term (Triangles) to account for the increasing difficulty of forming large cliques.(While both such terms are rarely used in e.g.social network settings due to their propensity to produce degenerate models when their associated coefficients are positive, these terms can be important for capturing geometric constraints in physical systems; since the associated coefficients are generally negative in these cases, they do not lead to runaway clique formation.)Although large cliques are strongly suppressed by packing constraints, PSNs are however highly triangulated.We thus combine the (hypothesized negative) triangle term with a GWESP term (here using a decay parameter of 0.8 identified by a pilot fit to a single graph).
Prior specification: To specify the prior for conjugate MAP, we begin with the approximation that the mean degree for a fully buried core residue will be approximately 12 (based on a standard sphere packing approximation; see e.g.[104]).In practice, however, many potential contacts are "lost" due to residues' not being completely surrounded by other residues (i.e., on the surface).To approximate the fraction of possible contacts that are "lost" in this way, we begin by approximating the expected surface area of the protein that would be used for residue-residue contacts if all residues were buried; paradoxically, this is the surface area of the fully unfolded protein.[105] show that the empirical model provides an excellent approximation to the unfolded surface area of monomeric proteins (where A u is the surface area in squared Angstroms, and M is the molecular mass in Daltons).For the surface area of a folded protein, they likewise report the model (with the same units as above).We may approximate the fraction of possible contacts "lost" to solvent in the folded protein as A f /A u , and thus approximate the expected degree by For lysozyme, we have M = 14.3kDa, giving us A f ≈ 6803.554Å2 , A u ≈ 21185 Å2 , and d ≈ 8.15 (i.e., about 32% of potential residue contacts are predicted to be lost).Although obtained entirely via a priori considerations, we note that this expected degree is quite close to the observed degree for the lysozyme structures in our sample (8.32), suggesting that it is indeed a reasonable choice.
To obtain τ , we simulate 1000 conditional Bernoulli graph draws with mean degree d, subject to the constraint that all backbone-adjacent residues are tied, and take τ equal to the means of the sufficient statistics for the sample.
To set the prior weight (n 0 , and hence δ), we observe that our prior information is fairly vague, and we would want the data to outweigh the prior even for a single graph observation.We thus set n 0 = 0.1, making the prior weight equivalent to one tenth of a single graph observation.For our data set, with m = 66, this implies a net prior weight of δ ≈ 0.0015.

Results
We perform conjugate MAP inference for the pooled ERGM model on the 66 lysozyme PSNs, using the above-specified model; estimation was performed using ergm under default settings incorporating the backbone-adjacency support constraint.The resulting parameter estimates are provided in Table 9.The model parameters can be interpreted based on the conditional log-odds of an edge between two nodes i and j, bearing in mind that many effects are necessarily simultaneous.the mean non-covalent nearest neighbor distance is approximately 3.8 Å, and the second-nearest is approximately 5.1 Å, so bulk is a positive interaction predictor for the vast majority of potential interactions.The minor inhibitory effect at very small distances, like that of aromaticity, may reflect steric hindrance.
Similar subtlety is seen in the case of the mildly positive effect of backbone distance -net of spatial distance -likely reflecting the tendency of the backbone to fold back on itself (creating strong bridges between parts of the protein that are distinct in sequence space).Note that, marginally, we find that contact probabilities fall off as roughly BBDist −5/4 , so this softening effect should not be confused with a net tendency for tie probability to increase with backbone distance.Rather, we find that when sequence-distant residues happen to be spatially proximate, they are particularly likely to be in contact.Less nuance is needed to interpret the effect of distance from the origin, or of the mean Cα distance between residues: both inhibit contact.The latter effect is, as has been observed, very large, in keeping with the constraints of a folded protein.Finally, we observe that net of everything else, existing contacts have an inhibitory effect of new ones (the negative 2-star parameter, cliques are strongly suppressed (negative triangle parameter), and there is an overall tendency towards triangulation net of clique suppression (positive GWESP parameter).
As a model adequacy check, we take 1000 draws from the posterior predictive distribution (based on the Laplace approximation), comparing the distribution of several standard structural properties (degree distribution, ESP distribution, geodesic distance distribution, and triad census) with the observed data means.The result is shown in Fig. 5.As can be seen, the model is able to recapitulate all of the above features, indicating that it does a reasonable job of capturing the basic structural properties of the lysozyme networks.

Reproducing Structural Variability
As noted above, one potential use for ERGM analysis of protein structures is to characterize variability, and to identify dimensions of structural variation that may be imperfectly constrained by available data.Here, we simulate draws from the fitted lysozyme model and examine their range of variation with respect to four basic graph-level indices (GLIs) found by [99] to distinguish protein structures.These are: • Transitivity [106] -a standard measure of triadic closure in network analysis, transitivity reflects the compactness of a PSN in the sense that higher levels of transitivity are associated with the structures that are closely and uniformly packed.
• Standard deviation of degree distribution -a measure of the level of heterogeneity in local packing around chemical groups.
• Standard deviation of the core number [107] -an indicator of the degree of heterogeneity in structural cohesion, which distinguishes between highly organized structures and structures that combine rigidly and loosely bound regions.
• Standard deviation of M-eccentricity -the idea of M-eccentricity stems from eccentricity [108], and was introduced in the context of PSN analysis by [99].The M-eccentricity of a vertex is the mean distance from that vertex to all other vertices; vertices with low M-eccentricity are more centrally located, while those with high M-eccentricity are peripheral to the graph structure.Thus the standard deviation of M-eccentricity distinguishes between uniformly globular structures and structures with deformations or other elongations.Fig. 5 shows the distribution of the above GLI values for the observed lysozyme networks and for posterior predictive draws from the pooled ERGM; GLIs were calculated using the sna library [86].For each of the GLI distributions, we can see that the posterior predictives cover the observed distributions, while being somewhat wider (reflecting posterior uncertainty).Such distributions have potential uses for e.g.statistical comparison of protein families or variants from pooled crystallographic data, where accounting for uncertainty in the distribution of structural properties is an important consideration.

Conclusion
We have presented a highly scalable approach for modeling multiple network observations with ERGMs, under both frequentist and Bayesian paradigms, utilizing basic exponential family properties to perform pooling and/or Bayesian updating entirely within the mean value space.This allows us to perform inference on arbitrary numbers of graphs at no additional cost, and to perform Bayesian inference at the same cost as maximum likelihood estimation.Moreover, by mapping the inferential problem to a problem involving a single network, it is possible to perform both pooled and Bayesian inference with standard software packages designed for single-network applications, without resorting to techniques like graph aggregation with structural zeros that add complexity and computational cost.Simulation experiments suggest that the frequentist properties of the pooling procedure are quite good (with minimal bias and good calibration with even small sample sizes), and conjugate-prior MAP inference yields well-behaved interpolation between prior parameters and the MLE.Conjugate-prior MAP estimates with a simple default prior were also found to have good frequentist properties for a range of diffuse prior weights, suggesting its value as a simple tool for regularized inference (with the most important use case being settings where the MLE does not exist due to the convex hull problem).Although this work focused on a specific choice of default prior that is analogous to a zero vector in the natural parameter space (with the exception of the edge parameter which is corrected for prior density) -a natural analog to the zero-centered priors used in existing strategies for Bayesian ERGM inference -the fact that the conjugate prior is specified in the mean value space (i.e., the space of graph statistics) makes it particularly easy to specify informative alternatives based on e.g.prior data sets.
We demonstrated the applicability of our inferential scheme with two applications, specifically to brain functional connectivity networks and to protein structure networks.In both cases, the ability to quickly and easily pool network data without additional computational cost, and to easily use either Bayesian or frequentist inference, facilitates analysis.We also show how the regularization offered by the use of prior structure makes it possible to include theoretically interesting mixing terms that (because their statistics lie on the convex hull) are problematic under MLE, and how prior substantive information (here, simple empirical models of the properties of monomeric proteins) can be used to create reasonable prior specifications even without existing network data.
The results shown here were produced using the MCMC MLE estimation strategy used by the ergm package, but the idea can be easily adapted to any other ERGM estimation scheme based on fitting to the sufficient statistics (e.g., contrastive divergence, stochastic approximation, the log partition function scheme of [109], or other forms of gradient descent).It is not compatible with approximate likelihood methods such as maximum pseudo-likelihood estimation (MPLE) that operate directly on edge variables, although we observe that it is still possible to initialize estimation by MPLE on a single graph from the set and then proceed with methods based on statistics (as was in fact done here), or otherwise use methods such as contrastive divergence that are similar in speed and accuracy.We do note that one side effect of the high level of statistical precision obtainable from pooled network models is that de facto accuracy eventually begins to depend more on numerical error than statistical uncertainty.While we find that calibration remains good for the range of data sizes considered in our simulation study, precise inference for very large collections of networks may require greater attention to numerical stability than is necessary for conventional ERGM inference.Efficient high-precision algorithms for pooled models in the large-m regime would seem to be an important problem for future work

Appendix A
For convenience, we here provide the definitions of a number of common families of ERGM terms used in this paper. Edges This term counts the number of edges within the graph; since the graphs used here are all undirected, we show the undirected version.The edge term acts as a de facto intercept for the model, setting the baseline tie probability, and is central in fixing the expected density of the network.
Geometrically weighted edgewise shared partners (GWESP) where EP k (y) is the number of connected pairs that have exactly k common neighbors, which is a measure of local clustering in network research.The decay parameter φ controls the relative contribution of EP k (y) to the GWESP statistic.A positive, large coefficient on GWESP term tends to concentrate more probability mass on the graphs with more local clustering.

Nodematch
Given a categorical covariate x defined for each node in the graph, a nodematch term counts the total number of edges between nodes sharing same value on x, t H (y; x) = i<j y ij 1 {x i =x j } Nodemix Given a categorical covariate x defined for each node in the graph, a nodemix term counts the total number of edges between nodes of level k and level l on x, t M (y; x) = i<j y ij 1 {(x i ,x j )=(k,l) or (x i ,x j )=(l,k)} .
The parameter associated with a k, l nodemix statistic inhibits or enhances the rate at which nodes with respective values k and l on x interact with one another; negative values hinder interaction, while positive values enhance it.

30/38
Nodecov Given a covariate x defined for each node in the graph, a nodecov term is defined as follows: This statistic captures the degree of association between nodes' values of x and their net tendency to send or receive ties.

Edgecov
Given a covariate x defined for each pair of nodes in the graph, an edgecov term is defined as follows: This statistic can be viewed as the product-moment of the elements of the adjacency matrix with the elements of x, and as such captures the degree of association between dyads' values on x and the presence or absence of edges.This statistic counts the number of 2-star configurations within the graph (i.e., a subgraph containing node with two simultaneous partners).The associated parameter can be viewed as indicating the effect of each of a node's current edges on the conditional log odds of a specific other edge involving that node being present; when negative, in particular, it can be viewed as a first-order model for hindrance (in which each existing edge makes it more difficult to acquire new edges).

Triangles t T (y) =
i<j<k y ik y jk y ij This statistic counts the number of triangles with in the graph (i.e., complete subgraphs of order three).The associated parameter can be viewed as indicating the effect of each of a node pair's shared partners on the conditional log odds of being tied.Although prone to producing runaway clique formation when associated with a positive parameter (the famous "density explosion" path to degeneracy), negative triangle parameters can serve to inhibit the formation of large cliques (since every clique of order k contains on the order of k 3 triangles) and can serve as a model for packing constraints.graphletCount

31/38
This statistic counts the number of times that a specific graphlet appears in a network, and can be added to ERGM via graphletCount command (see [95] for more details about its definition and implementation) y obs )/m = arg max θ η(θ) ḡ(y obs ) + log h(y obs ) − ψ(η(θ)), where ḡ(y obs ) = 1 m m i=1 g(y i ) is the arithmatic mean of the observed statistics, and h(y obs ) = exp 1 m m i=1 log h(y i ) is the geometric mean of the reference measure over the observed graphs.
) drawing a sample of IID Bernoulli graphs Y Bern p with parameter p = d/(n − 1), and then (2) setting the prior expectation 9/38 τ = ḡ(Y Bern p Fig 1.Average MAP estimates for Y ∼ M , by δ.Solutions interpolate between the natural parameters corresponding to the prior (δ = 1) and the MLE (δ = 0); shrinkage is nearly linear in δ near the non-informative limit.

Fig 5 .
Fig 5. Model adequacy checks for the pooled lysozyme model; shaded areas/boxes show posterior predictive intervals, while red points indicate observed mean values.The lysozyme model successfully recapitulates a range of structural features.

Table 1 .
Pooled MLE bias, standard error, and coverage rates of 95% Wald confidence intervals as a function of sample size.
* CP : Coverage Probability approximated by coverage rates of the simulated data Table

Table 2 .
Mean MAP estimates of model parameters under different relative prior weights when

Table 3 .
Frequentist properties of MAP estimate and Laplace approximation for model parameters under different relative prior weights with sample size fixed at m = 1

Table 6 .
MAP estimates and posterior standard deviations, conjugate Bayesian analysis of brain functional connectivity networks.

Table 7 .
Leave-one-out cross validation error for regularized inference on M 2 as a function of n 0 .

Table 8 .
Model parameter estimates and standard errors for regularized inference on M 2 .