Stochastic step-wise feature selection for Exponential Random Graph Models (ERGMs)

Helal El-Zaatari; Fei Yu; Michael R. Kosorok

doi:10.1371/journal.pone.0314557

Abstract

This study introduces a novel methodology for endogenous variable selection in Exponential Random Graph Models (ERGMs) to enhance the analysis of social networks across various scientific disciplines. Addressing critical challenges such as ERGM degeneracy and computational complexity, our method integrates a systematic step-wise feature selection process. This approach effectively manages the intractable normalizing constants characteristic of ERGMs, ensuring the generation of accurate and non-degenerate network models. An empirical application to nine real-life binary networks demonstrates the method’s effectiveness in accommodating network dependencies and providing meaningful insights into complex network interactions. Particularly notable is the adaptability of this methodology to both directed and undirected networks, overcoming the limitations of traditional ERGMs in capturing realistic network structures. The findings contribute to network analysis, offering a robust framework for modeling and interpreting social networks and laying a foundation for future advancements in statistical network analysis techniques.

Citation: El-Zaatari H, Yu F, Kosorok MR (2024) Stochastic step-wise feature selection for Exponential Random Graph Models (ERGMs). PLoS ONE 19(12): e0314557. https://doi.org/10.1371/journal.pone.0314557

Editor: Pablo Martin Rodriguez, Federal University of Pernambuco: Universidade Federal de Pernambuco, BRAZIL

Received: March 18, 2024; Accepted: November 12, 2024; Published: December 17, 2024

Copyright: © 2024 El-Zaatari et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting information files.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Statistical analysis of social networks plays a pivotal role in various scientific disciplines, offering valuable insights into complex network interactions. Accurate modeling is particularly crucial when working with moderately sized networks, typically comprising a few thousand nodes, as it enables the explanation, analysis, replication, and prediction of network phenomena observed in nature. In the field of health sciences, social network analysis contributes to reducing health disparities [1] and fostering collaboration and research efficiency, leading to scientific innovations and discoveries [2]. By uncovering patterns in collaboration networks, network analysis facilitates the prediction of future connections among individuals or organizations, which holds significant value for multiple stakeholders including health policy researchers, administrators, and research sponsors [3–5].

The advancement in computational power of personal computers in the 21st century has empowered researchers to conduct sophisticated statistical modeling without relying on supercomputers [6]. One powerful technique widely used in social network research is Exponential Random Graph Models (ERGMs). ERGMs are particularly adept at capturing network dependencies by incorporating endogenous variables. However, a challenge arises when the chosen endogenous variables do not accurately capture the observed network structures, leading to ERGM degeneracy [7], a state where networks become unrealistic and uninterpretable [7–9].

Addressing the weakness of ERGMs presents multiple challenges that require careful consideration. First, the dependency among network observations, similar to that in longitudinal studies, invalidates models that assume independence [10]. Secondly, accurately modeling this dependency is crucial but complicated. While stochastic block models treat this dependency as a nuisance parameter, ERGMs aim to explicitly model and quantify it through endogenous variables. The complexity lies in selecting appropriate endogenous variables, given the vast array of choices and the risk of degeneracy through inappropriate selections. Researchers have reported at least five distinct types of ERGMs in their scholarly work, including the standard ERGM [11, 12], Bayesian ERGM [13], Temporal ERGM [14], Separable Temporal ERGM [15, 16], and Multi-level ERGM [17, 18]. They face the daunting task of choosing from thousands of potential endogenous variables [19], a process lacking systematic guidance or tools.

Therefore, this study proposes and tests a novel methodology for endogenous variable selection in ERGMs, targeting the critical challenges in the field. Our approach encompasses key aspects of variable selection, degeneracy screening, and model fitting, providing a comprehensive solution to enhance the effectiveness and reliability of ERGM modeling, particularly in collaboration networks [20]. We conduct empirical testing and rigorous analysis for the proposed statistical algorithms, aiming to facilitate more accurate and meaningful interpretations of network phenomena in various scientific disciplines.

The paper is structured as follows: Section 2 offers a mathematical definition of an ERGM, setting the foundational understanding of this class of models. Section 3 delves into the details of the endogenous variable selection procedure, which includes establishing the initial set of variables, a novel step-wise variable selection process, and an innovative degeneracy screening method based on edge count. Section 4 applies our proposed algorithms to nine real-life binary networks, presenting numerical results that include the selection of endogenous variables, the count of potential pairwise ERGMs, average counts of edges, 2-stars, and triangles, and the efficacy of our degeneracy screening approach. Section 5 discusses the significance of our methodology and testing results, potential future research directions, and limitations, underscoring the implications of our findings for advancing the field of network analysis.

2 Definition of ERGMs

A network or graph G consists of nodes and edges denoted by G = (V, E) respectively. The nodes are assumed to be finite with V = {1, …, N}. The edges represent ties between two different nodes i, j. Modeling networks is centered around the edges E of a graph. The outcome of interest Y_i,j is defined for two separate nodes i ∈ V and j ∈ V. Depending on the type of network, this outcome Y_i,j can take on binary, discrete or real valued numbers. For example, a binary outcome where Y_i,j = 1 indicates an edge between nodes i and j while Y_i,j = 0 indicates no edge. Additionally, nodes i ∈ V can possess a collection of attributes situated in Euclidean space.

Statistical modeling of a network involves defining a probability distribution over graph G. This model space comprises a set of these probability distributions, each indexed by a parameter space Θ. The selected probability distribution will determine the complexity and details of the network model. Inspired by generalized linear models, exponential random graphs model the probability of a tie formation Y_i,j = 1 given the nodal attributes X. For a binary outcome,ERGMs are akin to logistic regression in network data analysis [21]. Analogous relationships exist between ERGMs with discrete and continuous-valued ties and their generalized linear model counterparts, such as Poisson regression and Gamma regression, respectively. The formulation for an ERGM with a binary outcome is given below. (1)

The network with n nodes is represented via an adjacency matrix Y on support and [22]. The vector of endogenous variables is represented via s(y) where and the vector of exogenous variables is represented via g(y) where . The natural numbers depend on the number of endogenous and exogenous variable chosen. The vector of regression coefficients, θ, is partitioned into two sub-vectors θ₁ and θ₂. The nodal attributes are represented by x. The computationally intensive normalizing constant is ψ(θ₁, θ₂), and a vector of exogenous variables, g(y, x), with their associated regression coefficients represented by θ₁ and θ₂ respectively.

3 Endogenous variable selection for ERGMs

In this section, we introduce a novel step-wise feature selection methodology for ERGMs, which adapts the classical statistical method of forward selection technique [23] to address the challenges posed by the complexity of selecting endogenous variables for ERGMs. Our approach starts by focusing on ERGMs with two predictors and employs the Akaike Information Criterion(AIC) for model assessment. This methodological framework guides the initial selection of endogenous variables, their subsequent evaluation, and categorization based on AIC impacts, and concludes with advanced degeneracy screening techniques.

3.1 Obtaining an initial set of endogenous variable

The selection of endogenous variables for an ERGM presents a significant challenge due to the vast array of available options, including hundreds of pre-defined or user-customized variables. These variables are integral for modeling different network structures [24]. In this study, we start with thirteen commonly used endogenous variables, as identified from the ERGM package in R [21]. These variables, selected for their relevance to binary un-directed networks include kstar, degree-wise shared partners (dsp), non-edgewise shared partners (nsp), edgewise shared partners (esp), triangle, isolates, sociality, degree cross product, degree popularity, geometrically weighted edgewise shared partners (gwesp), geometrically weighted non-edgewise shared partners (gwnsp), geometrically weighted dyad-wise shared partners (gwdsp) and geometrically weighted degree. The kstar, esp, dsp and nsp endogenous variables require an upper bound whereas the other nine endogenous variables do not require an upper bound. These endogenous variables represent characteristics of social networks such as reciprocity, transitivity and centrality. For example, the triangle term measures the transitivity of a network. If a triangle term is included in an ERGM with positive regression coefficient it means that transitivity is a key feature in the observed network [25].

We employ a systematic method to select endogenous variables by establishing an informed upper bound, thus providing a structured way to refine choices, particularly for variables requiring a natural number input. For example, the dsp variable is a network statistic equal to the number of dyads with k shared partners [19], and demands the selection of an input . Given the vast range of possibilities, it’s necessary to set an upper limit for k. Similarly, variables such as kstar, esp, and nsp also require a natural number to be well-defined [21]. Different natural numbers k lead to distinct network structures, as illustrated by the difference between dsp(2) dsp(3) (Fig 1).

Download:

Fig 1. Illustration of the dyadwise shared partner endogenous variable with k = 1, k = 2 and k = 3 respectively.

(Source: The original graph appeared as Fig 11 in [26]).

https://doi.org/10.1371/journal.pone.0314557.g001

Addressing the impracticality of considering an infinite yet countable number of endogenous variables, our approach involves sequentially fitting uni-variate ERGMs. Starting from k = 1 and progressing until we reach a specific cutoff point, k = N_k. This cutoff, N_k is determined when we achieve three or more consecutive parameter estimates reaching infinity. Beyond N_k, further consideration of endogenous variables becomes impractical, as they lead to coefficient estimates of negative infinity. The identification of N_k plays a pivotal role in dictating the size of the initial set of endogenous variables. Applying this method to an observed network results in a set of M candidate endogenous variables, , with the set’s size determined by the upper bounds of variables like dsp, esp, nsp, kstar and the remaining nine endogenous variables.

3.2 Stochastic forward selection

This section delves into the Stochastic Forward Selection process, the core of our proposed methodology. We start with a basic ERGM featuring only an edge term, akin to the intercept in a linear model. This baseline model, also known as a Bernoulli Random Graph assumes that the probability of a tie formation between two nodes follows a Bernoulli distribution independent of other ties within the network [10]. This assumption does not account for observation dependence, an important factor in network data.

The process involves the following steps, detailed in Algorithm 1 (Bounding the input Parameter k) and Algorithm 2 (Stochastic Forward Selection for Endogenous Variables).

Algorithm 1 Bounding the Input Parameter k

1. Endogenous Variable Requirement: Start with an endogenous variable s_i(y, l) with an input parameter

2. Sequential Fitting: Fit uni-variate ERGMs with an edge term and endogenous variable s_i(y, l) beginning with l = 2.

3. Upper Bound Determination: Identify the upper bound number N_k when the parameter estimates for s_i(y, N_k+2), s_i(y, N_k+1), s_i(y, N_k) all yield negative infinity.

4. Variable Iteration: Repeat 1–3 for the following endogenous variables: dsp, esp, nsp and kstar, marking their respective upper bounds by N₁, N₂, N₃ and N₄.

5. Final Set Formation: Obtain a final set of endogenous variables .

Building upon the Algorithm 1, we categorize the endogenous variables based on their observed relative AIC changes during the stochastic forward selection process. This categorization is essential in discerning variables that significantly enhance the model and those that may lead to degenerate models or yield ambiguous results.

Category 1: Endogenous variables that consistently lower the AIC compared to the null ERGM, indicating a positive contribution and suggesting their inclusion in the model.
Category 2: Variables that lead to degenerate ERGMs, marked by a very negative relative AIC change or a consistently negative mean relative AIC change, suggesting their exclusion.
Category 3: Variables with ambiguous impact on the AIC, possibly due to poor initial parameter estimates or lack of predictive power. Here, the 10th percentile of the relative AIC change, denoted as , is crucial. A positive indicates potential predictive power, while a negative value suggests exclusion.

Therefore, the null model serves as a baseline for comparing candidate ERGMs. We systematically fit uni-variate ERGMs for each element in s for i ∈ {1, …, M} as shown in Eq (2). (2)

The edge term and its associated regression coefficient are represented by s₀(y) and θ₀, respectively. The endogenous variable under consideration is s_i(y), with its regression coefficient . The null ERGM’s AIC is denoted by AIC₀, and the AIC for each uni-variate ERGM is by AIC_i for i ∈ {1, …, M}. The selection of the endogenous variable, s_i(y) is contingent upon the calculated relative AIC change as shown in Eq (3). (3)

Algorithm 2 Stochastic Forward Selection for Endogenous Variables

1. Initial Set Formation: Start with a set consisting of M candidate endogenous variables.

2. Null Model Fitting: Fit a null model ERGM with only an edge term, denoting the AIC value by AIC₀.

3. Variable Assessment: For i ∈ {1, …, M}:

• Sequentially fit uni-variate ERGMs with one endogenous variable at a time; P_θ(Y = y) = ψ(θ₀, θ_i)exp{θ₀s₀(y) + θ_is_i(y)}.

• Record the estimate of the AIC for each uni-variate ERGM by AIC_i and compute the relative AIC change b_i.

• Refit the uni-variate ERGM M times and compute the 10th percentile of the relative AIC change denoted by .

4. Variable Exclusion: If then remove s_i(y) from the set .

This forward selection strategy effectively categorizes endogenous variables based on their influence on the AIC. Variables that consistently elevate the AIC are considered less informative and excluded. The 10th percentile of the relative AIC change, , plays a crucial role in this process. If is positive, it suggests that this endogenous variable can predict network structure; otherwise, it should not be included. The AIC is computed via MCMC with starting values obtained via contrastive divergence [8].

3.3 Degeneracy screening

After categorizing and selecting endogenous variables for inclusion, we now address a critical aspect of model refinement in ERGMs: degeneracy screening(Algorithm 3). This step is vital to ensure that our model remains robust and accurate, free from the distortions of multicollinearity often observed in ERGMs with multiple endogenous variables [7]. Degenerate networks, characterized by unrealistic network structures, emerge as a significant challenge in ERGMs when multicollinearity occurs due to the inclusion of numerous endogenous variables. To counteract this, our approach analyzes network motifs—small, statistically significant graph patterns typically comprising up to 6 nodes [26]. These motifs serve as a barometer for assessing the realism and practicality of the networks generated by our ERGM. To discard degenerative ERGMs, model selection needs to be based on network motif counts.

3.3.1 Network motifs used for model selection.

The use of network motifs for model selection is an extension of the degeneracy screening, which enables us to delve deeper into the structural analysis of the networks. To recommend non-degenerate ERGMs we count the number of edges for an observed network and compare that count to the average number of edges a candidate ERGM produces. We then compute the relative error of these counts in the last step. While the edge network motif played a central role in screening for degenerate models, our model selection process employs additional network motifs to refine the selection criteria further. Specifically, we focus on the counts of 2-stars and triangles alongside the edge counts. The rationale behind incorporating these motifs is to compare the distribution of these specific patterns in the candidate ERGMs against their occurrence in the observed network.

The selection of models is based on the alignment of the mean number of edges, 2-stars, and triangles in the ERGMs with their corresponding counts in the observed network. The closer these averages are to the observed counts, the more representative and accurate the model is considered. This approach ensures that the selected ERGM not only avoids degeneracy but also closely mirrors the actual network structure in terms of these key motifs.

Algorithm 3 Degeneracy Screening Algorithm

1. Edge Count Assessment: For an observed network, compute the observed edge counts denoted by H_O

2. Model Comparison: For a proposed ERGM , compute the average edge count denoted by H_M.

3. Discrepancy Calculation: Compute the difference |H_M − H_O|.

4. Model Exclusion: If then discard M from the set of possible models.

The last step of the degeneracy screening algorithm is critical: if a candidate ERGM’s average edge count significantly deviates from the observed count (either overestimates or underestimates), it is deemed degenerate and thus excluded from our set of potential models. This step ensures the models we select not only statistically represent the observed network structure but also adhere to realistic network formations.

By integrating degeneracy screening into our methodology, we enhance the model’s reliability, ensuring that it reflects the true nature of the network data and remains free from the distortions of multicollinearity. This process, combined with our earlier steps of variable selection and categorization, culminates in an ERGM that is both robust and representative of the complex dynamics inherent in network structures.

4 Numerical studies

This section presents the application of our algorithms to nine real-life networks [21, 27], varying in complexity and size (i.e., the nodes vary from 16 to 418).

4.1 Types of networks

These networks are categorized into three main categories based on their structure and size, allowing us to comprehensively evaluate the potential and limitations of our method across varied network types. (1) Small Networks: This category includes networks with up to 20 nodes and a maximum of 30 edges. Representative networks in this group are the Florentine marriage, Florentine business and Molecule networks [28]. (2) Moderately Complex Networks: Networks in this category are slightly larger and more complex, with node counts ranging from 20 to 80 and edge counts between 50 and 200. Examples include the Lazega lawyer network, Kapferer tailor shop networks 1 and 2, and the Zach karate networks [29, 30]. (3) Highly Complex Networks: the final category encompasses the largest and most complicated network with nodes ranging from 80 to 418 and edges counts from 200 to 556 [31]. A notable network in this category is the challenging Ecoli network [32], known for its complexity.

4.2 Stochastic forward selection(Algorithm 1 and 2)

In applying Algorithm 1, we obtained an initial set of endogenous variables, establishing a model space for each of the 9 networks (Table 1). The upper bounds of kstar, nsp, dsp and esp are found in Table 2.

Download:

Table 1. Characteristics of 9 real-life undirected networks obtained from the literature.

https://doi.org/10.1371/journal.pone.0314557.t001

Download:

Table 2. The upper bounds of kstar, esp, dsp and nsp for the 9 real life networks.

https://doi.org/10.1371/journal.pone.0314557.t002

The model space consists of ERGMs with either one or two distinct endogenous variables alongside the edge term. Aiming to select endogenous variables that accurately reflect the observed network structure, we applied Algorithm 2, leading to a significant reduction in the model space for all networks. An example of such an ERGM, utilizing the kstar(2) and kstar(3) terms is visualized Fig 2. This visualization represents three networks simulated from this ERGM, showcasing Algorithm 2’s effectiveness in capturing network dynamics.

Download:

Fig 2. Leftmost network denotes the observed Florentine marriage network.

The remaining three networks are draws from the ERGM in equation P_θ(Y = y|X) = ψ (θ)exp{θ₀ × edges + θ₁ × kstar(2) + θ₂ × kstar(3) + θ₃g₁(x) + θ₄g₂ (x) + θ₅g₃ (x)}. g₁(x) denotes the exogenous variable of familial wealth, g₂(x) denotes the number of seats on the civic council and g₃(x) denotes the total number of business and marriage ties.

https://doi.org/10.1371/journal.pone.0314557.g002

Transitioning from the broader application of our proposed stochastic forward selection approach above, the following is a specific example highlighting the importance of the relative AIC percentile in our algorithm. We applied algorithm 1 to a transcription regulation network for Ecoli [32–34], a case that exemplifies the nuances of variable selection in Category 3. In this instance, the relative AIC change between AIC change between the null ERGM and the uni-variate ERGM was recorded 90 times. The focal endogenous variable was the degree cross product. The results, depicted in Fig 3, show that in 5 out of 90 instances, there was a substantial increase in AIC compared to the null model. Crucially, the 10th percentile of the relative AIC for the degree cross product is positive with a median value of 0.0265. This observation underscores the potential risk of incorrectly excluding significant variables based on a single AIC calculation, thus validating the need for a multi-faceted evaluation approach as embodied in our methodology.

Download:

Fig 3. Relative AIC change frequency.

Frequency of the relative AIC change for the uni-variate ERGM with degree cross product as the main predictor. The histogram on the left is an example of relative AIC fluctuation due to poor initial starting points. The histogram on the right exhibits relative AIC fluctuation that mimics random noise.

https://doi.org/10.1371/journal.pone.0314557.g003

4.3 Degeneracy screening and model selection (Algorithm 3)

Our focus of model selection was primarily on the counts of edges, 2-stars and triangles as determined by Algorithm 3. This approach allowed us to systematically identify and exclude degenerate ERGMs, which are characterized by edge counts that significantly diverge from those observed in the actual networks.

For this reduced model space, the average number of edges, 2-stars and triangles generally aligned with the observed counts. However, the number of edges tended to be overestimated across all networks, a bias often inherent in exponential family models [35]. To address this discrepancy, it is advisable to allow ERGMs to include more than 3 endogenous variables, thereby enhancing model accuracy. The variations in the counts of 2-stars and triangles compared to the observed networks further illustrate the complex dynamics within these networks and the necessity of a flexible and robust modeling approach.

For this reduced model space, the average number of edges, 2-stars and triangles tend towards the observed count. The number of edges were overestimated for all networks as shown in Table 3. This bias is inherent to exponential family models [35] and can be corrected if more endogenous variables are used. On the other hand, the average number of 2-stars and triangles were sometimes above the observed count and below the observed count for different networks. A remedy to this discrepancy is allowing the candidate ERGM to possess more than 3 endogenous variables.

Download:

Table 3. Results from applying our algorithms to the 9 real-life networks.

https://doi.org/10.1371/journal.pone.0314557.t003

5 Discussion

The results of the numerical studies have affirmed the effectiveness of our proposed methodology in generating well-fitting and non-degenerate ERGMs. Our approach has been successfully applied to a diverse range of networks, from small-scale networks with fewer than 20 nodes to complex networks like the Ecoli network, encompassing both directed and undirected structures. This versatility in application demonstrates the robustness of this approach.

The novelty of our method lies in its adaptive use of a step-wise feature selection process tailored specifically for ERGMs. This approach, which meticulously evaluates the impact of each endogenous variable using the AIC, marks a significant departure from traditional methods. It addressed the inherent complexity of ERGMs, especially the challenges posed by their intractable normalizing constants, and offers a robust framework that enhances the model’s accuracy and interpretability.

Our methodological framework opens new avenues in network analysis, particularly in handling directed networks, which traditionally pose a challenge due to their complex structures. The ability to incorporate directed endogenous variables, though resulting in an expanded model space, paves the way for more nuanced analyses of directed networks. This capability is crucial for future studies to understand the directional dynamics within networks, offering potential for groundbreaking discoveries in fields such as social network analysis, epidemiology, and beyond. One notable limitation of our approach is the inflated model space that emerges when incorporating directed endogenous variables. This expansion complicates the model selection process, particularly when models include more than three variables. Addressing this limitation necessitates the development of sophisticated model selection procedures capable of navigating this increased complexity.

6 Conclusion

In conclusion, the method we introduced in this study represents a significant advancement in network analysis. By providing a robust and adaptable framework for ERGMs, our methodology not only ensures the generation of accurate and non-degenerate models but also enhances the potential for their application in more complex network types. While the challenge of an expanded model space presents an opportunity for further research, it also highlights the fertile ground for further advancements in the analytical capabilities and applications of ERGMs. As we build on this work, the potential for new insights and understandings of complex network structures becomes increasingly attainable.

Supporting information

S1 Data. Contains the edge lists and vertex attributes for the nine networks used in this study as excel files.

Contains the R code for the three algorithms and their results in a .RData file.

https://doi.org/10.1371/journal.pone.0314557.s001

(ZIP)

References

1. Okamoto J, for Population Health C, Group HDEW. Scientific collaboration and team science: a social network analysis of the centers for population health and health disparities. Translational behavioral medicine. 2015;5(1):12–23. pmid:25729449
- View Article
- PubMed/NCBI
- Google Scholar
2. Bennett LM, Gadlin H, Marchand C. Collaboration team science: Field guide. US Department of Health & Human Services, National Institutes of Health …; 2018.
3. Yu F, Van AA, Patel T, Mani N, Carnegie A, Corbie-Smith GM, et al. Bibliometrics approach to evaluating the research impact of CTSAs: a pilot study. Journal of clinical and translational science. 2020;4(4):336–344. pmid:33244415
- View Article
- PubMed/NCBI
- Google Scholar
4. Provan KG, Harvey J, De Zapien JG. Network structure and attitudes toward collaboration in a community partnership for diabetes control on the US-Mexican border. Journal of Health Organization and Management. 2005;19(6):504–518. pmid:16375071
- View Article
- PubMed/NCBI
- Google Scholar
5. Luke DA, Wald LM, Carothers BJ, Bach LE, Harris JK. Network influences on dissemination of evidence-based guidelines in state tobacco control programs. Health education & behavior. 2013;40(1 suppl):33S–42S. pmid:24084398
- View Article
- PubMed/NCBI
- Google Scholar
6. Nordhaus WD. The progress of computing. Available at SSRN 285168. 2001;.
7. Li K. Degeneracy, duration, and co-evolution: extending exponential random graph models (ERGM) for social network analysis; 2015.
8. Krivitsky PN. Using contrastive divergence to seed Monte Carlo MLE for exponential-family random graph models. Computational Statistics & Data Analysis. 2017;107:149–161.
- View Article
- Google Scholar
9. Bang-Jensen J, Gutin G. Basic terminology, notation and results. Classes of Directed Graphs. 2018; p. 1–34.
- View Article
- Google Scholar
10. Kolaczyk ED, Csárdi G. Statistical Analysis of Network Data with R. vol. 65. Springer; 2014.
11. Uddin S, Hossain L, Hamra J, Alam A. A study of physician collaborations through social network and exponential random graph. BMC health services research. 2013;13(1):1–14. pmid:23803165
- View Article
- PubMed/NCBI
- Google Scholar
12. Zappa P, Mariani P. The interplay of social interaction, individual characteristics and external influence in diffusion of innovation processes: An empirical test in medical settings. Procedia-Social and Behavioral Sciences. 2011;10:140–147.
- View Article
- Google Scholar
13. Caimo A, Pallotti F, Lomi A. Bayesian exponential random graph modelling of interhospital patient referral networks. Statistics in medicine. 2017;36(18):2902–2920. pmid:28421624
- View Article
- PubMed/NCBI
- Google Scholar
14. Azondekon R. Modeling the Complexity and Dynamics of the Malaria Research Collaboration Network in Benin, West Africa: papers indexed in the Web Of Science (1996–2016). In: AMIA Annual Symposium Proceedings. vol. 2018. American Medical Informatics Association; 2018. p. 195.
15. Ho E, Jeon M, Lee M, Luo J, Pfammatter AF, Shetty V, et al. Fostering interdisciplinary collaboration: A longitudinal social network analysis of the NIH mHealth Training Institutes. Journal of clinical and translational science. 2021;5(1).
- View Article
- Google Scholar
16. Broekel T, Bednarz M. Disentangling link formation and dissolution in spatial networks: an application of a two-mode STERGM to a project-based R&D network in the German biotechnology industry. Networks and Spatial Economics. 2018;18:677–704.
- View Article
- Google Scholar
17. Wang P, Robins G, Pattison P, Lazega E. Exponential random graph models for multilevel networks. Social Networks. 2013;35(1):96–115.
- View Article
- Google Scholar
18. McGlashan J, de la Haye K, Wang P, Allender S. Collaboration in complex systems: Multilevel network analysis for community-based obesity prevention interventions. Scientific Reports. 2019;9(1):12599. pmid:31467328
- View Article
- PubMed/NCBI
- Google Scholar
19. Hunter DR, Handcock MS, Butts CT, Goodreau SM, Morris M. ergm: A package to fit, simulate and diagnose exponential-family models for networks. Journal of statistical software. 2008;24(3):nihpa54860. pmid:19756229
- View Article
- PubMed/NCBI
- Google Scholar
20. Yu F, El-Zaatari HM, Kosorok MR, Carnegie A, Dave G. The application of exponential random graph models to collaboration networks in biomedical and health sciences: a review. Network Modeling Analysis in Health Informatics and Bioinformatics. 2024;13(1):5.
- View Article
- Google Scholar
21. Handcock MS, Hunter DR, Butts CT, Goodreau SM, Krivitsky PN, Morris M, et al. Package ‘ergm”; 2015.
22. Yin F, Butts CT. Highly scalable maximum likelihood and conjugate Bayesian inference for ERGMs on graph sets with equivalent vertices. Plos one. 2022;17(8):e0273039. pmid:36018834
- View Article
- PubMed/NCBI
- Google Scholar
23. Effroymson M. Multiple regression analysis. Mathematical Methods for Digital Computers, Ed. A. Ralson and HS Wilf; 1960.
24. Hunter DR, Goodreau SM, Handcock MS. ergm. userterms: A Template Package for Extending statnet. Journal of statistical software. 2013;52(2):i02. pmid:24307887
- View Article
- PubMed/NCBI
- Google Scholar
25. Morris M, Handcock MS, Hunter DR. Specification of exponential-family random graph models: terms and computational aspects. Journal of statistical software. 2008;24(4):1548. pmid:18650964
- View Article
- PubMed/NCBI
- Google Scholar
26. Masoudi-Nejad A, Schreiber F, Kashani ZRM. Building blocks of biological networks: a review on major network motif discovery algorithms. IET systems biology. 2012;6(5):164–174. pmid:23101871
- View Article
- PubMed/NCBI
- Google Scholar
27. Caimo A, Friel N. Bergm: Bayesian exponential random graphs in R. arXiv preprint arXiv:12012770. 2012;.
28. Padgett JF. Marriage and elite structure in Reinassance Florence; 1282-1500. Redes: revista hispana para el análisis de redes sociales. 2011;21:0071–97.
- View Article
- Google Scholar
29. Kapferer B. Strategy and transaction in an African factory: African workers and Indian management in a Zambian town. Manchester University Press; 1972.
30. Lazega E. The collegial phenomenon: The social mechanisms of cooperation among peers in a corporate law partnership. Oxford University Press, USA; 2001.
31. Resnick MD, Bearman PS, Blum RW, Bauman KE, Harris KM, Jones J, et al. Protecting adolescents from harm: findings from the National Longitudinal Study on Adolescent Health. Jama. 1997;278(10):823–832. pmid:9293990
- View Article
- PubMed/NCBI
- Google Scholar
32. Shen-Orr SS, Milo R, Mangan S, Alon U. Network motifs in the transcriptional regulation network of Escherichia coli. Nature genetics. 2002;31(1):64–68. pmid:11967538
- View Article
- PubMed/NCBI
- Google Scholar
33. Hummel RM, Hunter DR, Handcock MS. A steplength algorithm for fitting ERGMs. Tech. Rep. 10-03, Pennsylvania State University; 2010.
34. Salgado H, Santos-Zavaleta A, Gama-Castro S, Millán-Zárate D, Díaz-Peredo E, Sánchez-Solano F, et al. RegulonDB (version 3.2): transcriptional regulation and operon organization in Escherichia coli K-12. Nucleic acids research. 2001;29(1):72–74. pmid:11125053
- View Article
- PubMed/NCBI
- Google Scholar
35. Efron B. The geometry of exponential families. The Annals of Statistics. 1978; p. 362–376.
- View Article
- Google Scholar

[ref1] 1. Okamoto J, for Population Health C, Group HDEW. Scientific collaboration and team science: a social network analysis of the centers for population health and health disparities. Translational behavioral medicine. 2015;5(1):12–23. pmid:25729449
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Bennett LM, Gadlin H, Marchand C. Collaboration team science: Field guide. US Department of Health & Human Services, National Institutes of Health …; 2018.

[ref3] 3. Yu F, Van AA, Patel T, Mani N, Carnegie A, Corbie-Smith GM, et al. Bibliometrics approach to evaluating the research impact of CTSAs: a pilot study. Journal of clinical and translational science. 2020;4(4):336–344. pmid:33244415
View Article
PubMed/NCBI
Google Scholar

[7] View Article

[8] PubMed/NCBI

[9] Google Scholar

[ref4] 4. Provan KG, Harvey J, De Zapien JG. Network structure and attitudes toward collaboration in a community partnership for diabetes control on the US-Mexican border. Journal of Health Organization and Management. 2005;19(6):504–518. pmid:16375071
View Article
PubMed/NCBI
Google Scholar

[11] View Article

[12] PubMed/NCBI

[13] Google Scholar

[ref5] 5. Luke DA, Wald LM, Carothers BJ, Bach LE, Harris JK. Network influences on dissemination of evidence-based guidelines in state tobacco control programs. Health education & behavior. 2013;40(1 suppl):33S–42S. pmid:24084398
View Article
PubMed/NCBI
Google Scholar

[15] View Article

[16] PubMed/NCBI

[17] Google Scholar

[ref6] 6. Nordhaus WD. The progress of computing. Available at SSRN 285168. 2001;.

[ref7] 7. Li K. Degeneracy, duration, and co-evolution: extending exponential random graph models (ERGM) for social network analysis; 2015.

[ref8] 8. Krivitsky PN. Using contrastive divergence to seed Monte Carlo MLE for exponential-family random graph models. Computational Statistics & Data Analysis. 2017;107:149–161.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref9] 9. Bang-Jensen J, Gutin G. Basic terminology, notation and results. Classes of Directed Graphs. 2018; p. 1–34.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref10] 10. Kolaczyk ED, Csárdi G. Statistical Analysis of Network Data with R. vol. 65. Springer; 2014.

[ref11] 11. Uddin S, Hossain L, Hamra J, Alam A. A study of physician collaborations through social network and exponential random graph. BMC health services research. 2013;13(1):1–14. pmid:23803165
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref12] 12. Zappa P, Mariani P. The interplay of social interaction, individual characteristics and external influence in diffusion of innovation processes: An empirical test in medical settings. Procedia-Social and Behavioral Sciences. 2011;10:140–147.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref13] 13. Caimo A, Pallotti F, Lomi A. Bayesian exponential random graph modelling of interhospital patient referral networks. Statistics in medicine. 2017;36(18):2902–2920. pmid:28421624
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref14] 14. Azondekon R. Modeling the Complexity and Dynamics of the Malaria Research Collaboration Network in Benin, West Africa: papers indexed in the Web Of Science (1996–2016). In: AMIA Annual Symposium Proceedings. vol. 2018. American Medical Informatics Association; 2018. p. 195.

[ref15] 15. Ho E, Jeon M, Lee M, Luo J, Pfammatter AF, Shetty V, et al. Fostering interdisciplinary collaboration: A longitudinal social network analysis of the NIH mHealth Training Institutes. Journal of clinical and translational science. 2021;5(1).
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref16] 16. Broekel T, Bednarz M. Disentangling link formation and dissolution in spatial networks: an application of a two-mode STERGM to a project-based R&D network in the German biotechnology industry. Networks and Spatial Economics. 2018;18:677–704.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref17] 17. Wang P, Robins G, Pattison P, Lazega E. Exponential random graph models for multilevel networks. Social Networks. 2013;35(1):96–115.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref18] 18. McGlashan J, de la Haye K, Wang P, Allender S. Collaboration in complex systems: Multilevel network analysis for community-based obesity prevention interventions. Scientific Reports. 2019;9(1):12599. pmid:31467328
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref19] 19. Hunter DR, Handcock MS, Butts CT, Goodreau SM, Morris M. ergm: A package to fit, simulate and diagnose exponential-family models for networks. Journal of statistical software. 2008;24(3):nihpa54860. pmid:19756229
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref20] 20. Yu F, El-Zaatari HM, Kosorok MR, Carnegie A, Dave G. The application of exponential random graph models to collaboration networks in biomedical and health sciences: a review. Network Modeling Analysis in Health Informatics and Bioinformatics. 2024;13(1):5.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref21] 21. Handcock MS, Hunter DR, Butts CT, Goodreau SM, Krivitsky PN, Morris M, et al. Package ‘ergm”; 2015.

[ref22] 22. Yin F, Butts CT. Highly scalable maximum likelihood and conjugate Bayesian inference for ERGMs on graph sets with equivalent vertices. Plos one. 2022;17(8):e0273039. pmid:36018834
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref23] 23. Effroymson M. Multiple regression analysis. Mathematical Methods for Digital Computers, Ed. A. Ralson and HS Wilf; 1960.

[ref24] 24. Hunter DR, Goodreau SM, Handcock MS. ergm. userterms: A Template Package for Extending statnet. Journal of statistical software. 2013;52(2):i02. pmid:24307887
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref25] 25. Morris M, Handcock MS, Hunter DR. Specification of exponential-family random graph models: terms and computational aspects. Journal of statistical software. 2008;24(4):1548. pmid:18650964
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref26] 26. Masoudi-Nejad A, Schreiber F, Kashani ZRM. Building blocks of biological networks: a review on major network motif discovery algorithms. IET systems biology. 2012;6(5):164–174. pmid:23101871
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref27] 27. Caimo A, Friel N. Bergm: Bayesian exponential random graphs in R. arXiv preprint arXiv:12012770. 2012;.

[ref28] 28. Padgett JF. Marriage and elite structure in Reinassance Florence; 1282-1500. Redes: revista hispana para el análisis de redes sociales. 2011;21:0071–97.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref29] 29. Kapferer B. Strategy and transaction in an African factory: African workers and Indian management in a Zambian town. Manchester University Press; 1972.

[ref30] 30. Lazega E. The collegial phenomenon: The social mechanisms of cooperation among peers in a corporate law partnership. Oxford University Press, USA; 2001.

[ref31] 31. Resnick MD, Bearman PS, Blum RW, Bauman KE, Harris KM, Jones J, et al. Protecting adolescents from harm: findings from the National Longitudinal Study on Adolescent Health. Jama. 1997;278(10):823–832. pmid:9293990
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref32] 32. Shen-Orr SS, Milo R, Mangan S, Alon U. Network motifs in the transcriptional regulation network of Escherichia coli. Nature genetics. 2002;31(1):64–68. pmid:11967538
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref33] 33. Hummel RM, Hunter DR, Handcock MS. A steplength algorithm for fitting ERGMs. Tech. Rep. 10-03, Pennsylvania State University; 2010.

[ref34] 34. Salgado H, Santos-Zavaleta A, Gama-Castro S, Millán-Zárate D, Díaz-Peredo E, Sánchez-Solano F, et al. RegulonDB (version 3.2): transcriptional regulation and operon organization in Escherichia coli K-12. Nucleic acids research. 2001;29(1):72–74. pmid:11125053
View Article
PubMed/NCBI
Google Scholar

[93] View Article

[94] PubMed/NCBI

[95] Google Scholar

[ref35] 35. Efron B. The geometry of exponential families. The Annals of Statistics. 1978; p. 362–376.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

Figures

Abstract

1 Introduction

2 Definition of ERGMs

3 Endogenous variable selection for ERGMs

3.1 Obtaining an initial set of endogenous variable

3.2 Stochastic forward selection

3.3 Degeneracy screening

3.3.1 Network motifs used for model selection.

4 Numerical studies

4.1 Types of networks

4.2 Stochastic forward selection(Algorithm 1 and 2)

4.3 Degeneracy screening and model selection (Algorithm 3)

5 Discussion

6 Conclusion

Supporting information

S1 Data. Contains the edge lists and vertex attributes for the nine networks used in this study as excel files.

References