Scalable parallel and distributed simulation of an epidemic on a graph

Guohao Dou

doi:10.1371/journal.pone.0291871

Abstract

We propose an algorithm to simulate Markovian SIS epidemics with homogeneous rates and pairwise interactions on a fixed undirected graph, assuming a distributed memory model of parallel programming and limited bandwidth. This setup can represent a broad class of simulation tasks with compartmental models. Existing solutions for such tasks are sequential by nature. We provide an innovative solution that makes trade-offs between statistical faithfulness and parallelism possible. We offer an implementation of the algorithm in the form of pseudocode in the Appendix. Also, we analyze its algorithmic complexity and its induced dynamical system. Finally, we design experiments to show its scalability and faithfulness. In our experiments, we discover that graph structures that admit good partitioning schemes, such as the ones with clear community structures, together with the correct application of a graph partitioning method, can lead to better scalability and faithfulness. We believe this algorithm offers a way of scaling out, allowing researchers to run simulation tasks at a scale that was not accessible before. Furthermore, we believe this algorithm lays a solid foundation for extensions to more advanced epidemic simulations and graph dynamics in other fields.

Citation: Dou G (2023) Scalable parallel and distributed simulation of an epidemic on a graph. PLoS ONE 18(9): e0291871. https://doi.org/10.1371/journal.pone.0291871

Editor: Wei Ju, Peking University, CHINA

Received: June 26, 2023; Accepted: September 7, 2023; Published: September 29, 2023

Copyright: © 2023 Guohao Dou. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: We release the code and simulation results on Zenodo. Link: https://doi.org/10.5281/zenodo.7750420.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Large-scale epidemic simulations have become indispensable for high-stakes policymaking in public health. In 2020, the Imperial College report [1] predicted serious consequences in the absence of government actions to control the COVID-19 pandemic, convincing the UK government to issue necessary restrictions. However, large-scale simulations require immense computing power. Just as the authors of [2] put it, building a model on the global scale is a “daunting undertaking”. Thanks to the increasing availability of commercial cloud platforms and supercomputers, we can run simulation algorithms on many server nodes. When the task scale is moderate, each computing node runs its instance of the simulation algorithm, and the task is embarrassingly parallel. We only consider the scale at which multiple computing units need to cooperate to finish one instance of the simulation algorithm, meaning that parallelism cannot be attained without communication, and nontrivial algorithm design is needed.

The importance of epidemic simulations and the availability of computing resources calls for algorithmic solutions. We henceforth focus on a highly useful special case to build a solid foundation for future extensions, and we define the scope of the problem as follows.

By “Markovian SIS epidemics”, we consider epidemics with an SIS (susceptible-infected-susceptible) compartmental model where all times to infection and times to recovery follow exponential distributions. In an SIS model, each individual, represented as a vertex in the graph, can either be in an infected (INF) state or a susceptible (SUS) state. Susceptible vertices neighboring infected vertices may transition into the infected state through an infection event. Infected vertices may transition into the susceptible state through a recovery event. Non-Markovian epidemic models exist and have become a rich field of study. See [3, 4].
By “homogeneous rates”, we mean that all exponentially distributed times to infection/recovery have the same infection/recovery rates, as opposed to heterogeneous rates, studied in [5, 6].
By “pairwise interactions”, we mean that all infection events through edges happen independently, in contrast to the cooperative infection model in [4]. Dynamics beyond pairwise interactions have also become an emerging field [7].
By “a fixed graph”, we mean that the graph structure on which the epidemic dynamics occur is fixed throughout the simulation. A body of literature exists for the scenario where the graph structure is allowed to vary, sometimes adaptively. See [8–11].
By “distributed memory model of parallel programming”, we mean computing units can communicate by sending messages alone. It differs from the shared memory model of parallelism. We ensure that the algorithm lends itself well to extreme-scale simulation tasks where researchers have to scale out in a loosely connected cluster of machines.
By “limited bandwidth”, we put an upper bound on the bandwidth usage, which tends to become the performance bottleneck.

We pick the simplest epidemic model (SIS) as an example of our algorithm. It will become clear that our algorithm can be easily generalized to other compartmental models. These limitations are as restrictive as they are instructive for future extensions. For now, they simplify the presentation and analysis of the algorithm.

Efforts have been made to simulate spreading dynamics on graphs. By now, there are mainly two schools of thought: time-driven simulations (TDS) [1, 2, 8, 12] and event-driven simulations (EDS) [13–16].

TDS proceeds in epochs of time steps of fixed size Δ. Let the recovery rate be γ and the infection rate β. In an epoch of a TDS algorithm, the algorithm scans through all vertices of the graph. If a vertex v is infected, it recovers with probability 1 − exp(−γΔ) at the end of the epoch. If a vertex v with n_v infected neighbors is susceptible at the beginning of the epoch, then it gets infected with probability 1 − exp(−n_vβΔ) at the end of the epoch. TDS is easy to implement and parallelize and can be fairly efficient if a large Δ is chosen, which explains why it is popular among researchers and practitioners [1, 2, 8, 12]. However, arguments can also be made against TDS. Firstly, the system’s state can only be observed at multiples of Δ. Secondly, as pointed out in [17], discretization introduces a significant bias from continuous formulation such as the master equation, a bias worsened by large Δ.

EDS proceeds by hopping from one event to another, usually by maintaining a queue of events ordered by their scheduled times of occurrence and processing these events in this order. Unlike TDS, EDS allows events to occur at any time, not just multiples of Δ. The occurrence of an event may schedule new events or cancel existing events, with the time between the scheduling and occurrence of an event, which we call the “pending time”, following a certain distribution (exponential distribution in our case). When the infection of a vertex v takes place at time t, a recovery event of vertex v gets scheduled at time t + T_rec where T_rec follows an exponential distribution with rate γ. When the recovery of a vertex v with n_v infected neighbors takes place at time t, an infection event of vertex v gets scheduled at time t + T_inf where T_inf follows an exponential distribution with rate n_vβ. EDS is faithful to the continuous formulation, and the system’s state can be observed at any time. However, EDS is sequential by design since the events must be processed one by one. Tremendous efforts have gone into parallelizing EDS, giving rise to the PDES (Parallel Discrete Event Simulation) community [18]. Two main methods exist in the PDES community, proposing two different ways of preventing events from occurring out of order. The first, often referred to as the CMB algorithm (Chandy, Misra, and Bryant) [19], relies on the crucial assumption that pending times have a positive lower bound and the algorithm blocks execution if out-of-order event processing is possible. The second, proposed in [20], allows out-of-order errors but employs a rollback mechanism to undo the errors. People have also adopted PDES techniques in epidemic simulations before [21], even though the authors of [21] treat each location as a logical process, not each individual, making their algorithm less generic than simulating epidemics on graphs. We believe that methods from the PDES community are not well-suited for simulating dynamical processes on graphs. On the one hand, pending times do not have a lower bound, making the CMB algorithm inapplicable and the algorithm from [20] arbitrarily inefficient. On the other hand, each vertex needs to be its own logical process, which brings about significant overhead from various sources, such as communication and operating system scheduling.

We propose in this article an algorithm that strikes a balance between TDS and EDS and enables the trade-off between parallelism and faithfulness (faithfulness to the continuous formulation of the dynamical system). The two can be viewed as two extremes of our algorithm. In fact, when the number of partitions is one (M = 1 as we will see later), our algorithm is reduced to classic sequential EDS, which we use as a baseline in our experiments. Our algorithm introduces parallelism through two techniques: partitioning and summary. We partition the graph and allow each part to run its own copy of EDS. However, partitioning without communication causes each part to observe a stale version of their neighbors’ states, leading to highly biased simulation results. To solve this problem, we take inspiration from TDS and force all parts to synchronize periodically, exchanging only a summary of its local information, which saves precious bandwidth.

We study our algorithm’s complexity, scalability, and faithfulness through theoretical and experimental means. We discover that, under mild assumptions, our algorithm scales well with the number of computing units and preserves distinctive features of the dynamical system despite a highly adversarial parameter setup.

The paper is organized as follows. First, we introduce basic notations, definitions, and assumptions in Section 2. Second, we describe the algorithm in Section 3 and offer an implementation of our algorithm in the form of pseudocode in Section 7. Third, we study the algorithm theoretically in Section 4 and experimentally in Section 5. Finally, Section 6 sums up the article and discusses the limitations and extensions of the algorithm.

1.1 Related work

Modeling epidemic processes is an important subject throughout human history. As early as 1766, Daniel Bernoulli publishes one of the earliest compartmental models about smallpox [22]. In 1927, Kermack et al.[ 23] publish what is commonly known as the Kermack-McKendrick theory, a seminal contribution that models epidemic dynamics in a homogeneous population with a system of differential equations. Their work gives rise to a rich body of literature on population dynamics that discusses complex compartmental transitions [24] and applications on real-world epidemic data [25].

In recent years, epidemic modeling on graph structures has attracted intellectual attention. Instead of dynamics in a homogeneous well-mixed population, which can usually be described by a handful of differential equations, researchers in this field focus on compartmental models on graphs, where each vertex in the graph is in one of the predefined compartments/states, and state transitions take place as a result of interactions among neighboring vertices. In this setting, an exact description of the dynamics requires an enormous number of differential equations, a number that typically scales exponentially with the number of vertices in the graph. Understanding the mathematical intractability, researchers explore various methods of simplification such as the mean-field approximation, resulting in a wealth of literature. See [26] for a comprehensive overview.

Because of the mathematical difficulty, many researchers and practitioners resort to numerical simulations. We have discussed copiously the two main streams of methods: EDS and TDS. There are also ad-hoc or agent-based methods of epidemic simulation, such as [1, 21, 27–29]. Many of these ad-hoc methods do employ parallel computation. However, their model adequacy comes at the cost of model explainability and generality.

Our algorithm can be viewed as a part of the broader effort in graph data processing. Many existing systems for parallel graph processing can be repurposed to implement our algorithm, such as [30, 31]. Even though we simply use METIS [32] for graph partitioning, a community detection algorithm [33, 34] can also be a reasonable choice. Moreover, because the ultimate goal of epidemic simulations is to predict the future, it is plausible that we can feed simulation trajectories to a graph neural network [35, 36] as training data and use the graph neural network for fast predictions later on.

2 Problem setup

We have as input

a fixed undirected graph G = (V, E) with initial vertex states (INF or SUS); INF for infected and SUS for susceptible;
a partition of V = V₁ ∪ V₂ ∪ V₃ ∪ ⋯ ∪V_M, ∀i ≠ j, V_i ∩ V_j = ∅, ∀i, V_i ≠ ∅;
an SIS epidemic model with recovery rate γ and per-edge infection rate β;
epoch length Δ; simulation task horizon H.

The epoch length Δ plays a similar role as the fixed time step in TDS. The simulation task horizon H specifies how much time in the simulated world should have elapsed by the time the simulation algorithm terminates.

We make the following assumptions on the input:

M² is . If we treat M as a function of |V|, M(|V|) satisfies
H, Δ, β, γ are constant functions of |V| and M. .
The mean degree 〈D〉 of G exists and is a constant function of |V| and M.

We simulate the spread of Markovian SIS epidemics with homogeneous rates and pairwise interactions. A simulation algorithm produces a trajectory of the form where e₀, e₁, … e_L are events that take place at t₀, t₁, … t_L in the simulated world. We refer to the production of one such trajectory as one simulation instance.

For this specific application, events are either infections or recoveries. A recovery event Rec(v) is scheduled when vertex v changes its state from susceptible (SUS) to infected (INF). When Rec(v) occurs, the state of v becomes susceptible. An infection event Inf(u → v) is scheduled when the state change (u susceptible XOR v infected) → (u infected AND v susceptible) occurs. When Inf(u → v) occurs, the state of v becomes infected. We keep track of the origin of infection because we want to differentiate between these infecting events, and only the event with the smallest scheduled time may occur, upon which all others will be canceled.

Suppose we have M processes P₁, …, P_M interconnected by a network, each with sufficient and fast local memory and a dedicated computing unit. We map each V_i to P_i, where we keep in memory

∀v_i ∈ V_i, the state of v_i and the list of its neighbors;
data structures used for computation local to P_i.

If a vertex v, its state, and the list of its neighbors reside in the memory of process P, we say v is a vertex on process P (or, local to process P). For the rest of this article, we abuse the notation and use the symbol of process P to denote the set of vertices on it in the context of set operations.

Note that the processes P₁, …, P_M also form a directed graph , where We denote the neighbors of vertex v as . We denote the neighbors of process P as .

Each of P₁, …, P_M can only efficiently access its memory. Therefore, they have to communicate via message passing. A channel exists for every , where P_i is the sender of messages and P_j the receiver (note that (P_i, P_j) and (P_j, P_i) are two different channels). Suppose that each channel can hold at most one 64-bit integer at a time, and every integer message communicated in the algorithm can be expressed as a 64-bit integer. That is, the system is supposed to crash if P_i puts an integer in the (P_i, P_j) channel before P_j takes the previous integer message out. We refer to a matched pair of sending and receiving as one message. We measure communication costs by the number of messages sent from all processes throughout one simulation instance.

We need an algorithm that can produce the aforementioned trajectory while sending messages.

3 Algorithm

We present and discuss the algorithm in this section. We begin by describing the algorithm on a high level. We provide the pseudocode of the algorithm in Section 7. We also compare it to other existing implementations for parallel processing.

Each process deserializes its assigned part of the adjacency list and initializes its data structures, including but not limited to the event queue essential to event-driven simulations. After initialization, the algorithm proceeds in iterations, which we refer to as “epochs”. Given simulation horizon H and epoch length Δ, the loop runs for ⌈H/Δ⌉ times, after which the algorithm terminates.

In each epoch, a process computes, based on its local memory, what will be needed by other processes for their local computation, after which it sends these messages asynchronously. It then waits for the messages from other processes to begin its own local computation. After finishing its local computation, it waits on the acknowledgements from other processes that the messages it has sent earlier has been received and then the cycle repeats.

Two key questions are manifested. Firstly, what information needs to be sent to other processes. Secondly, what local computation needs to take place.

The first question determines bandwidth usage. We choose to send merely a summary statistic from process P to process Q, which acts as a proxy of P on Q. If a vertex v on process P has at least one neighbor on process Q, we say v is bordering Q. Our choice of the summary statistic for this article is the number of infected vertices on P bordering Q, denoted as n_P→Q. Let the number of vertices on P bordering Q be N_P→Q. Because we assume a fixed network, N_P→Q only needs to be computed once during initialization. We then compute the probability of finding an infected vertex on the (P → Q) border as n_P→Q/N_P→Q.

Let be the l-th integer that process P sents to Q. We have the IO automaton in Fig 1, which illustrates the messaging pattern and the iterative nature of the algorithm.

Download:

Fig 1. The IO automaton of process P.

We assume that process P has neighboring processes Q₁, Q₂, Q₃.

https://doi.org/10.1371/journal.pone.0291871.g001

The second question is partially answered by classic EDS. On top of the event-driven scheme that we have reviewed in the Introduction, we incorporate the information from n_{Q → P} by introducing a new type of event, Inf(Q → v), for susceptible vertex v ∈ P bordering Q. Let n_Q→v be the number of v’s neighbors on Q. The rate of event Inf(Q → v) is computed as We effectively conduct a mean-field approximation on the border. We summarize this alternating pattern of communication and computation in Fig 2.

Download:

Fig 2. An example with 3 interconnected processes.

Each circle stands for a process. The blue rectangle in the red circle stands for the subset of vertices on Q that border P. n_Q→P is computed based on this subset and sent to P for P’s local computation of the next epoch. Each process maintains its own priority queue, which sorts events by their scheduled times of occurrence and keeps picking the most imminent event.

https://doi.org/10.1371/journal.pone.0291871.g002

Notably, upon receiving a new value for n_Q→P, any event on P of the form Inf(Q → v) needs to be updated with the new rate by generating a new scheduled time of occurrence.

We offer our implementation in the form of pseudocode in Section 7.

This pattern of communication and computation bears a great resemblance to the “ghost cell” method commonly employed in numerical method and high-performance computing communities [37–41]. The key difference is that instead of sending the whole border area (shaded rectangles in Fig 2) over the network, we exploit the nature of event-driven simulations and send only a summary statistic of the border, vastly reducing bandwidth usage. In addition to the bandwidth reduction, in practice, small messages also allow MPI to use the eager protocol, which enables better overlap of communication and computation [42], compared to the expensive rendezvous protocol designed for large messages.

Besides our implementation in Section 7, many other systems for parallel processing can be used to implement our algorithm. One of the earliest attempt to model this pattern is Bulk Synchronous Parallel (BSP) [43], which consists of many supersteps. Within each superstep, processors engage in local computation, communicate with each other, and wait on a global synchronizing barrier, after which the next superstep may begin. These supersteps play the same role as our epochs, except that our implementation does not rely on a global barrier, which is expensive in practice. Instead, synchronization in our implementation happens through data dependency stipulated by the process graph . BSP also uses remote memory access while our implementation uses message passing, which is more commonly supported in terms of hardware.

Another system inspired by BSP is Pregel [30], which is a vertex-centric framework to conduct large-scale graph processing. A processor in BSP is mapped to a vertex in Pregel, which sends/receives messages to/from other vertices and modifies its state accordingly. Like BSP, Pregel also has a global barrier, implemented as the master in the master-worker architecture, which additionally takes care of fault tolerance. Our algorithm can be readily implemented under the Pregel framework, with the graph being the process graph .

[31] offers a good review of Pregel-like parallel processing systems. All examined in [31] but GraphLab [44] adopt the BSP model, with global barriers for synchronization. GraphLab [44] has an asynchronous mode with no global barrier. However, this asynchronous mode only ensures serializability, which precludes race conditions but does not gurantee the iterative semantics of our implementation and other BSP-based systems. Moreover, GraphLab uses shared memory for locking to ensure serializability, which hampers performance due to intensive lock contention according to [31].

4 Theoretical results

4.1 Complexity

We discuss in this section the complexity of algorithm 5 in S1 File. We make some simplifying assumptions on the independence of random variables to make the problem mathematically tractable.

The state of a vertex v can be modeled by a stochastic process, S_v(t), where We assume that We also assume that S_v(t) is independent of its neighborhood. More precisely, let v be a vertex with d neighbors, v₁, v₂, …, v_d. We assume that the factorization is legitimate. Let be a function defined on its neighborhood. This assumption implies that the factorization must also be legitimate.
The degree of a vertex v, D_v, can be modeled by a random variable taking values from . We assume that {D_v}_v∈V are i.i.d. random variables with p_d = Pr{D_v = d} and .
The process membership of a vertex v, Π_v, can be modeled by a random variable taking values from {1, 2, …, M}. We assume that {Π_v}_v∈V are i.i.d. random variables with
∀t ∈ [0, H], ∀v ∈ V, the collection of random variables {S_v(t)} ∪ {D_v}_v∈V ∪ {Π_v}_v∈V consists of independent random variables.

We now review and justify these assumptions. We present the scenarios where these assumptions hold and point out directions where they may be loosened.

The assumption on vertex state independence echoes the mean-field assumption in [26, 45, 46]. It is common in the literature to assume independence among all vertex states, which is too strong an assumption for what our analysis requires. Also, because infected vertices are more likely to be neighboring each other in actual epidemic processes, assuming vertex state independence leads to an overestimated count of (INF, SUS) edges, which suits our purpose of showing an upper bound. This assumption works well for dynamic contact graphs where the neighborhood keeps changing and complete graphs where the neighborhood is the entire graph. In both cases, the neighborhood is a good representative of all vertices. In [26, 47], analyses of refined models beyond the one-vertex mean-field approximation are provided.
The assumption on vertex degree independence comes from the configuration model (without degree correlation) where a random number of “stubs” are generated on each vertex and then randomly connected [48]. The configuration model is a powerful graph generation method because it allows one to specify arbitrary degree distribution. It works well when degree correlation is shown to be negligible. We are aware that degree correlation does exist in real-world graph structures [49], and people have studied epidemic dynamics on graphs with degree correlation [50].
This assumption on process membership is a consequence of partitioning the graph uniformly at random, which is a sensible partitioning scheme without prior knowledge about the graph structure. In practice, one may choose a graph partitioning software that reduces the number of interprocess edges, such as [51].
The assumption that {D_v}_v∈V ∪ {Π_v}_v∈V are independent makes sense as long as graph partitioning is done without considering graph topology. The assumption that S_v(t) and {Π_v}_v∈V are independent may not hold for highly irregular graphs, where some processes are significantly more infected than others. However, because we have assumed uniform graph partitioning, the chances of one part being highly different from the others are slim. The assumption that S_v(t) and {D_v}_v∈V are independent works well when the degree distribution is highly centralized around its mean and is trivially true if we are considering a regular graph. As pointed out in [52], the assumption that {S_v(t), D_v} are independent is unrealistic because intuitively, vertices with more neighbors ought to have a higher chance of catching the disease.

Any deviation of the input from these assumptions will inevitably hurt the validity of our analysis. Nevertheless, we decide to stick to this classical set of assumptions to attain an early result on complexity analysis and lay a foundation for future refinement.

Take an arbitrary process P. We denote the stochastic process N(t) as the number of events in the queue of process P at time t in the simulated world. We denote the total rate Λ(t), also a stochastic process, as the sum of rates of all infection events and recovery events in the queue of process P at time t in the simulated world.

Let μ_v be the number of distinct processes that vertex v on P is bordering. If we pick uniformly at random from the alphabet {1, 2, …, M} with replacement, then . By construction, μ_v ≤ min{D_v, M − 1}. Finding the distribution of μ_v is nontrivial, but its expectation can be easily shown as where is the generating function of the degree distribution.

Queue operations dominate the computational costs. Therefore, we state the number of queue operations in proposition 1.

Proposition 1. The number of operations (up to constant factors) incurred by the queue data structure on process P in one simulation instance, denoted as random variable , can be expressed as where we assume without loss of generality that H/Δ yields an integer.

is the number of queue operations within epochs. In each infinitesimal dt, an event occurs with probability dtΛ(t), upon which it flips the state of vertex v and examines its neighbors, causing at most D_v queue push or remove operations, each with complexity log N(t). is the number of queue operations between epochs. Every susceptible vertex v on P bordering some other process Q causes a queue fix operation of event Inf(Q → v) upon receiving an updated border infection count n_Q→P. A queue fix at the beginning of the (k + 1)-th epoch has complexity log N(kΔ). Each susceptible vertex v causes μ_v queue fixes.

Where proposition 1 gains precision, it loses mathematical tractability. The key issue lies in the fact that Λ(t), D_v, and log N(t) are not independent random variables, nor are log N(kΔ), Π_v, S_v(kΔ), and μ_v. We resort to more approximations to counter this issue and attain a good qualitative description of the scaling behavior. We define as a function of |V| and M that satisfies meaning that is an upper bound of queue operation complexity almost surely when M → ∞. We also assume that D_v is concentrated around its mean, allowing us to use 〈D〉 in its stead. For , we simply upper-bound it by 1.

After these, we obtain an expression for the approximate complexity .

Definition 1.

We will focus on providing an upper bound for .

Theorem 1. (1) (2) Proof. Let v be an arbitrary vertex on process P. The number of infected neighbors on P of vertex v, η_v(t), can be expressed as where v_d is the d-th neighbor of vertex v.

The number of events of the form Inf(u → v), u, v ∈ P, denoted as stochastic process N_inf,local(t), is which makes

The number of events of the form Inf(Q → v), Q ≠ P, v ∈ P, denoted as stochastic process N_inf,remote(t), is (3) (4) (5) The total rate from other processes, Λ_inf,remote(t), satisfies The number of events of the form Rec(v), v ∈ P, denoted as the stochastic process N_rec(t), is which makes Then we have which takes maximum when , and the maximum is And we have which takes maximum when , and the maximum Λ* is

Note that theorem 1 does not rely on the assumption that M² is .

Lemma 1. almost surely as M → ∞.

Proof. By assuming that M² is , we have where we treat |V| as the inverse function of M(|V|).

Using Markov inequality, we have ∀a, With a change of variable b = log a, Take b = 4 log(|V|/M), and we obtain which implies that

Lemma 1 offers us a candidate for , namely,

Theorem 2. Proof. First we have Then we have Because we know Then

Theorem 2 exhibits good scalability in terms of M. If we consider the case where M = 1, the result in theorem 2 is correct if we remove the term.

Definition 2. Let random variable be the total number of messages sent from all processes in one simulation instance.

Theorem 3. Proof. The proof is trivial since each process sends at most M−1 messages in an epoch.

4.2 Analysis of the dynamical system

In this section, we study the dynamical system that algorithm 5 in S1 File gives rise to. We study two extreme cases of this algorithm: M = 1, where all vertices belong to the same process, and M = |V|, where each vertex belongs to its own process.

Again we list our assumptions first:

The state of a vertex v can be modeled by a stochastic process, S_v(t). For any given t ∈ [0, H], v ∈ V, S_v(t) is a Bernoulli random variable with Again, we assume that S_v(t) is independent of its neighborhood.
The graph G = (V, E) is a given, fixed graph represented by adjacency matrix . A_uv = 1 if (u, v) ∈ E and 0 otherwise.

Theorem 4. When M = 1, the dynamics of algorithm 5 in S1 File obey the master equation (6) Proof. When M = 1, the algorithm is reduced to the Next Reaction Method, which is faithful according to [14]. Note that the exact master equation would involve 2^|V| equations, which is intractable. By invoking the assumption on v’s independence of its neighborhood, we factorize the exact but cumbersome master equation into |V| equations in the form of Eq (6), a technique also introduced in [47, 53, 54]. This technique is called by many in the field as “one-vertex quenched mean-field theory”.

Consider M = |V|. During the l-th epoch [(l − 1)Δ, lΔ], a vertex v can only rely on a snapshot of its neighbors’ states taken at time (l − 1)Δ. We define the total infection rate from its neighbors in the l-th epoch as The state of vertex v during the l-th epoch follows the continuous-time Markov chain (CTMC) in Fig 3.

Download:

Fig 3. The CTMC that the state of vertex v follows during the l-th epoch.

Transition rates γ and are constant during the l-th epoch.

https://doi.org/10.1371/journal.pone.0291871.g003

The CTMC in Fig 3 has rate matrix which gives the transition matrix (7)

To distinguish the dynamics between M = 1 and M = |V|, we use different symbols to denote the (marginal) probability of vertex v being infected at time t. We use P_v(t) for M = 1 and Q_v(t) for M = |V|. Let us also adopt the following shorthand notation:

We show in theorem 5 that the difference between the two dynamics is of the same order of magnitude as Δ, meaning that choosing a smaller Δ leads to a smaller loss of faithfulness.

Let ϕ(Δ) be any function that satisfies while we do not necessarily have It is useful to think of ϕ(Δ) as a set of functions instead of a particular function. It is the set of functions whose Taylor expansion around 0 has the form Obviously, this set of functions is closed under addition and scalar product, which is why we can abuse the notation by writing

Similarly, let be any function of Δ that satisfies Note that any function that is is ϕ(Δ), but not the other way around. It is the set of functions whose Taylor expansion around 0 has the form Theorem 5. Suppose we are given the same initial condition, that is, ∀v ∈ V, P_v(0) = Q_v(0). For any T ∈ (0, H) s.t. , for any v ∈ V, P_v(T) − Q_v(T) is ϕ(Δ).

Proof. For M = 1, the dynamics can be described by the differential equation with its integral form being We can rewrite the integral as the sum of a residual term and the Riemann sum,

For M = |V|, the transition matrix between epochs is

Let the probability mass of vertex v be

And similarly

Thanks to the CTMC, which implies that (8) A simple Taylor expansion of Eq (8) gives us A telescoping sum gives us Here, yields ϕ(Δ) because which does go to zero when Δ → 0.

In fact, for any λ = 0, 1, …, L−1, L, (9) (10) using the same reasoning. Eqs (9) and (10) describe a recurrent relation which is conducive to a proof by induction. We want to show that Let 0 < L′ < L. Let us assume for the sake of argument that is ϕ(Δ) for any λ = 1, 2, …, L′−1, L′. The base case is straightforward with being zero (the same initial condition). For the inductive case, we need to show that

To show this, note that

A subtraction gives us which is exactly what we need. Note that is also ϕ(Δ) because Also Δϕ(Δ) yields because

We know . In particular, .

Definition 3. Let the state vector be This definition makes sense for both M = 1 and M = |V|. We say β* is the epidemic threshold if, all other input parameters being fixed,

when β < β*, is a locally asymptotically stable fixed point;
when β > β*, is an unstable fixed point.

We let be the largest eigenvalue of the adjacency matrix A. Because we only discuss the dynamics of M = |V|, we simply use P_v(t) for M = |V|. We use the shorthand

Theorem 6. When M = |V|, as Δ → ∞, the dynamics of the algorithm exhibit the same epidemic threshold as the dynamics in Eq (6).

Proof. First, we have Indeed, as Δ → ∞, we recover the stationary distribution, and vertex v essentially forgets about its state at the beginning of epoch l, which allows us to write (11) We have obtained a discrete autonomous system about . We see that is a fixed point. Let us linearize the discrete system in Eq (11) by computing the Jacobian, meaning that if we let , and let be the largest eigenvalue of J₀, which indicates that ρ = 0 is locally asymptotically stable if and only if and unstable if and only if This result is in line with the results in [55], proving the claim. A simple linear expansion on the right-hand side of Eq (6) yields the same conclusion.

Theorem 7. The system in Eq (6) and the system in Eq (11) have the same fixed points.

Proof. Simply set the right-hand side of Eq (6) to zero, yielding (12) Set in Eq (11), yielding (13) Eqs (12) and (13) are equivalent and must result in the same set of fixed points.

Theorem 5 is reassuring in the sense that even if the graph is partitioned to the utmost with M = |V|, a smaller Δ does lead to smaller loss of faithfulness.

Meanwhile, theorems 6 and 7 deals with a highly adversarial scenario: M = |V| and Δ → ∞. Intuitively, the longer Δ is, the more stale snapshots of neighbor states will be. Also, the more parts there are in the partition, the more commonly we need to rely on stale neighbor states. Theorems 6 and 7 tell us that even under this challenging scenario, algorithm 5 in S1 File can still recover some of the most distinctive features of the dynamical system that it is supposed to simulate.

Note that even though algorithm 5 in S1 File is equivalent to the Next Reaction Method [14] commonly used in EDS when M = 1, it is not equivalent to TDS when M = |V|, at least not with the popular method of rejection sampling used among practitioners [1, 2]. The intuition is simple: in a scheme of rejection sampling and synchronous update, the state of a vertex can change at most once in an epoch, ruling out any possibility of re-infection, which may have a nontrivial impact on the trajectory, especially with large Δ. Indeed, rejection sampling is simply a linear approximation of the transition matrix in Eq (7). If we have a variant of TDS that employs the exact transition matrix in Eq (7), then an equivalence can be established between this variant of TDS and algorithm 5 in S1 File.

5 Experimental results

In this section, we design experiments to demonstrate the faithfulness and scalability (time consumption and performance breakdown) of algorithm 5 in S1 File. The MPI implementation we choose is MPICH [56].

For all experiments in this section, we collect the turnaround time on process P_i in the r-th repeated experiment for all i ∈ {1, 2, …, M} and r ∈ {1, 2, …, R}. We compute the total time consumption as The time spent blocking is computed in the same way.

As for average trajectories, we collect infection count at time lΔ in the r-th repeated experiment. . r ∈ {1, 2, …, R}. We do not collect the entire trajectory, which involves merging event logs on all processes and visualizing data at the scale of gigabytes. Instead, we pick integer multiples of Δ and compute averages as

5.1 An SIS epidemic on an Erdos-Renyi graph

In this section, we run our experiments on one single machine, using the multiprocessing feature of MPI [38]. This setup limits the scale of experiments we can handle (memory limits), but it ensures low-latency communication among processes.

We synthesize a standard G(N, p) Erdos-Renyi graph [57] with number of vertices N = 100000 and edge probability p = 10⁻⁴, which we refer to as er100k. We select uniformly at random 1% of vertices to be initially infected. This graph is used for all repeated runs. γ = 0.25. β = 0.05. H = 60.

We run experiments on the following grid of parameter settings: M = 1, 2, 4, 8 and Δ = 0.1, 0.2, …, 0.9, 1.0. For each choice of parameter M, the graph is evenly partitioned into M parts. The partitioning is independent of vertex states and graph connectivity. The partitioning is fixed for all experiments with the same M. We run 20 repeated experiments to take averages for each point in the parameter grid.

Besides M = 1, 2, 4, 8, we also implement TDS with simple rejection sampling.

We plot the average trajectories in Fig 4 to assess the faithfulness of the algorithm under various parameter settings. We also plot in Fig 5 the relative divergence from the setup with M = 1, where the algorithm is reduced to a standard EDS.

Download:

Fig 4. Uniform partitioning of er100k: Average trajectories with different M and Δ.

https://doi.org/10.1371/journal.pone.0291871.g004

Download:

Fig 5. Uniform partitioning of er100k: Difference from M = 1 with different M and Δ.

The horizontal line associated with M = 1 represents the ground truth. All other curves show some lack of faithfulness. When Δ is small (Δ = 0.1, 0.4), M > 1 tends to overestimate the spread of the epidemic since by making a mean-field approximation on the border, we permit interprocess infection events like Inf(Q → v) to occur even if the border vertex v has no infected neighbor on Q at all. When Δ is large (Δ = 0.7, 1.0), M > 1 tends to underestimate the spread of the epidemic since the more outdated the information about other processes is, the more this process underestimates the border infection probability throughout the epoch. The main divergence happens during the exponential growth phase, where a small delay in time causes significant differences in case count. Once the epidemic reaches equilibrium, the differences are diminished (< 5%) for all algorithms.

https://doi.org/10.1371/journal.pone.0291871.g005

We collect timing information in this experiment, shown in Fig 6.

Download:

Fig 6. Uniform partitioning of er100k: Average time consumption with different M and Δ.

For total time consumption, the break-even point is around Δ = 0.2, since a smaller Δ causes more updates of interprocess infection events of the form Inf(Q → v). This is aligned with our result from theorem 2, since for M = 1, the overhead term 〈D〉/Δ is absent. Generally, total time consumption decreases as M increases and as Δ increases. We have also shown that the time spent blocking grows as M increases and shrinks as Δ increases, even though it accounts for less than 5% of total time consumption.

https://doi.org/10.1371/journal.pone.0291871.g006

5.2 Stochastic block model

In this section, we study the relationship between graph connectivity and algorithmic scalability. We prepare N = 100000 vertices with 40% of them being initially infected (selected uniformly at random). We partition these vertices evenly into eight blocks and assign them to M = 8 processes accordingly. The partitioning is independent of vertex states. The size of each block is n = N/M = 12500.

The graph connectivity obeys the stochastic block model (SBM) [58]. We fix the mean degree of all synthesized SBMs to be 〈D〉 = 10. We refer to the probability of finding an edge between two vertices from the same block as p_i and the probability of finding an edge between two vertices from different blocks as p_o. We have Once we fix p_o, p_i can be computed as

We synthesize 20 SBMs with parameters shown in Table 1.

Download:

Table 1. Parameters for SBMs.

https://doi.org/10.1371/journal.pone.0291871.t001

For each (p_i, p_o) pair in Table 1, we generate a stochastic block model, on which we conduct 20 repeated runs of algorithm 5 in S1 File with M = 1 as the baseline and 20 repeated runs with M = 8 to compute the speed-up factor. The results are shown in Fig 7.

Download:

Fig 7. Speed-up factor of M = 8 over M = 1.

Speed-up factor is almost 8 with p_o = 0, where the SBM is reduced to 8 disconnected components, and the task becomes embarrassingly parallel. In general, the speed-up factor goes up as Δ goes up and as p_o goes down, both reducing the amount of communication required.

https://doi.org/10.1371/journal.pone.0291871.g007

5.3 Experiments with a real graph on an MPI cluster

In this section, we run our experiments on a cluster of four machines using the same MPI implementation as before. Each of the four machines has the following hardware setup: 2x12cores@2.5 GHz, 256GB RAM, 2x240GB SSD, 2x2TB@7200RPM.

We choose a graph from a real-world blogging community called LiveJournal, where people declare friendship with each other, forming an undirected graph with one connected component of 3997962 vertices and 34681189 edges [59]. We refer to this graph as lj4m.

We use METIS [32, 51, 60] for graph partitioning. METIS partitions graphs so that all parts have roughly the same size and edges between different parts are few. We partition the LiveJournal graph into 12, 24, 48, and 96 parts, using the default parameter setting of METIS [32]. We assign the parts evenly to the four machines connected by LAN (Ethernet), each having 3, 6, 12, and 24 parts. A maximum of 24 is chosen here because each machine has 24 cores.

0.5% of vertices are initialized to be infected, chosen uniformly at random, independent from graph partitioning. β = 0.02, γ = 0.3. H = 60. To take averages, we run 10 experiments for each parameter setting of (M, Δ) pair.

We collect average trajectories, shown in Fig 8. We also plot the relative divergence from the setup with M = 1 in Fig 9.

Download:

Fig 8. METIS partitioning of lj4m: Average trajectories for an SIS epidemic on a graph of around 4 million vertices.

https://doi.org/10.1371/journal.pone.0291871.g008

Download:

Fig 9. METIS partitioning of lj4m: Difference from M = 1 with different M and Δ.

We see that, unlike Fig 5, all experiments with M > 1 underestimate the infection count. There are two potential causes. Firstly, the choice of Δ is way higher than the typical interevent time in the simulated system, which tends to shrink as the size of the system grows. Secondly, our mean-field approximation on the border tends to inflate the number of infected vertices, as we have discussed in Fig 5. This inflating effect is counterbalanced by our usage of the METIS graph partitioning library, which vastly reduces the number of interprocess edges and gives rise to smaller borders on all processes. Again, as the epidemic reaches equilibrium, the divergence shrinks to less than 5%.

https://doi.org/10.1371/journal.pone.0291871.g009

We also collect timing information, shown in Fig 10.

Download:

Fig 10. METIS partitioning of lj4m: Average time consumption with different M and Δ.

Like Fig 6, we see a decrease in total time consumption as M increases and as Δ increases. Unlike Fig 6, time spent blocking goes down as M increases. Also unlike Fig 6, the fraction of time spent blocking goes up as Δ increases. We hypothesize that this is caused by poor load balancing because with a large Δ and a small M, the computational workload per epoch increases, meaning that a straggler process can lag behind more, keeping all other processes blocked. Meanwhile, simulations on er100k enjoy good load balancing because the graph and the partitioning scheme in Fig 6 are highly homogeneous.

https://doi.org/10.1371/journal.pone.0291871.g010

5.4 Ghost-cell implementation

In this section, we discuss a variant of algorithm 5 in S1 File, referred to as the “ghost-cell implementation”, which demystifies the phenomenon observed in Figs 5 and 9 where M > 1 leads to a different equilibrium from that of M = 1.

First and foremost, lack of faithfulness occurs in two different types in Fig 5 with large Δ and Fig 9. The first type is about growth, where M > 1 underestimates the infection count during the exponential growth phase of the epidemic. The second type is about the equilibrium, where M = 1 and M > 1 eventually reach different equilibria. We are more concerned about the second type of error because:

the second type is harder to explain intuitively, while the first type can be explained by stale border states causing the lag;
the second type is more detrimental to practitioners since many are only interested in the epidemic size at equilibrium;
the existence of the second type contradicts theorem 7.

Because in theorem 7, we assume that M = |V|, meaning that every vertex is on its own process and no averaging is needed, we hypothesize that the border averaging trick causes incorrect equilibrium. To test our hypothesis, we modify algorithm 5 in S1 File to send the state of the entire border instead of a summary statistic. We refer to this variant of algorithm 5 in S1 File as the “ghost-cell implementation” (GCI).

We also use a new graph dataset: the DBLP collaboration network [59]. The dataset comes with ground truth communities, These communities are not disjoint, nor do they cover all of V. We prune the graph by taking its subgraph induced by yielding a subgraph with around 260k vertices, which we call dblp260k. We conduct this preprocessing to strengthen the community structure of the graph.

We run GCI on er100k and dblp260k, with the same parameters as in Fig 4. The results are shown in Figs 11 to 13.

Download:

Fig 11. Uniform partitioning of er100k (GCI): Difference from M = 1 with different M and Δ.

Compared to Fig 5, the lag (type-1) is still around but the difference in equilibria (type-2) has been reduced to less than 1%.

https://doi.org/10.1371/journal.pone.0291871.g011

Download:

Fig 12. Uniform partitioning of dblp260k: Difference from M = 1 with different M and Δ.

A uniform partitioning of dblp260k yields highly unsatisfactory results with a type-2 error of more than 10%, indicating that while a naive uniform partitioning scheme works well for naive Erdos-Renyi graphs, sophisticated graph structures demand sophisticated partitioning schemes.

https://doi.org/10.1371/journal.pone.0291871.g012

Download:

Fig 13. Uniform partitioning of dblp260k (GCI): Difference from M = 1 with different M and Δ.

The GCI brings significant improvements compared to Fig 12 by bringing the type-2 error down to less than 1%.

https://doi.org/10.1371/journal.pone.0291871.g013

6 Discussions

We have proposed in Section 2 and Section 3 an algorithm to simulate epidemics at a large scale with the help of parallel/distributed computing hardware. We have offered a well-tailored implementation in Section 7. We have analyzed its complexity in Section 4.1 and established that it scales well with the number of cores used while sending a modest number of messages. We have also studied in Section 4.2 its induced dynamical system, discovering that when each vertex is on its own process, the bias our algorithm introduces tends to zero as Δ tends to zero, and our algorithm preserves features interesting to epidemiologists such as the epidemic threshold even as Δ tends to +∞. We find in Section 5.1 and Section 5.3 that the total time consumption goes down as M and Δ increase, however at the cost of faithfulness, as shown in Fig 4. We find in Section 5.2 that by reducing graph connectivity between parts/processes, we achieve better scalability, inspiring our decision to employ a graph partitioning software in Section 5.3. In Section 5.3, however, we see that a graph partitioning scheme can potentially be a double-edged sword in that it may worsen load balancing, as shown in Fig 16 in S1 File. Even though we have shown that poor load balancing can be the culprit, in our future work, we will investigate the true cause of the difference between Figs 6 and 10.

As for the limitations and extensions of our algorithm, let us review the scope of the problem.

“Markovian”: Instead of rates of exponential distributions, we compute instantaneous rates as where ψ(t) is the PDF and Ψ(t) is the CDF. Instead of border infection probability, we compute the average infection rate on the border by taking the average of instantaneous rates at the turn of epochs.
“homogeneous rates”: Similar solution as above.
“SIS epidemics”: Consider a different compartmental model with k compartments. Then instead of a number, we can send a PMF, summarizing the probability of a border vertex being in different compartments. Then instead of one integer, we send k − 1 integers.
“a fixed graph”: We can allow the graph to change at the end of each epoch, an idea also adopted in [10].

Moreover, our approach of sending the border infection probability is just one of the many ways of sending a summary of this process to another. This approach has two main downsides. We discuss them below and offer some possible remedies.

It allows infection events to occur where there should be none. Consider a vertex v on P and a vertex u on Q, each being the only neighbor of the other. Suppose v and u are susceptible at the beginning of the epoch. As long as some infected vertices exist on the (P, Q) border or the (Q, P) border, u or v can turn infected by the end of the epoch, even though neither has any infected neighbors. A solution to this problem is to send the state of the whole border over the network, although this will inevitably increase bandwidth usage.
Whether we send the state of the whole border or just a summary statistic, the information is stale. One expedient is to use a smaller Δ, limiting how stale the information can be. Another solution is to “predict” the state dynamics of the remote process, even though this is more of an art than a science and not easily generalizable to applications beyond epidemiology.

Supporting information

S1 File.

https://doi.org/10.1371/journal.pone.0291871.s001

(PDF)

References

1. Ferguson NM, Laydon D, Nedjati-Gilani G, Imai N, Ainslie K, Baguelin M, et al. Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand. Imperial College COVID-19 Response Team. Imperial College COVID-19 Response Team. 2020;20(10.25561):77482.
- View Article
- Google Scholar
2. Ferguson NM, Cummings DA, Fraser C, Cajka JC, Cooley PC, Burke DS. Strategies for mitigating an influenza pandemic. Nature. 2006;442(7101):448–452. pmid:16642006
- View Article
- PubMed/NCBI
- Google Scholar
3. Van Mieghem P, Van de Bovenkamp R. Non-Markovian infection spread dramatically alters the susceptible-infected-susceptible epidemic threshold in networks. Physical review letters. 2013;110(10):108701. pmid:23521310
- View Article
- PubMed/NCBI
- Google Scholar
4. Boguná M, Lafuerza LF, Toral R, Serrano MÁ. Simulating non-Markovian stochastic processes. Physical Review E. 2014;90(4):042108. pmid:25375439
- View Article
- PubMed/NCBI
- Google Scholar
5. Buono C, Vazquez F, Macri PA, Braunstein L. Slow epidemic extinction in populations with heterogeneous infection rates. Physical Review E. 2013;88(2):022813. pmid:24032889
- View Article
- PubMed/NCBI
- Google Scholar
6. Qu B, Wang H. SIS epidemic spreading with heterogeneous infection rates. IEEE Transactions on Network Science and Engineering. 2017;4(3):177–186.
- View Article
- Google Scholar
7. Battiston F, Cencetti G, Iacopini I, Latora V, Lucas M, Patania A, et al. Networks beyond pairwise interactions: structure and dynamics. Physics Reports. 2020;874:1–92.
- View Article
- Google Scholar
8. Gross T, D’Lima CJD, Blasius B. Epidemic dynamics on an adaptive network. Physical review letters. 2006;96(20):208701. pmid:16803215
- View Article
- PubMed/NCBI
- Google Scholar
9. Gross T, Blasius B. Adaptive coevolutionary networks: a review. Journal of the Royal Society Interface. 2008;5(20):259–271. pmid:17971320
- View Article
- PubMed/NCBI
- Google Scholar
10. Vestergaard CL, Génois M. Temporal Gillespie algorithm: fast simulation of contagion processes on time-varying networks. PLoS computational biology. 2015;11(10):e1004579. pmid:26517860
- View Article
- PubMed/NCBI
- Google Scholar
11. Wang W, Liu QH, Liang J, Hu Y, Zhou T. Coevolution spreading in complex networks. Physics Reports. 2019;820:1–51. pmid:32308252
- View Article
- PubMed/NCBI
- Google Scholar
12. Ferreira SC, Castellano C, Pastor-Satorras R. Epidemic thresholds of the susceptible-infected-susceptible model on networks: A comparison of numerical and theoretical results. Physical Review E. 2012;86(4):041125. pmid:23214547
- View Article
- PubMed/NCBI
- Google Scholar
13. Gillespie DT. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. Journal of computational physics. 1976;22(4):403–434.
- View Article
- Google Scholar
14. Gibson MA, Bruck J. Efficient exact stochastic simulation of chemical systems with many species and many channels. The journal of physical chemistry A. 2000;104(9):1876–1889.
- View Article
- Google Scholar
15. Li C, van de Bovenkamp R, Van Mieghem P. Susceptible-infected-susceptible model: A comparison of N-intertwined and heterogeneous mean-field approximations. Physical Review E. 2012;86(2):026116. pmid:23005834
- View Article
- PubMed/NCBI
- Google Scholar
16. Röst G, Vizi Z, Kiss IZ. Pairwise approximation for SIR-type network epidemics with non-Markovian recovery. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2018;474(2210):20170695. pmid:29507514
- View Article
- PubMed/NCBI
- Google Scholar
17. Fennell PG, Melnik S, Gleeson JP. Limitations of discrete-time approaches to continuous-time contagion dynamics. Physical Review E. 2016;94(5):052125. pmid:27967171
- View Article
- PubMed/NCBI
- Google Scholar
18. Fujimoto R. Parallel and distributed simulation. In: 2015 Winter Simulation Conference (WSC). IEEE; 2015. p. 45–59.
19. Bryant RE. Simulation of Packet Communication Architecture Computer Systems. MASSACHUSETTS INST OF TECH CAMBRIDGE LAB FOR COMPUTER SCIENCE; 1977.
20. Jefferson DR. Virtual time. ACM Transactions on Programming Languages and Systems (TOPLAS). 1985;7(3):404–425.
- View Article
- Google Scholar
21. Perumalla KS, Seal SK. Discrete event modeling and massively parallel execution of epidemic outbreak phenomena. Simulation. 2012;88(7):768–783.
- View Article
- Google Scholar
22. Dietz K, Heesterbeek J. Daniel Bernoulli’s epidemiological model revisited. Mathematical biosciences. 2002;180(1-2):1–21. pmid:12387913
- View Article
- PubMed/NCBI
- Google Scholar
23. Kermack WO, McKendrick AG. A contribution to the mathematical theory of epidemics. Proceedings of the royal society of london Series A, Containing papers of a mathematical and physical character. 1927;115(772):700–721.
- View Article
- Google Scholar
24. Brauer F. Compartmental models in epidemiology. Mathematical epidemiology. 2008; p. 19–79.
- View Article
- Google Scholar
25. Demongeot J, Griette Q, Magal P. SI epidemic model applied to COVID-19 data in mainland China. Royal Society Open Science. 2020;7(12):201878. pmid:33489297
- View Article
- PubMed/NCBI
- Google Scholar
26. Kiss IZ, Miller JC, Simon PL, et al. Mathematics of epidemics on networks. Cham: Springer. 2017;598:31.
27. Soto-Ferrari M, Holvenstot P, Prieto D, de Doncker E, Kapenga J. Parallel programming approaches for an agent-based simulation of concurrent pandemic and seasonal influenza outbreaks. Procedia computer science. 2013;18:2187–2192.
- View Article
- Google Scholar
28. Bhatele A, Yeom JS, Jain N, Kuhlman CJ, Livnat Y, Bisset KR, et al. Massively parallel simulations of spread of infectious diseases over realistic social networks. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE; 2017. p. 689–694.
29. Tian Z, Lindner P, Nissl M, Koch C, Tannen V. Generalizing Bulk-Synchronous Parallel Processing for Data Science: From Data to Threads and Agent-Based Simulations. Proc ACM Manag Data. 2023;1(2).
- View Article
- Google Scholar
30. Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, et al. Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data; 2010. p. 135–146.
31. Han M, Daudjee K, Ammar K, Özsu MT, Wang X, Jin T. An experimental comparison of pregel-like graph processing systems. Proceedings of the VLDB Endowment. 2014;7(12):1047–1058.
- View Article
- Google Scholar
32. Karypis G, Kumar V. METIS—Serial Graph Partitioning and Fill-reducing Matrix Ordering;. Available from: https://github.com/KarypisLab/METIS.
33. Fortunato S. Community Detection in Graphs. Physics Reports. 2009;486(3-5).
- View Article
- Google Scholar
34. Said A, Abbasi RA, Maqbool O, Daud A, Aljohani NR. CC-GA: A Clustering Coefficient based Genetic Algorithm for Detecting Communities in Social Networks. Applied Soft Computing. 2017; p. S1568494617306774.
- View Article
- Google Scholar
35. Zhou J, Cui G, Hu S, Zhang Z, Yang C, Liu Z, et al. Graph neural networks: A review of methods and applications. AI Open. 2020;1:57–81.
- View Article
- Google Scholar
36. Ju W, Fang Z, Gu Y, Liu Z, Long Q, Qiao Z, et al. A Comprehensive Survey on Deep Graph Representation Learning; 2023.
37. Kjolstad FB, Snir M. Ghost cell pattern. In: Proceedings of the 2010 Workshop on Parallel Programming Patterns; 2010. p. 1–9.
38. Gropp W, Gropp WD, Lusk E, Skjellum A, Lusk ADFEE. Using MPI: portable parallel programming with the message-passing interface. vol. 1. MIT press; 1999.
39. Kulikovskii AG, Pogorelov NV, Semenov AY. Mathematical aspects of numerical solution of hyperbolic systems. Chapman and Hall/CRC; 2000.
40. Abgrall R, Shu CW. Handbook of numerical methods for hyperbolic problems: applied and modern issues. vol. 18. Elsevier; 2017.
41. Chi C, Lee BJ, Im HG. An improved ghost-cell immersed boundary method for compressible flow simulations. International Journal for Numerical Methods in Fluids. 2017;83(2):132–148.
- View Article
- Google Scholar
42. Hoefler T, Lumsdaine A. Message progression in parallel computing-to thread or not to thread? In: 2008 IEEE International Conference on Cluster Computing. IEEE; 2008. p. 213–222.
43. Valiant LG. A bridging model for parallel computation. Communications of the ACM. 1990;33(8):103–111.
- View Article
- Google Scholar
44. Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM. Distributed graphlab: A framework for machine learning in the cloud. arXiv preprint arXiv:12046078. 2012;.
45. Gómez S, Arenas A, Borge-Holthoefer J, Meloni S, Moreno Y. Discrete-time Markov chain approach to contact-based disease spreading in complex networks. EPL (Europhysics Letters). 2010;89(3):38009.
- View Article
- Google Scholar
46. Pastor-Satorras R, Vespignani A, et al. Epidemics and immunization in scale-free networks. Handbook of Graphs and Networks, Wiley-VCH, Berlin. 2003;.
47. Mata AS, Ferreira SC. Pair quenched mean-field theory for the susceptible-infected-susceptible model on complex networks. EPL (Europhysics Letters). 2013;103(4):48003.
- View Article
- Google Scholar
48. Newman M. Networks. Oxford university press; 2018.
49. Mondragón R. Estimating degree–degree correlation and network cores from the connectivity of high–degree nodes in complex networks. Scientific reports. 2020;10(1):1–24. pmid:32221346
- View Article
- PubMed/NCBI
- Google Scholar
50. Boguá M, Pastor-Satorras R, Vespignani A. Epidemic spreading in complex networks with degree correlations. In: Statistical mechanics of complex networks. Springer; 2003. p. 127–147.
51. Karypis G, Kumar V. METIS: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices. University of Minnesota Conservancy. 1997;.
52. Pastor-Satorras R, Vespignani A. Epidemic spreading in scale-free networks. Physical review letters. 2001;86(14):3200. pmid:11290142
- View Article
- PubMed/NCBI
- Google Scholar
53. Ortega E, Machado D, Lage-Castellanos A. Dynamics of epidemics from cavity master equations: Susceptible-infectious-susceptible models. Physical Review E. 2022;105(2):024308. pmid:35291082
- View Article
- PubMed/NCBI
- Google Scholar
54. Wang W, Tang M, Stanley HE, Braunstein LA. Unification of theoretical approaches for epidemic spreading on complex networks. Reports on Progress in Physics. 2017;80(3):036603. pmid:28176679
- View Article
- PubMed/NCBI
- Google Scholar
55. Wang Y, Chakrabarti D, Wang C, Faloutsos C. Epidemic spreading in real networks: An eigenvalue viewpoint. In: 22nd International Symposium on Reliable Distributed Systems, 2003. Proceedings. IEEE; 2003. p. 25–34.
56. Team M. MPICH—a high performance and widely portable implementation of the Message Passing Interface (MPI) standard;. Available from: https://www.mpich.org/.
57. Bollobás B. Random graphs. In: Modern graph theory. Springer; 1998. p. 215–252.
58. Holland PW, Laskey KB, Leinhardt S. Stochastic blockmodels: First steps. Social networks. 1983;5(2):109–137.
- View Article
- Google Scholar
59. Leskovec J, Krevl A. SNAP Datasets: Stanford Large Network Dataset Collection; 2014. http://snap.stanford.edu/data.
60. Karypis G, Kumar V. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing. 1998;20(1):359–392.
- View Article
- Google Scholar

[ref1] 1. Ferguson NM, Laydon D, Nedjati-Gilani G, Imai N, Ainslie K, Baguelin M, et al. Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand. Imperial College COVID-19 Response Team. Imperial College COVID-19 Response Team. 2020;20(10.25561):77482.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Ferguson NM, Cummings DA, Fraser C, Cajka JC, Cooley PC, Burke DS. Strategies for mitigating an influenza pandemic. Nature. 2006;442(7101):448–452. pmid:16642006
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Van Mieghem P, Van de Bovenkamp R. Non-Markovian infection spread dramatically alters the susceptible-infected-susceptible epidemic threshold in networks. Physical review letters. 2013;110(10):108701. pmid:23521310
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Boguná M, Lafuerza LF, Toral R, Serrano MÁ. Simulating non-Markovian stochastic processes. Physical Review E. 2014;90(4):042108. pmid:25375439
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref5] 5. Buono C, Vazquez F, Macri PA, Braunstein L. Slow epidemic extinction in populations with heterogeneous infection rates. Physical Review E. 2013;88(2):022813. pmid:24032889
View Article
PubMed/NCBI
Google Scholar

[17] View Article

[18] PubMed/NCBI

[19] Google Scholar

[ref6] 6. Qu B, Wang H. SIS epidemic spreading with heterogeneous infection rates. IEEE Transactions on Network Science and Engineering. 2017;4(3):177–186.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref7] 7. Battiston F, Cencetti G, Iacopini I, Latora V, Lucas M, Patania A, et al. Networks beyond pairwise interactions: structure and dynamics. Physics Reports. 2020;874:1–92.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref8] 8. Gross T, D’Lima CJD, Blasius B. Epidemic dynamics on an adaptive network. Physical review letters. 2006;96(20):208701. pmid:16803215
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref9] 9. Gross T, Blasius B. Adaptive coevolutionary networks: a review. Journal of the Royal Society Interface. 2008;5(20):259–271. pmid:17971320
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref10] 10. Vestergaard CL, Génois M. Temporal Gillespie algorithm: fast simulation of contagion processes on time-varying networks. PLoS computational biology. 2015;11(10):e1004579. pmid:26517860
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref11] 11. Wang W, Liu QH, Liang J, Hu Y, Zhou T. Coevolution spreading in complex networks. Physics Reports. 2019;820:1–51. pmid:32308252
View Article
PubMed/NCBI
Google Scholar

[39] View Article

[40] PubMed/NCBI

[41] Google Scholar

[ref12] 12. Ferreira SC, Castellano C, Pastor-Satorras R. Epidemic thresholds of the susceptible-infected-susceptible model on networks: A comparison of numerical and theoretical results. Physical Review E. 2012;86(4):041125. pmid:23214547
View Article
PubMed/NCBI
Google Scholar

[43] View Article

[44] PubMed/NCBI

[45] Google Scholar

[ref13] 13. Gillespie DT. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. Journal of computational physics. 1976;22(4):403–434.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref14] 14. Gibson MA, Bruck J. Efficient exact stochastic simulation of chemical systems with many species and many channels. The journal of physical chemistry A. 2000;104(9):1876–1889.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref15] 15. Li C, van de Bovenkamp R, Van Mieghem P. Susceptible-infected-susceptible model: A comparison of N-intertwined and heterogeneous mean-field approximations. Physical Review E. 2012;86(2):026116. pmid:23005834
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref16] 16. Röst G, Vizi Z, Kiss IZ. Pairwise approximation for SIR-type network epidemics with non-Markovian recovery. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2018;474(2210):20170695. pmid:29507514
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref17] 17. Fennell PG, Melnik S, Gleeson JP. Limitations of discrete-time approaches to continuous-time contagion dynamics. Physical Review E. 2016;94(5):052125. pmid:27967171
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref18] 18. Fujimoto R. Parallel and distributed simulation. In: 2015 Winter Simulation Conference (WSC). IEEE; 2015. p. 45–59.

[ref19] 19. Bryant RE. Simulation of Packet Communication Architecture Computer Systems. MASSACHUSETTS INST OF TECH CAMBRIDGE LAB FOR COMPUTER SCIENCE; 1977.

[ref20] 20. Jefferson DR. Virtual time. ACM Transactions on Programming Languages and Systems (TOPLAS). 1985;7(3):404–425.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref21] 21. Perumalla KS, Seal SK. Discrete event modeling and massively parallel execution of epidemic outbreak phenomena. Simulation. 2012;88(7):768–783.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref22] 22. Dietz K, Heesterbeek J. Daniel Bernoulli’s epidemiological model revisited. Mathematical biosciences. 2002;180(1-2):1–21. pmid:12387913
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref23] 23. Kermack WO, McKendrick AG. A contribution to the mathematical theory of epidemics. Proceedings of the royal society of london Series A, Containing papers of a mathematical and physical character. 1927;115(772):700–721.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref24] 24. Brauer F. Compartmental models in epidemiology. Mathematical epidemiology. 2008; p. 19–79.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref25] 25. Demongeot J, Griette Q, Magal P. SI epidemic model applied to COVID-19 data in mainland China. Royal Society Open Science. 2020;7(12):201878. pmid:33489297
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref26] 26. Kiss IZ, Miller JC, Simon PL, et al. Mathematics of epidemics on networks. Cham: Springer. 2017;598:31.

[ref27] 27. Soto-Ferrari M, Holvenstot P, Prieto D, de Doncker E, Kapenga J. Parallel programming approaches for an agent-based simulation of concurrent pandemic and seasonal influenza outbreaks. Procedia computer science. 2013;18:2187–2192.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref28] 28. Bhatele A, Yeom JS, Jain N, Kuhlman CJ, Livnat Y, Bisset KR, et al. Massively parallel simulations of spread of infectious diseases over realistic social networks. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE; 2017. p. 689–694.

[ref29] 29. Tian Z, Lindner P, Nissl M, Koch C, Tannen V. Generalizing Bulk-Synchronous Parallel Processing for Data Science: From Data to Threads and Agent-Based Simulations. Proc ACM Manag Data. 2023;1(2).
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref30] 30. Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, et al. Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data; 2010. p. 135–146.

[ref31] 31. Han M, Daudjee K, Ammar K, Özsu MT, Wang X, Jin T. An experimental comparison of pregel-like graph processing systems. Proceedings of the VLDB Endowment. 2014;7(12):1047–1058.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref32] 32. Karypis G, Kumar V. METIS—Serial Graph Partitioning and Fill-reducing Matrix Ordering;. Available from: https://github.com/KarypisLab/METIS.

[ref33] 33. Fortunato S. Community Detection in Graphs. Physics Reports. 2009;486(3-5).
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref34] 34. Said A, Abbasi RA, Maqbool O, Daud A, Aljohani NR. CC-GA: A Clustering Coefficient based Genetic Algorithm for Detecting Communities in Social Networks. Applied Soft Computing. 2017; p. S1568494617306774.
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref35] 35. Zhou J, Cui G, Hu S, Zhang Z, Yang C, Liu Z, et al. Graph neural networks: A review of methods and applications. AI Open. 2020;1:57–81.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref36] 36. Ju W, Fang Z, Gu Y, Liu Z, Long Q, Qiao Z, et al. A Comprehensive Survey on Deep Graph Representation Learning; 2023.

[ref37] 37. Kjolstad FB, Snir M. Ghost cell pattern. In: Proceedings of the 2010 Workshop on Parallel Programming Patterns; 2010. p. 1–9.

[ref38] 38. Gropp W, Gropp WD, Lusk E, Skjellum A, Lusk ADFEE. Using MPI: portable parallel programming with the message-passing interface. vol. 1. MIT press; 1999.

[ref39] 39. Kulikovskii AG, Pogorelov NV, Semenov AY. Mathematical aspects of numerical solution of hyperbolic systems. Chapman and Hall/CRC; 2000.

[ref40] 40. Abgrall R, Shu CW. Handbook of numerical methods for hyperbolic problems: applied and modern issues. vol. 18. Elsevier; 2017.

[ref41] 41. Chi C, Lee BJ, Im HG. An improved ghost-cell immersed boundary method for compressible flow simulations. International Journal for Numerical Methods in Fluids. 2017;83(2):132–148.
View Article
Google Scholar

[114] View Article

[115] Google Scholar

[ref42] 42. Hoefler T, Lumsdaine A. Message progression in parallel computing-to thread or not to thread? In: 2008 IEEE International Conference on Cluster Computing. IEEE; 2008. p. 213–222.

[ref43] 43. Valiant LG. A bridging model for parallel computation. Communications of the ACM. 1990;33(8):103–111.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref44] 44. Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM. Distributed graphlab: A framework for machine learning in the cloud. arXiv preprint arXiv:12046078. 2012;.

[ref45] 45. Gómez S, Arenas A, Borge-Holthoefer J, Meloni S, Moreno Y. Discrete-time Markov chain approach to contact-based disease spreading in complex networks. EPL (Europhysics Letters). 2010;89(3):38009.
View Article
Google Scholar

[122] View Article

[123] Google Scholar

[ref46] 46. Pastor-Satorras R, Vespignani A, et al. Epidemics and immunization in scale-free networks. Handbook of Graphs and Networks, Wiley-VCH, Berlin. 2003;.

[ref47] 47. Mata AS, Ferreira SC. Pair quenched mean-field theory for the susceptible-infected-susceptible model on complex networks. EPL (Europhysics Letters). 2013;103(4):48003.
View Article
Google Scholar

[126] View Article

[127] Google Scholar

[ref48] 48. Newman M. Networks. Oxford university press; 2018.

[ref49] 49. Mondragón R. Estimating degree–degree correlation and network cores from the connectivity of high–degree nodes in complex networks. Scientific reports. 2020;10(1):1–24. pmid:32221346
View Article
PubMed/NCBI
Google Scholar

[130] View Article

[131] PubMed/NCBI

[132] Google Scholar

[ref50] 50. Boguá M, Pastor-Satorras R, Vespignani A. Epidemic spreading in complex networks with degree correlations. In: Statistical mechanics of complex networks. Springer; 2003. p. 127–147.

[ref51] 51. Karypis G, Kumar V. METIS: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices. University of Minnesota Conservancy. 1997;.

[ref52] 52. Pastor-Satorras R, Vespignani A. Epidemic spreading in scale-free networks. Physical review letters. 2001;86(14):3200. pmid:11290142
View Article
PubMed/NCBI
Google Scholar

[136] View Article

[137] PubMed/NCBI

[138] Google Scholar

[ref53] 53. Ortega E, Machado D, Lage-Castellanos A. Dynamics of epidemics from cavity master equations: Susceptible-infectious-susceptible models. Physical Review E. 2022;105(2):024308. pmid:35291082
View Article
PubMed/NCBI
Google Scholar

[140] View Article

[141] PubMed/NCBI

[142] Google Scholar

[ref54] 54. Wang W, Tang M, Stanley HE, Braunstein LA. Unification of theoretical approaches for epidemic spreading on complex networks. Reports on Progress in Physics. 2017;80(3):036603. pmid:28176679
View Article
PubMed/NCBI
Google Scholar

[144] View Article

[145] PubMed/NCBI

[146] Google Scholar

[ref55] 55. Wang Y, Chakrabarti D, Wang C, Faloutsos C. Epidemic spreading in real networks: An eigenvalue viewpoint. In: 22nd International Symposium on Reliable Distributed Systems, 2003. Proceedings. IEEE; 2003. p. 25–34.

[ref56] 56. Team M. MPICH—a high performance and widely portable implementation of the Message Passing Interface (MPI) standard;. Available from: https://www.mpich.org/.

[ref57] 57. Bollobás B. Random graphs. In: Modern graph theory. Springer; 1998. p. 215–252.

[ref58] 58. Holland PW, Laskey KB, Leinhardt S. Stochastic blockmodels: First steps. Social networks. 1983;5(2):109–137.
View Article
Google Scholar

[151] View Article

[152] Google Scholar

[ref59] 59. Leskovec J, Krevl A. SNAP Datasets: Stanford Large Network Dataset Collection; 2014. http://snap.stanford.edu/data.

[ref60] 60. Karypis G, Kumar V. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing. 1998;20(1):359–392.
View Article
Google Scholar

[155] View Article

[156] Google Scholar

Figures

Abstract

1 Introduction

1.1 Related work

2 Problem setup

3 Algorithm

4 Theoretical results

4.1 Complexity

4.2 Analysis of the dynamical system

5 Experimental results

5.1 An SIS epidemic on an Erdos-Renyi graph

5.2 Stochastic block model

5.3 Experiments with a real graph on an MPI cluster

5.4 Ghost-cell implementation

6 Discussions

Supporting information

S1 File.

References