Exact Hybrid Particle/Population Simulation of Rule-Based Models of Biochemical Systems

Detailed modeling and simulation of biochemical systems is complicated by the problem of combinatorial complexity, an explosion in the number of species and reactions due to myriad protein-protein interactions and post-translational modifications. Rule-based modeling overcomes this problem by representing molecules as structured objects and encoding their interactions as pattern-based rules. This greatly simplifies the process of model specification, avoiding the tedious and error prone task of manually enumerating all species and reactions that can potentially exist in a system. From a simulation perspective, rule-based models can be expanded algorithmically into fully-enumerated reaction networks and simulated using a variety of network-based simulation methods, such as ordinary differential equations or Gillespie's algorithm, provided that the network is not exceedingly large. Alternatively, rule-based models can be simulated directly using particle-based kinetic Monte Carlo methods. This “network-free” approach produces exact stochastic trajectories with a computational cost that is independent of network size. However, memory and run time costs increase with the number of particles, limiting the size of system that can be feasibly simulated. Here, we present a hybrid particle/population simulation method that combines the best attributes of both the network-based and network-free approaches. The method takes as input a rule-based model and a user-specified subset of species to treat as population variables rather than as particles. The model is then transformed by a process of “partial network expansion” into a dynamically equivalent form that can be simulated using a population-adapted network-free simulator. The transformation method has been implemented within the open-source rule-based modeling platform BioNetGen, and resulting hybrid models can be simulated using the particle-based simulator NFsim. Performance tests show that significant memory savings can be achieved using the new approach and a monetary cost analysis provides a practical measure of its utility.


S1 Monetary cost analysis
We assume that each simulation is run on one virtual core (with possibly more than one simulation running per core) and there is no memory sharing among simulations. Let I be an Amazon EC2 instance type, m I the total instance memory, v I the number of virtual cores in the instance, u v I the number of compute units per virtual core, C I the cost of the instance per unit time, and N the number of simultaneous simulations running on an instance. For each configuration pair (I, N ), with N ∈ {1, 2, . . .}, we will compute the cost (per unit time) per "effective" compute unit (ECU). We will then obtain the minimum cost by finding that (those) configuration(s) (multiple instances may have the same cost) with the lowest cost under the constraint that the instance(s) has (have) sufficient memory to run N simulations. Under our assumption that a simulation is run on one virtual core, the average number of compute units available per simulation is the minimum of the number of compute units per core and the total number of compute units divided by the number of running simulations, Note that the number of ECUs is equal to the total number of compute units if N ≥ v I and less if N < v I .
The intuition here is that if the memory required per simulation limits the number of simulations that can be run simultaneously on an instance to fewer than the number of virtual cores, then we must treat the instance as if it had fewer virtual cores since some cannot be utilized. The cost (per unit time) per ECU for a given configuration is then the cost of the instance divided by the number of ECUs, For this instance type, we show in Fig. S1 the minimum effective cost, C I min (m s ), versus memory per simulation, m s . Up to m s = 8.6 GB it is possible to run four simultaneous simulations at a cost of ∼ $0.08/ECU-hr. We can run three simulations at ∼ $0.10/ECU-hr for 8.6 < m s ≤ 11.4 GB, two simulations at ∼ $0.15/ECU-hr for 11.4 < m s ≤ 17.1 GB, and one simulation at ∼ $0.31/ECU-hr for 17.1 < m s ≤ 34.2 GB. Above m s = 34.2 GB we obviously cannot run any simulations because the memory requirement exceeds the total memory available in the instance.

S2.1 Network-based methods
In network-based stochastic simulation methods, the system state is represented by a species population vector and the reaction network by a list of reactions [stoichiometry vector, rate law, and current propensity (rate)] along with a reaction dependency tree for efficient propensity updates [2]. Keeping species concentrations fixed, when the reactor volume is increased the population counters and reaction propensities all increase, but the space required to store the state information remains constant. This is one of the beneficial properties of network-based methods. However, in practice it is possible for this constant cost to be prohibitively large if the reaction network is combinatorially large (e.g., infinite), such as arises naturally in models of multimeric protein complex assembly and branching polymers [3,4].

S2.2 Network-free methods
In contrast to network-based methods, the memory cost of network-free (NF) methods scales approximately linearly with reactor volume (with fixed concentrations). The memory required for a NF simulation consists of two quantities: the fixed simulator memory and the particle-dependent memory. The former is the memory required to initialize the NF simulator with zero particles. This is just a constant value, a nf , which may depend on the software implementation. The latter can be calculated by summing over the memory cost of populating each species. Assuming stochastic fluctuations are small compared to mean concentrations, the population of a species is approximately the product of a volume-independent concentration and the reactor volume. Hence, the cost of representing a species is the product of a species-specific constant, b nf (s), the species concentration, [s], and the reactor volume, V . The species-specific constant accounts for the cost of representing the molecular structure and all reactant pattern matches for one instance of the species. The total memory cost of the particles is obtained by summing over the cost of all species in the set of reachable species S. Since each term in the sum is linearly dependent on volume, the sum is also linear in volume. At large volumes the constant cost of the NF simulator becomes negligible. Thus, we obtain a linear memory cost in terms of volume, i.e.,

S2.3 Hybrid particle/population
The memory cost of HPP is also linear with respect to reactor volume and we can derive it in the same way as for NF. The cost of initializing the simulator with zero particles, a hpp (P ), includes the constant cost of initializing both the simulator and the partial reaction network for a specific set of population species P ⊂ S. The particle cost term has the same form as for NF except that we only sum over the non-population species, S\P . Thus, we have Note that the species-specific memory constant, b hpp (s, P ), may be different than that for NF, b nf (s). This is because of the memory required in HPP to match the particle-based species to the mixed particlepopulation and population-mapping rules. In general, b nf (s) ≤ b hpp (s, P ). It is also typically the case that the fixed simulator cost of HPP, a hpp (P ), is greater than that for NF, a nf . Thus, in order to reduce memory use the set of population species, P , must be chosen such that B hpp (P ) < B nf . It is difficult to calculate these quantities explicitly, but we can see the factors that contribute to a "good" set of population species, i.e., high species concentrations and participation in a small number of mixed particle-population rules.

S2.4 Optimal HPP
With a formula for HPP memory cost, we can now imagine a hypothetical algorithm where an optimal set of population species is chosen "on the fly" to minimize memory cost and the hybrid model is adapted accordingly. Let us call this method "automated HPP" (aHPP). How would the memory use of aHPP compare to HPP, NF, and SSA? We will begin to answer this question by considering the behavior at small and large volumes. We will then scrutinize the intermediate region between these two extremes.

S2.4.1 Small volumes
For fixed species concentrations, as the reactor volume approaches zero the populations of all species also approach zero. There is little benefit, therefore, to lumping populations in this case since the memory cost associated with representing the partial reaction network will outweigh any savings due to lumping. Thus, at very small volumes aHPP will treat all species as particles (as in NF methods) and, consequently, the memory scaling will be linear with volume.

S2.4.2 Large volumes
Memory scaling of aHPP at large volumes will depend on whether the model encodes a finite or countablyinfinite network. For a finite network, the population of every reachable species will become large as the volume increases towards infinity. At some point, the cost of representing species as particles (linear cost) will exceed that of representing the entire reaction network (constant cost). Thus, at very large volumes aHPP will treat every species as a population and, consequently, the memory scaling will be constant with volume.
The case of countably-infinite networks is more difficult. In general, the memory scaling with volume will depend on the specifics of the model, with the best-case scenario being nearly constant and the worst-case being linear. An example of the worst case is the TLBR model in the gel phase [3,4], where a "super aggregate" forms that subsumes the majority of the molecules in the system. Increasing the volume only increases the size of the aggregate, so little to no lumping is possible. aHPP will operate essentially as an NF method in this case and, consequently, memory use with scale linearly with volume.
The situation will not be so dire in all cases, however. For example, consider a system with N particles, where each particle may take on a countably infinite number of states s ∈ {1, 2, 3, . . .} and has independent probability p(s) = 0.5 s of being in state s. The probability that none of the particles are in state s is thus p(N s = 0) = (1−0.5 s ) N , where N s is the number of particles in state s. If we consider each state to be an independent species, then we can estimate the expected number of populated species, , by summing over the probability that each state is populated: (1) We see in Fig. S2 that, in the limit of large N , this sum grows logarithmically with N . The intuition here is that an increase in the number of particles (increasing the volume while holding concentrations fixed) results in larger populations of previously populated species (those with small s) and new instances of rare species that were not previously populated (those with large s). However, the rate of accumulation of novel species diminishes as the volume gets large, leading to logarithmic growth. This is clearly an artificial example in that we are not considering intermolecular binding or the mechanisms of species formation. However, it does provide a simple example of sub-linear memory scaling for an infinite-network model. Obviously, it is difficult to say how the memory cost will scale for more realistic infinite-network models, but if the bulk of the population resides in relatively few of the species, with the remainder of the species being increasingly rare, then we expect that "something like" logarithmic scaling will be achievable in many cases.

S2.4.3 Intermediate volumes
Finally, we consider volumes between "very large" and "very small." Let us denote the minimal memory use function by m * : V → (0, ∞). We will say that P V ⊆ S is the optimal lumping for volume V if it achieves the minimum memory use. We will argue that m * is monotonic increasing and concave with respect to V . This yields a qualitative picture for aHPP where memory use scales linearly for very small volumes, is monotonic increasing and concave for intermediate volumes, and is somewhere between constant (finite networks and best case for infinite networks) and linear (worst case for infinite networks) at very large volumes. This behavior, along with those for NF, HPP, and network-based SSA, is sketched in Fig. 9 of the main text.
First, suppose 0 < V 0 < V 1 and let P 1 be the optimal lumping for V 1 . Since B hpp (P 1 ) ≥ 0, we have Thus, m * is monotonic increasing with volume. Now, suppose 0 < V 0 < V 1 and 0 ≤ t ≤ 1. Let the set of population species, P t , be the optimal lumping for volume as the minimum memory we must have Together these inequalities imply that After a bit of algebra, we obtain Recalling

S3 Rule-based modeling with BioNetGen
BioNetGen is a collection of open-source software tools for specifying, simulating, and analyzing rule-based models of complex biochemical systems [5][6][7][8]. Models are specified in the BioNetGen language (BNGL), a formal text-based modeling language designed specifically to represent proteins and protein-protein interactions. BNGL is similar to other rule-based languages, notably Kappa [9,10]. In this section, we first provide a summary of the BNGL syntax. We then discuss the structure of a BNGL input file, including the model specification and a summary of actions that can be applied to a model. We follow with instructions on how to generate an HPP model within BioNetGen and simulate it using NFsim. For further information regarding the syntax and semantics of BNGL, we refer the reader to Refs. [6,7,11].

S3.1.1 Macromolecules as structured objects
The fundamental object in BNGL is the molecule. Molecules are structured objects that contain components.
Components have state values (possibly null) and serve as sites for bond formation. Typically, molecules correspond to biological macromolecules such as proteins, while components represent protein functional domains. Component states may describe post-translational modifications, such as phosphorylation, or other properties of the macromolecule, such as conformational state.
Molecules are defined through molecule type templates that specify the set of components, and the possible state values of those components, that comprise a molecule. For example, in Blinov et al. [12] the molecule type definition for the epidermal growth factor receptor (EGFR) is This specifies that Egfr molecules have four components: l, r, Y1068, and Y1148. The first two represent a ligand binding domain and a receptor dimerization domain, respectively. The other two represent tyrosine residues 1068 and 1148, which are sites of phosphorylation [13,14]. Both Y1068 and Y1148 can exist in two possible states, Y (tyrosine) and pY (phosphorylated tyrosine). Names of molecule types, components, and states may be assigned any arbitrary string of alphanumeric characters [6,7], but conventionally are chosen to correspond to biological entities.
Molecules can be thought of as instances of molecule types, the difference being that in molecules a state for each molecular component must be explicitly specified. For example, Egfr(l,r,Y1068∼pY,Y1148∼Y) specifies an unligated Egfr monomer with a phosphorylated tyrosine at residue 1068 and an unphosphorylated tyrosine at residue 1148.

S3.1.2 Molecular complexes as hierarchical graphs
BNGL represents molecular complexes within a hierarchical graph formalism: components are represented by nodes, molecules are groupings of nodes, and bonds are edges between components [15,16]. Molecules that are connected by bonds between their components are said to belong to the same complex . Bound components are designated by a shared !LABEL token following each component in the bond. For example, Egf(r!0).Egfr(l!0,r!1,Y1068∼pY,Y1148∼Y).Egfr(l,r!1,Y1068∼pY,Y1148∼Y) specifies a complex comprised of an Egfr dimer and a bound Egf ligand. The Egf ligand is bound to the first Egfr molecule through a bond, labeled '0', between its r component and the l component of the Egfr molecule. The two Egfr molecules are bound to each other through a second bond, labeled '1', between their r components. The '.' symbol specifies that two molecules are part of the same complex. Molecules that are bound to each other are necessarily in the same complex but molecules in the same complex need not be bound to each other (e.g., the Egf ligand and the second Egfr molecule).
Two complexes that have the same molecular composition, with the same configuration of bonds and states, are said to be instances of the same species. More formally, two complexes are of the same species if their graphs are identical up to a labeled isomorphism [15].

S3.1.3 Patterns as subgraphs
Patterns describe structural motifs that may be found within complexes. In BNGL, patterns are written just like complexes except that components, states, bonds, and molecules that are not part of the motif are omitted. A pattern is said to match a complex (or species) if the motif is found within the complex. Note that it is possible for a pattern to match at multiple places within the same complex. In terms of graphs, a pattern is a subgraph and a match is a subgraph isomorphism from the pattern into the complex graph [5,15]. As an example, the pattern Egfr(l,r) matches each instance of Egfr with unbound l and r components, including Egfr(l,r,Y1068∼Y,Y1148∼pY), Egfr(l,r,Y1068∼pY,Y1148∼pY), and where the matches have been underlined. Since the components Y1068 and Y1148 are omitted from the pattern they can be in any of their allowed states and either bound or unbound.
A pattern may also require that a component be bound without specifying its binding partner by substituting the bond label with the '+' wildcard (e.g., Egfr(l!+,r)). Similarly, the bonding state can be specified as irrelevant (either bound or unbound) by using the '?' wildcard (e.g., Egfr(l!?,r)). Finally, if a component is listed but no state is indicated then the pattern will match the component regardless of state.
Patterns have two primary purposes in BNGL [6,7]: (i) they specify parts of a complex that are involved in a reaction; (ii) they define the observable outputs (or simply observables) of a model. The second function is critical for facilitating comparisons between simulation predictions and experimental observations.

S3.1.4 Reaction rules as graph transformations
Reaction rules (or simply rules) describe classes of reactions that transform the states of complexes in a system. A rule consists of a set of reactant patterns, a set of product patterns, a directionality arrow (unidirectional '->' or bidirectional '<->'), and a rate law. The reactant patterns match reaction sites (motifs) within complexes in the system. The product patterns indicate how the reactant patterns are modified when the rule fires. The rate law determines the firing rate per set of reactant matches. In terms of graphs, a rule is a graph transformation with a dynamic rate that determines how frequently the transformation is applied to each matching subgraph [15].
There are six basic transformations that can be encoded in a BNGL rule: (i) bond addition, (ii) bond deletion, (iii) state change, (iv) molecule synthesis, (v) molecule deletion, and (vi) complex deletion. 1 The transformations are implied by the differences between the reactant and product patterns in a rule. For example, the rule Egf(r) + Egfr(l,r) -> Egf(r!0).Egfr(l!0,r) k matches Egf molecules with unbound r components and Egfr molecules with unbound l and r components. The rate law is of the mass-action type (by default [6,7]) with a rate constant k. When the rule fires a bond is formed between the r component of Egf and the l component of Egfr. We refer the reader to Refs. [6,7] for further details regarding the specifics of the BNGL syntax.

S3.2 Anatomy of a BNGL file
BioNetGen models are specified within text files containing the .bngl extension. Model elements are specified in blocks delimited by begin and end tags. The entire model specification is enclosed within a begin model / end model block and the model specification itself is contained within six additional blocks: parameters, molecule types, seed species, observables, functions, and reaction rules. Though not strictly required, it is suggested that these blocks appear in this order. Molecule types, reaction rules, and observables have been discussed above. Values of rate parameters, initial populations of species, etc., are listed in the parameters block and can be referenced in any subsequent block. All initially-populated species and their population levels are specified in the seed species block. These act as the starting point for network generation in BioNetGen and define the initial system state for any subsequent simulations. Arbitrary functions of observables can be defined in the functions block and used in place of rate constants in the reaction rules block. Two types of functions can be defined: global and local [17,18]. Local functions take as an argument a reference to a complex or molecule and evaluate observables over that complex/molecule only. In contrast, global functions evaluate observables over the entire domain.
Any actions to be performed on the model must be listed after the model specification (i.e., the end model directive). Common actions include generating a network by iterative application of the rule set to the set of initial seed species, simulating a network (either deterministically or stochastically), performing a "network-free" simulation, changing or resetting the values of species populations or rate parameters, reading in a previously-generated network, and exporting the model and/or network in various file formats (e.g., SBML [19], M-file [20]). If multiple actions are listed in sequence then each is performed using the final system state from the previous action as the initial state for the current action (unless the system is explicitly reset to the initial conditions). For example, one might run an initial equilibration simulation in the absence of ligand, then add a dose of ligand and run a second simulation to investigate the dose-response behavior of the system. We refer the reader to Refs. [6][7][8] for further details regarding model specification and action syntax in BioNetGen.

S3.3 Generating and simulating an HPP model with BioNetGen/NFsim
From a BioNetGen user's perspective, once a set of species has been selected for lumping, converting a standard BNGL model into an HPP model requires only two additional steps. First, the population-mapping rules (see "Population species and population-mapping rules" in the main text) must be specified within a population maps block, either within the same BNGL file (for convenience) or a separate configuration file (to preserve the integrity of the original model). Note that this is the only place where the population species should be specified, they should not be included in the molecule types or seed species blocks (Sec. S3.2). The lumping rate constant should be set to a large value (NFsim does not currently support infinite rates) and can be specified either as a number or as a reference to a parameter in the parameters block. See Text S3, S6, S9, S12, and S16 for examples of HPP configuration files.
Next, the generate hybrid model() action must be invoked after the model specification. Arguments are passed to generate hybrid model() in a comma-separated list enclosed by curly braces and each argument/value pair has the form argument=>value. Allowed arguments are: • overwrite=>0/1: Disable/enable automatic overwriting of any existing hybrid model (default: 0).
• actions=>["action1","action2",. . .]: Define a list of actions (as strings) to be included at the end of the HPP model file. In generating the HPP model, BioNetGen automatically appends all of the population species defined in the population maps block to the original molecule types block, each followed by a population keyword to distinguish it as a population type. The seed species block is modified by replacing any structured species selected for lumping with their population species counterparts. The model observables are also modified to account for the introduction of the population species, which is straightforward. Observables are collections of one or more patterns [6,7]. Therefore, each structured species selected for lumping is matched against each pattern in each observable and if a match is found then the associated population species is added to the observable list. Finally, PNE is performed to obtain the expanded rule set, to which the population-mapping rules are appended in the reaction rules block. A before-and-after illustration of this process can be seen by comparing the original (Text S2) and partially-expanded (Text S3) versions of the simple receptor activation model considered in "Partial network expansion" and Figs. 2 and 3 of the main text.
If execute=>1 is passed to generate hybrid model(), BioNetGen will additionally perform any actions on the HPP model that are listed in the actions argument. For example, a NFsim simulation can be run through BioNetGen by calling the simulate nf() action [7,8], e.g., This allows NFsim simulations to be run in standalone mode (see Refs. [17,18]). Note that NFsim version 1.11 or later is required to run HPP simulations.

S4 BioNetGen graph formalism and the partial network expansion algorithm
This section presents the graph-theoretic formalism of rule-based modeling in BNGL. This material is adapted from a previous graph-based description of BNGL by Blinov et al. [15] and also draws inspiration from the κ-calculus graph formalism of Winskel and co-workers [21]. The formalism of Blinov et al. is oriented towards a population-based approach. Thus, the terminology of the present work diverges to some degree to support a particle-based perspective. We begin in Sec. S4.1 by summarizing the graph-theoretic formalism of BNGL. We then build upon this material in Sec. S4.2 to provide a formal basis for the partial network expansion (PNE) algorithm, complete with pseudocode.

S4.1.1 Pattern graphs
Pattern graphs are objects that describe substructures, or "motifs," within molecular complexes. Similar to the concept of a functional group in organic chemistry, a pattern graph specifies a molecular motif, or moiety, that participates in a characteristic class of reactions. Any instance of a pattern graph within a complex will participate in its characteristic reaction without regard to the composition and structure of the surrounding complex.
Definition 1 (Pattern Graph). Given a countable index set I, a set of type labels Ω T , and a set of state labels Ω S , a pattern graph is defined by the tuple where M is the set of molecules, C is the set of components, B is the set of component bonds (edges), C ± = C ∪ {+, −} is the set of components augmented with the non-specific bond wildcard '+' and the null bond '−', type is a function that assigns a type label to each molecule and component, and state is a function that assigns a state label to each component ('∅' is the null state and '?' is the unspecified state wildcard ). Furthermore, P satisfies the following: (i) at most one bond per component; (ii) bonds are undirected; (iii) no self bonds. The set of all pattern graphs is denoted by P.
Each component in C is contained by a unique molecule in M . If c = (m, n) ∈ C, then component c is contained by molecule m. Components can be thought of as the nodes of the pattern graph, bonds as edges between nodes, and molecules as groups of components.

Example: Representation of a BNGL pattern as a pattern graph
Consider the BNGL pattern L(r!1).R(l!1,a∼P).
Let us represent this pattern formally as a pattern graph. We begin by assigning a unique index to each molecule (from left to right) and a unique ordered pair of indices to each component (also from left to right), where the first element of the pair is the index of the molecule containing the component: Note that the bond from component r to component l appears in B twice, once for each ordering of the components. This is simply a convention for indicating that the bond is undirected. It is also clear that P is a connected pattern graph because molecules L and R are connected through this bond.

S4.1.2 Complex graphs and ensembles of complexes
A complex graph, or simply a complex , is a pattern graph that defines an assembly of connected molecules with a precise configuration of bonds and states. While a general pattern graph describes parts of a molecular complex, a complex graph describes a complete instance of a molecular complex. In other words, there is no ambiguity, such as unspecified state ('?') or non-specific bond ('+') wildcards.

Definition 3 (Complex Graph).
A complex graph is a connected pattern graph where: (i) every component state is defined (no state wildcards); (ii) every component has a null bond or a proper bond (no bond wildcards).
The state of a model system is described by an ensemble of complexes.

Definition 4 (Ensemble of Complexes). An ensemble of complexes is a finite set of disjoint complexes.
Intuitively, an ensemble of complexes is a "particle-based" representation of a system, where each particle corresponds to a distinct molecule in one of the complexes.

S4.1.3 Pattern graph isomorphism and species
It is useful to define the notion of equivalence of pattern graphs. We will say that two pattern graphs are equivalent, or isomorphic, if they have the same composition and structure.
Definition 5 (Pattern Graph Isomorphism). Pattern graphs x and y are isomorphic, written x ∼ = y, if there exists a one-to-one and onto map φ of molecules and components in x to those in y that preserves the following properties: (i) graph structure; (ii) component containership; (iii) type; (iv) states. A mapping with these properties is called an isomorphism.
Since isomorphism is an equivalence relation on the set of complexes, we can partition any ensemble into classes of isomorphic complexes, which we call species. This partition is useful because all complexes within a class have the same composition and structure and, therefore, each member has the same reaction motifs and propensities to participate in reaction rules.
Definition 6 (Species). Given a complex graph y, the species [y] is the set of all complexes that are isomorphic to y. The set of all species is denoted by S.
Though [y] is properly a set, it can be represented by any complex y ∈ [y].
If X is an ensemble of complexes, then a similar but distinct concept is that of the species [y] with respect to X, written [y|X]. This is the subset of complexes in X that are isomorphic to y, i.e., {x ∈ X | x ∼ = y}. The population of a species with respect to an ensemble X is the number of complexes in [y|X], i.e., ρ([y|X]) = |{x ∈ X | x ∼ = y}|. Thus, a species with respect to an ensemble can be represented by a complex and a population attribute: [y|X] ≡ (y, ρ([y|X])). The set of all species with a population attribute is denoted S † = S × N ≥0 . Under this notation, an ensemble of complexes X can be represented by a set of species with population counters, where X/ ∼ = is the quotient set of X by ∼ =.
We may choose to dispense with the reference to X entirely and directly represent a system as a species ensemble.
Definition 7 (Species Ensemble). A species ensemble S is a set of species with population counters. More precisely, If (s, N s ) ∈ S, then we will abuse notation slightly and write s ∈ S and define ρ(s) = N s . In contrast to an ensemble of complexes, a species ensemble is a "population-based" representation of a system since complexes belonging to the same species are lumped together into a single object.

S4.1.4 Embeddings of pattern graphs
An embedding, or "pattern match," is a map from a pattern to an instance of that pattern within a complex or, more generally, another pattern.
Definition 8 (Embedding). Given patterns p and q, an embedding of p in q is a one-to-one mapping φ from the molecules and components of p into those of q that preserves the following properties: (i) graph structure compatibility; (ii) component containership; (iii) type; (iv) state compatibility. The nonspecific bond '+' is compatible with any bond except the null bond '-'. The unspecified state wildcard '?' is compatible with any state.
An embedding of pattern p in pattern q is denoted by φ p,q . If X is an ensemble of complexes, then emb(p, X) is the set of all embeddings of pattern p into the complexes of X. Similarly, the set of embeddings from pattern p into any other pattern graph is denoted by emb(p, P). Embeddings of a pattern graph into a species, e.g., φ ∈ emb(p, s), where s ∈ S, should be interpreted as embeddings into the pattern graph representing the species.
It is often useful to refer to the pattern graph or species that contains the specific instance of an embedding. We call this the target of the embedding.
Definition 9 (Target). Given an embedding φ, the target of φ, written targ(φ), is the pattern graph or species that contains the image of φ.

S4.1.5 Reaction rules
A reaction rule represents a class of potential transformations on an ensemble of complexes.

Definition 10 (Reaction Rule). A reaction rule is defined by the tuple
where V is the number of reactant patterns, W is the number of product patterns, (R v ) are the reactant pattern graphs, (P w ) are the product pattern graphs, ψ is the component correspondence map, and k is the propensity, 2 which may be given by a non-negative real number or function.
The reactant pattern graphs describe motifs in complexes where a reaction event may occur. The product patterns describe the structure at the site of the motif subsequent to transformation under the action of the rule. The component correspondence map is a one-to-one, partial mapping of molecules and components in (R v ) to those of (P w ) that satisfies several properties (see below). The propensity k is defined such that k·dt is the probability for each rule instance (described next) to undergo transformation in the next infinitesimal time interval dt [22,23]. In addition to a simple non-negative constant, the propensity may be a function of the ensemble of complexes (a "global function"), the target of a reactant pattern embedding (a "local function"), or both (a "composite function") [24]. The component correspondence map, ψ, defines how molecules and components in the reactant patterns correspond to those in the product patterns. The map is partial because a subset of molecules and components may be deleted or synthesized by the action of the rule. If a reactant molecule is not in the domain of ψ, dom(ψ), then the molecule and its components are deleted. Similarly, if a product molecule is not in the image of ψ, im(ψ), then the molecule and its components are synthesized. The differences in the bond and state configurations of corresponding components imply a set of pattern graph transformations. For example, if two components in the reactant patterns do not share a bond edge but the corresponding components in the product patterns do, then a "bond addition" transformation is implied by the rule.
Definition 11 (Component Correspondence Map). Given reactant patterns (R v ) V v=1 and product patterns (P w ) W w=1 , a component correspondence map ψ is a one-to-one, partial mapping from the molecules and components of (R v ) to those of (P w ), that satisfies the following properties: (i) molecules map to molecules and components map to components; (ii) types are preserved; (iii) molecule substructures are preserved; (iv) no new wildcards appear in the product graphs.
It is useful to define the notion of the "reaction center" of a rule, which is the subset of molecules and components in the reactant patterns that are modified by the action of the rule. Conversely, molecules and components in the reactant patterns of a rule that are not within the reaction center are called the "reaction context." In BNGL, the reaction context represents additional conditions necessary for a transformation encoded in a rule to take place. By convention, the reaction context does not contribute to the propensity of a rule, i.e., multiple matches to the context does not increase the value of the propensity (multiple matches to the center does). As such, it is necessary to equate embeddings where the images of the reaction center are the same even if the context differs.
Definition 13 (Embedding Equivalence). Given a rule r, with reaction center RC(r), the embeddings φ, φ ∈ emb(R v , P) are equivalent with respect to RC(r) if φ| RC(r) = φ | RC(r) , where φ| RC(r) is the restriction of the embedding map φ to the molecules and components in RC(r).
The equivalence class of embedding φ with respect to RC(r) is written [φ]. When the context is clear, we will usually omit reference to RC(r). Furthermore, since the embedding class is represented by any of its members, we will usually omit the square braces. Note that all embeddings in an equivalence class have the same target. The set of all equivalence classes of embeddings of pattern R is denoted by emb(R, P)/ RC(r).
In order to apply the transformation(s) encoded by a reaction rule, it is necessary to have an embedding of each reactant pattern into a target. A rule instance is a collection of embedding classes, one for each reactant pattern, whose targets can be transformed by the action of a rule.

Definition 14 (Rule Instance). Given reactant patterns
The targets of a rule instance are denoted by the tuple targ( Note that the targets are not necessarily distinct.
The action of a rule is defined by an operator that maps rule instances to products. The products are constructed by transforming the embedding targets according to the graph operations implied by the differences between the reactant patterns and the product patterns via the component correspondence map.
Definition 15 (Action of a Rule on a Rule Instance). Given a rule r with reactant patterns (R v ) V v=1 , the action of r is given by an operator that maps rule instances to tuples of products, where (P) <∞ is the set of finite sequences of pattern graphs (although sometimes the pattern graph is the representative of a species).
The details of this operator are beyond the scope of the present work (see Sec. 3.1 of Blinov et al. [15] for a discussion within the context of network generation). It is sufficient for our purposes to simply state that we have a unique operator that maps rule instances to products.
It may seem awkward that the action of a rule maps embeddings to products rather than targets to products. However, this distinction is important because the same targets may be transformed into different products depending on the specific embedding into those targets. Thus, the action of a rule depends on both the targets and the embeddings, but since the targets are implicit in the choice of embeddings it is unnecessary to explicitly mention them in the definition. Also note that the number of products produced by the action of a rule is not necessarily equal to the number of product patterns in the rule. This is generally rectified by a post-action rejection step. For example, during network generation BioNetGen rejects all reactions generated from a rule that violate the product-side molecularity of that rule [6] (see below for further discussion).
If the targets of a rule instance belong to an ensemble of complexes, then we can naturally define the action of the rule on the ensemble. The instance targets are removed from the ensemble and replaced by the products generated by the action of the rule on the instance.
Definition 16 (Action of a Rule on an Ensemble of Complexes). Given an ensemble of complexes X, a rule r, and a rule instance where r[(φ v )] is the action of r on (φ v ).
The action of a rule on a species ensemble S is defined similarly except that targets and products belonging to the species in S are handled by modification of the population counter rather than set deletion and addition.
Definition 17 (Action of a Rule on a Species Ensemble). Given a species ensemble S, a rule r, and a rule instance where (p w ) are the products of the action r[(φ v )].
Given an ensemble of complexes X, the set of all rule instances for rule r in X is defined to be This set, along with the rule action operator r[ ], encapsulates every possible transformation on an ensemble of complexes by a reaction rule, and forms the basis for the NF simulation approach [3,17,24]. The set of all rule instances for a species ensemble S is defined analogously.
A few notes about rule instance propensities. The propensity of a rule instance is the probability of the rule acting on the instance in the infinitesimal time interval dt. There are several factors that determine this probability: (i) k r , the propensity of the rule; (ii) the symmetry factor of the rule; (iii) the targets of the rule instance; (iv) the products of the action. The symmetry factor of a rule is a positive constant associated with the intrinsic symmetry of the reactant and product patterns of the rule. Its computation is the beyond the scope of this document (see Blinov et al. [15] for a discussion). The targets of the rule instance influence the propensity in two ways. First, the targets of the embeddings must be distinct complexes, i.e., the set of targets must have the same molecularity as the rule. If the targets are not distinct then the propensity is zero. Second, if the rule propensity is a local function (see Sneddon et al. [17]), then the local structure of the target can influence the propensity. The products can also influence the propensity, but only as a post-action rejection. While it can be shown that the products, in an aggregate sense, always contain embeddings of the product patterns, there is no guarantee that the products will have the same molecularity as the product patterns. After the products are constructed by the action of the rule, the molecularity of the products can be compared to that of the rule. If the molecularity is different, then the products are rejected and the targets restored. Since compliance with product molecularity is handled as a rejection step, the propensity of a rule instance is actually an upper bound on the true propensity.
Definition 18 (Propensity of a Rule Instance for an Ensemble of Complexes). Given a rule r with propensity k r and symmetry constant s r , an ensemble of complexes X, and a rule instance (φ v ), the propensity of the rule instance is

Similarly, for a species ensemble,
Definition 19 (Propensity of a Rule Instance for a Species Ensemble). Given a rule r with propensity k r and symmetry constant s r , a species ensemble S, and a rule instance (φ v ), the propensity of the rule instance is Note that the propensity of a rule instance for an ensemble of complexes is just a special case of this definition [i.e., ρ(targ(φ v )) = 1 if targ(φ v ) is a complex x ∈ X].

S4.1.6 BNGL models
An observable defines a measurable output of a model system. Observables are composed of a set of pattern graphs, each corresponding to a molecular motif that is detectable by an experimental assay [6].
Definition 20 (Observable). An observable is defined by the tuple where name is a label for the observable, type is a property that determines how the observable is calculated, and (P v ) is a tuple of pattern graphs.
The population of an observable is a measure of the number of embeddings of the pattern graphs within the system. The type of the observable, "Molecule" or "Species", determines the method of computing the population (see Faeder et al. [6]). For an ensemble of complexes X, the population of an observable is where H(z) = 1 if z ≥ 1 and 0 otherwise. Similarly, for a species ensemble S, A BNGL model completely specifies a dynamic system, including the initial system configuration, the reaction rules that govern the dynamics, and a set of observables that define the outputs of the system. where seed species is a subset of S † that specifies the initial state of the model system, observables is a set of observables, and reaction rules is a set of reaction rules that govern the dynamics of the system.
In practice, a BNGL model may also include parameters, molecule types (component-level type graphs), and function definitions [6], but we will omit these here for simplicity. Note that while seed species are always specified by a species ensemble in a model, the representation within a simulator may be as an ensemble of complexes (particle representation), a species ensemble (population representation), or a hybrid ensemble (see below).

S4.2.1 Hybrid ensembles
HPP simulation requires the notion of a hybrid ensemble. A hybrid ensemble is a particle/population representation of a system, since part of the system is represented as individual particles and the remainder as lumped populations.

Definition 22 (Hybrid Ensemble).
A hybrid ensemble is the pair (X, S), where X is an ensemble of complexes and S is a species ensemble.
For convenience, if (X, S) is a hybrid ensemble and x ∈ X, then we define ρ(x) = 1. The action of a rule on a hybrid ensemble is defined by combining Defs. 16 and 17.
Definition 23 (Action of a Rule on a Hybrid Ensemble). Given a hybrid ensemble (X, S), a rule r, and a rule instance (φ v ) ∈ v emb(R v , X ∪ S)/ RC(r), the action of r on (X, S) through (φ v ) is where (p w ) are the products of the action r[(φ v )].
The set of all rule instances for a hybrid ensemble (X, S) is defined analogously to that for an ensemble of complexes, The propensity of a rule instance for a hybrid ensemble is the same as defined in Def. 19. Finally, the population of an observable for a hybrid ensemble (X, S) is just the sum of the observable counts for the complexes and the species, i.e.,

S4.2.2 Child rules and rule restriction
A reaction rule r is said to be a child of reaction rule r if the rule instances of r are a subset of those of r and the action of r is equivalent to r on that subset. In formal terms, there must be a rule instance and a tuple of embeddings (φ w ) ∈ w emb(P w , P w ). Furthermore, it must follow that the component correspondence maps commute, i.e. φ • ψ(x) = ψ • φ(x) for all x ∈ dom(ψ), where φ and φ are the unions of the respective embeddings. A child rule can be constructed from the pairing of a rule r with a rule instance (φ v ). The child rule has reactant patterns given by the targets of (φ v ) and product patterns constructed by the action of the rule on the instance.
Definition 24 (Restriction of a Rule by a Rule Instance). The child rule induced by the restriction of rule r by a rule instance (φ v ) is defined to be where r[(φ v )] are the products of the action of r on (φ v ) and ψ is obtained from a single pushout of φ and ψ (see Refs. [15,25]). In general, the targets may be species or pattern graphs.
Given a parent rule p, child rules r and r are equivalent if the reactant patterns are isomorphic, the product patterns are isomorphic, and the component correspondence maps commute with the isomorphisms.
Definition 25 (Child Rule Equivalence). Child rules r and r are equivalent, written r ∼ = r , if the reactant patterns are isomorphic via mappings (φ v ), the product patterns are isomorphic via mappings (φ w ), and the following hold: where φ and φ are the unions of the embedding maps.
The class of all child rules equivalent to r is denoted by [r]. The equivalence relation partitions the set of child rules into classes. If R is a set of child rules, then the set of all equivalence classes in R is denoted R/ ∼ =. Child rules in the same class can be treated as a single entity with a multiplicity factor that ensures that the propensity of the class is equal to the sum of the individual propensities. A proof is beyond the scope of this document.

S4.2.3 Population-mapping rules
A species is "unstructured" if it is represented by a complex consisting of a single molecule that contains no components. Given a set of structured species S and a set of unstructured species U , where the types in U do not overlap with those in S, an "unstructured representation" of the species in S is a one-to-one mapping λ : S → U . In the population-adapted NF simulator, complexes belonging to an unstructured species can be treated as a population (a species ensemble) rather than as individual particles (an ensemble of complexes). Thus, structured species can be treated as populations if an unstructured representation is provided and the reaction rules are partially expanded so that the dynamics of the unstructured species are equivalent to those of the structured counterpart. This process is accomplished by the PNE algorithm.
Prior to performing PNE, the modeler defines a set of population-mapping rules. Each mapping in an unstructured representation implies a population-mapping rule, and conversely a set of population-mapping rules implies an unstructured representation. Formally, a population-mapping rule is a reaction rule that matches complexes belonging to a single species and transforms them into the unstructured representation.

S4.2.4 The PNE algorithm
Network expansion is the process of enumerating all of the specific reactions (at the resolution of species) implied by a set of reaction rules. The rule set produced by the PNE algorithm is a partial network expansion in the sense that the reactions are enumerated only with respect to the population species. Interactions among particles and between particles and population species are encoded by a set of child rules derived from the original rule set. Thus, there are three types of reaction rules among the set of children: (i) rules that govern the interaction of particles; (ii) mixed particle-population rules that govern the interaction of particles and population species; (iii) population reactions that govern the interactions among population species. Rules of type (i) are just duplicates of the original reaction rules. Rules of type (iii) are proper reactions since each reactant pattern matches exactly one species. Rules of type (ii) and (iii) together encapsulate every possible interaction among the population species and between population species and reactant motifs within particles. In this sense, PNE is a hybrid of network-based and network-free methods: particles interact within an NF framework, populations interact within a network-based framework, and particle-population interactions are handled in a hybrid manner. Pseudocode for the PNE algorithm is presented as Algorithm 1. The method takes as input a BNGL model and a set of population-mapping rules and returns a model with a partially expanded rule set (an HPP model) that is dynamically equivalent to the original model for sufficiently large k lump (or for any k lump if the "exact" option is selected; see line 21 of Algorithm 1). When the HPP model is loaded into the population-adapted NF simulator, all of the unstructured species in U are represented as a species ensemble (i.e., as populations). The structured species, however, are represented as an ensemble of complexes X (i.e., as particles). Therefore, the system representation is a hybrid ensemble (X, U † ). The propensities for rule instances are computed according to Def. 19. However, rule instances in which two or more reactant patterns have the same target are given the same propensity as other instances and then rejected in a post-sampling step. Furthermore, at simulation runtime, populationmapping rules serve the practical purpose of matching unlumped particles belonging to a population species, deleting the particle, and incrementing the population counter of the corresponding unstructured representation. Runtime lumping is required since the PNE algorithm is not guaranteed to identify every possible way that particle-particle or particle-population events can yield a product belonging to the set of population species (e.g., through complex dissociation). However, PNE does enumerate every way that a population species can participate in a reaction as a reactant. if ∃s ∈ S such that s = R v then // equality is required! 28 substitute λ(s) for R v 29 foreach P w ∈ (P w ) do 30 if ∃s ∈ S such that s ∼ = P w then // isomorphism is sufficient here 31 substitute λ(s) for P w 32 add (R v ), (P w ), ψ , k to child rules // Compute equivalence classes of child rules (Def. 25) and add to hpp model 33 add (child rules/ ∼ =) to hpp model.reaction rules // Add population-mapping rules (Def. 26) 34 foreach s ∈ S do 35 add (s), (λ(s)), ∅, k lump to hpp model.reaction rules