VGsim: Scalable viral genealogy simulator for global pandemic

Accurate simulation of complex biological processes is an essential component of developing and validating new technologies and inference approaches. As an effort to help contain the COVID-19 pandemic, large numbers of SARS-CoV-2 genomes have been sequenced from most regions in the world. More than 5.5 million viral sequences are publicly available as of November 2021. Many studies estimate viral genealogies from these sequences, as these can provide valuable information about the spread of the pandemic across time and space. Additionally such data are a rich source of information about molecular evolutionary processes including natural selection, for example allowing the identification of new variants with transmissibility and immunity evasion advantages. To our knowledge, there is no framework that is both efficient and flexible enough to simulate the pandemic to approximate world-scale scenarios and generate viral genealogies of millions of samples. Here, we introduce a new fast simulator VGsim which addresses the problem of simulation genealogies under epidemiological models. The simulation process is split into two phases. During the forward run the algorithm generates a chain of population-level events reflecting the dynamics of the pandemic using an hierarchical version of the Gillespie algorithm. During the backward run a coalescent-like approach generates a tree genealogy of samples conditioning on the population-level events chain generated during the forward run. Our software can model complex population structure, epistasis and immunity escape.


Introduction
The unprecedented world-wide effort to produce and share viral genomic data for the ongoing SARS-CoV-2 pandemic allows us to trace the spread and the evolution of the virus in real time, and has made apparent the need for improved computational methods to study viral evolution [1]. These data yield important insights into the effects of population structure [2][3][4][5], public health measures [6,7], immunity escape [8,9], and complex fitness effects [10,11]. It is essential that we also have tools to accurately simulate viral evolutionary processes so that the research community can validate inference methods and develop novel insights into the effects of such complexities. However, there are no software packages capable of simulating the scale and apparent complexity of viral evolutionary dynamics during the SARS-CoV-2 pandemic.
Pandemic-scale datasets impose technical problems associated with the scalability and memory usage of computational methods. There is already substantial progress in building scalable simulators and data analysis methods for human genome data. The current state-ofthe-art human genome simulator msprime [12] is capable of simulating millions of sequences with length comparable with human chromosomes. Methods such as the Positional Burrows-Wheeler Transform (PBWT) [13], its ARG-based extension tree consistent PBWT [14], and tsinfer [15] can be used to efficiently process and store genomic sequences, but all of these approaches are designed for actively recombining organisms. Moreover, the primary population models underlying these methods are the Kingman coalescent [16], the Wright-Fisher model [17,18] and the Li-Stephens model [19]. We recently developed approaches for compressing and accessing viral genealogies that dramatically reduce space and memory requirements [20,21], but there are no viral genealogy simulation methods that can efficiently produce pandemic-scale datasets.
Coalescent models [22] are powerful tools for studying humans, many other eukaryotes, and pathogen populations (e.g. [23][24][25]). Structured coalescent models have been successfully applied for epidemioloigical inference, in particular to detect and understand barriers to transmission in human populations [26,27]. However, the assumptions of coalescent frameworks are often violated in other epidemiological settings. First, the effective population size is usually modeled either as piece-wise constant or as exponential growth. However, the coalescent with exponential growth and birth-death models do not result in equivalent genealogies [28]. Second, it's not simple to use coalescent models to describe the effects of selection. If we consider the pandemic on a longer time period, basic birth-death models (e.g. [29]) are not an appropriate choice, since the reproductive rate usually decreases with time as collective immunity builds up or as the susceptible population is exhausted. These limitations are often addressed in epidemiology using compartmental models, such as SI, SIS and SIR [30], or their stochastic realisations, which are also birth-death processes. Backward in time genealogical models conditioned on epidemiological process have also been used to simulate pathogen genealogies, for example in PhyDyn [31] and TiPS [32].
Simulating realistic selection in backward-time models is a well-known challenging problem. A common workaround is to assume a single deterministic frequency trajectory or to generate a stochastic frequency trajectory in forward time, and then to simulate the ancestry of the samples around the selected site in a coalescent framework (e.g., [33,34]). However, more complex models of selection, including e.g., gene-gene interactions, or epistasis, are often beyond the scope of such coalescent models. Nonetheless, epistasis is thought to be an important component of viral evolutionary processes [35,36], and incorporating the effects of such complex evolutionary dynamics is essential for accurate simulations of evolution.
We introduce a novel simulation method that can rapidly generate pandemic-scale viral genealogies. Our approach is a forward-backward algorithm where we generate a series of stochastic events forward in time, then traverse backwards through this event series to generate the realized viral genealogy for a sample taken from the full population. Our framework includes the accumulation of immunity within host populations and of viral mutations that affect the fitness of descendant lineages. Our method is extremely fast, and can produce a phylogeny with 50 million total samples in just 88.5 seconds. The genealogies output from our simulation are compatible with phastSim [37], making it possible to generate realistic genome data for the simulated samples. This framework empowers efficient and realistic simulation of pandemic-scale viral datasets.

Design and implementation
Our epidemiological model is a compartmental model [38] (see S1 File-section 1 for a brief introduction to compartmental models), and the realisations of the stochastic processes are drawn using the Gillespie algorithm [39]. The different compartments in our model are defined based on several interacting real-world complexities: (1) host population structure with corresponding population-specific viral frequencies and contact rates, (2) separate host infectious groups resulting from different viral haplotypes, and (3) different host susceptibility groups.
We break the simulation into two phases. In the first one (the forward pass), we generate a population-level epidemiological process which is represented as the series of events (Fig 1) resulting from the "reactions" (Table 1). These events then influence the properties of the viral genealogy, which is sampled in the second phase (the backward pass). The specific viral genealogy is sampled conditioned of the population-level epidemiological process using a coalescent framework. Table 2 lists all the features which determine the simulation. In the beginning, the user should specify the number U of mutable sites (see section Fitness landscape),the number T of susceptibility types (section Epidemiological model), and the number K of populations (section Population model).

Fitness landscape
Because this simulation framework focuses on generating the viral genealogy, and not genomes, we track only mutations at genome sites that have a large positive effect on viral fitness. That is, these mutations enhance the transmissibility of the virus or lead to immunity escape. We expect this will typically be a relatively small number of mutations relative to the size of the viral genome, simplifying the problem. To efficiently model neutral genetic variation we suggest using phastSim [37] on a tree generated by our algorithm; the output produced by our method can be directly imported into phastSim for downstream processing.
To define the intended model of selection on new mutations, the user specifies the number U of mutable sites and their specific fitness effects (i.e., their effect on the birth rate). The  Table 1. Black squares correspond to the consecutive steps, where the subfamilies are chosen with the weights given by their propensities. The propensities for each step are cached and updated only if they change due to an event. For migration propensities, the rejection approach is used instead (S1 File-section 3).   Table 2. List of features which determine the simulation scenario. All the rates are normalized by the number of individuals in a particular group (i.e. the number of individuals infected with a particular haplotype or individuals of a certain susceptibility type). The rates are measured in terms of events per time unit.

Model
Feature Description Value Substitution probabilities The probabilities of particular nucleotide substitution at haplotype h given that the mutation occurred at the site u.
The multiplier which allows to change the relative susceptibility to haplotype h of an individual with susceptibility type i.

Susceptibility transition rate
The rate at which susceptible individuals move from one susceptibility type to another without being infected. This allows to model the loss of immunity with time or vaccination efforts.

Population model Population size
Total number of individuals in population s. N s 2 ð0; 1Þ

Contact density
The multiplicative modifier of transmission rate corresponding to the relative number of contacts in population s. It is used to describe differences in social behaviour of the host population (e.g. population density, NPIs).

Non-pharmaceutical interventions (NPI)
Conditions to impose and lift NPI in population k (determined by the proportion of infectious individuals in the population) and the contact density during the NPI.
Sampling effort This modifier increases or decreases the sampling rate in population s.

Migration probability
The probability that an individual from population s is temporarily visiting population r.

PLOS COMPUTATIONAL BIOLOGY
VGsim: Scalable viral genealogy simulator for global pandemic 3U substitution probabilities p u h 1 , p u h 2 , p u h 3 , one for each site u and for each of the 3 possible new nucleotides at site u. Transmission, recovery, and sampling rates, as well as mutation rates, susceptibility, and triggered susceptibility types can be defined individually for every haplotype. Of particular interest, gene-gene, or epistatic, interactions can be flexibly modelled using this approach.
We refer to sequences carrying particular sets of variants as "haplotypes", because two identical sequences can appear as a results of different mutation events, so they might not belong to the same clade in the final tree.

Epidemiological model
To model the host immunity process, we use a generalised SI-model. The compartments within each population represent different types of susceptible individuals or infectious individuals.
Different susceptible compartments in the same host population are used to model different types of immunity. These can correspond for example to host individuals that have recovered from previous exposure to different viral haplotypes. Susceptible compartments can also be used to represent different vaccination statuses. For each susceptible compartment S i , and for each viral haplotype h we consider a susceptibility coefficient σ ih which multiplicatively changes the transmission (birth) rate of the corresponding haplotype. In particular, σ ih = 0 corresponds to absolute resistance, similar to the R-compartment in SIR-model, but specific to individuals who have immunity type i and are exposed to haplotype h. 0 < σ ih < 1 would correspond to partial immunity, while σ ih > 1 corresponds to increased susceptibility.
Different infectious compartments within a host population correspond to individuals infected by a haplotype and can potentially infect susceptible hosts. As we mentioned in the section Fitness landscape, the transmission rate λ h , recovery rate ρ h and mutation rates can be set independently for each haplotype h. After recovery, a host individual that was infected with haplotype h, and therefore was in compartment I s h for some population s, is moved to the corresponding susceptibility (immunity) compartment S s iðhÞ . Different haplotypes might however lead to the same types of immunity.
NB: The evolution of individual immunity is modeled as Markovian-it is determined only by the latest infection, and does not have memory about previous infections. Whether this assumption provides an accurate approximation of the immunity dynamics within the host population is an important consideration and may depend on the specific pathogen biology. Different haplotypes can lead to the same immunity. Some immunity types can be specific, e. g., to vaccination immunity without being associated with any haplotypes at all.
The rate of transmission of viral lineages within a population also depends on how frequently two host individuals come in contact with each other. To flexibly accommodate such differences, each population s is assigned a contact density δ s parameter. This parameter can be used to simulate differences in the local population density, social behaviours etc. In the abscence of migration, the rate for an individual from susceptibility class S s i (the susceptible compartment i within population s) to be infected with haplotype h by another individual within population s is where jS s i j is the number of individuals in S s i , I s h is the class of individuals infected with haplotype h in population s, λ h is the baseline transmission rate of individuals infected with haplotype h, and N s is the total population size of deme s. If there is migration, the equation should be modified by taking into account the probability that two interacting individuals are in the same population (not necessarily in their home population). We give more details on accomodating migration in the section Migration and S1 File-section 2.
Direct transitions between susceptible compartments are possible, for which users can specify a transition matrix for susceptible compartments. A transition between susceptible compartment can be used for example to model a vaccination event, or the loss of immunity with time.

Population model
Demes. The population model is based on an island (demic) model. Each population is described at each point in time by its total size N s , number jI s h j of infectious individuals with each viral haplotype h, number jS s i j of susceptible host individuals for each susceptibility type i, relative contact density δ s , and a population-specific non-pharmaceutical interventions strategy and effectiveness. All individuals within each deme have the same contact patterns.
Non-pharmaceutical interventions. Several governments have imposed non-pharmaceutical intervention (NPI) measures such as lockdowns, mandatory mask wearing in public or social disctancing, during the COVID-19 pandemic as an effort to control the spread of SARS-CoV-2. Understanding the effects of non-pharmaceutical interventions is a crucial concern for designing effective public health strategies. We implement these measures as follows. When the total number of simultaneously infectious individuals in population s surpasses a certain user-defined population-specific percentage (e.g. 1%) of the population size N s , the non-pharmaceutical interventions are imposed and the contact density δ s is changed to a new (typically lower) population-specific value. When the percentage of the infectious individuals drops below a user-specified value (e.g. 0.1%) the measures are lifted and the contact density in population s reverts to its initial value δ s . Migration. Migration is described by a matrix μ sr which defines the probabilities at which an individual from source population s can be found in target population r. In our model, new infections occur due to contact between infectious individuals from one population and susceptible individuals from the second population when they contact within the same population. This can be due to the travel of a susceptible individual to a source population, where it contracts an infection, after which it returns back to the home population (q = r in Eq 1); or, to the travel of an infectious individual into a target population where it transmits the infection to a susceptible individual (q = s in Eq 1); or, to the travel of both infectious and susceptible individuals to a third population. The derivation of each term is explained in S1 File-section 2 (see S1 File-equation S.1). This model corresponds to short-term travel such as tourist or business trips, where an individual returns soon back to the home population. The proposed process is different from the traditional migration modelling in population genetics, when an individual moves permanently to a new population.
More in detail, transmissions occur when a susceptible individual with immunity i from population s meets an infected individual with haplotype h from population r while both of them are being in population q. The rate of such a transmission is The total rate of transmission between compartments S r i and I s h is then X q M 0 ðr; i; s; h; qÞ ¼ jS where μ ss = 1 − ∑ q6 ¼s μ sq is the probability that an individual originally from population s is currently not in a different population. If r = s, this equation describes within-population transmission. Notice that the sum in the last equation does not depend on the actual counts of individuals in the compartments, and hence it can be precomputed in advance. r, s and q are not necessarily different, in particular for r = s we obtain the within-population transmission rate between S r i and I s h . Since it is computationally demanding to keep track of how each migration rate between each pair of compartments is affected by each simulated event, instead we keep track of cumulative upper bounds on such migration rates (see S1 File-section 3 for details). In the case a potential migration event is sampled according to these upper bounds, we then proceed to calculate the precise migration rates and only sample a specific migration event according to its own exact rate. This is efficient when cross-population transmissions (migrations) are rare compared to within-population transmissions. This algorithmic implementation might perform suboptimally if population structure is extremely weak.

Sampling
Sampling is modelled using a continuous sampling scheme. In this scheme every individual infected with haplotype h in population s is sampled at rate c s z h , the product of the haplotypespecific sampling rate z h and the population modifier c s . Sampled individuals then instantly recover and are moved to susceptible group S s iðhÞ , effectively increasing the recovery rate ρ h for I s h by c s z h . Alternatively, one can think about this sampling scheme as setting the recovery rate for I s h to ρ h + c s z h and sampling an individual in I s h upon its recovery with probability c s z h /(ρ h + c s z h ). More details can be found in S1 File-section 7.

Algorithm
The simulation process is split into two parts, forward and backward. In the forward run, a chain of events (including sampled cases) describing the dynamics of the epidemiological process at the population level is generated with the Gillespie algorithm [39]. In the backward run, our method simulates the genealogy of the samples in a coalescent-like manner while conditioning on the events generated during the forward run.
Forward run. The forward run generates a chain of events which reflects the dynamics of the pandemic. Our implementation of the Gillespie algorithm is based on three algorithmic ideas: logarithmic direct method [39] (the events, or "reactions", are organised in nested families, Fig 1), rejection-based approach [40] for migrations (see S1 File-section 3 for details), and organising propensity dependencies to avoid updating those propensities which are not affected by events [41]. Details are given in S1 File-section 4.
Backward run. The backward run randomly builds a genealogical tree of the samples while conditioning on this chain of events generated in the forward run.
All of the ancestral lineages of the samples generated in the forward run belong to one of the infectious compartment corresponding to a haplotype h in a specific population s. Lineages are exchangeable within each compartment. Conditional on any event generated in the forward run, it is straightforward to calculate the probability that the event affected zero, one or two sample ancestral lineages in the backward run (see S1 File-section 5 for details).
Implementation details. VGsim provides a convenient Python user interface. Performance-critical parts are implemented in C++ via Cython [42]. The dependencies are kept to a minimum: NumPy [43] and mc_lib -a small wrapper of the NumPy C API for generation of pseudorandom numbers in Cython [44].

Correctness of the software implementation
To assess the correctness of our implementation, we compared simulated epidemiological trajectories and distributions of coalescent times produced by VGsim with those produced by MASTER [45]. In all cases, the results are consistent, and described in detail in S1 Filesection 8.

Forward run performance
To test the scalability of the population model, we performed simulations with K = 2, 5, 10, 20, 50 and 100 total host populations with 2 � 10 9 /K individuals in each and generated 10 million events (see Fig 1) in each run (see Table 3). There are 16 haplotypes resulting from two segregating sites with mutation rates 0.01 in each of them (this is unrealistically high, but it ensures that all the haplotypes appear in the simulation), and three susceptibility group with the first group corresponding to the absence of immunity, the second group corresponding to partial immunity and the last one corresponding to resistance to all strains. The transmission rate is λ = 2.5 for all haplotypes except one, and λ = 4.0 for this last haplotype. The recovery rate is ρ = 0.9, the sampling rate is z = 0.1 (so, the effective reproductive number is 2.5 which approximately correspond to SARS-CoV-2 [46] if the time unit is interpreted as one week). All the migration probabilities were set to μ = M/(K − 1), where M is the cumulative migration rate from a population. The runtime of the forward algorithm does not depend only on the cumulative migration rate M, but also on the percentage of potential migrations rejected by the algorithm (see section Migration for details), which appears to grow with M. However, the effect on runtime is relatively modest (in contrast to the naive algorithm which is quadratic in the number of populations, see Table 1) indicating that this approach scales well to pandemicscale simulations.

Backward run and overall performance
Our implementation of the backward run algorithm relies on an efficient and compact tree representation, a Prüfer-like code [47]. Each node is associated with an index in an array, and the corresponding entry in the array is the index of the parent node. The time needed to Table 3. Run time to generate 10 million events. The second number is the percentage of discarded events (due to migration acceptance/rejection). There are 16 = 4 2 haplotypes and 3 susceptible compartments. The sampling rate is set to z = 0.1, recovery rate is ρ = 0.9, transmission rate is λ = 2.5. The tests were run on a server node with Intel Xeon Gold 6152 2.1-3.7 GHz processor and 1536GB of memory.

PLOS COMPUTATIONAL BIOLOGY
VGsim: Scalable viral genealogy simulator for global pandemic generate a tree mainly depends on two factors: the total number of events generated in the forward run, and the total number of samples in a tree. We report the execution time of the backward run in Table 4. The combined approach is sufficiently fast that it can be used to generate many replicate simulations as is often required to validate empirical methods and to train model parameters. Table 4 also shows the forward time, the total number of generated events and the total number of infected individuals over the simulation for various sampling rate (where the sampling rate z = 0.01 is 1 in 100 cases is sampled, z = 0.1 corresponds to 1 in 10 cases is sampled, and z = 1.0 means that every case is sampled), and various sample sizes. The simulation assumes the absence of immunity after infection (SIS-model), which allows to run the simulation sufficiently long to collect enough samples (instead, with an SIR-model susceptible individuals can be exhausted before the desired number of samples is simulated).
To showcase the limit of applicability of our simulator, we also show in Table 4 the computational demand for the simulation of an unrealistically large (for now) genealogy of 150 million samples (with 1 in 100 cases sampled), for which we almost reached the memory limit available on our supercomputer node (1536GB) [48]. The total number of infections in the population is more than 15 billion cases, with the total number of events being more than 30 billions. The forward run time was approximately 9.5 hours and the backward run time was 13.5 minutes. Table 4. Run time in seconds to generate a random genealogy for a sample of a certain size for different sampling rates. The execution time is shown split into the time demand for the forward run and the one for the backward run only. We simulated 16 = 4 2 haplotypes and no host immunity. The recovery rate is ρ = 1.0 − z, with z the sampling rate, while the transmission rate is λ = 2.5 for all 16 haplotypes. The tests were run on a server node with an Intel Xeon Gold 6152 2.1-3.7 GHz processor and 1536GB of memory.  [32]) implement Gillespie algorithm for compartmental models, but they currently lack a simple user interface, instead requiring users to specify the full set of reaction equations, and they might be not specifically optimised for epidemiological purposes. On the other hand, both MASTER and TiPS implement approximate methods (tau-leaping and hybrid approaches), which decrease the time of forward simulation by orders of magnitude and hence might outperform our simulator depending on simulation scenario. VGsim is optimised to scale for large epidemics and genealogies, though approximate approaches are not available in the current implementation. It also has a simple and flexible user interface which helps merge together several complexities (epidemiology, evolution, population structure and cross-immunity). The detailed discussion of different simulating frameworks and detailed comparisons with them can be found in S1 File-section 9.

Simulating realistic nucleotide mutations
Our simulation framework generates a phylogenetic tree, and if the user specifies a scenario with strongly selected mutations, these are included in the output; we, however, do not include a method for simulating many neutral variants. To facilitate studies that require full viral genome sequences we have made the output of our approach compatible with that of phast-Sim [37]. Briefly, a user can easily load the output of our software into phastSim, and phastSim will generate neutral mutations, while leaving previously determined selected mutations unaffected.

Availability and future directions
VGsim is freely available from https://github.com/Genomics-HSE/VGsim under GPL-3.0 License. It is tested for Python 3.7 and later under Ubuntu and macOS. The documentation and tutorials are published at https://vg-sim.readthedocs.io/. The future development of VGsim will include the following updates. We will consider improving performance by adding the τ-leaping algorithm and optimizing memory usage to handle larger numbers of genetic sites. We will also extend the available models by adding super-spreading events, life-cycle compartments, and new sampling schemes. We will also add recombination events, though they seem to be relatively rare [51] and so far are not a major driver of SARS-CoV-2 genetic diversity and evolution.
VGsim is particularly useful for simulating large datasets, in particular, in those cases when agent-based simulators become inefficient (see S1 File-section 9 for more detailed discussion). It is primarily optimised for the studies of world-wide pandemic scenarios, and it is motivated by the features of the ongoing SARS-CoV-2 pandemic. We plan for the future to add more features which would generalise its applicability to different pathogens (e.g. with complex life-cycle). Further possible optimisations of our algorithm will also be investigated.
Our implementation allows simulations of scenarios with a few loci with strong phenotypic effects. However, we cannot simulate the effect of many loci with mild fitness effects. While mild and widespread fitness effects can be simulated by phastSim, they are simulated in a typical phylogenetic way (using a substitution codon matrix with specifiable nonsynonymous/synonymous ratios) and so their impact on the tree shape and epidemiological dynamics are neglected.

Conclusion
We developed a fast simulator VGsim which can be used to produce genealogies of millions of samples from world-scale pandemic scenarios. Our method models many major aspects of epidemiological dynamics: viral molecular evolution, host population structure, host immunity, vaccinations and NPIs. We expect that VGsim will be a useful tool in method validation and in simulation-based statistical inference.
The performance of our simulator should meet the performance requirements of most studies. The flexible Python API, combination of epidemiological (including cross-immunity), population and evolutionary models make it a timely tool for the modern and future research.
Supporting information S1 File. Supporting information. Supporting information file with the introduction to epidemiological compartmental models, details of the migration model, presentation of the simulation procedure and algorithms, example of the software Python API, details on the implemented sampling strategy, validation of the software implementation and detailed discussion of other epidemiological simulators. (PDF)