Skip to main content
Advertisement
  • Loading metrics

VGsim: Scalable viral genealogy simulator for global pandemic

Abstract

Accurate simulation of complex biological processes is an essential component of developing and validating new technologies and inference approaches. As an effort to help contain the COVID-19 pandemic, large numbers of SARS-CoV-2 genomes have been sequenced from most regions in the world. More than 5.5 million viral sequences are publicly available as of November 2021. Many studies estimate viral genealogies from these sequences, as these can provide valuable information about the spread of the pandemic across time and space. Additionally such data are a rich source of information about molecular evolutionary processes including natural selection, for example allowing the identification of new variants with transmissibility and immunity evasion advantages. To our knowledge, there is no framework that is both efficient and flexible enough to simulate the pandemic to approximate world-scale scenarios and generate viral genealogies of millions of samples. Here, we introduce a new fast simulator VGsim which addresses the problem of simulation genealogies under epidemiological models. The simulation process is split into two phases. During the forward run the algorithm generates a chain of population-level events reflecting the dynamics of the pandemic using an hierarchical version of the Gillespie algorithm. During the backward run a coalescent-like approach generates a tree genealogy of samples conditioning on the population-level events chain generated during the forward run. Our software can model complex population structure, epistasis and immunity escape.

Author summary

We develop a fast and flexible simulation software package VGsim for modeling epidemiological processes and generating genealogies of large pathogen samples. The software takes into account host population structure, pathogen evolution, host immunity and some other epidemiological aspects. The computational efficiency of the package allows to simulate genealogies of tens of millions of samples, which is important, e.g., for SARS-CoV-2 genome studies.

This is a PLOS Computational Biology Software paper.

Introduction

The unprecedented world-wide effort to produce and share viral genomic data for the ongoing SARS-CoV-2 pandemic allows us to trace the spread and the evolution of the virus in real time, and has made apparent the need for improved computational methods to study viral evolution [1]. These data yield important insights into the effects of population structure [25], public health measures [6, 7], immunity escape [8, 9], and complex fitness effects [10, 11]. It is essential that we also have tools to accurately simulate viral evolutionary processes so that the research community can validate inference methods and develop novel insights into the effects of such complexities. However, there are no software packages capable of simulating the scale and apparent complexity of viral evolutionary dynamics during the SARS-CoV-2 pandemic.

Pandemic-scale datasets impose technical problems associated with the scalability and memory usage of computational methods. There is already substantial progress in building scalable simulators and data analysis methods for human genome data. The current state-of-the-art human genome simulator msprime [12] is capable of simulating millions of sequences with length comparable with human chromosomes. Methods such as the Positional Burrows-Wheeler Transform (PBWT) [13], its ARG-based extension tree consistent PBWT [14], and tsinfer [15] can be used to efficiently process and store genomic sequences, but all of these approaches are designed for actively recombining organisms. Moreover, the primary population models underlying these methods are the Kingman coalescent [16], the Wright-Fisher model [17, 18] and the Li-Stephens model [19]. We recently developed approaches for compressing and accessing viral genealogies that dramatically reduce space and memory requirements [20, 21], but there are no viral genealogy simulation methods that can efficiently produce pandemic-scale datasets.

Coalescent models [22] are powerful tools for studying humans, many other eukaryotes, and pathogen populations (e.g. [2325]). Structured coalescent models have been successfully applied for epidemioloigical inference, in particular to detect and understand barriers to transmission in human populations [26, 27]. However, the assumptions of coalescent frameworks are often violated in other epidemiological settings. First, the effective population size is usually modeled either as piece-wise constant or as exponential growth. However, the coalescent with exponential growth and birth-death models do not result in equivalent genealogies [28]. Second, it’s not simple to use coalescent models to describe the effects of selection. If we consider the pandemic on a longer time period, basic birth-death models (e.g. [29]) are not an appropriate choice, since the reproductive rate usually decreases with time as collective immunity builds up or as the susceptible population is exhausted. These limitations are often addressed in epidemiology using compartmental models, such as SI, SIS and SIR [30], or their stochastic realisations, which are also birth-death processes. Backward in time genealogical models conditioned on epidemiological process have also been used to simulate pathogen genealogies, for example in PhyDyn [31] and TiPS [32].

Simulating realistic selection in backward-time models is a well-known challenging problem. A common workaround is to assume a single deterministic frequency trajectory or to generate a stochastic frequency trajectory in forward time, and then to simulate the ancestry of the samples around the selected site in a coalescent framework (e.g., [33, 34]). However, more complex models of selection, including e.g., gene-gene interactions, or epistasis, are often beyond the scope of such coalescent models. Nonetheless, epistasis is thought to be an important component of viral evolutionary processes [35, 36], and incorporating the effects of such complex evolutionary dynamics is essential for accurate simulations of evolution.

We introduce a novel simulation method that can rapidly generate pandemic-scale viral genealogies. Our approach is a forward-backward algorithm where we generate a series of stochastic events forward in time, then traverse backwards through this event series to generate the realized viral genealogy for a sample taken from the full population. Our framework includes the accumulation of immunity within host populations and of viral mutations that affect the fitness of descendant lineages. Our method is extremely fast, and can produce a phylogeny with 50 million total samples in just 88.5 seconds. The genealogies output from our simulation are compatible with phastSim [37], making it possible to generate realistic genome data for the simulated samples. This framework empowers efficient and realistic simulation of pandemic-scale viral datasets.

Design and implementation

Our epidemiological model is a compartmental model [38] (see S1 File—section 1 for a brief introduction to compartmental models), and the realisations of the stochastic processes are drawn using the Gillespie algorithm [39]. The different compartments in our model are defined based on several interacting real-world complexities: (1) host population structure with corresponding population-specific viral frequencies and contact rates, (2) separate host infectious groups resulting from different viral haplotypes, and (3) different host susceptibility groups.

We break the simulation into two phases. In the first one (the forward pass), we generate a population-level epidemiological process which is represented as the series of events (Fig 1) resulting from the “reactions” (Table 1). These events then influence the properties of the viral genealogy, which is sampled in the second phase (the backward pass). The specific viral genealogy is sampled conditioned of the population-level epidemiological process using a coalescent framework.

thumbnail
Fig 1. The scheme of the nested family Gillespie algorithm used to generate an event in the forward run.

The corresponding reactions are listed in Table 1. Black squares correspond to the consecutive steps, where the subfamilies are chosen with the weights given by their propensities. The propensities for each step are cached and updated only if they change due to an event. For migration propensities, the rejection approach is used instead (S1 File—section 3).

https://doi.org/10.1371/journal.pcbi.1010409.g001

thumbnail
Table 1. The list of reactions and corresponding epidemiological events simulated by the Gillespie algorithm in our model, and the number of reactions in each category in function of the number of mutable sites U, number of susceptible individuals T, and the number of populations K.

https://doi.org/10.1371/journal.pcbi.1010409.t001

Table 2 lists all the features which determine the simulation. In the beginning, the user should specify the number U of mutable sites (see section Fitness landscape),the number T of susceptibility types (section Epidemiological model), and the number K of populations (section Population model).

thumbnail
Table 2. List of features which determine the simulation scenario.

All the rates are normalized by the number of individuals in a particular group (i.e. the number of individuals infected with a particular haplotype or individuals of a certain susceptibility type). The rates are measured in terms of events per time unit.

https://doi.org/10.1371/journal.pcbi.1010409.t002

Fitness landscape

Because this simulation framework focuses on generating the viral genealogy, and not genomes, we track only mutations at genome sites that have a large positive effect on viral fitness. That is, these mutations enhance the transmissibility of the virus or lead to immunity escape. We expect this will typically be a relatively small number of mutations relative to the size of the viral genome, simplifying the problem. To efficiently model neutral genetic variation we suggest using phastSim [37] on a tree generated by our algorithm; the output produced by our method can be directly imported into phastSim for downstream processing.

To define the intended model of selection on new mutations, the user specifies the number U of mutable sites and their specific fitness effects (i.e., their effect on the birth rate). The mutations are modelled as single nucleotide substitutions, so each site has four possible variants (A, T, C and G). Mutations lead to the appearance of different haplotypes with different transmission and immunological properties. Up to H = 4U different haplotypes can appear in the simulation. Each haplotype h can be assigned its own specific U mutation rates mu(h) and 3U substitution probabilities , , , one for each site u and for each of the 3 possible new nucleotides at site u. Transmission, recovery, and sampling rates, as well as mutation rates, susceptibility, and triggered susceptibility types can be defined individually for every haplotype. Of particular interest, gene-gene, or epistatic, interactions can be flexibly modelled using this approach.

We refer to sequences carrying particular sets of variants as “haplotypes”, because two identical sequences can appear as a results of different mutation events, so they might not belong to the same clade in the final tree.

Epidemiological model

To model the host immunity process, we use a generalised SI-model. The compartments within each population represent different types of susceptible individuals or infectious individuals.

Different susceptible compartments in the same host population are used to model different types of immunity. These can correspond for example to host individuals that have recovered from previous exposure to different viral haplotypes. Susceptible compartments can also be used to represent different vaccination statuses. For each susceptible compartment Si, and for each viral haplotype h we consider a susceptibility coefficient σih which multiplicatively changes the transmission (birth) rate of the corresponding haplotype. In particular, σih = 0 corresponds to absolute resistance, similar to the R-compartment in SIR-model, but specific to individuals who have immunity type i and are exposed to haplotype h. 0 < σih < 1 would correspond to partial immunity, while σih > 1 corresponds to increased susceptibility.

Different infectious compartments within a host population correspond to individuals infected by a haplotype and can potentially infect susceptible hosts. As we mentioned in the section Fitness landscape, the transmission rate λh, recovery rate ρh and mutation rates can be set independently for each haplotype h. After recovery, a host individual that was infected with haplotype h, and therefore was in compartment for some population s, is moved to the corresponding susceptibility (immunity) compartment . Different haplotypes might however lead to the same types of immunity.

NB: The evolution of individual immunity is modeled as Markovian—it is determined only by the latest infection, and does not have memory about previous infections. Whether this assumption provides an accurate approximation of the immunity dynamics within the host population is an important consideration and may depend on the specific pathogen biology. Different haplotypes can lead to the same immunity. Some immunity types can be specific, e.g., to vaccination immunity without being associated with any haplotypes at all.

The rate of transmission of viral lineages within a population also depends on how frequently two host individuals come in contact with each other. To flexibly accommodate such differences, each population s is assigned a contact density δs parameter. This parameter can be used to simulate differences in the local population density, social behaviours etc. In the abscence of migration, the rate for an individual from susceptibility class (the susceptible compartment i within population s) to be infected with haplotype h by another individual within population s is where is the number of individuals in , is the class of individuals infected with haplotype h in population s, λh is the baseline transmission rate of individuals infected with haplotype h, and Ns is the total population size of deme s. If there is migration, the equation should be modified by taking into account the probability that two interacting individuals are in the same population (not necessarily in their home population). We give more details on accomodating migration in the section Migration and S1 File—section 2.

Direct transitions between susceptible compartments are possible, for which users can specify a transition matrix for susceptible compartments. A transition between susceptible compartment can be used for example to model a vaccination event, or the loss of immunity with time.

Population model

Demes.

The population model is based on an island (demic) model. Each population is described at each point in time by its total size Ns, number of infectious individuals with each viral haplotype h, number of susceptible host individuals for each susceptibility type i, relative contact density δs, and a population-specific non-pharmaceutical interventions strategy and effectiveness. All individuals within each deme have the same contact patterns.

Non-pharmaceutical interventions.

Several governments have imposed non-pharmaceutical intervention (NPI) measures such as lockdowns, mandatory mask wearing in public or social disctancing, during the COVID-19 pandemic as an effort to control the spread of SARS-CoV-2. Understanding the effects of non-pharmaceutical interventions is a crucial concern for designing effective public health strategies. We implement these measures as follows. When the total number of simultaneously infectious individuals in population s surpasses a certain user-defined population-specific percentage (e.g. 1%) of the population size Ns, the non-pharmaceutical interventions are imposed and the contact density δs is changed to a new (typically lower) population-specific value. When the percentage of the infectious individuals drops below a user-specified value (e.g. 0.1%) the measures are lifted and the contact density in population s reverts to its initial value δs.

Migration.

Migration is described by a matrix μsr which defines the probabilities at which an individual from source population s can be found in target population r. In our model, new infections occur due to contact between infectious individuals from one population and susceptible individuals from the second population when they contact within the same population. This can be due to the travel of a susceptible individual to a source population, where it contracts an infection, after which it returns back to the home population (q = r in Eq 1); or, to the travel of an infectious individual into a target population where it transmits the infection to a susceptible individual (q = s in Eq 1); or, to the travel of both infectious and susceptible individuals to a third population. The derivation of each term is explained in S1 File—section 2 (see S1 File—equation S.1). This model corresponds to short-term travel such as tourist or business trips, where an individual returns soon back to the home population. The proposed process is different from the traditional migration modelling in population genetics, when an individual moves permanently to a new population.

More in detail, transmissions occur when a susceptible individual with immunity i from population s meets an infected individual with haplotype h from population r while both of them are being in population q. The rate of such a transmission is

The total rate of transmission between compartments and is then (1) where μss = 1 − ∑qs μsq is the probability that an individual originally from population s is currently not in a different population. If r = s, this equation describes within-population transmission. Notice that the sum in the last equation does not depend on the actual counts of individuals in the compartments, and hence it can be precomputed in advance. r, s and q are not necessarily different, in particular for r = s we obtain the within-population transmission rate between and .

Since it is computationally demanding to keep track of how each migration rate between each pair of compartments is affected by each simulated event, instead we keep track of cumulative upper bounds on such migration rates (see S1 File—section 3 for details). In the case a potential migration event is sampled according to these upper bounds, we then proceed to calculate the precise migration rates and only sample a specific migration event according to its own exact rate. This is efficient when cross-population transmissions (migrations) are rare compared to within-population transmissions. This algorithmic implementation might perform suboptimally if population structure is extremely weak.

Sampling

Sampling is modelled using a continuous sampling scheme. In this scheme every individual infected with haplotype h in population s is sampled at rate csζh, the product of the haplotype-specific sampling rate ζh and the population modifier cs. Sampled individuals then instantly recover and are moved to susceptible group , effectively increasing the recovery rate ρh for by csζh. Alternatively, one can think about this sampling scheme as setting the recovery rate for to ρh + csζh and sampling an individual in upon its recovery with probability csζh/(ρh + csζh). More details can be found in S1 File—section 7.

Algorithm

The simulation process is split into two parts, forward and backward. In the forward run, a chain of events (including sampled cases) describing the dynamics of the epidemiological process at the population level is generated with the Gillespie algorithm [39]. In the backward run, our method simulates the genealogy of the samples in a coalescent-like manner while conditioning on the events generated during the forward run.

Forward run.

The forward run generates a chain of events which reflects the dynamics of the pandemic. Our implementation of the Gillespie algorithm is based on three algorithmic ideas: logarithmic direct method [39] (the events, or “reactions”, are organised in nested families, Fig 1), rejection-based approach [40] for migrations (see S1 File—section 3 for details), and organising propensity dependencies to avoid updating those propensities which are not affected by events [41]. Details are given in S1 File—section 4.

Backward run.

The backward run randomly builds a genealogical tree of the samples while conditioning on this chain of events generated in the forward run.

All of the ancestral lineages of the samples generated in the forward run belong to one of the infectious compartment corresponding to a haplotype h in a specific population s. Lineages are exchangeable within each compartment. Conditional on any event generated in the forward run, it is straightforward to calculate the probability that the event affected zero, one or two sample ancestral lineages in the backward run (see S1 File—section 5 for details).

Implementation details.

VGsim provides a convenient Python user interface. Performance-critical parts are implemented in C++ via Cython [42]. The dependencies are kept to a minimum: NumPy [43] and mc_lib —a small wrapper of the NumPy C API for generation of pseudorandom numbers in Cython [44].

Results

Correctness of the software implementation

To assess the correctness of our implementation, we compared simulated epidemiological trajectories and distributions of coalescent times produced by VGsim with those produced by MASTER [45]. In all cases, the results are consistent, and described in detail in S1 File—section 8.

Forward run performance

To test the scalability of the population model, we performed simulations with K = 2, 5, 10, 20, 50 and 100 total host populations with 2 ⋅ 109/K individuals in each and generated 10 million events (see Fig 1) in each run (see Table 3). There are 16 haplotypes resulting from two segregating sites with mutation rates 0.01 in each of them (this is unrealistically high, but it ensures that all the haplotypes appear in the simulation), and three susceptibility group with the first group corresponding to the absence of immunity, the second group corresponding to partial immunity and the last one corresponding to resistance to all strains. The transmission rate is λ = 2.5 for all haplotypes except one, and λ = 4.0 for this last haplotype. The recovery rate is ρ = 0.9, the sampling rate is ζ = 0.1 (so, the effective reproductive number is 2.5 which approximately correspond to SARS-CoV-2 [46] if the time unit is interpreted as one week). All the migration probabilities were set to μ = M/(K − 1), where M is the cumulative migration rate from a population. The runtime of the forward algorithm does not depend only on the cumulative migration rate M, but also on the percentage of potential migrations rejected by the algorithm (see section Migration for details), which appears to grow with M. However, the effect on runtime is relatively modest (in contrast to the naive algorithm which is quadratic in the number of populations, see Table 1) indicating that this approach scales well to pandemic-scale simulations.

thumbnail
Table 3. Run time to generate 10 million events.

The second number is the percentage of discarded events (due to migration acceptance/rejection). There are 16 = 42 haplotypes and 3 susceptible compartments. The sampling rate is set to ζ = 0.1, recovery rate is ρ = 0.9, transmission rate is λ = 2.5. The tests were run on a server node with Intel Xeon Gold 6152 2.1–3.7 GHz processor and 1536GB of memory.

https://doi.org/10.1371/journal.pcbi.1010409.t003

Backward run and overall performance

Our implementation of the backward run algorithm relies on an efficient and compact tree representation, a Prüfer-like code [47]. Each node is associated with an index in an array, and the corresponding entry in the array is the index of the parent node. The time needed to generate a tree mainly depends on two factors: the total number of events generated in the forward run, and the total number of samples in a tree. We report the execution time of the backward run in Table 4. The combined approach is sufficiently fast that it can be used to generate many replicate simulations as is often required to validate empirical methods and to train model parameters. Table 4 also shows the forward time, the total number of generated events and the total number of infected individuals over the simulation for various sampling rate (where the sampling rate ζ = 0.01 is 1 in 100 cases is sampled, ζ = 0.1 corresponds to 1 in 10 cases is sampled, and ζ = 1.0 means that every case is sampled), and various sample sizes. The simulation assumes the absence of immunity after infection (SIS-model), which allows to run the simulation sufficiently long to collect enough samples (instead, with an SIR-model susceptible individuals can be exhausted before the desired number of samples is simulated).

thumbnail
Table 4. Run time in seconds to generate a random genealogy for a sample of a certain size for different sampling rates.

The execution time is shown split into the time demand for the forward run and the one for the backward run only. We simulated 16 = 42 haplotypes and no host immunity. The recovery rate is ρ = 1.0 − ζ, with ζ the sampling rate, while the transmission rate is λ = 2.5 for all 16 haplotypes. The tests were run on a server node with an Intel Xeon Gold 6152 2.1–3.7 GHz processor and 1536GB of memory.

https://doi.org/10.1371/journal.pcbi.1010409.t004

To showcase the limit of applicability of our simulator, we also show in Table 4 the computational demand for the simulation of an unrealistically large (for now) genealogy of 150 million samples (with 1 in 100 cases sampled), for which we almost reached the memory limit available on our supercomputer node (1536GB) [48]. The total number of infections in the population is more than 15 billion cases, with the total number of events being more than 30 billions. The forward run time was approximately 9.5 hours and the backward run time was 13.5 minutes.

Comparison with other simulators

There are many epidemiological simulators which are capable of producing viral genealogies. Agent-based simulators (e.g. nosoi [49], FAVITES [50]) allow to create very detailed models, because every agent’s parameters can be set individually. The trade-off is they are computationally demanding, so only relatively small scenarios can be modelled. Other simulators (e.g. MASTER [45] and TiPS [32]) implement Gillespie algorithm for compartmental models, but they currently lack a simple user interface, instead requiring users to specify the full set of reaction equations, and they might be not specifically optimised for epidemiological purposes. On the other hand, both MASTER and TiPS implement approximate methods (tau-leaping and hybrid approaches), which decrease the time of forward simulation by orders of magnitude and hence might outperform our simulator depending on simulation scenario. VGsim is optimised to scale for large epidemics and genealogies, though approximate approaches are not available in the current implementation. It also has a simple and flexible user interface which helps merge together several complexities (epidemiology, evolution, population structure and cross-immunity). The detailed discussion of different simulating frameworks and detailed comparisons with them can be found in S1 File—section 9.

Simulating realistic nucleotide mutations

Our simulation framework generates a phylogenetic tree, and if the user specifies a scenario with strongly selected mutations, these are included in the output; we, however, do not include a method for simulating many neutral variants. To facilitate studies that require full viral genome sequences we have made the output of our approach compatible with that of phastSim [37]. Briefly, a user can easily load the output of our software into phastSim, and phastSim will generate neutral mutations, while leaving previously determined selected mutations unaffected.

Availability and future directions

VGsim is freely available from https://github.com/Genomics-HSE/VGsim under GPL-3.0 License. It is tested for Python 3.7 and later under Ubuntu and macOS. The documentation and tutorials are published at https://vg-sim.readthedocs.io/.

The future development of VGsim will include the following updates. We will consider improving performance by adding the τ-leaping algorithm and optimizing memory usage to handle larger numbers of genetic sites. We will also extend the available models by adding super-spreading events, life-cycle compartments, and new sampling schemes. We will also add recombination events, though they seem to be relatively rare [51] and so far are not a major driver of SARS-CoV-2 genetic diversity and evolution.

VGsim is particularly useful for simulating large datasets, in particular, in those cases when agent-based simulators become inefficient (see S1 File—section 9 for more detailed discussion). It is primarily optimised for the studies of world-wide pandemic scenarios, and it is motivated by the features of the ongoing SARS-CoV-2 pandemic. We plan for the future to add more features which would generalise its applicability to different pathogens (e.g. with complex life-cycle). Further possible optimisations of our algorithm will also be investigated.

Our implementation allows simulations of scenarios with a few loci with strong phenotypic effects. However, we cannot simulate the effect of many loci with mild fitness effects. While mild and widespread fitness effects can be simulated by phastSim, they are simulated in a typical phylogenetic way (using a substitution codon matrix with specifiable nonsynonymous/synonymous ratios) and so their impact on the tree shape and epidemiological dynamics are neglected.

Conclusion

We developed a fast simulator VGsim which can be used to produce genealogies of millions of samples from world-scale pandemic scenarios. Our method models many major aspects of epidemiological dynamics: viral molecular evolution, host population structure, host immunity, vaccinations and NPIs. We expect that VGsim will be a useful tool in method validation and in simulation-based statistical inference.

The performance of our simulator should meet the performance requirements of most studies. The flexible Python API, combination of epidemiological (including cross-immunity), population and evolutionary models make it a timely tool for the modern and future research.

Supporting information

S1 File. Supporting information.

Supporting information file with the introduction to epidemiological compartmental models, details of the migration model, presentation of the simulation procedure and algorithms, example of the software Python API, details on the implemented sampling strategy, validation of the software implementation and detailed discussion of other epidemiological simulators.

https://doi.org/10.1371/journal.pcbi.1010409.s001

(PDF)

Acknowledgments

We are thankful to Mikhail Shishkin for testing VGsim on the Apple Silicon M1 processor. This research was supported in part through computational resources of HPC facilities at HSE University [48].

References

  1. 1. Hodcroft EB, De Maio N, Lanfear R, MacCannell DR, Minh BQ, Schmidt HA, et al. Want to track pandemic variants faster? Fix the bioinformatics bottleneck. Nature Publishing Group; 2021. pmid:33649511
  2. 2. Gonzalez-Reiche AS, Hernandez MM, Sullivan MJ, Ciferri B, Alshammary H, Obla A, et al. Introductions and early spread of SARS-CoV-2 in the New York City area. Science. 2020;369(6501):297–301. Available from: https://science.sciencemag.org/content/369/6501/297 pmid:32471856
  3. 3. Nadeau SA, Vaughan TG, Scire J, Huisman JS, Stadler T. The origin and early spread of SARS-CoV-2 in Europe. Proceedings of the National Academy of Sciences. 2021;118(9). Available from: https://www.pnas.org/content/118/9/e2012008118 pmid:33571105
  4. 4. Ladner JT, Larsen BB, Bowers JR, Hepp CM, Bolyen E, Folkerts M, et al. An Early Pandemic Analysis of SARS-CoV-2 Population Structure and Dynamics in Arizona. mBio. 2020;11(5). Available from: https://mbio.asm.org/content/11/5/e02107-20 pmid:32887735
  5. 5. Komissarov AB, Safina KR, Garushyants SK, Fadeev AV, Sergeeva MV, Ivanova AA, et al. Genomic epidemiology of the early stages of the SARS-CoV-2 outbreak in Russia. Nature Communications. 2021 Jan;12(1):649. pmid:33510171
  6. 6. Lycett SJ, Hughes J, McHugh MP, da Silva Filipe A, Dewar R, Lu L, et al. Epidemic waves of COVID-19 in Scotland: a genomic perspective on the impact of the introduction and relaxation of lockdown on SARS-CoV-2. medRxiv. 2021. Available from: https://www.medrxiv.org/content/early/2021/01/20/2021.01.08.20248677.
  7. 7. Tegally H, Wilkinson E, Lessells RJ, Giandhari J, Pillay S, Msomi N, et al. Sixteen novel lineages of SARS-CoV-2 in South Africa. Nature Medicine. 2021 Mar;27(3):440–6. pmid:33531709
  8. 8. Garcia-Beltran WF, Lam EC, St Denis K, Nitido AD, Garcia ZH, Hauser BM, et al. Multiple SARS-CoV-2 variants escape neutralization by vaccine-induced humoral immunity. Cell. 2021. Available from: https://www.sciencedirect.com/science/article/pii/S0092867421002981
  9. 9. Burioni R, Topol EJ. Assessing the human immune response to SARS-CoV-2 variants. Nature Medicine. 2021 Apr;27(4):571–2. pmid:33649495
  10. 10. Zeng HL, Dichio V, Rodríguez Horta E, Thorell K, Aurell E. Global analysis of more than 50,000 SARS-CoV-2 genomes reveals epistasis between eight viral genes. Proceedings of the National Academy of Sciences. 2020;117(49):31519–26. Available from: https://www.pnas.org/content/117/49/31519 pmid:33203681
  11. 11. Rochman ND, Wolf YI, Faure G, Mutz P, Zhang F, Koonin EV. Ongoing global and regional adaptive evolution of SARS-CoV-2. Proceedings of the National Academy of Sciences. 2021;118(29). Available from: https://www.pnas.org/content/118/29/e2104241118
  12. 12. Kelleher J, Etheridge AM, McVean G. Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLOS Computational Biology. 2016 05;12(5):1–22. pmid:27145223
  13. 13. Durbin R. Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics. 2014 01;30(9):1266–72. pmid:24413527
  14. 14. Shchur V, Ziganurova L, Durbin R. Fast and scalable genome-wide inference of local tree topologies from large number of haplotypes based on tree consistent PBWT data structure. bioRxiv. 2019. Available from: https://www.biorxiv.org/content/early/2019/02/06/542035.
  15. 15. Kelleher J, Wong Y, Wohns AW, Fadil C, Albers PK, McVean G. Inferring whole-genome histories in large population datasets. Nature Genetics. 2019 Sep;51(9):1330–8. pmid:31477934
  16. 16. Kingman JFC. On the genealogy of large populations. Journal of Applied Probability. 1982;19(A):27–43.
  17. 17. Fisher RA, Russell EJ. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London Series A, Containing Papers of a Mathematical or Physical Character. 1922;222(594-604):309–68. Available from: https://royalsocietypublishing.org/doi/abs/10.1098/rsta.1922.0009.
  18. 18. Wright S. EVOLUTION IN MENDELIAN POPULATIONS. Genetics. 1931;16(2):97–159. Available from: https://www.genetics.org/content/16/2/97 pmid:17246615
  19. 19. Li N, Stephens M. Modeling Linkage Disequilibrium and Identifying Recombination Hotspots Using Single-Nucleotide Polymorphism Data. Genetics. 2003;165(4):2213–33. Available from: https://www.genetics.org/content/165/4/2213 pmid:14704198
  20. 20. Turakhia Y, Thornlow B, Hinrichs AS, De Maio N, Gozashti L, Lanfear R, et al. Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic. Nature Genetics. 2021 Jun;53(6):809–16. pmid:33972780
  21. 21. McBroome J, Thornlow B, Hinrichs AS, De Maio N, Goldman N, Haussler D, et al. matUtils: Tools to Interpret and Manipulate Mutation Annotated Trees. bioRxiv. 2021. Available from: https://www.biorxiv.org/content/early/2021/04/04/2021.04.03.438321.
  22. 22. Kingman JFC. The coalescent. Stochastic Processes and their Applications. 1982;13(3):235–48. Available from: https://www.sciencedirect.com/science/article/pii/0304414982900114.
  23. 23. Rosenberg NA, Nordborg M. Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nature Reviews Genetics. 2002 May;3(5):380–90. pmid:11988763
  24. 24. Drummond AJ, Rambaut A, Shapiro B, Pybus OG. Bayesian Coalescent Inference of Past Population Dynamics from Molecular Sequences. Molecular Biology and Evolution. 2005 02;22(5):1185–92. pmid:15703244
  25. 25. De Maio N, Wilson DJ. The Bacterial Sequential Markov Coalescent. Genetics. 2017 05;206(1):333–43. pmid:28258183
  26. 26. Volz EM, Kosakovsky Pond SL, Ward MJ, Leigh Brown AJ, Frost SDW. Phylodynamics of Infectious Disease Epidemics. Genetics. 2009 12;183(4):1421–30. pmid:19797047
  27. 27. Volz EM, Koelle K, Bedford T. Viral Phylodynamics. PLOS Computational Biology. 2013 03;9(3):1–12.
  28. 28. Lambert A, Stadler T. Birth–death models and coalescent point processes: The shape and probability of reconstructed phylogenies. Theoretical Population Biology. 2013;90:113–28. Available from: https://www.sciencedirect.com/science/article/pii/S0040580913001056 pmid:24157567
  29. 29. Stadler T. On incomplete sampling under birth–death models and connections to the sampling-based coalescent. Journal of Theoretical Biology. 2009;261(1):58–66. Available from: https://www.sciencedirect.com/science/article/pii/S0022519309003300 pmid:19631666
  30. 30. Brauer F. Compartmental models in epidemiology. In: Mathematical epidemiology. Springer; 2008. p. 19–79.
  31. 31. Volz EM, Siveroni I. Bayesian phylodynamic inference with complex models. PLOS Computational Biology. 2018 11;14(11). pmid:30422979
  32. 32. Danesh G, Saulnier E, Gascuel O, Choisy M, Alizon S. Simulating trajectories and phylogenies from population dynamics models with TiPS. bioRxiv. 2020. Available from: https://www.biorxiv.org/content/early/2020/11/09/2020.11.09.373795.
  33. 33. Kern AD, Schrider DR. Discoal: flexible coalescent simulations with selection. Bioinformatics. 2016 08;32(24):3839–41. pmid:27559153
  34. 34. Ewing G, Hermisson J. MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics. 2010 06;26(16):2064–5. pmid:20591904
  35. 35. Kryazhimskiy S, Dushoff J, Bazykin GA, Plotkin JB. Prevalence of Epistasis in the Evolution of Influenza A Surface Proteins. PLOS Genetics. 2011 02;7(2):1–11. pmid:21390205
  36. 36. Sanjuán R, Moya A, Elena SF. The contribution of epistasis to the architecture of fitness in an RNA virus. Proceedings of the National Academy of Sciences. 2004;101(43):15376–9. Available from: https://www.pnas.org/content/101/43/15376
  37. 37. De Maio N, Weilguny L, Walker CR, Turakhia Y, Corbett-Detig R, Goldman N. phastSim: efficient simulation of sequence evolution for pandemic-scale datasets. bioRxiv. 2021. Available from: https://www.biorxiv.org/content/early/2021/03/16/2021.03.15.435416 pmid:33758852
  38. 38. Kermack William Ogilvy MAG, Thomas WG. Thomas A contribution to the mathematical theory of epidemics. Proceedings of Royal Society A. 1927;115:700–721.
  39. 39. Gillespie DT. Stochastic Simulation of Chemical Kinetics. Annual Review of Physical Chemistry. 2007;58(1):35–55. pmid:17037977
  40. 40. Thanh VH, Priami C, Zunino R. Efficient rejection-based simulation of biochemical reactions with stochastic noise and delays. The Journal of Chemical Physics. 2014;141(13):134116. pmid:25296793
  41. 41. Cao Y, Li H, Petzold L. Efficient formulation of the stochastic simulation algorithm for chemically reacting systems. The Journal of Chemical Physics. 2004;121(9):4059–67. pmid:15332951
  42. 42. Behnel S, Bradshaw R, Citro C, Dalcin L, Seljebotn DS, Smith K. Cython: The best of both worlds. Computing in Science & Engineering. 2011;13(2):31–9.
  43. 43. Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020 Sep;585(7825):357–62. pmid:32939066
  44. 44. Burovski E, Godyaev D, Gorbunova V. mc_lib: Assorted small utilities for MC simulations with Cython;.
  45. 45. Vaughan TG, Drummond AJ. A Stochastic Simulator of Birth–Death Master Equations with Application to Phylodynamics. Molecular Biology and Evolution. 2013 03;30(6):1480–93. pmid:23505043
  46. 46. Billah MA, Miah MM, Khan MN. Reproductive number of coronavirus: A systematic review and meta-analysis based on global level evidence. PLOS ONE. 2020 11;15(11):1–17. pmid:33175914
  47. 47. Prüfer H. Neuer Beweis eines Satzes über Permutationen. Arch Math Phys. 1918.
  48. 48. Kostenetskiy PS, Chulkevich RA, Kozyrev VI. HPC Resources of the Higher School of Economics. Journal of Physics: Conference Series. 2021 jan;1740:012050.
  49. 49. Lequime S, Bastide P, Dellicour S, Lemey P, Baele G. nosoi: A stochastic agent-based transmission chain simulation framework in R. Methods in Ecology and Evolution. 2020;11(8):1002–7. Available from: https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13422. pmid:32983401
  50. 50. Moshiri N, Ragonnet-Cronin M, Wertheim JO, Mirarab S. FAVITES: simultaneous simulation of transmission networks, phylogenetic trees and sequences. Bioinformatics. 2018 11;35(11):1852–61.
  51. 51. Turkahia Y, Thornlow B, Hinrichs A, McBroome J, Ayala N, Ye C, et al. Pandemic-Scale Phylogenomics Reveals Elevated Recombination Rates in the SARS-CoV-2 Spike Region. bioRxiv. 2021. Available from: https://www.biorxiv.org/content/early/2021/08/05/2021.08.04.455157.