phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets

Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, and are an essential component of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here, we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100, 000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and it implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutability models that we developed to more realistically represent SARS-CoV-2 genome evolution.


Comparison of memory demand
On the Y axis we show the maximum memory demand in kB to perform simulations using different software. On the X axis is the number of tips simulated. Each point represents the mean of ten replicates. In red is the memory demand of phastSim with a concise output, and in orange of phastSim with additionally generating a FASTA format output (these values largely overlap). In green is the demand of pyvolve, and in purple of Seq-Gen. In yellow and brown are respectively the memory demand of INDELible with method 1 (matrix exponentiation) and method 2 (Gillespie approach).  On the X axis is the model used for simulations: "nucleotide" is a nucleotide substitution model without variation; "nuc+10cat" is a nucleotide model with 10 rate categories; "nuc+alpha" is a nucleotide model with continuous variation in rate (each site has a distinct rate sampled from a Gamma distribution); "codon" represents a codon substitution model; "codon+10cat" represents a codon substitution model with On the Y axis we show the maximum memory demand in kB to perform simulations using different software. On the X axis is the rescaling factor we use to make the phylogenetic tree branch lengths longer or shorter. Here we used alignments of 5000 tips.
Testing the correctness of phastSim simulations 3 Tree-likeness of simulated alignments and correctness of the 4 simulated substitution process 5 We ran a set of systematic simulations to assess if the genomes simulated by phastSim 6 adhered to the evolutionary history represented by the input phylogenetic tree, and if 7 the substitution process simulated by phastSim well represents the one specified by the 8 user. To do this, we simulated 100 replicates each with a random trees with 8 tips and 9 random branch lengths between 0.0005 and 0.0015, simulated with ETE3 [1]. For each 10 replicate we simulated genome evolution with phastSim under a GTR model with 11 random substitution rates (uniformly sampled between 0 and 1) but constant nucleotide 12 frequencies (0.1, 0.2, 0.3, and 0.4 respectively for A, C, G and T); the substitution 13 matrix was normalized as usually done in phylogenetics before simulating sequence 14 evolution. We did not simulate indels. A random root genome sequence of length 10 6 15 bases was sampled for each replicate according to the equilibrium nucleotide frequencies. 16 Then, for each of the 100 replicates, we ran RAxML v8.2.11 (raxmlHPC) [2] with a 17 GTR model and no rate variation. The trees inferred by RaxML were always identical  In a second set of simulations of rate variation we used the same setting as above but 32 simulated continuous rate variation across sites in phastSim under a gamma model with 33 α = 1.5. We then ran inference of tree and α with RAxML-NG v1.0.2 [3]

37
In the third set of simulations of rate variation we used again the same setting as . C Estimated α parameter representing variation in substitution rates across the genome (the simulated value was α = 1.5); part of the discordance between simulated and estimated value in C is likely due to the fact that we simulated continuous variation in substitution rates, which was approximated by RAxML-NG with 20 discretized rate categories.  branch of length 10 −4 substitutions per site (newick tree format "(S1:0.0001,S2:0.0);"). 50 In the base scenario, insertion and deletion rates were both equal to the substitution 51 rate, and indel lengths were geometrically distributed with parameter 0.3 (mean length 52 10/3). In addition to the base scenario, we consider 5 further modified scenarios:

53
• "Longer branch", same as the base scenario but with a 3 times longer branch 54 separating the two samples.

55
• "More insertions", same as the base scenario but with 3 times higher insertion 56 rate.

58
• "Longer insertions", same as the base scenario but with 3 times longer (on 59 average) insertions.

61
For each scenario we ran 50 replicates for each of the two software, always using a