Modeling neutral viral mutations in the spread of SARS-CoV-2 epidemics

Although traditional models of epidemic spreading focus on the number of infected, susceptible and recovered individuals, a lot of attention has been devoted to integrate epidemic models with population genetics. Here we develop an individual-based model for epidemic spreading on networks in which viruses are explicitly represented by finite chains of nucleotides that can mutate inside the host. Under the hypothesis of neutral evolution we compute analytically the average pairwise genetic distance between all infecting viruses over time. We also derive a mean-field version of this equation that can be added directly to compartmental models such as SIR or SEIR to estimate the genetic evolution. We compare our results with the inferred genetic evolution of SARS-CoV-2 at the beginning of the epidemic in China and found good agreement with the analytical solution of our model. Finally, using genetic distance as a proxy for different strains, we use numerical simulations to show that the lower the connectivity between communities, e.g., cities, the higher the probability of reinfection.

1 Appendix A: Simulation parameters 1 The network simulations follow the model proposed in reference [1], and the parameters 2 are displayed in Table 1 Table 1. Simulation Parameters. The number of nodes in each simulation is described properly. *This is the input average degree for the networks construction, but the actual value for each realization fluctuates. For the communities simulations, this is the parameter for constructing each isolated network, as also for the control case p = 0.
For the numerical solution of mean-field approaches, following the SEIR model we have used the following parameters: R 0 = 2.4, γ = 1/14 day −1 ; β = R 0 γ and 5 σ = 1/ t i , where t i is the mean period of incubation, averaged over the distribution 6 from Table 1 [1]. 7 2 Appendix B: Analytical calculations 8 Our goal is to derive a recurrence equation for the average genetic distance, i.e., given 9 the distance d t at time t, we aim to calculate the distance d t+1 at time t + 1. The idea 10 is to calculate d t+1 as a weighted average, where the weights are the number of pairs 11 that are distanced by a certain amount. In a SEIR model, every iteration starts with a 12 given number of recovered (R t ), infected (I t ) and exposed (E t ) individuals. When an 13 individual recovers, its infecting virus stops to spread and to evolve, and we call it a offspring (x t ) increase the number of viruses in Exposed individuals in the next 20 iteration, when they will be allowed to evolve.

21
At the beginning of iteration t + 1, there are (R t + E t + I t )(R t + E t + I t − 1)/2 pairs 22 of viruses sharing an average distance equal to d t , but along the iteration some of the 23 distances may increase by a certain amount to be calculated, as also new viruses may 24 arise. Therefore, where Z t is a normalization factor, which counts the total number of pairs at the end of 26 iteration t + 1, If the mutation rate is zero and no new infections occur (x t = 0) the "Increases" term 28 and the "Offspring" term are equal to zero, and d t+1 = d t , as expected.

29
In the following two subsections, we shall calculate the "Increases" term and the 30 "Offspring" term, which accounts for the evolution and for the spread, respectively.

32
Genetic distances between evolving viruses increase over time. In order to calculate how 33 much these distances increase we first consider that mutations occurring in the same 34 locus of different genomes are unlikely, as well as more than one mutation per locus on a 35 single genome. This approximation holds as long as the epidemic duration T remains 36 sufficiently small, µT 1. Thus, after one time step, an evolving genome acquires, on 37 average, Bµ mutations. The distance between two evolving genomes will increase, on 38 average, by 2Bµ nucleotides after one time step. The distance between viruses in 39 exposed individuals, for example, increases by 2Bµ and because there are E t (E t − 1)/2 40 pairs of exposed individuals, their evolution along the iteration t + 1 contributes 41 2BµE t (E t − 1)/2 to the Increases term. On the other hand, the distance between 42 viruses in an exposed and a recovered individual, or an infected individual that recovers, 43 is only Bµ, because the latter two do not evolve. There are E t (R t + r t ) pairs among 44 these viruses, and thus their contribution to Increases is E t (R t + r t )Bµ. We recall that 45 the updates in our model occur in the order "Transmission", "Attempt to Recovery" 46 and lastly, "Genome Evolution". Thus, if an infected individual recovers its virus does 47 not have the chance to mutate.

48
Therefore, in order to compute the Increases term, we must calculate the average 49 increase in distance between all pairs of viruses and how many pairs of these viruses 50 exist. Table 2 summarizes this information. We obtain 51 The contribution of the new infections to the average distance d t+1 , the Offspring term, 53 is more tricky. To simplify matters we will assume that an infected individual infects 54 only one susceptible per time step, which is a good assumption if the basic reproduction 55 number R 0 is small compared to the average duration of symptoms. Thus, x t is also the 56 number of individuals who infected a susceptible within the time step t + 1, which will 57 be called ancestors from now on. Let D 1 be the average distance between ancestors and 58 the other viruses at time t, and D 2 , the distance between the exposed and the other 59 viruses. Note that an ancestor may recover and, therefore, not mutate in this time step. 60 The Offspring term is a sum of different contributions between offspring and the other 61 viruses in the population, as explained in detail below. 62 1. Genetic distance between offspring and recovered. The number of pairs is x t R t .

63
Because offspring do not evolve in the time step they appear, their average 64 distance is D 1 . Then, its contribution to the Offspring term is x t R t D 1 .

65
2. Genetic distance between offspring and exposed. The number of pairs is x t E t .

66
Because the exposed evolve, these pairs contribute with x t E t (D 2 + Bµ) to the 67 Offspring term.

77
(c) The distance between the offspring and individuals that recover is D 1 , 78 because neither of these viruses evolve in this time step. There are 79 x t ((I t − r t )/I t )r t pairs of these viruses, adding x t ((I t − r t )/I t )r t D 1 to the 80 Offspring term.

86
There are x t r t /I t new infections of this type, contributing 87 (x t r t /I t )(r t − 1)D 1 to the Offspring term.

(c) The distance between offspring and the other infected individuals is
89 Bµ), since the other infected viruses evolve..
Putting all these terms together and defining Z t ≡ 2Z t we obtain The reason for assigning the distance D 1 between infected and other viruses, instead 94 of d t , is that infected individuals represent only a fraction of the viruses in the 95 population, and the distance between them and other viruses grows over time, therefore 96 being above the average d t . The same holds for the exposed individuals.

97
Although we were not able to analytically find an expression for D 1 and D 2 , we can 98 approximate them as follows. First we assume that D 2 ≈ D 1 . When the epidemic 99 begins, all viruses are infected, so that D 1 = d t . However, the ratio between infected 100 and recovered individuals decreases to zero along the epidemic, making D 1 larger than 101 d t . Thus, to first order, it is possible to approximate D 1 ≈ d t (1 + ), with a function 102 of the number of recovered individuals, R t /(I t + E t + R t ) and the average number of 103 mutations Bµ. Our simulations showed that the linear function 104 D 1 = d t (1 + 2BµR t /(I t + E t + R t )) works well (considering the parameters in Appendix 105 A), leading to the theoretical result expressed by Eq.(2) from the main text. 106

107
To achieve the continuum limit we start by substituting r t = R t+1 − R t and (2) from the main text and subtracting d t from both sides of this equation: Then, we consider the first order approximations and once Bµ in the last line of Eq.(A7) is the number of mutations per time step, we replace it by Bµ∆ṫ Finally, by taking the limit ∆t → 0 we obtain the continuous time equation. 108

109
The average distance d (i) root,t between viruses from a lineage and its root is calculated using the same technique discussed above, however it is much simpler, once we only need to calculate the average distance from a kind of virus and the root (a single virus which does not evolve). Using the same notation, but now with a super-index to denote the lineage, we obtain 1,root being the average distance between 110 infected and the root, which is given (similarly to D 1 ) by The factor 4 is a fit from numerical investigations. The continuum limit is obtained by 112 subtracting d In order to estimate the real (from genetic data) genetic evolution, we used 55 complete 116 genome sequences collected in China [8]. First, these sequences were ordered and 117 July 12, 2021 5/10 numbered by its collection date and a matrix of genetic distances d ij between genomes i 118 and j has been constructed. Each pair of sequences were alligned with the 119 Needleman-Wunsch algorithm, with score +1 for match and −1 for mismatch [9]. Then, 120 the distance between two genomes was computed counting the number of substitutions 121 between the sequences, neglecting indels. 122 We defined a time window τ W = 14 = τ 0 days. Thus, every genome collected within 123 τ W are considered infected, and the genomes collected before this time window are 124 considered recovered. Now, we calculate the average distance among the infected d I,t , 125 recovered d R,t and among infected and recovered d IR,t at the time t. Fig.1 shows an 126 example of a distance matrix with a specific time window. Finally, the average distance 127 at time t can be computed as where I t and R t are respectively given by I(t) and R(t) described evaluated in the 129 supplemental material.

130
With this algorithm, we obtained 20 non-overlapping sets of infected genomes. One 131 of these sets contained only one sequence and was not usable; a second set was too far 132 from all other data and was also discarded. Thus, we were able to calculate 18 points 133 (that appear in Fig.4 from the main text) with error bars given by the standard 134 deviation of each set of distances (between infected, recovered and between infected and 135 recovered) at each time t .  Fig 1. Example of distance matrix to illustrate the algorithm to infer the genetic evolution. Every genome collected within a time window τ W is considered to belong to an infected individual. The red block shows distances between these viruses. The blue block shows viruses that appeared before the present time window, whose individuals are considered to have recovered. Green blocks are distances between infected and recovered individuals. The remaining entries are distances from viruses that have not appeared yet at that considered time, i.e., they appeared after the considered time window.

137
We got the Chinese epidemic data from the dataset "Epidemic Data for Novel 138 Coronavirus COVID-19" from Wolfram data repository [10]. Unfortunately, this dataset 139 starts on 22 January (going up to 18 August by the date of our analysis), lacking the 140 previous data. Another concern is about the change in the notification protocols 141 adopted by the Chinese government. On 13 February, the Hubei province started to 142 report not only the positive laboratory tests, but also the clinically diagnosed cases as 143 infected too, appearing a sudden increase in infected curve [11]. We also need to correct 144 the data by including undetected cases. 145 Firstly, in order to correct the notification problem, we smoothly distribute the 146 sudden increase number of cases among the previous dates. Following reference [11], the 147 Now, the undetected cases in China were estimated in reference [12], and also 152 following reference [11], we get and the infected curve is now given as 158 Fig.2 shows the curves after these corrections. Once we do no have directly access to 159 exposed data, we did not consider exposed individuals, meaning, at this point, that we 160 are dealing with a SIR model without any prejudice to the present theory. However, 161 bad data is an important source of error. Finally, we fit an exponential curve to a few initial data points of I(t) and R(t) and 163 extrapolate it to previous dates. For the I-curve, we have adjusted the exponential 164 e a(t−t0) , with fit parameters a and t 0 , on the first n I = 10 data points and extrapolated 165 it up to the first case t 0 days before. With this approach, we found t 0 = 11 Dec, which 166 is close to the first case reported by WHO, 08 Dec [13]. For the R(t)-curve, we have 167 used the first n R = 13 data points. The numbers n I and n R were chosen in order to 168 make the exponential extrapolation makes sense according to WHO estimates of the 169 first case, as also to make R(t) < I(t) in a plausible way. 170 Now, the curves R(t) and I(t) can be implemented in the recurrence equation and 171 the distance evolution can be estimated, with the first distance d 0 equalling zero.