Bayesian Monitoring of Emerging Infectious Diseases

We define data analyses to monitor a change in R, the average number of secondary cases caused by a typical infected individual. The input dataset consists of incident cases partitioned into outbreaks, each initiated from a single index case. We split the input dataset into two successive subsets, to evaluate two successive R values, according to the Bayesian paradigm. We used the Bayes factor between the model with two different R values and that with a single R value to justify that the change in R is statistically significant. We validated our approach using simulated data, generated using known R. In particular, we found that claiming two distinct R values may depend significantly on the number of outbreaks. We then reanalyzed data previously studied by Jansen et al. [Jansen et al. Science 301 (5634), 804], concerning the effective reproduction number for measles in the UK, during 1995–2002. Our analyses showed that the 1995–2002 dataset should be divided into two separate subsets for the periods 1995–1998 and 1999–2002. In contrast, Jansen et al. take this splitting point as input of their analysis. Our estimated effective reproduction numbers R are in good agreement with those found by Jansen et al. In conclusion, our methodology for detecting temporal changes in R using outbreak-size data worked satisfactorily with both simulated and real-world data. The methodology may be used for updating R in real time, as surveillance outbreak data become available.


Introduction
Incidence and prevalence are standard epidemiological indicators, monitored to understand disease dynamic within society [1,2]. In the case of infectious diseases, it is customary to measure how far an epidemic is from eradication by calculating yet another epidemiological parameter, the basic reproduction number, denoted by R 0 (e.g., [3], pp. [4][5]. By definition, R 0 represents the average number of secondary cases caused by a typical infectious individual in a fully susceptible population. If R 0 is larger than 1, then an outbreak becomes an epidemic; otherwise, it goes extinct. If the population is not fully susceptible, then one calculates the effective reproduction number, denoted by R [4,5]. However, neither R 0 nor R is regularly monitored by public-health authorities (e.g., [6], pp. 39-65).
R 0 depends on a number of factors pertaining to the pathogen-host biology, such as pathogen transmissibility and the natural history of disease. In addition, R 0 may depend on factors pertaining to host sociology, such as population density and social awareness about epidemics. All these factors may change with time. Changes in R 0 may signal that the pathogen has become more transmissible, virulent or persistent in the population. They may also signal societal re-organization in response to the epidemic dynamic. Particularly important are changes in R, which, in addition to changes in R 0 , may also reflect changes in the susceptibility of the population. Monitoring changes in R may thus be instrumental in determining the success of public-health interventions such as mass vaccination [4,7,8].
An epidemiological situation that would benefit from R 0 and/or R monitoring is that of a zoonotic pathogen repeatedly introduced in a population where it undergoes subcritical transmission. As the pathogen explores more and more hosts, the opportunity for mutations increases, with increasing chance for pandemic strains to occur [9]. Possible applications of this scenario may be the cases of the pre-pandemic severe acute respiratory syndrome [10] and Middle East respiratory syndrome [11].
Another application of R 0 and/or R monitoring is to assessing the success of mass vaccination by surveying disease outbreaks before and after a major epidemiological event, such as implementation of mass vaccination [5] and loss of confidence in vaccination programs [7]. Surveillance and contact tracing provides a means of monitoring outbreaks by permanent registration of new cases. For example, monkeypox [12] and non-zoonotic measles [4] are potential candidates for eradication through vaccination. Monkeypox remains worrisome for the possibility that repeated introductions in the human population may yield novel strains of human poxes [13]. Following vaccine licensing in 1963, measles incidence in the United States decreased by more than 95% [14]. Nevertheless, recent R analysis [8] shows the potential for measles to re-emerge and emphasizes the importance of continued surveillance. In this work, we discuss monitoring R using outbreak-size surveillance data; only minor modifications are needed to apply our methodology to monitoring R 0 of emerging infectious diseases.
There exist two major approaches to estimate R from outbreak-size data. In the first approach, R is estimated using the maximum likelihood method. Blumberg et al. [8,15] used Galton-Watson branching processes to construct the likelihood of observed data consisting of outbreak sizes. By maximizing the likelihood function, they calculated R for monkeypox in the Democratic Republic of Congo (1980)(1981)(1982)(1983)(1984) and measles in the United States (2001-2011). Jansen et al. [7] used continuous-time Markov chains to construct the likelihood of observed data. By maximizing the likelihood function, they calculated R for measles outbreaks in the United Kingdom (1995Kingdom ( -2002, advocating for an increase in R due to loss of confidence in the vaccination program. The second approach is based on Bayesian inference, which requires a prior distribution describing the current knowledge on the parameter of interest (e.g., R) and the likelihood of observing the data set according to a model of choice [16]. As a result, one computes a posterior distribution, representing an upgrade of the prior, according to the data. Hence Bayesian inference follows closely the principle of surveillance and learning processes.
Farrington et al. [4] proposed two alternative ways to constructing the posterior probability of R based on (1) outbreak size and (2) outbreak duration. They estimated R for measles (1997)(1998)(1999) outbreaks in the United States using data from 41 outbreaks caused by a single introduction. Angelov et al. [17] estimated R from a set of clusters sizes, where each cluster consisted from a known number of outbreaks. Following a Bayesian approach, they calculated R for mumps for three different regions of Bulgaria (2005)(2006)(2007)(2008). Yanev et al. [18] described different Bayesian estimators under two different families of loss functions to study multiple outbreaks. They calculated R to describe transmission of smallpox in Europe (1960)(1961)(1962)(1963)(1964)(1965)(1966)(1967)(1968)(1969)(1970). Prior knowledge on R was obtained from past data covering the period 1951-1960.
Previous literature focused on extracting a single R value from a given epidemiological dataset. In this work, we describe a methodology to estimate two time-ordered R values from the same set of surveillance data, in the regime of subcritical transmission (R<1); however, the generalization to multiple values is straightforward.

Monitoring Temporal Changes in R
A patient of an outbreak directly infects a number of individuals, so-called secondary cases. In successive generations, each secondary case continues to spread the pathogen, causing secondary cases by themselves. The network of who has infected whom has a tree structure, which, for emerging or re-emerging diseases, may be described using the theory of branching processes [19]. We assume that the data set, denoted by T, is an array of N sizes of trees (i.e., outbreak sizes) ordered by the infection time of the index patient. We analyze the transmission process in terms of the effective reproduction number, R, which may change with time due to biological (e.g., pathogen transmissibility and virulence) and/or sociological (e.g., frequency of contact between individuals) factors.

General theory
We first briefly present how to compute a single value of R from an epidemiological dataset, in the Bayesian paradigm. To express lack of knowledge about R, we use a non-informative improper prior π(R) = 1, where R is uniformly distributed from zero to infinity. We construct the likelihood L(T|R) that the dataset T is observed, assuming that the effective reproduction number is R. Given that the detected trees have sizes n 1 ,. . .,n N , ordered according to the date of infection of index patients, we have [15,19] where p(n i ,R) is the probability that the single transmission tree i has size n i [19]. The posterior distributionpðRjΤÞ of R for the observed dataset T is calculated according to Bayes' rule [4] pðRjΤÞ / LðΤjRÞpðRÞ ð 2Þ which yieldsp The posterior distributionpðRjΤÞ was used to calculate the average effective reproduction number hRi and its 95% credible interval (CI).
The quality of R statistics depends not only on the number of trees but also on their sizes. In short, we summarize the amount of data using the concept of information. According to the definition (e.g., [16], pp. 32), information is given by the logarithm of the probability to observe the given dataset T. In our case, the information, denoted by I T , is given by the natural logarithm of the likelihood L(T|R) (c.f. Eq 1) so that information is measured in natural units (i.e., nat). Hence, information is presented as the sum of N contributions, one for each tree in T (i.e., an extensive quantity over the number of trees). For a numerical estimate of I T corresponding to the dataset, we used the average of R (i.e., hRi) obtained from the Bayesian framework.
Monitoring the pandemic risk of emerging infectious diseases, one extracts at least two time-ordered R values from the same dataset T. Hence, the dataset T is divided into two timeordered subsets T a and T b = T\T a . The Bayesian approach described above (c.f. Eq 3) is then applied independently to each subset T a,b to estimate two effective reproduction numbers R a,b . Obviously, the estimates R a,b depend on the selection of T a,b . For example, if T a is nearly all T then R b is badly estimated.
To justify the choice of extracting two R values from T, we proceed as follows. We denote by H T the model with a single R estimate and by H a,b the model with two R estimates. Given R a,b , the likelihood for the dataset T a [T b is given by We evaluate whether H a,b is more plausible than H T by calculating the Bayes factor [20], using the corresponding likelihoods. When our initial beliefs are a priori equally probable, pr(H a,b ) = pr (H T ), the Bayes factor expresses how well the observed data were predicted by H a,b , compared to H T ; i.e., the higher the value of B(T a ), the more is justified to extract two R values rather than one from T. Kaas et al. [20] provided an interpretation of the strength of the second model H a,b in terms of four categories according to the gradation of 2lnB(T a ). They suggested a very strong preference for H a,b if 2lnB(T a )>10. In our model, we make the same decision. We extract two R a,b values when 2lnB(T a )>10; otherwise, we extract a single value of R from T. Our R analysis is performed according to the following steps. For every i = 1,. . .,(N−1): 1. Let T a consist of the first i trees in T and T b = T\T a ; 2. Estimate the pair of posterior distributionspðR a;b jΤ a;b Þ using Eq 3; Numerical integration of the normalization constant was performed using the trapezoidal rule.
3. Use the average hRi a,b as an estimate of the parameters R a,b .
Finally, we denote by Τ Ã a;b the sets T a,b which yield the largest value of 2lnB(T a ). If 2lnBðΤ Ã a Þ > 10, we accept hRi a,b as the best estimates of effective reproduction numbers on Τ Ã a;b and we denote them by hRi Ãa;b ; otherwise, we calculate a single R value on T using Eq 3.

Numerical tests using synthetic data
Synthetic data consisting of arrays of sizes of transmission trees were simulated, assuming that the number of cases caused by each infected individual is a Poisson deviate with average R. Tree sizes are random integers given by a distribution p(n i ,R), obtained from the probabilitygeneration function for the Galton-Watson branching processes according to Pitman [19]. For the Poisson offspring distribution, we obtain The synthetic dataset, consisting of sizes of N transmission trees with known R, was used to validate our methodology of estimating R a,b . The analytical approach according to Eq 3 yields hRi ¼ 1 À 1= n þ 1=ð nNÞ ð 7Þ where n is the average size of the trees in T. The corresponding standard deviation becomes negligible when N)1. Eq 7 gives R = 1 if T consists of a single tree. For large tree number in T, the third term goes to zero and Eq 7 becomes identical to the formula derived from the maximum likelihood L(T|R) method. Using Eq 7, we estimate the change of R in real time, as new epidemiological data become available. Suppose we add a new tree of size n to T. The corresponding change in R, shows that, for a sufficiently large tree number in the dataset N)1, the sign of the change in R is determined by the difference between the size n of the newly added tree and n. We now proceed with the discussion of our simulations. In the first example, we generated a homogeneous dataset T, consisting of 100 trees with R equal to 0.6. Fig 1(a) shows 2lnB(T a ) as a function of information (c.f. Eq 4) in the first subset T a . The observed maximum is~3, indicating weak justification for calculating two R values. Hence we concluded that this dataset should be assigned a single R estimate.
In the second example, we assumed a stepwise increase of R, modeling adaptation of the pathogen to human-to-human transmission. We generated a non-homogeneous synthetic dataset where 50 trees were generated with R a = 0.6, while the remaining 50 trees were generated with R b = 0.85. The parameter 2lnB(T a ) (c.f. Fig 1(b)) reaches its maximum at the 55 th tree. Contrary to the previous example, the observed maximum of 2lnB(T a ) is~14, suggesting strong justification for calculating R a,b . The best R a,b estimates, denoted by R Ãa;b , have nonoverlapping error bars (95% CI) and are in satisfactory agreement with the numerical choices for the parameters when 2lnB(T a )>10 (c.f. Fig 1(c)).
In order to clarify the minimal size of the initial dataset required to get reliable estimations of R a,b , we performed evaluations of R Ãa;b for statistically independent synthetic datasets T, containing from 2 to 200 trees. For each size of T, we averaged 120 independent realizations to alleviate stochastic effects . Fig 2(a) shows the average 2lnBðΤ Ã a Þ and its 95% confidence interval as a function of the total number of trees in the homogeneous datasets, when R equal to 0.6. The values are below 10 and the featured dependence is not particularly sensitive to the number of trees in T.
The situation is quite different for non-homogeneous datasets. As an example, we generated a set of trees T with R equal to 0.6 and 0.85 for the first and second halves, respectively. The results are presented in Fig 2(b), where the average 2lnBðΤ Ã a Þ increases with the number of trees in T. For a low number of trees (up to~100 trees in T), the average of 2lnBðΤ Ã a Þ is less than 10, indicating no preference to estimate two R values. For a high number of trees, the average of 2lnBðΤ Ã a Þ is larger than 10, demonstrating that the model with two R values is superior. For small sizes of T (c.f . Fig 2(c)), average estimates of R Ãa;b display strong fluctuations. Starting with~50 trees per dataset T, R estimates are neater. With further increasing the size of T, R estimates are more grouped around the exact values and the error bars (95% CI) are smaller.

Application to epidemiological data
We applied our method of evaluating two R values from an epidemiological dataset. We aimed to reproduce recent results obtained by Jansen et al. [7], concerning the transmission dynamics of measles in the UK during 1995-2002. Although measles-elimination programs were set worldwide, measles eradication has not yet been achieved. In the late nineties, the safety of a combined measles-mumps-rubella (MMR) vaccine became controversial, which resulted in decreased uptake of the MMR vaccine after 1998. As a consequence, measles outbreaks increased in sizes.
Jansen et al. [7] calculated two effective reproduction numbers R for the UK, regarding the periods 1995-1998 and 1999-2002, respectively. They found that R increased significantly, from 0.47 to 0.82. The epidemiological data consisted of measles cases grouped into outbreaks of size 2 or more. We accounted for the left censoring of the outbreak-size data by renormalizing the probability of observing outbreaks p(n i ,R) as p(n i ,R)/(1−p(1,R)), where Eq 6 provided the probability model. The results are presented in Fig 3. The parameter 2lnB(T a ) shows a marked maximum (c.f. , Fig 3(a)) at the 35 th outbreak, the latest one of 1998. The magnitude of  The impact of amount of data on R evaluation. We considered 120 independent realizations of synthetic datasets with number of outbreaks between 2 and 200. Panels (a) and (b) show the average of 2lnBðΤ Ã a Þ and its 95% confidence interval as a function of tree number for homogeneous and nonhomogeneous T, respectively. In particular, the panels are as follows. (a) All trees in T are generated with R = 0.6. The average 2lnBðΤ Ã a Þ < 10 recommends one R evaluation from T. (b) The first half of the number of outbreaks in T has R a = 0.6, while the second half has R b = 0.85. The average 2lnBðΤ Ã a Þ increases with the number of trees in T. Once above 10 (i.e., starting from datasets of 100 outbreaks or more), it is justified estimating R a,b . The corresponding hRi is not significant (p a = 0.92 and p b = 0.96 for R a,b , respectively) and is explained by the choice of the offspring distribution (i.e. the Poisson or the geometric offspring distribution) rather than by the difference in the methods of data analysis.

Discussion
We propose a method to monitor the effective reproduction number of infectious diseases with sub-threshold transmission. The method may apply to alert and surveillance systems of diseases emerging and/or re-emerging from natural reservoirs. In this case, monitoring an increase in R 0 and/or R may be used to determine the implementation of public-health intervention. The method may also be used to assess the effectiveness of public-health programs designed for disease elimination. In this case, evaluating R before and after implementing intervention may confirm the performance of the public-health program.
We validated our method using synthetic data, of which we presented a few simulations. We also reanalyzed data previously studied by Jansen et al. [7,21], concerning the transmission of measles in the UK during 1995-2002. While our R findings are similar, we also extracted the time when rumors on the MMR vaccine started to affect measles vaccination from the epidemiological data. Of note, in a previous publication, Blumberg et al. analyzed the transmissibility of measles in the US (1997)(1998)(1999) and Canada (1998Canada ( -2001, estimating two values of R from two distinct datasets. However, further epidemiological assumptions are required for a direct comparison of R resulting from the two analyses. Our model has several limitations. Epidemiological data are often meant for estimating incidence or prevalence (e.g., [1,2], [6] pp. 39-65). However, our study requires a particular type of longitudinal data, consisting of outbreak sizes where the outbreaks are ordered according to the date of infection of index patients. In addition, our model implies the appearance of every new index case only after the termination of an outbreak caused by the previous index case (i.e., we assume a sequence of non-overlapping in time transmission trees). These data may result from close surveillance of emerging or re-emerging infections. However, outbreaks may be clustered in time and space and difficult to tell apart. If the number of index patients in each cluster is determined, the model could be amended by using the Borel-Tanner distribution [22] for p(n i ,R) in Eq 6.
Our model provides a scheme to monitor R in the regime of subcritical transmission (i.e., R<1); other methods may be used for estimating R in the regime of supercritical transmission [23]. The quality of R estimation in our model depends on the number of outbreaks; see Fig 3c. Therefore the number of changes in R that can be estimated from a dataset depends on the number of outbreaks. We addressed the case of detecting a single change in R by solving a onedimensional maximization problem for the Bayes factor B(T a ) (c.f., Eq 5). It is no coincidence that (global) maximization problems are divided between one-and multi-dimensional, the first class of problems being significantly easier. However, a number of numerical algorithms are readily available to address maximization in several dimensions [24]. In our case, such algorithms would have to face additional difficulties, inherent to the stochastic nature of the data modeling.
In conclusion, our work proposes a novel method to monitor changes in the effective reproductive number from an epidemiological dataset consisting solely of outbreak sizes.