Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Bayesian Monitoring of Emerging Infectious Diseases

Bayesian Monitoring of Emerging Infectious Diseases

  • Pavel Polyakov, 
  • Romulus Breban


We define data analyses to monitor a change in R, the average number of secondary cases caused by a typical infected individual. The input dataset consists of incident cases partitioned into outbreaks, each initiated from a single index case. We split the input dataset into two successive subsets, to evaluate two successive R values, according to the Bayesian paradigm. We used the Bayes factor between the model with two different R values and that with a single R value to justify that the change in R is statistically significant. We validated our approach using simulated data, generated using known R. In particular, we found that claiming two distinct R values may depend significantly on the number of outbreaks. We then reanalyzed data previously studied by Jansen et al. [Jansen et al. Science 301 (5634), 804], concerning the effective reproduction number for measles in the UK, during 1995–2002. Our analyses showed that the 1995–2002 dataset should be divided into two separate subsets for the periods 1995–1998 and 1999–2002. In contrast, Jansen et al. take this splitting point as input of their analysis. Our estimated effective reproduction numbers R are in good agreement with those found by Jansen et al. In conclusion, our methodology for detecting temporal changes in R using outbreak-size data worked satisfactorily with both simulated and real-world data. The methodology may be used for updating R in real time, as surveillance outbreak data become available.


Incidence and prevalence are standard epidemiological indicators, monitored to understand disease dynamic within society [1, 2]. In the case of infectious diseases, it is customary to measure how far an epidemic is from eradication by calculating yet another epidemiological parameter, the basic reproduction number, denoted by R0 (e.g., [3], pp. 4–5). By definition, R0 represents the average number of secondary cases caused by a typical infectious individual in a fully susceptible population. If R0 is larger than 1, then an outbreak becomes an epidemic; otherwise, it goes extinct. If the population is not fully susceptible, then one calculates the effective reproduction number, denoted by R [4, 5]. However, neither R0 nor R is regularly monitored by public-health authorities (e.g., [6], pp. 39–65).

R0 depends on a number of factors pertaining to the pathogen–host biology, such as pathogen transmissibility and the natural history of disease. In addition, R0 may depend on factors pertaining to host sociology, such as population density and social awareness about epidemics. All these factors may change with time. Changes in R0 may signal that the pathogen has become more transmissible, virulent or persistent in the population. They may also signal societal re-organization in response to the epidemic dynamic. Particularly important are changes in R, which, in addition to changes in R0, may also reflect changes in the susceptibility of the population. Monitoring changes in R may thus be instrumental in determining the success of public-health interventions such as mass vaccination [4, 7, 8].

An epidemiological situation that would benefit from R0 and/or R monitoring is that of a zoonotic pathogen repeatedly introduced in a population where it undergoes subcritical transmission. As the pathogen explores more and more hosts, the opportunity for mutations increases, with increasing chance for pandemic strains to occur [9]. Possible applications of this scenario may be the cases of the pre-pandemic severe acute respiratory syndrome [10] and Middle East respiratory syndrome [11].

Another application of R0 and/or R monitoring is to assessing the success of mass vaccination by surveying disease outbreaks before and after a major epidemiological event, such as implementation of mass vaccination [5] and loss of confidence in vaccination programs [7]. Surveillance and contact tracing provides a means of monitoring outbreaks by permanent registration of new cases. For example, monkeypox [12] and non-zoonotic measles [4] are potential candidates for eradication through vaccination. Monkeypox remains worrisome for the possibility that repeated introductions in the human population may yield novel strains of human poxes [13]. Following vaccine licensing in 1963, measles incidence in the United States decreased by more than 95% [14]. Nevertheless, recent R analysis [8] shows the potential for measles to re-emerge and emphasizes the importance of continued surveillance. In this work, we discuss monitoring R using outbreak-size surveillance data; only minor modifications are needed to apply our methodology to monitoring R0 of emerging infectious diseases.

There exist two major approaches to estimate R from outbreak-size data. In the first approach, R is estimated using the maximum likelihood method. Blumberg et al. [8, 15] used Galton–Watson branching processes to construct the likelihood of observed data consisting of outbreak sizes. By maximizing the likelihood function, they calculated R for monkeypox in the Democratic Republic of Congo (1980–1984) and measles in the United States (2001–2011). Jansen et al. [7] used continuous-time Markov chains to construct the likelihood of observed data. By maximizing the likelihood function, they calculated R for measles outbreaks in the United Kingdom (1995–2002), advocating for an increase in R due to loss of confidence in the vaccination program.

The second approach is based on Bayesian inference, which requires a prior distribution describing the current knowledge on the parameter of interest (e.g., R) and the likelihood of observing the data set according to a model of choice [16]. As a result, one computes a posterior distribution, representing an upgrade of the prior, according to the data. Hence Bayesian inference follows closely the principle of surveillance and learning processes.

Farrington et al. [4] proposed two alternative ways to constructing the posterior probability of R based on (1) outbreak size and (2) outbreak duration. They estimated R for measles (1997–1999) outbreaks in the United States using data from 41 outbreaks caused by a single introduction. Angelov et al. [17] estimated R from a set of clusters sizes, where each cluster consisted from a known number of outbreaks. Following a Bayesian approach, they calculated R for mumps for three different regions of Bulgaria (2005–2008). Yanev et al. [18] described different Bayesian estimators under two different families of loss functions to study multiple outbreaks. They calculated R to describe transmission of smallpox in Europe (1960–1970). Prior knowledge on R was obtained from past data covering the period 1951–1960.

Previous literature focused on extracting a single R value from a given epidemiological dataset. In this work, we describe a methodology to estimate two time-ordered R values from the same set of surveillance data, in the regime of subcritical transmission (R<1); however, the generalization to multiple values is straightforward.

Monitoring Temporal Changes in R

A patient of an outbreak directly infects a number of individuals, so-called secondary cases. In successive generations, each secondary case continues to spread the pathogen, causing secondary cases by themselves. The network of who has infected whom has a tree structure, which, for emerging or re-emerging diseases, may be described using the theory of branching processes [19]. We assume that the data set, denoted by T, is an array of N sizes of trees (i.e., outbreak sizes) ordered by the infection time of the index patient. We analyze the transmission process in terms of the effective reproduction number, R, which may change with time due to biological (e.g., pathogen transmissibility and virulence) and/or sociological (e.g., frequency of contact between individuals) factors.

General theory

We first briefly present how to compute a single value of R from an epidemiological dataset, in the Bayesian paradigm. To express lack of knowledge about R, we use a non-informative improper prior π(R) = 1, where R is uniformly distributed from zero to infinity. We construct the likelihood L(T|R) that the dataset T is observed, assuming that the effective reproduction number is R. Given that the detected trees have sizes n1,…,nN, ordered according to the date of infection of index patients, we have [15, 19] (1) where p(ni,R) is the probability that the single transmission tree i has size ni [19]. The posterior distribution of R for the observed dataset T is calculated according to Bayes’ rule [4] (2) which yields (3)

The posterior distribution was used to calculate the average effective reproduction number 〈R〉 and its 95% credible interval (CI).

The quality of R statistics depends not only on the number of trees but also on their sizes. In short, we summarize the amount of data using the concept of information. According to the definition (e.g., [16], pp. 32), information is given by the logarithm of the probability to observe the given dataset T. In our case, the information, denoted by IT, is given by the natural logarithm of the likelihood L(T|R) (c.f. Eq 1) (4) so that information is measured in natural units (i.e., nat). Hence, information is presented as the sum of N contributions, one for each tree in T (i.e., an extensive quantity over the number of trees). For a numerical estimate of IT corresponding to the dataset, we used the average of R (i.e., 〈R〉) obtained from the Bayesian framework.

Monitoring the pandemic risk of emerging infectious diseases, one extracts at least two time-ordered R values from the same dataset T. Hence, the dataset T is divided into two time-ordered subsets Ta and Tb = T\Ta. The Bayesian approach described above (c.f. Eq 3) is then applied independently to each subset Ta,b to estimate two effective reproduction numbers Ra,b. Obviously, the estimates Ra,b depend on the selection of Ta,b. For example, if Ta is nearly all T then Rb is badly estimated.

To justify the choice of extracting two R values from T, we proceed as follows. We denote by HT the model with a single R estimate and by Ha,b the model with two R estimates. Given Ra,b, the likelihood for the dataset Ta∪Tb is given by L(Ta|Ra)L(Tb|Rb) [c.f., Eq 1 for L(T|R)]. We evaluate whether Ha,b is more plausible than HT by calculating the Bayes factor [20], using the corresponding likelihoods. When our initial beliefs are a priori equally probable, pr(Ha,b) = pr(HT), the Bayes factor (5) expresses how well the observed data were predicted by Ha,b, compared to HT; i.e., the higher the value of B(Ta), the more is justified to extract two R values rather than one from T. Kaas et al. [20] provided an interpretation of the strength of the second model Ha,b in terms of four categories according to the gradation of 2lnB(Ta). They suggested a very strong preference for Ha,b if 2lnB(Ta)>10. In our model, we make the same decision. We extract two Ra,b values when 2lnB(Ta)>10; otherwise, we extract a single value of R from T.

Our R analysis is performed according to the following steps. For every i = 1,…,(N−1):

  1. Let Ta consist of the first i trees in T and Tb = T\Ta;
  2. Estimate the pair of posterior distributions using Eq 3; Numerical integration of the normalization constant was performed using the trapezoidal rule.
  3. Use the average 〈Ra,b as an estimate of the parameters Ra,b.
  4. Calculate B(Ta) using Eq 5.

Finally, we denote by the sets Ta,b which yield the largest value of 2lnB(Ta). If , we accept 〈Ra,b as the best estimates of effective reproduction numbers on and we denote them by ; otherwise, we calculate a single R value on T using Eq 3.

Numerical tests using synthetic data

Synthetic data consisting of arrays of sizes of transmission trees were simulated, assuming that the number of cases caused by each infected individual is a Poisson deviate with average R. Tree sizes are random integers given by a distribution p(ni,R), obtained from the probability-generation function for the Galton–Watson branching processes according to Pitman [19]. For the Poisson offspring distribution, we obtain (6)

The synthetic dataset, consisting of sizes of N transmission trees with known R, was used to validate our methodology of estimating Ra,b. The analytical approach according to Eq 3 yields (7) where is the average size of the trees in T. The corresponding standard deviation (8) becomes negligible when N≫1. Eq 7 gives R = 1 if T consists of a single tree. For large tree number in T, the third term goes to zero and Eq 7 becomes identical to the formula derived from the maximum likelihood L(T|R) method.

Using Eq 7, we estimate the change of R in real time, as new epidemiological data become available. Suppose we add a new tree of size n to T. The corresponding change in R, (9) shows that, for a sufficiently large tree number in the dataset N≫1, the sign of the change in R is determined by the difference between the size n of the newly added tree and .

We now proceed with the discussion of our simulations. In the first example, we generated a homogeneous dataset T, consisting of 100 trees with R equal to 0.6. Fig 1(a) shows 2lnB(Ta) as a function of information (c.f. Eq 4) in the first subset Ta. The observed maximum is ~3, indicating weak justification for calculating two R values. Hence we concluded that this dataset should be assigned a single R estimate.

Fig 1. Ra,b values for synthetic datasets T.

We considered simulated datasets consisting of 100 outbreak sizes; the vertical arrows show the position of the 50th tree. Panel (a) corresponds to R = 0.6 and shows 2lnB(Ta) as a function of information in Ta, as Ta increases from including the first tree only to including the first N−1 trees. The maximum of 2lnB(Ta) is below 10, suggesting preference for a single R estimate over the whole dataset. Panel (b) is similar to panel (a), except T consisted of 50 trees with Ra = 0.6 and 50 trees with Rb = 0.85. Values of 2lnB(Ta) larger than 10 show where the dataset T can be split into Ta,b with justification for evaluating Ra,b. The corresponding Ra,b values are shown in the panel (c) as a function of 2lnB(Ta).

In the second example, we assumed a stepwise increase of R, modeling adaptation of the pathogen to human-to-human transmission. We generated a non-homogeneous synthetic dataset where 50 trees were generated with Ra = 0.6, while the remaining 50 trees were generated with Rb = 0.85. The parameter 2lnB(Ta) (c.f. Fig 1(b)) reaches its maximum at the 55th tree. Contrary to the previous example, the observed maximum of 2lnB(Ta) is ~14, suggesting strong justification for calculating Ra,b. The best Ra,b estimates, denoted by , have non-overlapping error bars (95% CI) and are in satisfactory agreement with the numerical choices for the parameters when 2lnB(Ta)>10 (c.f. Fig 1(c)).

In order to clarify the minimal size of the initial dataset required to get reliable estimations of Ra,b, we performed evaluations of for statistically independent synthetic datasets T, containing from 2 to 200 trees. For each size of T, we averaged 120 independent realizations to alleviate stochastic effects. Fig 2(a) shows the average and its 95% confidence interval as a function of the total number of trees in the homogeneous datasets, when R equal to 0.6. The values are below 10 and the featured dependence is not particularly sensitive to the number of trees in T.

Fig 2. The impact of amount of data on R evaluation.

We considered 120 independent realizations of synthetic datasets with number of outbreaks between 2 and 200. Panels (a) and (b) show the average of and its 95% confidence interval as a function of tree number for homogeneous and non-homogeneous T, respectively. In particular, the panels are as follows. (a) All trees in T are generated with R = 0.6. The average recommends one R evaluation from T. (b) The first half of the number of outbreaks in T has Ra = 0.6, while the second half has Rb = 0.85. The average increases with the number of trees in T. Once above 10 (i.e., starting from datasets of 100 outbreaks or more), it is justified estimating Ra,b. The corresponding values, calculated for each dataset T, are shown in the panel (c), as a function of the number of outbreaks in T.

The situation is quite different for non-homogeneous datasets. As an example, we generated a set of trees T with R equal to 0.6 and 0.85 for the first and second halves, respectively. The results are presented in Fig 2(b), where the average increases with the number of trees in T. For a low number of trees (up to ~100 trees in T), the average of is less than 10, indicating no preference to estimate two R values. For a high number of trees, the average of is larger than 10, demonstrating that the model with two R values is superior. For small sizes of T (c.f. Fig 2(c)), average estimates of display strong fluctuations. Starting with ~50 trees per dataset T, R estimates are neater. With further increasing the size of T, R estimates are more grouped around the exact values and the error bars (95% CI) are smaller.

Application to epidemiological data

We applied our method of evaluating two R values from an epidemiological dataset. We aimed to reproduce recent results obtained by Jansen et al. [7], concerning the transmission dynamics of measles in the UK during 1995–2002. Although measles-elimination programs were set worldwide, measles eradication has not yet been achieved. In the late nineties, the safety of a combined measles-mumps-rubella (MMR) vaccine became controversial, which resulted in decreased uptake of the MMR vaccine after 1998. As a consequence, measles outbreaks increased in sizes.

Jansen et al. [7] calculated two effective reproduction numbers R for the UK, regarding the periods 1995–1998 and 1999–2002, respectively. They found that R increased significantly, from 0.47 to 0.82. The epidemiological data consisted of measles cases grouped into outbreaks of size 2 or more. We accounted for the left censoring of the outbreak-size data by renormalizing the probability of observing outbreaks p(ni,R) as p(ni,R)/(1−p(1,R)), where Eq 6 provided the probability model. The results are presented in Fig 3. The parameter 2lnB(Ta) shows a marked maximum (c.f., Fig 3(a)) at the 35th outbreak, the latest one of 1998. The magnitude of 2lnB(Ta) is ~14, justifying the split of the 1995–2002 dataset into two separate subsets for the periods 1995–1998 and 1999–2002. Jansen et al. [7] used the same split of the dataset, based on information about measles-vaccination coverage and the MMR controversy. Our analysis yielded the same conclusion, based on the measles outbreak data alone.

Fig 3. Ra,b values for measles in the UK during 1995–2002.

The dataset consists of 78 trees “Jansen et al. [7]”. (a) 2lnB(Ta) versus information in Ta. Since 2lnB(Ta)>10, we evaluate Ra,b. The vertical arrow indicates the best splitting point of T, determined by the maximum of 2lnB(Ta). (b) 〈Ra,b values (95% CI), plotted as a function of 2lnB(Ta)>, were calculated for all frames Ta,b where 2lnB(Ta)>10.

Fig 3(b) shows Ra,b (95% CI) as a function of 2lnB(Ta). Both Ra,b have nearly constant values for 2lnB(Ta)>10. For the maximum value of 2lnB(Ta)>10, we evaluated as 0.54 (90%CI 0.43–0.66) and 0.86 (90%CI 0.79–0.93), respectively. The quantitative difference with the results by Jansen et al. [7, 21] (Ra = 0.47 (90%CI 0.36–0.55) and Rb = 0.82 (90%CI 0.71–0.87)) is not significant (pa = 0.92 and pb = 0.96 for Ra,b, respectively) and is explained by the choice of the offspring distribution (i.e. the Poisson or the geometric offspring distribution) rather than by the difference in the methods of data analysis.


We propose a method to monitor the effective reproduction number of infectious diseases with sub-threshold transmission. The method may apply to alert and surveillance systems of diseases emerging and/or re-emerging from natural reservoirs. In this case, monitoring an increase in R0 and/or R may be used to determine the implementation of public-health intervention. The method may also be used to assess the effectiveness of public-health programs designed for disease elimination. In this case, evaluating R before and after implementing intervention may confirm the performance of the public-health program.

We validated our method using synthetic data, of which we presented a few simulations. We also reanalyzed data previously studied by Jansen et al. [7, 21], concerning the transmission of measles in the UK during 1995–2002. While our R findings are similar, we also extracted the time when rumors on the MMR vaccine started to affect measles vaccination from the epidemiological data. Of note, in a previous publication, Blumberg et al. analyzed the transmissibility of measles in the US (1997–1999) and Canada (1998–2001), estimating two values of R from two distinct datasets. However, further epidemiological assumptions are required for a direct comparison of R resulting from the two analyses.

Our model has several limitations. Epidemiological data are often meant for estimating incidence or prevalence (e.g., [1, 2], [6] pp. 39–65). However, our study requires a particular type of longitudinal data, consisting of outbreak sizes where the outbreaks are ordered according to the date of infection of index patients. In addition, our model implies the appearance of every new index case only after the termination of an outbreak caused by the previous index case (i.e., we assume a sequence of non-overlapping in time transmission trees). These data may result from close surveillance of emerging or re-emerging infections. However, outbreaks may be clustered in time and space and difficult to tell apart. If the number of index patients in each cluster is determined, the model could be amended by using the Borel–Tanner distribution [22] for p(ni,R) in Eq 6.

Our model provides a scheme to monitor R in the regime of subcritical transmission (i.e., R<1); other methods may be used for estimating R in the regime of supercritical transmission [23]. The quality of R estimation in our model depends on the number of outbreaks; see Fig 3c. Therefore the number of changes in R that can be estimated from a dataset depends on the number of outbreaks. We addressed the case of detecting a single change in R by solving a one-dimensional maximization problem for the Bayes factor B(Ta) (c.f., Eq 5). It is no coincidence that (global) maximization problems are divided between one- and multi-dimensional, the first class of problems being significantly easier. However, a number of numerical algorithms are readily available to address maximization in several dimensions [24]. In our case, such algorithms would have to face additional difficulties, inherent to the stochastic nature of the data modeling.

In conclusion, our work proposes a novel method to monitor changes in the effective reproductive number from an epidemiological dataset consisting solely of outbreak sizes.


We thank Prof V.A.A. Jansen (Royal Holloway University of London) for the epidemiological data. We thank the reviewers for their helpful comments. P.P. is grateful for financing in the form of a postdoctoral scholarship from Agence Nationale de la Recherche (Labex Integrative Biology of Emerging Infectious Diseases).

Author Contributions

Conceived and designed the experiments: RB PP. Performed the experiments: RB PP. Analyzed the data: PP. Wrote the paper: RB PP.


  1. 1. World Health Organization (WHO), Emerging diseases. Available:
  2. 2. World Health Organization (WHO), Disease incidence, prevalence and disability. Available:
  3. 3. Diekmann O, Heesterbeek H, Britton T. Mathematical tools for understanding infectious disease dynamics. Princeton: Princeton University Press; 2013.
  4. 4. Farrington CP, Kanaan MN, Gay NJ. Branching process models for surveillance of infectious diseases controlled by mass vaccination. Biostatistics. 2003;4(2):279–95. pmid:12925522
  5. 5. Farrington CP, Whitaker HJ. Estimation of effective reproduction numbers for infectious diseases using serological survey data. Biostatistics. 2003;4(4):621–32. pmid:14557115
  6. 6. Anderson RM, May RM, Microparasites. In: Infectious diseases of humans. New York: Oxford University Press Inc.; 2004. pp. 27–154.
  7. 7. Jansen VAA, Stollenwerk N, Jensen HJ, Ramsay ME, Edmunds WJ, Rhodes CJ. Measles outbreak in a population with declining vaccine uptake. Science. 2003;301(5634):804. pmid:12907792
  8. 8. Blumberg S, Enanoria WT, Lloyd-Smith JO, Lietman TM, Porco TC. Identifying postelimination trends for the introduction and transmissibility of measles in the United States. Am J Epidemiol. 2014;179(11):1375–82. pmid:24786800
  9. 9. Antia R, Regoes RR, Koella JC, Bergstrom CT. The role of evolution in the emergence of infectious diseases. Nature. 2003;426(6967):658–61. pmid:14668863
  10. 10. Lau SK, Woo PC, Li KS, Huang Y, Tsoi HW, Wong BH, et al. Severe acute respiratory syndrome coronavirus-like virus in Chinese horseshoe bats. PNAS. 2005;102(39):14040–45. pmid:16169905
  11. 11. Cotten M, Watson SJ, Zumla AI, Makhdoom HQ, Palser AL, Ong SH, et al. Spread, circulation, and evolution of the Middle East respiratory syndrome coronavirus. MBio. 2014;5(1):e01062–13. pmid:24549846
  12. 12. Fine PE, Jezek Z, Grab B, Dixon H. The transmission potential of monkeypox virus in human populations. Int J Epidemiol. 1988;17(3):643–50. pmid:2850277
  13. 13. Rimoin AW, Mulembakani PM, Johnston SC, Lloyd Smith JO, Kisalu NK, Kinkela TL, et al. Major increase in human monkeypox incidence 30 years after smallpox vaccination campaigns cease in the Democratic Republic of Congo. PNAS. 2010;107(37):16262–7. pmid:20805472
  14. 14. Hamborsky J, Kroger A, Wolfe S, Measles. In: Epidemiology and Prevention of Vaccine-Preventable Diseases. 13th ed. 1600 Clifton Rd Atlanta: Washington D.C. Public Health Foundation, Centers for Disease Control and Prevention; 2015. pp. 209–28.
  15. 15. Blumberg S, Lloyd-Smith JO. Inference of R0 and transmission heterogeneity from the size distribution of shuttering chains. PLoS Comput Biol. 2013;9(5):e1002993. pmid:23658504
  16. 16. MacKay DJC. Information theory, inference, and learning algorithms. Cambridge: Cambridge University Press; 2003.
  17. 17. Angelov AG, Slavtchova-Bojkova M. Bayesian estimation of the offspring mean in branching processes: Application to infectious disease data. Comput Math Appl. 2012;64(3):229–35.
  18. 18. Yanev GP, Tsokov CP. Decision—theoretic estimation of the offspring mean in mortal branching processes. Communications in Statistics Stochastic Models. 1999;15(5):889–902.
  19. 19. Pitman J. Enumerations of trees and forests related to branching processes and random walks. DIMACS Ser Discret M. 1998;41:163–80.
  20. 20. Kaas RE, Raftery AE. Bayes factors. J Am Statist Assoc. 1995;90(430):773–95.
  21. 21. Jansen VAA, Stollenwerk N, Modeling measles outbreaks. In: Haccou P, Jagers P, Vatutin VA. Branching Processes. Variation, Growth, and Extinction of Polualtions. New York: Cambridge University Press; 2005. pp. 236–49.
  22. 22. Haigh FA, Breuer MA. The Borel–Tanner distribution. Biometrika. 1960;47:143–50.
  23. 23. Heffernan JM, Smith RJ, Wahl LM. Perspectives on the basic reproductive ratio. J R Soc Interface. 2005;2:281–93. pmid:16849186
  24. 24. Press WH, Teukolsky SA, Vetterling WT, Flannery BP, Minimization or maximization of functions. In: Numerical recipes. The art of scientific computing. 3rd ed. Hong Kong: Cambridge University Press; 2007. pp. 487–555.