Nested Sampling for Bayesian Model Comparison in the Context of Salmonella Disease Dynamics

Understanding the mechanisms underlying the observed dynamics of complex biological systems requires the statistical assessment and comparison of multiple alternative models. Although this has traditionally been done using maximum likelihood-based methods such as Akaike's Information Criterion (AIC), Bayesian methods have gained in popularity because they provide more informative output in the form of posterior probability distributions. However, comparison between multiple models in a Bayesian framework is made difficult by the computational cost of numerical integration over large parameter spaces. A new, efficient method for the computation of posterior probabilities has recently been proposed and applied to complex problems from the physical sciences. Here we demonstrate how nested sampling can be used for inference and model comparison in biological sciences. We present a reanalysis of data from experimental infection of mice with Salmonella enterica showing the distribution of bacteria in liver cells. In addition to confirming the main finding of the original analysis, which relied on AIC, our approach provides: (a) integration across the parameter space, (b) estimation of the posterior parameter distributions (with visualisations of parameter correlations), and (c) estimation of the posterior predictive distributions for goodness-of-fit assessments of the models. The goodness-of-fit results suggest that alternative mechanistic models and a relaxation of the quasi-stationary assumption should be considered.


Model comparison
Model-based inference is widely used in life sciences in order to assess the plausibility of hypothesised biological mechanisms based on data from observations or experiments. One of the most common approaches to compare competing models representing alternative hypotheses relies on Akaike's Information Criterion (AIC) [1]. For a given data set D, the plausibility of the candidate models M i is assessed by calculating their respective AIC values, AIC i : In (1), b h h MLE,i is the maximum likelihood estimate of the set parameters associated with model M i , and n df is the corresponding number of degrees of freedom. If AIC 1 vAIC 2 then M 1 is more plausible than M 2 , with respect to D, in the sense that the Kullback-Liebler divergence of M 1 from the true model is smaller [2]. An important drawback to the classic approach to model choice is that it is based on a single point estimate b h h MLE,i of h, the uncertainty in h being ignored. In contrast, the Bayesian approach considers a probability distribution for h, with p(h (i) DD,M i ) expressing the uncertainty in h (i) given D (for a model M (i) ).
Suppose that we wish to select a model from a set of candidate models fM 1 , . . . ,M m g given our observation of data D. We can express this goal probabilistically by stating that the aim is to determine the most probable model: arg max Mi p(M i DD).
From Bayes' theorem, we have therefore, if p(M i ) is known, or considered to be equal for all M i then the focus is on the model evidence p(DDM i ).
If h (i) is the set of parameters associated with model M i , the Bayesian approach to p(DDM (i) ) is to integrate over all possible values of h (i) : In addition to allowing for parameter uncertainty, (3) intrinsically penalizes against models that are better able to fit to observed data because of their complexity [3], thereby removing the need for an explicit complexity penalization term.
The integral of (3) can be estimated analytically or numerically. In analytical approaches, the integral is approximated by the adoption of simplifying assumptions; for example, as used for derivation of the Bayes Information Criterion [4]. Numerical approaches are based on some form of Monte Carlo sampling such as Gibbs Sampling [5].
One approach to estimating the integral =t dh numerically is to sample h randomly from its prior, however, the prior p(h) is often concentrated in places where the likelihood L(h) is relatively low. This problem becomes more severe in high-dimensional parameter (h) spaces, or in problems where the likelihood function L(h) is concentrated in a very small region.
To overcome the problem, Skilling [6,7] proposed a means of estimating Ð q L(h)p(h)dh that, by design, samples h sparsely from the h space where the likelihood L(h) is low, and densely where L(h) is high, by means of 'nested sampling', which is the focus of this paper. A recent addition to the Bayesian arsenal, nested sampling has been used in cosmology to compare alternative models of the universe against observed data [8]. Outside of physics, it has, so far, received little attention [9,10].

Within-host dynamics of a bacterial infection
Quantitative research on infectious disease dynamics has undergone rapid development over the last two decades, motivated by concerns about emerging infections that can spread globally and about the evolution of pathogens resistant to existing control measures such as antimicrobials and vaccines. Bayesian computation has become the method of choice to fit stochastic dynamic models to epidemiological [11] or experimental datasets [12]. This is in large part due to the appeal of being able to produce measures of uncertainty and correlation for the model parameters based on their posterior probability distributions. Similarly, models for within-host dynamics of infection have more recently started to benefit from Bayesian inference approaches [13].
Salmonella enterica causes systemic diseases (typhoid and paratyphoid fever) [14], food-borne gastroenteritis and non-typhoidal septicaemia (NTS) [15] in humans and in many other animal species world-wide, which also cause a very serious problem for the food industry. The global burden of typhoid fever is estimated at ca. 22 million cases with a mortality estimated at ca. 200,000 deaths per year [14,16]. Paratyphoid has an estimated 5.4 million illnesses worldwide [16]. The high incidence of these diseases, that affect both travellers to and residents in endemic areas, and threaten infants, children and immunodeficient patients, dictates the urgent need for more efficacious preventive and therapeutic measures.
In the mouse model of systemic infection, Salmonella reside and proliferate mainly within phagocytic cells of the spleen liver, bone marrow and lymph nodes [17][18][19]. Observation of Salmonella by fluorescence microscopy in the tissues of mice has revealed that a key feature of systemic infections with wild type bacteria is the presence, on average, of low bacterial numbers within individual phagocytes irrespective of net bacterial growth rate and time since infection [20][21][22][23].
In an effort to understand the dynamics that underpin the intracellular numerical distributions of Salmonella within the host cells, and to capture the essential traits of the cell-to-cell spread of the bacteria, we have used mathematical model frameworks for the intensity of intracellular infection that links the quasi-stationary distribution of bacteria to bacterial and cellular demography. An example of this the work done by Brown et al. [24], who compared the observed distribution fC n g, where C n is the number of cells with n bacteria, across 16 candidate infection models. The models under consideration were as follows: (a) one homogeneous model, in which, for every cell, burst occurred only when the number of bacteria n in a cell reached a single burst threshold N max ; (b) five heterogeneous models having a probability distribution of burst thresholds; and (c) eight stochastic models for which there is a probability that a given cell will undergo burst. Two datasets were analysed, one for a virulent strain of bacteria and the other for an attenuated strain. Brown et al. [24] computed the maximum likelihood estimates of the parameters of each model, and selected the 'best' model based on the corresponding AIC values.
In order to overcome the issues raised by AIC discussed above, we decided to re-analyse the datasets and re-assess the models within a Bayesian framework.

Methods
What follows is an elaboration of the description of nested sampling given by Skilling [6,7].

Nested sampling
The expected value of a function g of a random variable X is given by where f X is the pdf of X. On comparing this expression with the target integral that is to say, the expected value of the likelihood under the prior. The cumulative distribution function F X (x) with respect to a random variable X is defined by and is related to the expectation E(X ) by [25]; consequently, from (5), we obtain the important relationship where l is likelihood, and L in the right-hand integral is equal to L(h). The reason why (6) is important is that the multivariate integral on the left-hand side has been equated to a univariate integral.
Since h has a distribution defined by prior p, and L~L(h), it follows that L has a probability distribution and thus a cumulative distribution function, which is present in the integrand of the right-hand integral of (6). We can replace Ð 1 0 (1{F L (l)) dl in (6) with a more accessible integral by the following steps. First, since the pdf of L is connected to the pdf of h via L(h), we can write thus, from (6), (7) and (8), we can write It will be convenient to rewrite the inner integral of (9) as w(l) to give where w(l) is the probability of selecting h from the prior p such that L(h)wl: Introducing j~w(l), hence l~w {1 (j), we can rewrite the previous integral as where w {1 (j) is that likelihood l such that p(L(h)wl)~j (cf. Equation (11)); for example, if w {1 (0:9)~0:0042 then 90% of h drawn from the prior p(h) will have likelihoods greater than 0.0042.

The algorithm
The main steps of the nested sampling technique are as follows. First, n points h (i.e., parameter vectors) are sampled from the prior p, and their corresponding likelihoods L(h) determined. The point h min,1 having the smallest likelihood is determined and its likelihood l min,1 is recorded. Furthermore, the probability j 1 that L(h)wl min,1 is also recorded.Point h min,1 is replaced by a new h drawn from the prior p but restricted to those h for which L(h)wl min,1 . In other words, a restricted prior is used: pD(L(h)wl min,1 ). If V is the set of all possible h then the set V 1~f hDL(h)wl min,1 g is a subset of V.
The above sequence of determining h min and the corresponding j is performed on the new set of points, giving rise to l min,2 and j 2 . Point h min,2 is replaced by a h drawn from the new restricted prior pD(L(h)wl min,2 ). In other words, h is sampled from V 2~f hDL(h)wl min,2 g, for which V 2 5V 1 .
This cycle is repeated until some stopping criterion has been reached. If this termination occurs at the J-th iteration then the resulting values of l min,i and j i will be l min,J wl min,J{1 w Á Á Á wl min,1 , and the resulting sequence of V subsets is hence the term nested sampling.
Model evidence Z~Ð q L(h)p(h)dh can be estimated from the recorded l min,i and j i values by means of the approximation where J is the number of iterations used, and D b Z Z is a vertical rectangular segment under the curve of Figure 1.
Algorithm 1 (Table 1) describes the above process in pseudocode.

Practical adjustments to the algorithm
We will now consider how some of the aspects of Algorithm 1 can be implemented. (13) could be evaluated by the trapezoidal approach but Sivia and Skilling [26] have found to be adequate (line 9 in Algorithm 1).
Line 7 in Algorithm 1 used the assignment j i /P(L(h)wl min,i ), but an alternative approach is to replace this assignment with j i /E½j i . An approximation of E½j is derived as follows. Let t i denote the ratio j i =j i{1 , with j 0~1 . At the kth iteration we have Now, therefore, from (14), As regards the termination of Algorithm 1, there is no rigorous criterion as to when the algorithm should be stopped, but Skilling [7] and Feroz and Hobson [28] have found to be an effective stopping condition, where f is the fraction of Z that will not significantly contribute to the estimate of Z (according to a user-defined value). Chopin and Robert [29] have shown that the asymptotic variance of the nested sampling approximation typically grows linearly with parameter dimensions. Input (a) likelihood function L(h); (b) prior p(h); (c) number n of active parameter vectors in use during nested sampling; (d) procedure for determining a region R(S) of parameter space that encloses a set of parameter vectors S; (e) fraction f of Z to be estimated.
1: Let S be a set of n parameter vectors h 1 , . . . ,h n * iid p  Table 3. Probability distributions d(N Dh) for the burst thresholds N . Finally, there is the structure of the restricted priors. Each new point h new for a set S of active points is sampled from prior p conditioned on the restriction that L(h new )wl min . Rather than searching across the entire h -space for such a point, it is more computationally efficient to restrict the search to a region R(S) that contains S. We have used rectangular cuboids for R(S).
Incorporating the above points into Algorithm 1 leads to Algorithm 2 ( Table 2). Before applying the algorithm to our experimental datasets, we tested it on a simple two-parameter likelihood function L(a,b)~a 30 (1{a) 30 30 . The analyses and results are presented in Methods S1.

The Salmonella models
Evidence p(DDM) was estimated by nested sampling with respect to two groups of models associated with within-host S. enterica infection, were each model M provides an expression for the probability q(nDh,M) that a host cell contains n bacteria.
In the first group of models, infected cells are assumed to burst when the number of bacteria they contain reach a fixed threshold N . The probability distributions considered for N are shown in Table 3.
For the second group of models, the assumption is that, instead of pre-programmed burst thresholds N , there is burst rate m that is a function of the number of bacteria n in a cell. For these models, the general relationship is m(n; m 0 ,m 1 ,m 2 )~m 0 zm 1 nzm 2 n 2 ð15Þ where m 0 ,m 1 ,m 2 [½0,?). Furthermore, the rate of bacterial replication a n is assumed to be related to n by where a 0 ,a 1 [½0,?). As explained in Brown et al. [24], in the dynamic model, time can be re-scaled by the baseline replication rate a 0 , therefore this parameter cannot be estimated using the quasi-stationary distribution. For convenience, we set a 0~1 , so  (15) and (16).  Table 6. The number C n of cells containing n bacteria when virulent (SL5560) and attenuated (SL3261) strains of bacteria were used. that the values of other parameters are relative to the baseline replication rate. The parameters of the eight stochastic models considered are shown in Table 4. Under the assumption that the number of host cells infected by n bacteria reaches a quasi-stationary distribution, the probability q(nDh,M) that a cell contains n bacteria can be derived for the 14 models [30]. For Model 1, we have the relationship For Models 2 to 6, the relationship is When bacterial replication is not dependent on n, a e~0 , in which case c~1, but when replication is density dependent, (19) and (20) need to be solved self-consistently. This can be done by assuming an initial value for c, computing q(nDh,M) from (19), updating c using (20), and repeating this iteratively until c no longer changes significantly. This process is shown in Algorithm 3 (Table 5).

Likelihood function
With expressions for q(nDh,M) established for all the models, we can now determine the likelihood L(h) required for Algorithm 2. Following Brown et al. [30], we can express the likelihood function by a multinomial distribution: p(fC n gDfq(nDh,M)g) ð23Þ where fC n g is the observed distribution of C n (the number of cells with n bacteria), and U~P n C n , if observations are assumed to be  independent. Garca-Pérez [31] provides an algorithm for the accurate computation of multinomial probabilities. As regards the prior p(h) for a model M, it will be assumed to be uniform across the parameter space of interest for that model; consequently, the prior will be set equal to the reciprocal of the size of the parameter space. More precisely, :

A continuation approach
The theory underlying nested sampling assumes that all the parameters for a model have continuous values, however, this will not necessarily be the case in practice. For example, the binomial model (Model 3) has a discrete parameter n and a continuous parameter p.
It is possible to formulate a theory of nested sampling for discrete parameters by replacing integrals with summations, but modifications to Algorithm 2 would be required to take account of the fact that, if h is discrete, several points could occupy the same location in parameter space.
An alternative response to the presence of discrete parameters is to use a type of continuation approach [32]; in other words, if f (x) is a function defined only for integer values of x, replace it with another function g(x) that takes real values, but for which g(x)~f For those models using a factorial of a parameter (i.e., Models 4 and 5), we can replace x! with C(xz1) since C is a function of a real value.

The data
The data D consisted of the number C n of mice cells observed (via fluorescence microscopy) to contain n S. enterica bacteria: D~fC n g 29 n~1 . One dataset was used for a virulent bacterial strain (SL5560); another for an attenuated strain (SL3261). The infected cells were taken randomly from various locations in the liver. The observed C n values are shown in Table 6.
The data was pooled. If C ½t n denotes the number of cells having n bacteria on day t then, for the virulent strain, Brown et al. [24] used C n~C ½3 n zC ½4 n , and for the attenuated strain they used C n~C

Posterior model probabilities
If we assume that the set of candidate models is exhaustive, we can apply (2) to estimate the posterior probability p(MDD) for each model. Furthermore, if p(M i ) is assumed to be equal for all models, we can usep There are 14 models, each arbitrarily having 10 estimates of b Z Z, but it is impractical to systematically apply each of the 10 14 possible combinations of b Z Z to [25]; therefore, the b Z Z values were

Results
The estimated model-evidence values b Z Z obtained by nested sampling for each model is shown in Tables 7 and 8. The ranges are shown in Table 9.
With respect to the data from the attenuated strain, the most probable model was Model 7 (m 0 only) followed by Model 11 (m 0 and a e ). With respect to the data from the virulent strain, the most

Parameter distributions
After having estimated the most probable model, M Ã , it is of interest to estimate the posterior joint probability of the parameters h with respect to D and M Ã : p(hDD,M Ã ).
From Bayes' theorem, we can write and the denominator of Eqn (26) can be estimated by nested sampling:p Parameter estimation via reject sampling Distribution p(hDD,M) can be estimated using reject sampling with approximation (27). As part of this process, the maximum of p p(hDD,M) can be determined by performing Nelder-Mead simplex optimisation with respect to this distribution over parameter space.
The estimated parameter distributions obtained by reject sampling for Models 3, 5, 7 and 11, are shown in Figures 3, 4, 5, and 6, respectively. In each case, the sample size n was 10000. The samples obtained by reject sampling were also used to construct density scatter plots (Figures 7 and 8), which provide a visualisation of the correlations between the parameters.

Parameter estimation directly from nested sampling
The parameter sequence fh min,1 ,h min,2 , . . . ,h min,J g is produced during nested sampling. Can this set of parameters be regarded as a random sample from p(hDD,M)? Sivia and Skilling [26] proposed using fh min,k g J k~1 for this purpose so long as it is weighted by , on the basis that w k &p(h min,k DD,M). A theoretical justification for this is given by Chopin and Robert [29].
The appropriateness of regarding fh min,k g J k~1 as a random sample from p(hDD,M), was ascertained empirically using the Kolomogorov-Smirnov test, as follows.
The Kolmogorov-Smirnov statistic D n is given by where F 0 (x) is the cdf of the null-hypothesis pdf, and F (x) is the empirical cdf obtained from a sample fX i g: This definition can be generalized to a weighted Kolmogorov-Smirnov statistic by replacing (28) with a weighted cdf: This allows us to take account of the weights fw k g J k~1 on fh min,k g J k~1 . Applying this method to the toy model M toy presented in Methods S1, a sample fh min,k g J k~1 , with J~10586, was obtained by performing nested sampling for the evaluation of evidence using the weighted Kolmogorov-Smirnov statistic D J . This statistic was equal to 0.01298. In order to obtain a frequentist p-value for the statistic, an empirical probability distribution for D J was obtained by randomly selecting a set fã a k g J k~1 of a values from p(aD : ,M toy ) and determining D J for the set, this being done 10000 times. On comparing 0.01298 with this empirical distribution, the p-value for fa min,k g J k~1 was found to be 0.0276. In contrast, when a sample of size J was obtained by reject sampling from As a result of this experiment, it was decided not to use fh min,k g J k~1 for estimating parameter distributions.

Model checking
It does not follow that the most probable model from a set of candidate models is necessarily an acceptable model: the most probable model may be the least worst of a set of poor models. What is required is an assessment of the fit of the most probable models to the observed data.  A common approach to assessing the fit of a model to data is to use a p-value with respect to some statistic T(y), where y is observed data. More formally, the classical p-value is given by where y 0 is a possible future value, and the probability is taken over the distribution of y 0 given h, a single parameter estimate.
A drawback of (29) is that it does not take account of the uncertainty in h expressed by the posterior distribution p(hDy,M). In contrast, the Bayesian posterior predictive p-value [33,34]    In the context of the Salmonella study, e h h was provided by the parameter estimates obtained for p(hDD,M), m was set to 10000, and p(y 0 D e h h,M) was modelled as a multinomial distribution p(fy where U is the total number of counts (cf. (22)).
In order to obtain m values of T(y 0 ) drawn from p(y 0 Dy,M), each y 0 drawn from p(y 0 D e h h,M) is mapped to T(y 0 , e h h). We used the G-statistic for the test statistic T [35]. The Gstatistic is proportional to the Kullback-Leibler measure of distribution divergence, and is given by where O n~Cn , and E n (h,M) is the expected value for y

Discussion
The AIC is a common maximum-likelihood approach to model comparison, but nested sampling enables a Bayesian approximation of model evidence p(DDM) to be computed, along with the advantages of adopting the Bayesian approach. These include integration across parameters; estimation of the posterior parameter distributions (with visualisation of parameter correlations); and estimation of the posterior predictive distributions for goodness-offit assessments of the models.
Under the assumptions used, the most probable models with respect to the virulent and attenuated strains of S. enterica were burst-threshold Model 3 (Poisson) and burst-rate Model 7 (m 0 only), respectively. The next two most probable models were burst-threshold Model 5 (negative binomial) and burst-rate Model 11 (m 0 plus a e ), respectively. However, the Bayesian posterior predictive p-values indicate that alternative models and/or a relaxation of the quasi-stationary assumption adopted by Brown et al. [24] should be considered. It may be the case that one of the candidate models is correct but the use of pooled data was detrimental.
Other assumptions of the underlying mechanistic model may also be wrong; in particular, the absence of bacterial death and the assumption that each released bacterium infects a new macrophage.
For both the attenuated and virulent strains, the data D was recorded over a number of days following infection and then pooled, with D~fC n g 29 n~1 . If time-dependent data is to be retained and nested sampling is to be applied then a method is required to estimate the likelihood function p(fD ½t gDh,M), where D ½t~f C ½t n g and C ½t n is the number of cells containing n bacteria on the t-th day. Branching processes have been used to model a variety of biological systems [36], and we will investigate the potential of estimating p(fD ½t gDh,M) through the use of Bellman-Harris processes to model within-host infection dynamics.
We have demonstrated that a visualisation of the marginal and joint posterior parameter distributions p(hDD,M) is readily obtainable once model evidence Z has been estimated by nested sampling. The estimated joint posterior distributions provided a visualisation of the correlations between the parameters. Through the use of a weighted Kolomogorov-Smirnov test, we also found that the parameter sequence fh min,k g J k~1 resulting from nested sampling could not be regarded as a random sample from the posterior parameter distribution p(hDD,M).
One drawback of Algorithm 2 is that the restricted priors will converge to a single mode when a likelihood is multi-modal, and this will cause the evidence Z to be underestimated. This issue can be resolved by implementing a multi-modal version of nested sampling, such as that proposed by Feroz et al. [37] for comparing cosmological models.

Supporting Information
Methods S1 Toy example of nested sampling. (PDF)