Use of Approximate Bayesian Computation to Assess and Fit Models of Mycobacterium leprae to Predict Outcomes of the Brazilian Control Program

Hansen’s disease (leprosy) elimination has proven difficult in several countries, including Brazil, and there is a need for a mathematical model that can predict control program efficacy. This study applied the Approximate Bayesian Computation algorithm to fit 6 different proposed models to each of the 5 regions of Brazil, then fitted hierarchical models based on the best-fit regional models to the entire country. The best model proposed for most regions was a simple model. Posterior checks found that the model results were more similar to the observed incidence after fitting than before, and that parameters varied slightly by region. Current control programs were predicted to require additional measures to eliminate Hansen’s Disease as a public health problem in Brazil.


Introduction
Hansen's disease (HD, or leprosy) is caused by chronic infection with M. leprae. After initial infection, most likely through respiratory droplet spread or direct contact [1], infected individuals enter a latent period of varying length, generally thought to be 3-5 years [1]. Upon becoming symptomatic, individuals will become either paucibacillary (PB, having 5 or fewer skin lesions) or multibacillary (MB, more than 5 skin lesions), possibly due to genetic factors in host immune response [1]. The 2 forms of the disease result in differing levels of peripheral nerve damage caused by the immune response to infection, with MB individuals suffering from higher grades of disability and more social stigma [2]. In contrast, PB individuals may heal spontaneously [1]. Diagnosis and classification are generally based on clinical signs, as diagnostic tests are limited.
The World Health Organization has set a goal of eliminating HD as a public health problem, defined as an annual incidence rate of <1/10,000 [3]. Despite global success at reaching this goal, clusters of high prevalence remain in at least 9 countries [4], including Brazil [5][6][7]. In PLOS ONE | DOI: 10 these countries, models may assist in identifying the most effective control points. Generalized mathematical modeling of HD has been limited to a series of discrete-time Reed-Frost models [8,9] and deterministic compartmental models that have not been fitted to data [10][11][12]. An individual-based model has been produced [9], but is specific to the demographic situation in Bangladesh. Model predictions of the contribution of any control program to changes in incidence has been found to be highly dependent on the assumptions of the model, many of which are the result of persistent gaps in knowledge [13]. There is a need for a model that will predict the effects of control strategies accurately [14], as the role of interventions such as vaccines is debated [15]. In particular, the transmission rate is difficult to determine, as screening tests are not available and case detection is known to vary by region and over time [16]. The rate of transition from latency to symptomatic disease, as well, is difficult to determine, as direct estimation would rely on knowing the time of infection, which is especially unlikely in household or community acquired infections. Approximate Bayesian Computation (ABC) has been applied to several infectious disease models [17][18][19][20] with great success, as the likelihood-free approach allows for application to complicated models where the explicit likelihood function is intractable [21]. The ABC algorithm also allows for direct comparison of different models. The purpose of this study is to use the ABC algorithm to better fit existing models of HD to incidence data from Brazil and to select the model with the best fit to the observed data. This model can then be used to predict the effect of control strategies.

Existing Models and Assumptions
Deterministic compartment models for M. leprae (Fig 1) were previously developed and analyzed mathematically [8,[10][11][12]. Regional model parameters are listed in Table 1 and modelspecific parameters are listed in Table 2; all parameters are based on the parameters used by the developers of the models. State names (i.e., S or E) are used to refer to the number of individuals in that state.
All models assume that individuals enter the population as susceptible (S) and, after infection at rate λ, become latently infected (E). In Model 1 [10], latent individuals may spontaneously enter the recovered state (R), or they may become clinically diseased in the multibacillary state (M), where they are subject to an extra mortality rate ν M , or paucibacillary (P) state. Model 1 assumes that all infected individuals can recover and will not relapse. In Model 2 [11], latent individuals may become detected (E D ), at which point they may recover (R) due to treatment. Both undetected and detected individuals may become multibacillary (M) or paucibacillary (P), from which they may recover or return to undetected latency. Model 2 assumes that recovered individuals (R) may relapse to either multibacillary or paucibacillary disease. Model 3 [12] is identical to Model 1 except for the assumption that multibacillary individuals (M) do not recover. Model 4 [12] is identical to Model 2 except for the assumption that latent cases (E) cannot be detected and cannot recover. The transmission rate, λ, is calculated in Models 1, 2, 3, and 4 as l ¼ b P Pþb M M N , where β P is the transmission coefficient for paucibacillary individuals and β M is the transmission coefficient for multibacillary individuals. Model 5 [12] divides the multibacillary and paucibacillary compartments into untreated (M N , P N ) and treated (M T , P T ) states, with the assumption that treated multibacillary cases do not suffer from extra mortality. Model 5 [12] assumes that treated individuals may recover (R) or relapse to the untreated state. The transmission rate, λ, in Model 5 as l ¼ b P ðP N þy 1 P T Þþb M ðM N þy 2 M T Þ N , where θ i is the proportional decrease in infectiousness associated with treatment of disease type i. Model 6 [8] also divides the diseased compartments into untreated and treated states, and assumes that untreated paucibacillary disease may develop into multibacillary disease. Model 6 also assumes: treated individuals may enter a recovered category (M R , P R ) or a dormant disease category (M A , P A ), with either category able to relapse to the treated category; dormant paucibacillary disease (P A ) may   The initial number of individuals in each category was determined empirically (see S1 File) in relation to the initial prevalence and the assumed proportion of MB disease in existing cases, P(MB). To calculate P(MB), the average proportion of MB cases among existing cases was calculated across each region for all years. The calculations used in each model are shown in Table 3.

Approximate Bayesian Computation
The ABC algorithm used to fit each model was identical, following the Sequential Monte Carlo algorithm proposed by Toni et al. [22] The distance function was calculated as where inc PB (t) is the observed incidence of paucibacillary HD at time t, inc PB fit ðtÞ is the fitted incidence of paucibacillary HD at time t, inc MB (t) is the observed incidence of multibacillary HD at time t, and inc MB fit ðtÞ is the fitted incidence of multibacillary HD at time t. The model-specific calculations for inc PB fit ðtÞ and inc MB fit ðtÞ are shown in Table 3. The first set was run without a rejection step, in order to empirically determine the tolerance. In each following set, the tolerance was set to the 60 th quantile of the distance function from the previous set [23] and the perturbation kernel for each parameter was set equal to a uniform distribution with a range of Table 3. Assumptions made about the initial number of individuals per category and the calculation of incidence for 6 models of Hansen's Disease.

Category
Description M/M N MB (all or untreated) C*P(MB)*P m π e *C*P(MB) π e *C*P(MB) prev: reported regional prevalence in 2000 P(MB): observed regional probability that a new case is multibacillary π i : relationship between C and the initial number of individuals in the i compartment (see S1 File) doi:10.1371/journal.pone.0129535.t003 twice the variance of that parameter's values from the previous set. Unknown parameters were the transmission parameters (β M and β P ) and transition parameters (γ M and γ P ) for MB and PB cases, respectively. Uniform prior distributions for the unknown parameters were based on a range that included the biological minimum (0) and at least twice the maximum value used by the developers of the models being fitted ( Table 2). The algorithm was implemented for 10 sets of 100,000 iterations each. Validation of the model is presented in S2.

Model Selection and Parameterization
The models were fit and model selection was performed using incidence data from Brazil, which has endemic HD in all regions, from 2000 to 2012 [6]. These data include the population size, growth and death rates (Table 1), and the annual number of new PB and MB cases by region for the period of 2000 to 2012 (Table 4). Over this period, the number of cases declined in all regions (Fig 2). The regions represented in these data are shown in Fig 3. We used the ABC algorithm described above to estimate the parameters for each region and each model. Model selection was performed for each region as described above, with a Bayes Factor >1 for each model comparison indicating the preferred model. Distance values from the final set were summed for each model fitted, and each Bayes Factor was calculated as the ratio of the summed distance of each model to the summed distance of the comparative model. This provides the pairwise strength of evidence to prefer a particular model to the comparative model. Posterior distributions were determined for each of the 6 models in each of the 5 regions.

Hierarchical Model Selection and Parameterization
After the model selection and parameterization process selected the best-fit model for each region, several hierarchical models were considered to allow for parameter distributions to be fit across regions. In version 1, all 4 parameters were shared across models and regions. In version 2, only transmission parameters were shared across regions. In version 3, only transition parameters were shared across regions. Version 4 consisted of the regionally-fit model. The where r represents the region. The ABC algorithm was implemented with sets of 10,000 iterations, but the number of sets was varied by the number of free parameters, 5 sets per parameter estimated. The priors used were the same as for the regional model selection and parameterization. The posterior distributions of each model (3 hierarchical models and the full regional model) were used to fit 100 iterations for each region. Model fit was determined by choosing the hierarchical model with the smallest summed d Ã over all regions in the posterior sample fits. For each region, a random weighted selection of 1,000 iterations from the best-fit model's posterior distributions was used to calculate the effective reproduction rate (R e ), the predicted prevalence in the region ihn 2050 (p 2050 ), and the time necessary to eliminate HD from the region (t elim ), which was calculated by solving the ODE model over 100 years and finding the earliest time at which the apparent prevalence fell below the WHO threshold of 0.0001, or 1 in 10,000 [24].
In order to check the consistency of the model results, data were simulated for each region using the median of the best fitted value from the hierarchical model, using Model 3. These data were then used to repeat the full model selection and parameterization process, including hierarchical model selection and parameterization. Results were compared to the simulated input values.

Results
The validation trials presented in the supplement found that the ABC algorithm used was able to reproduce many of the simulated values if moderate values were simulated, especially if Model 3 or Model 4 was used to simulate the data, but tended to produce posterior distributions closer to the center of the prior distribution if extreme values were simulated. The model selection process was able to identify the simulated model in most cases.
The Bayes Factor ratios from applying the ABC algorithm with each model to individual regions of Brazil are shown in Table 5, with the columns being the comparative models. For all but the Southern region, Model 3 provided the best fit, and no model was strongly preferred over Model 3 in the Southern region (although Model 4 was weakly preferred); therefore, all hierarchical modeling was performed on Model 3. The posterior distributions of the best fit models for each region are shown in Fig 4, compared to the prior distribution, and Table 6. The fitted transmission rate for PB cases, β P , is similar across regions, with overlapping posterior distributions. The fitted transmission rate for MB cases, β M , is more varied, with higher fitted values in the Northeast and lower fitted values in the Midwest and Southeast. Transition rates (γ M and γ P ), in contrast, show 2 distinct groupings of regionally fit parameters. The posterior distributions of the North and Northeast are significantly lower than that of other regions, especially for MB cases.
The posterior distributions of the hierarchical models are shown in Table 6. As shown by the relative weight value in Table 6, Version 1, in which all parameters were shared across regions, was preferred to all other versions, with the regional model being the second-best fit. As would be expected, the hierarchical model posterior distributions fell in the midst of the regional posterior distributions, with no evidence that any one region was overly influential. Versions 2 and 3, in which some parameters were shared across regions, proved to have difficulty in fitting the regional parameters; posterior distributions were similar to prior distributions.
The posterior estimations for incidence of PB and MB HD in all regions of Brazil, for both the best-fit regional model and the best-fit hierarchical model, are compared in Figs 5 and 6 to the incidence estimations of the unfitted model (using the parameters provided by the developers of the model) and the observed values. In most regions, the hierarchical model captured the observed incidence dynamics better than the unfitted model; the hierarchical model was especially preferred for its ability to capture the decline in incidence observed in all regions over time. The unfitted and regionally fitted parameters showed a tendency to increase incidence during the later part of the observation period. Both regional and hierarchical fits were best in the North and Midwest. No set of parameters was able to capture the high peak in PB incidence in the Northeast and Southeast.
All regional estimates of R e had a median value of 1.3, with ranges of 0.97 to 1.7. No region was predicted to eliminate HD as a public health risk within 100 years. Median values of predicted prevalence in the year 2050 ranged from 0.0007 in the South to 0.0045 in the Midwest; this represents a slight increase from the observed prevalence in 2000.
The simulation study, in which fitted values were used to simulate data for each region and the fitting process was repeated, produced almost exactly the same results as the original fitting. Model 3, the simulated model, was preferred for all regions except the Southern region.    Parameter values were similar to the input parameters across all regions. The hierarchical model in which all parameters were shared (version 1) was preferred, and the medians (range) of the best-fit parameter distributions were 0.45 (0.24-0.95) for β M , 0.27 (0-0.95) for β P , 0.13 (0.046-0.29) for γ M , and 0.09 (0.03-0.20) for γ P . The transmission values were higher than the input parameters, and the transition values were lower, but the distributions were similar.

Discussion
This study reports the best fit of the previously published mathematical models and parameters for fitting incidence of Hansen's Disease, in both multibacillary and paucibacillary form, in Brazil. This is the first study to directly compare the different suggested models of HD to field data. By identifying and parameterizing the best-fit model (the model for which the distance between the estimated and observed incidence is smallest), this study provides a guideline for studying control and prevention of HD in an endemically infected country and suggests further improvements to mathematical models of HD. The ABC algorithm was found to be effective for selecting the appropriate model and marginally adequate for improving the fit for most of the models simulated. This algorithm has been applied to several infectious disease models [17][18][19][20], but only one with 2 outcome variables to track [25] (in this case, incidence of PB and MB cases). Use of ABC's non-likelihoodbased approach allowed for easier use of more data, which improved the model fit. However, the algorithm experienced difficulty with the interconnected nature of the parameters; the high negative correlation between transmission and transition parameters lead to a tendency to overestimate one and underestimate the other. This could be corrected if better field estimates of at least one set of parameters were available, to decrease the range of the current, somewhat uninformative, priors. Validation with extreme values resulted in poor fitting, which could be a result of insufficient particles to capture the extremes of the prior distribution. The simulation study, however, found that the algorithm was able to consistently reproduce the simulated values if fitted parameter values were used to simulate the data. This indicates that the algorithm is sufficient to fit realistic parameter values, and is consistent in identifying the best-fit model.
There have been questions with regard to the use of ABC for model selection, especially when applied to insufficient summary statistics [26,27]. Although many of these caveats are related to the joint estimation of models and parameters, as opposed to the separate estimation of parameters for each model applied here, we were aware of the potential for model selection to fail. For this reason, we chose to perform validation tests (see S2 File) with simulated data. This validation found that our summary statistic was able to select the appropriate (simulated) model in most cases. We also considered using a combination of summary statistics, in which the distance between simulated and observed incidence was calculated separately for MB and PB cases and the rejection step required both distance functions to be below their individual thresholds. However, this did not improve model fit or change model selection results during validation trials (data not shown), so the more efficient joint summary statistic was used.
One further difficulty to the fitting of these models was in selecting the initial values for each compartment. As prevalence estimates were available, these were at least able to set the known numbers of PB and MB cases. However, the numbers of latent, dormant, recovered, and undetected cases could not be determined from the available data. This led to the empirical approach applied here (S1 File). The initial values found using this approach were sensitive to the assumed parameter values, which could bias the model fitting process. In order to minimize this, we applied the 2-step approach, with a preliminary model fitting using initial values based on the unfitted parameters. The preliminary fitted parameters were used to determine a new set of initial values, which were then used for the final model fitting. This empirical 2-step approach should decrease the bias of the unfitted parameters while providing reasonable estimates to the true initial values.
In the initial validation test, the algorithm resulted in a preference for the simulated model in all but Model 3, which was the model found to be preferred with the estimation based on observed data. This could be a result of the assumptions necessary in model fitting. In the validation test, all assumptions about model parameters are known to be true, and so will not bias the results of the fitting, allowing more complicated models to reproduce the results of a simple model. In the fitting based on observed data, all unfitted parameters are assumed to be known, but there is a possibility that some may be wrong. As Model 3 has the fewest unfitted parameters, it is the least biased by this uncertainty.
For most regions, except the South, Model 3 was found to have the best fit. The Southern region preferred Model 4 slightly. Model 3 was the simplest, which may explain the preference; there are fewer unfitted parameters to influence the results of the fitting. However, this model does not allow for cure of MB disease, which is known to occur and which all other models allow. It could be that other models underestimate the amount of time that MB individuals are infectious, and that this value (represented by α M in Models 1, 2, and 4 and by φ M in Models 5 and 6) should also be fitted. The current value, however, assumes that the average MB case will become non-infectious in 5 years (in Models 1, 2, and 4) or 2 years (in Model 6), which are reasonably in line with current assumptions; Model 5 assumes a very low case-finding rate, and so is unlikely to be underestimating the length of infectiousness in MB cases. As the amount of time spent infectious would implicitly include the case-finding rate in Models 1-4, and that value would change with differing control programs, it could be appropriate to find better estimates of the recovery rate in future studies. In the present study, the number of observations is insufficient to fit many correlated parameters.
The best-fit parameters, regionally and by hierarchical model, show useful patterns. The transmission parameter for MB cases (β M ) was almost always higher than that for PB cases (β P ). This was expected, and reinforces the existing belief that MB cases are more infectious than PB cases, but the difference in the parameter values is smaller than expected. In addition, the transition rate to MB (γ M ) is slightly higher than the transition rate to PB (γ P ). As these models are formulated, that indicates that slightly more infections result in MB cases than PB cases, which was observed in the data from Brazil (Fig 2). However, it is known that PB cases can be self-healing and may be under-reported, and so this model (based on reported cases) may be underestimating the true γ P , which may also bias estimates of β M and β P . It may be useful to configure a model that would take into account the prevalence of genetic susceptibility for MB, allowing for a more accurate representation of the 2-path model that is becoming more accepted for mycobacterial diseases [9].
The hierarchical model in which all parameters were shared was found to provide the best overall fit. The best-fit models were able to dramatically improve the estimated incidence of PB and MB cases in several regions (Figs 5 and 6). In some regions, the regional model showed a better posterior estimation for MB cases; this indicates that there is regional variation in HD dynamics, likely due to regional variations in health and socioeconomic factors [28] as well as regional differences in case detection rates. The North and Northeast regions, for instance, showed much lower transition rates than the other regions, possibly related to lower true case detection rates in these regions. This may also have resulted in the fitting of much higher transmission rates for the Northeast, as cases took longer to become infectious.
The model was overall a poor fit for PB cases the Southeast and Northeast. These regions had later peak incidence, indicating that control programs may have changed during the time period under study. That would cause a poor fit for all parameters, as the transmission rates depend on hygiene, prophylaxis, and vaccination rates and the transition rates depend on case finding and treatment rates.
Based on the results of this study, the 5 regions of Brazil are not progressing towards elimination; the mean and range of the posterior distribution of R e is 1.3 (0.97-1.7), barely touching 1. While an increase in prevalence is predicted, this is possibly a discrepancy between the model structure, which does not allow for recovery from MB cases, and the public health authority, which assumes all cases to be recovered at the end of treatment. The observations being fitted and the criterion for elimination, incidence, would not be affected by this discrepancy. With the current programs remaining unchanged, the model predicts that most regions of Brazil cannot eliminate HD as a public health problem within 100 years. Improved control programs will be needed if the goal of elimination is to be reached.