A spatial-mechanistic model to estimate subnational tuberculosis burden with routinely collected data: An application in Brazilian municipalities

Reliable subnational estimates of TB incidence would allow national policy makers to focus disease control resources in areas of highest need. We developed an approach for generating small area estimates of TB incidence, and the fraction of individuals missed by routine case detection, based on available notification and mortality data. We demonstrate the feasibility of this approach by creating municipality-level burden estimates for Brazil. We developed a mathematical model describing the relationship between TB incidence and TB case notifications and deaths, allowing for known biases in each of these data sources. We embedded this model in a regression framework with spatial dependencies between local areas, and fitted the model to municipality-level case notifications and death records for Brazil during 2016–2018. We estimated outcomes for 5568 municipalities. Incidence rate ranged from 8.6 to 57.2 per 100,000 persons/year for 90% of municipalities, compared to 44.8 (95% UI: 43.3, 46.8) per 100,000 persons/year nationally. Incidence was concentrated geographically, with 1% of municipalities accounting for 50% of incident TB. The estimated fraction of incident TB cases receiving diagnosis and treatment ranged from 0.73 to 0.95 across municipalities (compared to 0.86 (0.82, 0.89) nationally), and the rate of untreated TB ranged from 0.8 to 72 cases per 100,000 persons/year (compared to 6.3 (4.8, 8.3) per 100,000 persons/year nationally). Granular disease burden estimates can be generated using routine data. These results reveal substantial subnational differences in disease burden and other metrics useful for designing high-impact TB control strategies.


Introduction
Tuberculosis (TB) is a leading infectious cause of death globally [1] despite the availability of effective combination drug regimens. A major contributor to ongoing TB mortality is the large fraction of cases that are not detected, and therefore not treated. In 2019, the World Health Organization (WHO) estimated that 29% (21-36%) of incident TB cases did not get diagnosed or receive treatment [1]. More effective approaches for case finding, diagnosis and linkage to care are needed to reduce the fraction of TB case that do not receive treatment [2], which contributes to TB morbidity and onward TB transmission. While there is only limited evidence supporting untargeted active TB case finding activities as an effective strategy to reduce TB transmission in communities [3], trials [4] and modeling studies [5] suggest that active TB case finding interventions targeted to highest risk sub-populations and spatial areas of disease concentration could have a sizeable epidemiologic impact. Identifying spatial areas of TB disease concentration, and especially areas where rates of undiagnosed TB may be high, remains a substantial challenge. In most settings, it is difficult to determine whether spatial heterogeneity in notification rates can be attributed to actual differences in TB incidence or to differences in surveillance system quality. This variability in the quality and completeness of surveillance systems make it difficult to reliably estimate subnational TB disease incidence and the local rates of undiagnosed TB.
Several approaches have been developed to describe subnational patterns of TB disease burden. These include methods that apply spatial smoothing methods to TB case notification data directly (not addressing the problem of under-detection) [6], use expert input and programmatic data to decompose national incidence esimates to local areas [7,8], or implicity assume that case detection rates are constant across modeled locations [9]. In this paper, we describe extensions to a published method for estimating subnational TB burden [10], which allows stable estimates to be generated at a fine geographic scale. Under this approach, local area estimates of TB incidence and the completeness of case detection are generated using a spatialmechanistic model fit to routine data on TB case notifications and reported deaths.
We demonstrate the feasibility of this approach using data from Brazil. The WHO estimates that Brazil had a national TB incidence of 45 (39-52) per 100,000 in 2018, with a case detection rate of 87% (75% -100%) [1]. Routine reporting for Brazil demonstrate substantial variation in notification rates across districts, and past studies employing simpler versions of the methods used in this study have suggested considerable subnational variation in incidence and case detection rates [10,11]. In this study, we report estimates of TB incidence, the fraction of cases treated, and the rate of untreated TB disease for each of 5568 contiguous municipalities in the country over the period 2016-2018.

Overview
We developed a novel method for generating small-area TB incidence estimates based on a previous mathematical model of TB incidence and case detection [10]. We made three specific changes to this previous work in order to produce small-area estimates: 1) we incorporated the spatial structure of municipalities to allow smoothing of noisy estimates across neighboring areas; 2) we aggregated notifications over a three-year period to smooth year-to-year stochastic variation at the municipal-level; and 3) we made adjustments to the previous method used to estimate the probability of death during TB treatment.
A small set of sociodemographic predictors were also included in the model to allow for correlations between municipalities with similar characteristics. The model estimates incidence as the sum of individuals with TB who initiate treatment, die without initiating treatment, or 'self-cure' without initiating treatment. Accounting for these pathways, we fit the model to the observed number of individuals with TB who receive treatment and the observed number of TB deaths in each of Brazil's 5568 contiguous municipalities.

Data
We accessed tuberculosis case notifications from 2016 to 2018 (n = 276,915) from Brazil's Notifiable Diseases Information System (SINAN; Sistema de Informação de Agravos de Notificação) [12]. Because TB treatment is exclusively accessed via Brazil's public healthcare system, TB case notifications are considered an accurate proxy for treatment initiation [10]. Case notifications were spatially referenced based on the recorded municipality of residence. In cases where municipality of residence was missing or coded incorrectly, municipality of treatment was used as a proxy (n = 202, 0.07%). We excluded cases that did not represent new diagnoses among living individuals (individuals who had received a misdiagnosis of TB (n = 6,063, 2.2%), were continuing care after transferring from another clinic (n = 7,762, 2.8%), were continuing care after previously being considered lost to follow-up (n = 22,242, 8.0%), or who had received a diagnosis of TB postmortem (n = 1,974, 0.7%)). In addition, we excluded cases where the municipality of residence had no neighboring municipalities (the island municipalities Ilhabela, SP and Fernando de Noronha, PE; n = 24, 0.009%). All case notifications had a recorded treatment outcome. Duplicate notifications are systematically removed from the database, and we did not detect any remaining duplicates in the analyzed dataset.
We accessed tuberculosis mortality data from 2016 to 2018 from the Brazilian Mortality Information System (SIM; Sistem de Informação de Mortalidade) [13]. We considered an individual to have died while actively infected with TB if at least one International Classification of Disease (ICD-10) code related to tuberculosis was listed as a primary or secondary cause of death [14]. We considered ICD-10 codes A15.0-A19.9, B20.0, K67.3, K93.0, M49.0, N74.0-N74.1, P37.0, U84.3; this list includes the code for TB/HIV co-infection. Deaths were spatially referenced based on municipality of residence. In addition, we used a linkage of the SIM and SINAN systems [15] to estimate the probability of death for case notifications with "death" or "loss to follow-up" as their reported treatment outcome.
Finally, we collated municipal-level data on sociodemographic characteristics (Table 1). These variables were chosen based on their expected relationship to TB burden [16] or the completeness of case detection.

Model description
We specified Poisson likelihood functions for total SINAN case notification data and SIM mortality data over the period 2016-2018.
For municipality i, where δ i represents population size, α i represents the modeled TB incidence rate, β i represents the modeled fraction of cases treated, δ i represents the probability of death during treatment, μ represents the probability of surviving the disease episode without treatment, b h represents the fraction to total deaths recorded in SIM (calculated for the state (h) in which the municipality is located), and ε i represents an adjustment for misreporting of TB deaths in the SIM database.
We followed the implementation of the modified Besag-York-Mollie (BYM2) model described by Morris et al. [21][22][23] to incorporate the spatial structure of municipalities in our model. We selected the BYM2 model because it does not presume spatial autocorrelation; the model includes a mixing parameter which distributes random variance in model estimates across spatial and non-spatial effects, making it an appropriate choice when the presence of spatial autocorrelation is uncertain [24]. We specified exponential and inverse logit functions for incidence (α i ) and fraction treated (β i ), respectively: For municipality i, where π 0 and ω 0 are constants; X i and Z i are vectors of municipallevel covariates; and π and ω are the associated vectors of regression coefficients. The vector X i includes the covariates percentage of crowded households, percentage of municipality classified as a favela (slum) or other subnormal housing, percentage of individuals in poverty, whether there is a prison in the municipality, number of public hospital beds per capita, primary healthcare access, and average GDP per capita. The vector Z i includes the covariates percentage of individuals in poverty, whether there is a prison in the municipality, number of public hospital beds per capita, primary healthcare access, and average GDP per capita. Apart from individuals recorded as having been misdiagnosed, we assumed that all notified cases represent individuals who truly had TB. Similarly, we did not consider false-positive diagnosis among SIM TB death data. Finally, π 1i and ω 1i are the combined random and spatial

TB Mortality
Number of deaths with a TB-related ICD-10 code as a primary or secondary cause SIM [13] 2016-18

TB Deaths after Notification
The fraction of cases in SINAN-TB with "Death' as an outcome that can be linked to a death in SIM occurring within 365 days of the case notification.

Treatment Outcome is "Death"
Fraction of notified TB cases with a known outcome where "death" is the outcome SINAN-TB [12] 2016-18

Treatment Outcome is "Loss to Follow Up"
Fraction of notified TB cases with a known outcome where "loss to follow-up" is the outcome

Poorly Defined Cause of Death
Percentage of deaths with a primary cause of death listed as an ICD-10 code considered "poorly defined" (ICD-10 codes R0-R99) Health Informatics, Brazilian Ministry of Health (DATASUS) [13] 2016-18

SIM Coverage
Correction factor for under-reporting of all deaths in the mortality system (SIM) Brazilian Institute of Geography and Statistics (IBGE) [17] 2018 effects of the BYM2 model: For municipality i, where φ i represents an intrinsic conditional auto-regressive spatial component, θ i is a non-spatial random component, ρ models the proportion of variance from the spatially structured effect, and S is a scaling factor computed from the spatial adjacency matrix. The scaling factor is chosen such that the Var(φ i ) � 1; additionally, the prior on θ is fixed such that Var(θ i ) � 1, making σ the overall standard deviation of the combined random effects component [21][22][23].
In addition, we estimated the probability of death among individuals who initiated treatment: For municipality i, where z i is the probability that death is listed as the treatment outcome, ι i is the probability that loss to follow-up is listed as the treatment outcome, λ is that probability that an individual with treatment outcome of death will appear in the death record, and η is the probability that an individual whose treatment outcome was lost to follow up will appear in the death record. We determined the means of the prior distributions for z i and ι i from state-level data, dividing the number of outcomes listed as death and lost to follow-up, respectively, by the number of treatment notifications with a definitive treatment outcome (cure, death, loss to follow-up, or treatment failure). We calculated the means of the prior distributions for λ and η based on a linkage of SINAN and SIM [15]. We calculated the fraction of individuals with a treatment outcome of "death" or "loss to follow-up" (respectively) linked to a death record in SIM within 365 days of their case notification. Because 2016 was the last year for which death data were available in the linked dataset, we restricted this analysis to case notifications from 2015.
Finally, we estimated the systematic underreporting of TB as a cause of death: For municipality i, where κ 1 is a constant, κ 2 is a municipal-level random effect, σ κ is the standard deviation, x i is the percentage of deaths in SIM attributed to a poorly-defined cause, and κ 3 is the associated regression coefficient (based on values elicited through an expert opinion survey [10]). We assumed that underreporting of TB as a cause of death is not a spatially dependent process.

Estimation and Implementation
There is substantial uncertainty around true values for several model parameters. We used a Bayesian approach to represent and propagate this uncertainty through the analysis, utilizing prior probability distributions to describe plausible ranges for model parameters (S1 Table). When fitting the model likelihood functions to data, we used a three years of input municipality data. Summing cases and deaths over a three-year period decreased the stochastic uncertainty associated with low case counts and reduced the number of municipalities with no reported TB deaths in the study period, improving inference for these municipalities.
The model was implemented in stan [25] using the rstan [26] package. We ran 6000 iterations on four chains, keeping 500 iterations after warm-up and thinning by a factor of two for a total of 1000 posterior draws. Running on four-chains in parallel using the Yale Center for Research Computing's high-performance computing cluster took approximately 8.5 hours, based on the reported elapsed time. The meanR was 1.005 and there were no diagnostic warnings. We calculated point estimates as the mean of the posterior draws. We calculated equaltailed 95% posterior intervals using the 2.5 th and 97.5 th percentiles of these distributions. To calculate state-and national-level results we summed estimates across municipalities for each posterior draw before calculating point estimates and intervals. Parameter posterior distributions can be found in S2 Table. Input data and model code are available at github.com/mel-hc/ TB_saie.

Results
Over the period 2016-2018, there were 239,369 individuals who initiated treatment for a newly diagnosed case of TB. Nationally, we observed 38.5 notified TB cases per 100,000 person-years over the three-year study period. In this same period, there were 59,183 deaths where TB was listed on the death certificate (as the underlying or contributory cause), corresponding to 3.5 deaths per 100,000 person-years.

TB incidence
We estimated a national incidence rate of 44.8 (95% credible interval: 43.3, 46.8) per 100,000 person-years, for the three-year study period. The TB incidence rate varied substantially across and within states (Fig 1). Average state-level incidence rate ranged from 13.8 (12.7, 15.0) per 100,000 person years in the Federal District to 93.4 (88.8, 98.9) in Amazonas.
The median municipal incidence rate was 19.3 per 100,000 and the estimated incident TB rate ranged from 4.6 to 1172 per 100,000 per year (S1A Fig). The majority of incident TB cases were concentrated in a small number of municipalities. Over half of TB incidence in each state was attributed to 1.1% to 6.7% of municipalities within that state. In Amazonas, 72.5% (71.2%, 73.7%) of incident TB occurred among individuals living in Manaus, and in Rio de Janeiro State, 52.2% (51.2%, 53.3%) of incident TB occurred among individuals living in Rio de Janeiro municipality. Nationally, 1% of municipalities accounted for 50.0% (49.7%, 50.4%) of incident TB. Because Brazil's population is highly concentrated in urban municipalities, we expect a small number of municipalities to account for a large share of incident TB. While these larger municipalities share several features that promote TB transmission (high population density, populations living in subnormal agglomerations), we found a relatively weak correlation between municipal population size and TB incidence rate (Pearson correlation coefficient (ρ) = 0.09, p < 0.001). A map of incident TB rates, by municipality, can be found in Fig 2A. A small number of municipalities (n = 39) had a mean estimated incident TB rate in excess of 150 per 100,000 inhabitants per year. Most of these municipalities had a large number of notified cases (average notified cases = 217) relative to their small population (average population = 28,855). Additionally, 95% of these municipalities (n = 37) contained a prison, and for these municipalities the incarcerated population was large relative to total population (median 175 incarcerated individuals to 1000 inhabitants) (S3 Table). Across all municipalities, the presence of a prison was a significant predictor of elevated TB incidence rate, corresponding to a 1.52 (1.46, 1.59) increase in the rate ratio of TB incidence, conditional on socioeconomic factors and healthcare quality. Collectively, municipalities with prisons had a higher TB incidence rate (51.9 [50.2, 54.2]) than municipalities without prisons (33.5 [32.2, 35.3]), p < 0.001). In municipalities with prisons, TB incidence rates correlate strongly with the number of incarcerated individuals per 1000 residents (Pearson correlation coefficient (ρ) = 0.87, p < 0.001). Municipalities with more than 10 incarcerated individuals to 1000 inhabitants had a median TB incidence rate of 67.7 (range: 9.2, 1172) per 100,000.

Fraction of incident TB receiving treatment
We estimated that the average fraction of incident TB cases receiving treatment in Brazil was 0.86 (0.82, 0.89). As with estimates of incident TB, the fraction of treated cases varied by state, from 0.81 (0.75, 0.85) in Alagoas, 0.89 (0.86, 0.92) in São Paulo. Across municipalities, the estimated fraction treated ranged from 0.73 to 0.95 (S1B Fig). Areas where the fraction of TB cases that receive treatment is low are generally concentrated in the Northeast region of the country (Fig 2B).

Rate of untreated TB
We estimate the rate of untreated tuberculosis as the product of the incident TB rate and 1 minus the fraction treated. Nationally, there were an average 13,049 (9941, 17247) individuals with incident TB that did not receive treatment each year (6.3 [4.8, 8.3] per 100,000 person-  (8.9, 18.8) in Amazonas (Fig 3).
We estimated high rates of untreated TB throughout Brazil, but found that these areas were generally concentrated in urban coastal municipalities and in the North region (Fig 4). Rio de Janeiro municipality had the largest number of untreated TB of any municipality, with an average 1031 (664, 1469) untreated TB cases each year. Collectively, the 65 municipalities with the highest number of untreated individuals accounted for 50.2% (47.8%, 52.7%) of untreated incident TB in Brazil (S4 Table). A 1% increase in the fraction treated across these 65 municipalities would result in 477 (462, 501) fewer untreated individuals each year; a 1% increase in fraction treated across all municipalities would result in 926 (897, 970) fewer untreated individuals each year.

Model sensitivity to prior distributions
We conducted sensitivity analyses to understand the impact of the choice of prior distribution for the probability of survival without treatment (an input estimated based on expert opinion). First, we re-fit the model with weakly informative priors and found that the posterior distribution was sensitive to the choice of model prior. We also found that the posterior probability of survival without treatment correlates strongly with national estimates of the TB incidence rate (Pearson correlation coefficient (ρ) = 0.89, p < 0.001) and fraction treated (Pearson correlation coefficient (ρ) = -0.90, p <0.001) (S2 Fig). However, the relative rank of municipal incidence rate and treatment coverage estimates when the model was run with strong priors around extreme values (0.3, 0.7) of survival without treatment were very strongly correlated (Spearman's rank correlation coefficient (ρ) = 0.99, p < 0.001 and (ρ) = 0.98, p < 0.001, respectively). Shapefiles were downloaded using the geobr package for the R programming language, which sources shapefiles from IBGE: https://www.ibge.gov.br/en/ geosciences/territorial-organization/territorial-meshes/. https://doi.org/10.1371/journal.pgph.0000725.g002

Discussion
We present a spatially-explicit Bayesian model to estimate rates of TB incidence and treatment initiation. We demonstrate the feasibility of this approach for generating small area disease burden estimates with routinely-collected data on case notifications and deaths. Applying this approach to data from Brazil, we found substantial subnational variation in TB epidemiology. Consequently, a small number of municipalities were responsible for the majority of incident TB in Brazil. Additionally, we found that the presence of prison in a municipality was associated with higher TB rates, a finding which is supported by previous work on TB incidence in Brazilian prisons [27] and their surrounding communities [28]. We present a metric of untreated TB which can be used to identify municipalities where a large number of incident TB cases did not initiate treatment (Fig 3, S4 Table). Improving case detection and treatment initiation rates in these municipalities could be an effective strategy for reducing the burden of TB in Brazil. Finally, we aggregated municipal-level results to the state-level and found that our point estimates for incidence and fraction treated generally fell within the uncertainty This approach is not the first that has been proposed to generate subnational TB burden estimates. One approach that has been used in other large high TB burden countries is the SUBsET method [7,8]. As applied in Indonesia, this method starts from WHO-estimated TB incidence for the country, then decomposes this total into district-level values based on predictors defined through expert elicitation (population size, urbanization, socio-economic indicators). Case detection rates are then calculated as the ratio of notifications to estimated incidence. While this method leverages local knowledge on epidemiological drivers, it excludes TB notifications and death data from the incidence estimation, which may provide valuable additional signals of disease burden. In another published study, province-level mathematical models were constructed for South Africa to estimate both disease burden estimates and investigate policy scenarios [9]. While the use of fully-developed policy models allowed this study to directly compare control options in each province, the inference approach was not designed to identify differences in case detection between provinces, and instead a common case detection rate was applied in all locations based on the national WHO estimate. A recent study in Ethiopia employed a Hidden Markov Model and spatiotemporal smoothing to link case notifications to unobserved TB incidence. This approach allowed for local differences in case detection rates, with these differences identified primarily through spatiotemporal random effects and the presence or absence of a health facility [29]. Other studies that have provided high granularity burden estimates have smoothed case notifications directly, implicitly assuming perfect case detection rates [6]. Finally, several approaches have been developed to leverage TB prevalence survey data to produce subnational TB burden estimates (applied in Pakistan [30] and Bangladesh [31]). While prevalence survey data represent a valuable additional source of information on the distribution of TB, these surveys are expensive to undertake and are not available for all countries.
Although the approach described here potentially resolves many of these limitations, this study has several limitations. First, the model is designed to correct for the misattribution of TB deaths to other causes, but this adjustment may be insufficient in some cases, leading to underestimates of incident TB rates in municipalities with exceptionally poor death records. Secondly, we found that the posterior distribution of the probability of survival without treatment was dependent on the choice of prior distribution, and model outcomes were sensitive to this parameter. The probability of death without treatment in the antibiotic era is not easily observable; our choice of prior distribution represents our best possible estimate. However, if the probability of survival without treatment were lower than our estimate, the model would underestimate the incident TB rate and overestimate the fraction treated (S2 Fig). Thirdly, our analytic strategy did not consider the possibility of overdiagnosis of TB. These false-positive diagnoses are an acknowledged risk of TB diagnostic algorithms, particularly when TB is diagnosed based on clinical criteria [32]. This could produce biased estimates of the fraction treated by overestimating treatment initiations and treatment fatality rates. Over the study period, we found that 87% of notified cases in Brazil were tested with highly-specific bacteriological tests (microscopy, culture, or molecular (GeneXpert) tests). Among notified cases without one of these tests, 72% had x-ray findings indicative of TB. These figures suggest that overdiagnosis is a relatively minor feature of TB diagnosis in Brazil, and unlikely to meaningfully affect our modelled estimates. Our analysis also did not include primary abandonment (diagnosed individuals who do not initiate treatment or who default within the first 30 days), however programmatic data suggest these individuals represent less than 1% of all notifications.
Finally, this analysis did not consider some risk factors for TB mortality that may have improved inference. The most important of these is HIV, which substantially increases mortality rates for individuals with TB disease [33,34]. If individuals who are co-infected with TB and HIV are more likely to die from TB in the absence of treatment, the municipalities with high TB-HIV coinfection rates could have upwardly biased incidence estimates. Conversely, if TB-HIV deaths are more likely to be misclassified as other categories of HIV/AIDS related deaths, this would mean that incidence estimates for high TB-HIV municipalities would be biased downward. The net effect of these two potential biases is unclear. Although HIV surveillance in Brazil is believed to be reasonably complete [35], HIV testing is not consistently performed with TB diagnosis (17.5% not performed or missing), making it difficult to infer rates of TB-HIV co-infection for this analysis. It may be necessary to extend this analytic approach to explicitly account for TB-HIV in order to apply it in high HIV prevalence settings.
The approach presented here can be a valuable tool enabling countries to better understand TB burden on a fine spatial scale. This approach uses routinely-collected data on case notifications and deaths, making it applicable to settings where the number of individuals receiving TB treatment is known and quality cause-specific death records are available. In settings where a large fraction of areas has no reported cases or deaths in a given year, aggregating over multiple years can reduce stochastic uncertainty in model results. The modified Besag-York-Mollié spatial and random effect terms can also be extended to account for non-contiguous areas. Furthermore, because the modified Besag-York-Mollié approach has been shown to be unbiased in applications with no spatial autocorrelation [24], the model does not need to be reparametrized for settings where the spatial autocorrelation in TB burden may be small or negligible.
Small area incidence estimates generated from this approach can help local health agencies tailor programmatic improvements to their specific epidemiological situations. Targeting improvements in TB case finding in municipalities with high rates of untreated TB could result in large increases in the number of infected individuals initiating treatment and ultimately decrease TB incidence and mortality.