Information differences across spatial resolutions and scales for disease surveillance and analysis: The case of Visceral Leishmaniasis in Brazil

Nationwide disease surveillance at a high spatial resolution is desired for many infectious diseases, including Visceral Leishmaniasis. Statistical and mathematical models using data collected from surveillance activities often use a spatial resolution and scale either constrained by data availability or chosen arbitrarily. Sensitivity of model results to the choice of spatial resolution and scale is not, however, frequently evaluated. This study aims to determine if the choice of spatial resolution and scale are likely to impact statistical and mathematical analyses. Visceral Leishmaniasis in Brazil is used as a case study. Probabilistic characteristics of disease incidence, representing a likely outcome in a model, are compared across spatial resolutions and scales. Best fitting distributions were fit to annual incidence from 2004 to 2014 by municipality and by state. Best fits were defined as the distribution family and parameterization minimizing the sum of absolute error, evaluated through a simulated annealing algorithm. Gamma and Poisson distributions provided best fits for incidence, both among individual states and nationwide. Comparisons of distributions using Kullback-Leibler divergence shows that incidence by state and by municipality do not follow distributions that provide equivalent information. Few states with Gamma distributed incidence follow a distribution closely resembling that for national incidence. These results demonstrate empirically how choice of spatial resolution and scale can impact mathematical and statistical models.


Unfunded studies
Enter: The author(s) received no specific funding for this work. The authors have declared that no competing interests exist. NO    Describe where the data may be found in full sentences. If you are copying our sample text, replace any instances of XXX with the appropriate details.
If the data are held or will be held in a public repository, include URLs, accession numbers or DOIs. If this information will only be available after acceptance, indicate this by ticking the box below. For example: All XXX files are available from the XXX database (accession number(s) XXX, XXX.).
• If the data are all contained within the manuscript and/or Supporting Information files, enter the following: All relevant data are within the manuscript and its Supporting Information files.
• If neither of these applies but you are able to provide details of access elsewhere, with or without limitations, please do so. For example: Data cannot be shared publicly because of [XXX]. Data are available from the XXX Institutional Data Access / Ethics Committee (contact via XXX) for researchers who meet the criteria for access to confidential data.
The data underlying the results presented in the study are available from (include the name of the third party • All relevant data are within the manuscript and its Supporting Information files. They can be found in S3_File.zip. and contact information or URL). This text is appropriate if the data are owned by a third party and authors do not have permission to share the data.

• * typeset
Additional data availability information: 2 Abstract 25 26 Nationwide disease surveillance at a high spatial resolution is desired for many infectious diseases, 27 including Visceral Leishmaniasis. Statistical and mathematical models using data collected from 28 surveillance activities often use a spatial resolution and scale either constrained by data availability or 29 chosen arbitrarily. Sensitivity of model results to the choice of spatial resolution and scale is not, 30 however, frequently evaluated. This study aims to determine if the choice of spatial resolution and scale 31 are likely to impact statistical and mathematical analyses. Visceral Leishmaniasis in Brazil is used as a 32 case study. Probabilistic characteristics of disease incidence, representing a likely outcome in a model, are 33 compared across spatial resolutions and scales. Best fitting distributions were fit to annual incidence from 34 2004 to 2014 by municipality and by state. Best fits were defined as the distribution family and 35 parameterization minimizing the sum of absolute error, evaluated through a simulated annealing 36 algorithm. Gamma and Poisson distributions provided best fits for incidence, both among individual 37 states and nationwide. Comparisons of distributions using Kullback-Leibler divergence shows that 38 incidence by state and by municipality do not follow distributions that provide equivalent information. 39

Introduction 48 49
Infectious disease research often relies on data generated through passive or active surveillance activities, 50 which can suffer from important limitations due to variation in methods and capacities for data collection 51 [1,2]. Typically, researchers aim to collect data at a high spatial resolution, that is, in the form of small 52 surveillance units such as counties or municipalities rather than states or nations [3,4], though this may 53 not always be seen as beneficial [5]. Conducting surveillance at a high spatial resolution, however, is 54 often unrealistic when considering large areas and constrained resources [6,7]. 55 56 Data collected from infectious disease surveillance activities are often used in researching involving 57 mathematical or statistical models. In such analyses, matters related to data quality are of concern. The 58 choice of spatial resolution is often based on data availability or chosen arbitrarily, with little attention 59 given to whether this decision may impact model results. Aggregating data into larger spatial units can aid 60 in computational efficiency, but creates the risk of introducing ecological fallacy [8] and masking 61 heterogeneity within those larger units if conclusions are drawn inappropriately [9][10][11]. This would be 62 particularly problematic when aiming to seek disease etiologies. Previous studies using mathematical or 63 statistical models have investigated the importance of high resolution data by repeating analyses using 64 data at different resolutions and then comparing results [10][11][12]. 65 66 An additional challenge to high quality surveillance is the need for surveillance over a large spatial scale, 67 referring to the entire area where surveillance is being conducted. Here, spatial scale differs from spatial 68 resolution in their definitions as follows: spatial scale refers to the total spatial area being examined, while 69 spatial resolution refers to the size of the individual spatial units within that area. Large-scale surveillance 70 can be particularly challenging for nations with large land areas and populations. In these circumstances, 71 there is potential benefit in identifying a smaller area, such as a state or group of states, where 72 surveillance can adequately estimate the national disease burden. This study compares spatial resolutions and scales using probability distributions rather than by fitting 145 models with assumptions and conducting a sensitivity analysis. Best fitting distributions from multiple 146 considered distributional families were fit for (1) annual incidence for each individual state using the 147 municipality as the unit of surveillance; (2) annual incidence nationwide using the municipality as the unit 148 of surveillance; and (3) annual incidence nationwide using the state as the unit of surveillance (Figure 1). 149 All 11 years of observation were included.

151
Common candidate distributions were selected based on exploratory analyses and suitability for discrete 152 incidence data, and wide ranges of parameters for each distribution were tested.  ] 159 Each distribution was evaluated for the optimal parameter set that minimizes the sum of absolute error 160 where ( ) represents the probability of observing an incidence of cases per 100,000 person-years 165 based on the proposed distribution indicated in Table 1  The first aim of comparing distributions is to determine if different inferential conclusions may be 187 reached using the municipality or the state as the resolution of nationwide surveillance. This was done by 188 comparing the fitted state-resolution distribution to an expected state-resolution distribution for the nation 189 based on the fitted municipality-resolution distribution for the nation. This expected distribution was 9 generated empirically by drawing Monte Carlo samples from the fitted municipality-resolution 191 distribution. 192 193 If incidence is denoted by X as a random variable following the fitted municipality-resolution distribution 194 for the nation; states are denoted by s; state s has ns municipalities, denoted by m; and municipality m has 195 a population of pmy in year y, this empirical distribution was generated by drawing 1,000 samples of count for the municipality. The sum of these is divided by the total state population to produce an 202 expected state-resolution incidence for a year based on the municipality-resolution distribution for 203 incidence. 204

205
Relative proportions of incidence values in these simulated values were compared to the probabilities of 206 each incidence value from the fitted state-resolution distribution through Kullback-Leibler (KL) 207 divergence [37]. KL divergence represents the additional information needed when using one distribution 208 to describe data from another distribution. By measuring dissimilarity, it has an inverse relationship to 209 shows the needed increase in information for the distribution of B to describe data from the distribution of 219 A [37]. A value of 1 represents an information increase by 100%, or a doubling of information, though 220 this is not an upper bound. A large RRIG value is indicative of distinct differences between distributions, 221 indicating that characterizations of VL incidence are sensitive to the resolution of surveillance. An RRIG 222 above 5% was selected as a threshold for having a distinct difference in distribution. Of the 26 states in Brazil, 22 were included in analyses since they all observed more than five nonzero 237 unique annual municipality-resolution incidence values over the study period (Table S1). The remaining 238 four states were excluded because their incidences over the 11 years did not provide enough unique The uniform distribution for nationwide municipality-resolution incidence was not able to converge after 248 increasing the iteration count to 200,000. All other distributions converged to optimal values. The best 249 fitting municipality-resolution distributions for individual states varied. Annual incidence values from ten 250 states were best fit by the rounded Gamma distribution, incidences from seven states were best fit by the 251 Poisson distribution, incidences from three states were best fit by the Zero Inflated Poisson distribution, 252 and incidences from two states were best fit by the Zero One Inflated Poisson distribution. Specific 253 parameters are shown by state in Table 2. Plots of the probability mass functions of each state's 254 municipality-resolution distribution are shown in Figure 3. Nationwide, the best fitting distribution for 255 municipality-resolution incidence was the Gamma distribution, and the best fitting distribution for state-256 resolution incidence was the Zero One Inflated Poisson distribution (  Comparisons between individual states' municipality-resolution incidence and national municipality-280 resolution incidence are shown in Table 2 using RRIG from Equation 3. The nationwide, municipality-281 resolution distribution is used as the reference for comparisons. The results show that six of the 22 states 282 had incidence following a distribution close to that of the nation (RRIG<0.05) (Table 2, Figure 4). Any of 283 these states could individually describe municipality-resolution incidence of the nation. Because not all 284 states adequately characterize national burden, true scale invariance was not seen, though self-similarity 285 was seen in the selected states. The states that exhibited some self-similar behaviors all followed a 286 Gamma distribution with generally similar parameters, particularly low values of shape parameters. Many 287 of these states were located near the center of the nation (Figure 4). This study aimed to assess the importance of the spatial scale and resolution used for VL surveillance and 296 subsequent quantitative analyses. This is also reflective of the dynamics of VL at different scales 297 determined by the distributions of incidence. Probability distributions were fit to incidences at different 298 spatial resolutions and scales and then compared to determine if distributional fit was sensitive to the 299 choice of scale and resolution. Aggregating municipality-resolution incidences into state-resolution 300 incidences led to notably different probabilistic characteristics of disease burden, suggesting the existence 301 of different processes driving disease occurrence at the two resolutions. When continuing surveillance at 302 the municipality resolution, six states' incidences follow distributions that adequately describe those of 303 each other as well as the nation of Brazil. While our results provide evidence against true invariance to 304 resolution and scale, some self-similarity is seen in both distributional parameters and moments. This 305 happens for states that are following a Gamma distribution, which implies medium-long range dispersal 306 of cases and a potential tendency toward a power-law distribution for small scale and shape parameters. 307 308 The self-similarity seen in six states does not indicate that significant resources can be saved in Brazil by 309 concentrating surveillance in a smaller area because they are not representative of the other states. The 310 remaining states still need to undergo surveillance in order for their VL burden to be adequately 311 characterized. Furthermore, it is of interest for public health to know where all VL cases occur in order to 312 intervene in an outbreak. If greater self-similarity were seen, it would largely be of interest to researchers 313 who could potentially generalize results of a smaller area to the nation of Brazil through conducting more 314 intensive data collection for additional data in a smaller area. However, because scale invariance was not 315 15 seen and self-similarity was seen in a small number of states, it is unlikely that descriptions of VL burden 316 in a smaller region of Brazil are generalizable to the entire nation. These considerations consider the 317 current observed state, for instance in case of a widespread propagation of the disease in long range. The results from this study do not necessarily suggest that one spatial resolution is more "correct" than 333 another or favor a particular resolution for analysis. The resolution for statistical analyses should rely on 334 the research question being posed and desired interpretation of results. However, the resolution 335 dependence implies that, assuming accuracy and precision in assigning municipalities to observed cases, 336 aggregating incidence to the state resolution likely introduces ecological fallacy. Thus, high resolution is 337 likely beneficial to capture disease dynamics accurately. For high-resolution incidence, the most likely 338 VL dynamics are represented by the Gamma distribution. These considerations should be always taken 339 into account when collecting and analyzing data because they indicate that the choice of resolution will 340 impact model results and their interpretation. Data characterizations and analyses at one resolution are not 341 16 interchangeable for characterizations and analyses at the other resolution. A related point to note is that 342 diligent surveillance is important when conducted at a finer spatial resolution to ensure accuracy of 343 municipalities that are matched to cases. 344

345
This study is, to the authors' knowledge, the first to examine VL incidence for sensitivity to scale and 346 resolution of surveillance data by finding best-fitting distributions to characterize incidence. Other studies 347 have analyzed the fractality of other diseases, such as cholera, and how that is important for a simple 348 estimation of disease spread in term of geography and magnitude [47]. Similar distribution fitting 349 processes are used in veterinary epidemiology [48], but less frequently in human disease epidemiology. 350 This analysis is important for informing future disease burden by providing location-specific estimates of 351 expected annual incidence. 352

353
The findings of this study can benefit surveillance, healthcare infrastructure, digital epidemiology, and 354 public health research focused on disease ecology. Care for an individual VL patient in Brazil, including 355 diagnosis, treatment, and medical care, is estimated to be approximately $500 (US) (plus an additional 356 $1470 (US) for secondary prophylaxis among VL patients with HIV) and lasts between seven and 20 357 days [49]. This is a high individual healthcare cost: yet, designing optimal surveillance that allows public 358 health practitioners to understand and prevent VL is an incredibly valuable task socially and 359 economically. These results and methods (applicable to any disease) can optimize disease data analysis 360 and surveillance for the reduction of the systemic disease burden. 361 362 Using only VL incidence data and not introducing other data sources provides focus on what would be the 363 outcome variable of a typical statistical analysis independently of any other predictors that may be 364 introduced. Refitting models at multiple resolutions or scales assumes that the outcome, in this instance 365 VL incidence, follows the same distribution in each scale and/or resolution. For example, using a 366 lognormal regression model with two resolutions assumes that incidence at both resolutions follows a 367 lognormal distribution, which may not be correct. When analyzing municipality-resolution cases, not all 368 states have distributions in the same family, and distributions following the same family have different 369 parameterizations because of the likely differential importance of the underlying socio-environmental 370 drivers. The latter point further motivates the use of Bayesian hierarchical models or other models, for 371 instance statistical physics and/or information theoretic models, which are able to handle the information 372 of scale and resolution controlling factors. 373

374
We show that the information theoretic RRIG can determine the amount of information needed to 375 describe the data using different resolutions or scales. It can be used as an information theoretic tool for 376 scaling (downscaling or upscaling, depending on the purpose) epidemiological data considering their 377 value and underlying distributions. 378 379 An additional point of novelty is the use of the ZOIP and Gamma distributions to characterize VL 380 incidence. Both distributions are uncommonly used for infectious disease incidence, despite closely fitting 381 observed data. The ZOIP distribution offers the advantage of specifically fitting high frequencies of 382 counts of one, describing single spurious cases. The Gamma distribution is advantageous for placing high 383 probability on low values. More specifically into the statistical physics of disease ecology, the Gamma 384 distribution has similarities to heavy tail distributions (for small shape and scale parameters) and ZOIP 385 represents Poisson distributions highlighting local/random and medium-range disease dynamics. The distributions that lead to consider VL dynamics at stationary state. Increases have been seen in VL cases 395 over time [50], though case counts between 2000 and 2014 have remained more consistent compared to 396 previous decades [51,52], indicating that these results are not likely to be sensitive to this assumption. 397 Populations over this time period by municipality generally showed small changes. The mean change in 398 population by municipality was an increase of approximately 11% between 2004 and 2014, and the 399 middle 90% of changes were between a 12% decrease and a 41% increase [33]. These considerations 400 motivate extensions of this study to define the relationship between space and time for scale dependent 401 processes. 402

403
Another assumption made in this study is the ability to fit a single probability distribution for VL 404 incidence for the entire nation of Brazil. Since not all of the included states are considered endemic for 405 VL [31], fitting a single distribution for incidence nationwide assumes that the same distribution can 406 represent incidence in both endemic and non-endemic states. However, this consideration should be 407 considered in the quantitative analyses that would follow from the results of this study. Other 408 heterogeneities across the nation, such as affluence, urbanization, or climate, which may impact VL 409 incidence, similarly are not considered for distribution fitting but should be accounted for during 410 subsequent analyses. 411

412
The results of this study rely on the data collected. VL case data were collected through passive 413 surveillance and notification to the Ministries of Health. It is commonly known that reported cases of 414 infectious diseases only represent a portion of the total cases [53-55], commonly representing the most 415 severe cases. This limits the accuracy of the data, and therefore distribution fitting, by the ability to report 416 cases as well as the potential heterogeneous severity of VL cases. It is also likely that across locations in 417 Brazil, amounts of underreporting of cases differ. The results of this study rely on the assumption that 418 reported cases provide an adequate representation of disease burden. Furthermore, inclusion of both 419 endemic and non-endemic states in the analyses may lead to the inclusion of case data representing both 420 typical VL incidence as well as atypical VL incidence. This could potentially affect distribution fitting if 421 underlying processes leading to typical and atypical incidence differ. 422 423 A limitation of this study is the reliance on the criterion for determining differences when comparing 424 distributions and algorithm used for determining best fitting distribution families and parameters. There 425 are numerous methods for performing both tasks, and different methods may lead to slightly different 426 conclusions. The methods of this study do, however, use assumption-free criteria in order to generate the 427 results. A sensitivity analysis was conducted to determine if the number of samples drawn to generate the 428 empirical state-resolution distribution described in section 2.2.1 using Equation 2 might impact RRIG 429 values, and it was found that using 1,000; 2,000; 5,000; and 10,000 samples did not yield distinct 430 difference in RRIG values and no differences in interpretations and conclusions. 431

432
Another important note is that this study used surveillance units of different sizes, examining aggregation 433 of municipalities of differing land areas and populations and comparisons among states with different 434 areas and populations. This results from using administrative districts, and still remains useful by using 435 the units recorded in infectious disease surveillance. However, diseases know no political boundaries; yet, 436 an ecosystem-based discretization to define homogeneous high resolution units would be preferable for 437 surveillance such as one based on Digital Elevation Models from which to derive physical ecosystem 438 boundaries that are relevant to disease spread. This would also help the control of diseases to assign to 439 different political entities. A related topic of research is the existence of spatial autocorrelation in the data. 440 Values of Moran's I using municipality-resolution incidence nationwide showed strong evidence of 441 spatial clustering. This implies that disease dynamics are non-local as already highlighted by differing 442 fitted distributions across states, which is consistent with previous works [56,57]. Subsequent analyses on 443 VL in Brazil would benefit from the use of methods that account for spatial autocorrelation. For the 444 purposes of distribution fitting, finding distributional families that most accurately characterize incidence 445 20 is of greater importance than determining a covariance structure that most accurately reflects 446

autocorrelation. Determining clusters and covariance structures is an important component of analysis 447
that follows the results of this study.  Nationwide disease surveillance at a high spatial resolution is desired for many infectious diseases, 27 including Visceral Leishmaniasis. Statistical and mathematical models using data collected from 28 surveillance activities often use a spatial resolution and scale either constrained by data availability or Infectious disease research often relies on data generated through passive or active surveillance activities, 50 which can suffer from important limitations due to variation in methods and capacities for data collection 51 [1, 2]. Typically, researchers aim to collect data at a high spatial resolution, that is, in the form of small 52 surveillance units such as counties or municipalities rather than states or nations [3,4], though this may 53 not always be seen as beneficial [5]. Conducting surveillance at a high spatial resolution, however, is 54 often unrealistic when considering large areas and constrained resources [6,7]. 55 56 Data collected from infectious disease surveillance activities are often used in researching involving 57 mathematical or statistical models. In such analyses, matters related to data quality are of concern. The 58 choice of spatial resolution is often based on data availability or chosen arbitrarily, with little attention 59 given to whether this decision may impact model results. Aggregating data into larger spatial units can aid 60 in computational efficiency, but creates the risk of introducing ecological fallacy [8] and masking 61 heterogeneity within those larger units if conclusions are drawn inappropriately [9][10][11]. This would be 62 particularly problematic when aiming to seek disease etiologies. Previous studies using mathematical or 63 statistical models have investigated the importance of high resolution data by repeating analyses using 64 data at different resolutions and then comparing results [10][11][12]. 65 66 An additional challenge to high quality surveillance is the need for surveillance over a large spatial scale, 67 referring to the entire area where surveillance is being conducted. Here, spatial scale differs from spatial 68 resolution in their definitions as follows: spatial scale refers to the total spatial area being examined, while 69 spatial resolution refers to the size of the individual spatial units within that area. Large-scale surveillance 70 can be particularly challenging for nations with large land areas and populations. In these circumstances, 71 there is potential benefit in identifying a smaller area, such as a state or group of states, where 72 surveillance can adequately estimate the national disease burden. The characteristic of having smaller 4 areas representative of the whole for a large range of sizes is known as scale invariance or fractality [13]. 74 Scale invariance is ubiquitous in many socio-ecological patterns such as finance [14], ecology [15], 75 biochemical processes [16], and biology across time scales [17]. 76 77 Scale invariance is an infrequently examined concept in infectious disease surveillance and epidemiology 78 in general, though it has relevance to many forms of data analysis or modeling. Scale invariance in 79 infectious disease research is more frequently used to describe scale-free networks, typically applied to 80 human communicable diseases [18] [24,26,27]. Areas of Brazil that previously had accounted for only 15% of all cases reported 100 nationally now can see nearly half of the nation's cases [27]. The disease has also become more common 101 in urban areas in recent years [25,27,28]decades [26,28,29], making it a major public health concern 102 and an important target for surveillance programs. As of 2015, based on the data used for this study, the 103 This study aims to assess the potential impact of using different spatial resolutions and scales on 109 statistical and mathematical models using surveillance data applied to VL cases in Brazil. In order to do 110 so, two objectives are pursued: (1) to determine if surveillance using incidence by state or municipality 111 leads to different conclusions regarding disease distribution; and (2) to determine if conducting VL 112 surveillance on a region within Brazil would adequately characterize the nation's VL incidence. This is 113 done by using best fitting probability distributions to describe disease data without incorporating outside 114 information. A conceptual visualization of the study aims are presented in Figure 1. Prior to conducting 115 statistical analyses or models, researchers may need to decide whether to consider data using different 116 spatial resolutions as well as the scale of analysis; the results of this study will provide insight into 117 whether the subsequent results may be sensitive to this decision. 118 Fig 1. Graphical overview of the study objectives: (a) fit distribution to annual incidence of Visceral 120 Leishmaniasis (VL) by state, (b) fit distribution to annual incidence of VL by municipality, (c) fit 121 distributions to annual incidence of VL by municipality within each state. Comparisons of these fitted 122 distributions indicate whether characterizing VL incidence by state or municipality are equivalent, 123 impacting statistical analyses using these data. [33] to calculate annual incidence, discretized to represent cases per 100,000 population per year. This study compares spatial resolutions and scales using probability distributions rather than by fitting 145 models with assumptions and conducting a sensitivity analysis. Best fitting distributions from multiple 146 considered distributional families were fit for (1) annual incidence for each individual state using the 147 municipality as the unit of surveillance; (2) annual incidence nationwide using the municipality as the unit 148 of surveillance; and (3) annual incidence nationwide using the state as the unit of surveillance (Figure 1). 149 All 11 years of observation were included. 150 Common candidate distributions were selected based on exploratory analyses and suitability for discrete 152 incidence data, and wide ranges of parameters for each distribution were tested. The Poisson, Zero 153 Inflated Poisson (ZIP), and Zero One Inflated Poisson (ZOIP) [34] distributions were selected as 154 candidate distributions along with the Gamma, Exponential, Power Law, and Uniform distributions 155 rounded to fit discrete data. These are described in Table 1. 159 Each distribution was evaluated for the optimal parameter set that minimizes the sum of absolute error 160 (SAE), defined as 161 where ( ) represents the probability of observing an incidence of cases per 100,000 person-years 165 based on the proposed distribution indicated in Table 1 and ( ) represents the observed proportion of 166 incidence values equaling . This measure compares similarities between the proposed distributions and 167 observed data and is less sensitive to outliers than other measures [35]. The first aim of comparing distributions is to determine if different inferential conclusions may be 187 reached using the municipality or the state as the resolution of nationwide surveillance. This was done by 188 comparing the fitted state-resolution distribution to an expected state-resolution distribution for the nation 189 based on the fitted municipality-resolution distribution for the nation. This expected distribution was 9 generated empirically by drawing Monte Carlo samples from the fitted municipality-resolution 191

distribution. 192
If incidence is denoted by X as a random variable following the fitted municipality-resolution distribution Of the 26 states in Brazil, 22 were included in analyses since they all observed more than five nonzero 237 unique annual municipality-resolution incidence values over the study period (Table S1) The uniform distribution for nationwide municipality-resolution incidence was not able to converge after 248 increasing the iteration count to 200,000. All other distributions converged to optimal values. The best 249 fitting municipality-resolution distributions for individual states varied. Annual incidence values from ten 250 states were best fit by the rounded Gamma distribution, incidences from seven states were best fit by the 251 Poisson distribution, incidences from three states were best fit by the Zero Inflated Poisson distribution, 252 and incidences from two states were best fit by the Zero One Inflated Poisson distribution. Specific 253 parameters are shown by state in Table 2. Plots of the probability mass functions of each state's 254 municipality-resolution distribution are shown in Figure 3. Nationwide, the best fitting distribution for 255 municipality-resolution incidence was the Gamma distribution, and the best fitting distribution for state-256 resolution incidence was the Zero One Inflated Poisson distribution (Table 2). No notable differences 257 were seen in distributional fit among VL endemic and non-endemic states. 258  Comparisons between individual states' municipality-resolution incidence and national municipality-280 resolution incidence are shown in Table 2 using RRIG from Equation 3. The nationwide, municipality-281 resolution distribution is used as the reference for comparisons. The results show that six of the 22 states 282 had incidence following a distribution close to that of the nation (RRIG<0.05) (Table 2, Figure 4). Any of 283 these states could individually describe municipality-resolution incidence of the nation. Because not all 284 states adequately characterize national burden, true scale invariance was not seen, though self-similarity 285 was seen in the selected states. The states that exhibited some self-similar behaviors all followed a 286 Gamma distribution with generally similar parameters, particularly low values of shape parameters. Many 287 of these states were located near the center of the nation (Figure 4). This study aimed to assess the importance of the spatial scale and resolution used for VL surveillance and 296 subsequent quantitative analyses. This is also reflective of the dynamics of VL at different scales 297 determined by the distributions of incidence. Probability distributions were fit to incidences at different 298 spatial resolutions and scales and then compared to determine if distributional fit was sensitive to the 299 choice of scale and resolution. Aggregating municipality-resolution incidences into state-resolution 300 incidences led to notably different probabilistic characteristics of disease burden, suggesting the existence 301 of different processes driving disease occurrence at the two resolutions. When continuing surveillance at 302 the municipality resolution, six states' incidences follow distributions that adequately describe those of 303 each other as well as the nation of Brazil. While our results provide evidence against true invariance to 304 resolution and scale, some self-similarity is seen in both distributional parameters and moments. This 305 happens for states that are following a Gamma distribution, which implies medium-long range dispersal 306 of cases and a potential tendency toward a power-law distribution for small scale and shape parameters. 307 308 The self-similarity seen in six states does not indicate that significant resources can be saved in Brazil by 309 concentrating surveillance in a smaller area because they are not representative of the other states. The 310 remaining states still need to undergo surveillance in order for their VL burden to be adequately 311 characterized. Furthermore, it is of interest for public health to know where all VL cases occur in order to 312 intervene in an outbreak. If greater self-similarity were seen, it would largely be of interest to researchers 313 who could potentially generalize results of a smaller area to the nation of Brazil through conducting more 314 intensive data collection for additional data in a smaller area. However, because scale invariance was not 315 15 seen and self-similarity was seen in a small number of states, it is unlikely that descriptions of VL burden 316 in a smaller region of Brazil are generalizable to the entire nation. These considerations consider the 317 current observed state, for instance in case of a widespread propagation of the disease in long range. The results from this study do not necessarily suggest that one spatial resolution is more "correct" than 333 another or favor a particular resolution for analysis. The resolution for statistical analyses should rely on 334 the research question being posed and desired interpretation of results. However, the resolution 335 dependence implies that, assuming accuracy and precision in assigning municipalities to observed cases, 336 aggregating incidence to the state resolution likely introduces ecological fallacy. Thus, high resolution is 337 likely beneficial to capture disease dynamics accurately. For high-resolution incidence, the most likely 338 VL dynamics are represented by the Gamma distribution. These considerations should be always taken 339 into account when collecting and analyzing data because they indicate that the choice of resolution will 340 impact model results and their interpretation. Data characterizations and analyses at one resolution are not 16 interchangeable for characterizations and analyses at the other resolution. A related point to note is that 342 diligent surveillance is important when conducted at a finer spatial resolution to ensure accuracy of 343 municipalities that are matched to cases. 344 This study is, to the authors' knowledge, the first to examine VL incidence for sensitivity to scale and 346 resolution of surveillance data by finding best-fitting distributions to characterize incidence. Other studies 347 have analyzed the fractality of other diseases, such as cholera, and how that is important for a simple 348 estimation of disease spread in term of geography and magnitude [47]. Similar distribution fitting 349 processes are used in veterinary epidemiology [48], but less frequently in human disease epidemiology. 350 This analysis is important for informing future disease burden by providing location-specific estimates of 351 expected annual incidence. 352

353
The findings of this study can benefit surveillance, healthcare infrastructure, digital epidemiology, and 354 public health research focused on disease ecology. Care for an individual VL patient in Brazil, including 355 diagnosis, treatment, and medical care, is estimated to be approximately $500 (US) (plus an additional 356 $1470 (US) for secondary prophylaxis among VL patients with HIV) and lasts between seven and 20 357 days [49]. This is a high individual healthcare cost: yet, designing optimal surveillance that allows public 358 health practitioners to understand and prevent VL is an incredibly valuable task socially and 359 economically. These results and methods (applicable to any disease) can optimize disease data analysis 360 and surveillance for the reduction of the systemic disease burden. 361 362 Using only VL incidence data and not introducing other data sources provides focus on what would be the 363 outcome variable of a typical statistical analysis independently of any other predictors that may be 364 introduced. Refitting models at multiple resolutions or scales assumes that the outcome, in this instance 365 VL incidence, follows the same distribution in each scale and/or resolution. For example, using a 366 lognormal regression model with two resolutions assumes that incidence at both resolutions follows a 367 states have distributions in the same family, and distributions following the same family have different 369 parameterizations because of the likely differential importance of the underlying socio-environmental 370 drivers. The latter point further motivates the use of Bayesian hierarchical models or other models, for 371 instance statistical physics and/or information theoretic models, which are able to handle the information 372 of scale and resolution controlling factors. 373

374
We show that the information theoretic RRIG can determine the amount of information needed to 375 describe the data using different resolutions or scales. It can be used as an information theoretic tool for 376 scaling (downscaling or upscaling, depending on the purpose) epidemiological data considering their 377 value and underlying distributions. 378 379 An additional point of novelty is the use of the ZOIP and Gamma distributions to characterize VL 380 incidence. Both distributions are uncommonly used for infectious disease incidence, despite closely fitting 381 observed data. The ZOIP distribution offers the advantage of specifically fitting high frequencies of 382 counts of one, describing single spurious cases. The Gamma distribution is advantageous for placing high 383 probability on low values. More specifically into the statistical physics of disease ecology, the Gamma 384 distribution has similarities to heavy tail distributions (for small shape and scale parameters) and ZOIP 385 represents Poisson distributions highlighting local/random and medium-range disease dynamics. The 386 higher statistical complexity (e.g. related to the number of parameters) of ZOIP reflects the random 387 Poissonian nature of the disease with other factors, while the lower complexity of the Gamma reflects its 388 more simple nature. 389 390 These analyses do not consider dependence on temporal resolution and scale although time and space for 391 stochastic processes relate to each other. The data in this study include yearly case counts; having smaller 392 time units such as months would allow for such consideration Additionally, distributions are assumed to 393 remain constant over the 11 years of observation considering the very minor variations in the inferred 394 distributions that lead to consider VL dynamics at stationary state. Increases have been seen in VL cases 395 over time [50], though case counts between 2000 and 2014 have remained more consistent compared to 396 previous decades [51,52], indicating that these results are not likely to be sensitive to this assumption. 397 Populations over this time period by municipality generally showed small changes. The mean change in 398 population by municipality was an increase of approximately 11% between 2004 and 2014, and the 399 middle 90% of changes were between a 12% decrease and a 41% increase [33]. These considerations 400 motivate extensions of this study to define the relationship between space and time for scale dependent 401 processes. 402

403
Another assumption made in this study is the ability to fit a single probability distribution for VL 404 incidence for the entire nation of Brazil. Since not all of the included states are considered endemic for 405 VL [31], fitting a single distribution for incidence nationwide assumes that the same distribution can 406 represent incidence in both endemic and non-endemic states. However, this consideration should be 407 considered in the quantitative analyses that would follow from the results of this study. Other 408 heterogeneities across the nation, such as affluence, urbanization, or climate, which may impact VL 409 incidence, similarly are not considered for distribution fitting but should be accounted for during 410 subsequent analyses. 411

412
The results of this study rely on the data collected. VL case data were collected through passive 413 surveillance and notification to the Ministries of Health. It is commonly known that reported cases of 414 infectious diseases only represent a portion of the total cases [53-55], commonly representing the most 415 severe cases. This limits the accuracy of the data, and therefore distribution fitting, by the ability to report 416 cases as well as the potential heterogeneous severity of VL cases. It is also likely that across locations in 417 Brazil, amounts of underreporting of cases differ. The results of this study rely on the assumption that 418 reported cases provide an adequate representation of disease burden. Furthermore, inclusion of both 419 typical VL incidence as well as atypical VL incidence. This could potentially affect distribution fitting if 421 underlying processes leading to typical and atypical incidence differ. 422 423 A limitation of this study is the reliance on the criterion for determining differences when comparing 424 distributions and algorithm used for determining best fitting distribution families and parameters. There 425 are numerous methods for performing both tasks, and different methods may lead to slightly different 426 conclusions. The methods of this study do, however, use assumption-free criteria in order to generate the 427 results. A sensitivity analysis was conducted to determine if the number of samples drawn to generate the 428 empirical state-resolution distribution described in section 2.2.1 using Equation 2 might impact RRIG 429 values, and it was found that using 1,000; 2,000; 5,000; and 10,000 samples did not yield distinct 430 difference in RRIG values and no differences in interpretations and conclusions. 431

432
Another important note is that this study used surveillance units of different sizes, examining aggregation 433 of municipalities of differing land areas and populations and comparisons among states with different 434 areas and populations. This results from using administrative districts, and still remains useful by using 435 the units recorded in infectious disease surveillance. However, diseases know no political boundaries; yet, 436 an ecosystem-based discretization to define homogeneous high resolution units would be preferable for 437 surveillance such as one based on Digital Elevation Models from which to derive physical ecosystem 438 boundaries that are relevant to disease spread. This would also help the control of diseases to assign to 439 different political entities. A related topic of research is the existence of spatial autocorrelation in the data. 440 Values of Moran's I using municipality-resolution incidence nationwide showed strong evidence of 441 spatial clustering. This implies that disease dynamics are non-local as already highlighted by differing 442 fitted distributions across states, which is consistent with previous works [56,57]. Subsequent analyses on 443 VL in Brazil would benefit from the use of methods that account for spatial autocorrelation. For the 444 purposes of distribution fitting, finding distributional families that most accurately characterize incidence 20 is of greater importance than determining a covariance structure that most accurately reflects 446 autocorrelation. Determining clusters and covariance structures is an important component of analysis 447 that follows the results of this study. and spatial scale of VL surveillance data is of interest to both researchers and government officials for 454 preparedness. Analyses using VL data should consider the findings of this study when planning analyses 455 and controls related to disease processes or population incidence trajectories. Surveillance agencies 456 should note that accurate surveillance by municipality is important because measuring incidence by state 457 alone does not offer an equivalent characterization, and while there do exist small areas with incidences 458 that can describes those of the others, nationwide surveillance at high resolution remains important to 459 consider likely heterogeneity of processes contributing to VL burden. This applies to other diseases with 460 incidences that depend on the scale and resolution of surveillance, which should be examined to assure 461 whether this dependence does exist. 462 21 Acknowledgements: The authors thank the Brazilian Ministries of Health for allowing the use of the 465 Visceral Leishmaniasis data for this study. The authors also acknowledge the resources of the Minnesota 466 Supercomputing Institute for computational aid. 467