Towards a simulation framework for optimizing infectious disease surveillance: An information theoretic approach for surveillance system design

Infectious disease surveillance systems provide vital data for guiding disease prevention and control policies, yet the formalization of methods to optimize surveillance networks has largely been overlooked. Decisions surrounding surveillance design parameters, such as the number and placement of surveillance sites, target populations, and case definitions, are often determined by expert opinion or deference to operational considerations, without formal analysis of the influence of design parameters on surveillance objectives. Here we propose a simulation framework to guide evidence-based surveillance network design to better achieve specific surveillance goals with limited resources. We define evidence-based surveillance design as a constrained, multi-dimensional, multi-objective, dynamic optimization problem, acknowledging the many operational constraints under which surveillance systems operate, the many dimensions of surveillance system design, the multiple and competing goals of surveillance, and the complex and dynamic nature of disease systems. We describe an analytical framework for the identification of optimal designs through mathematical representations of disease and surveillance processes, definition of objective functions, and the approach to numerical optimization. We then apply the framework to the problem of selecting candidate sites to expand an existing surveillance network under alternative objectives of: (1) improving spatial prediction of disease prevalence at unmonitored sites; or (2) estimating the observed effect of a risk factor on disease. Results of this demonstration illustrate how optimal designs are sensitive to both surveillance goals and the underlying spatial pattern of the target disease. The findings affirm the value of designing surveillance systems through quantitative and adaptive analysis of network characteristics and performance. The framework can be applied to the design of surveillance systems tailored to setting-specific disease transmission dynamics and surveillance needs, and can yield improved understanding of tradeoffs between network architectures.

control policies, yet the formalization of methods to optimize surveillance networks has largely 31 been overlooked. Decisions surrounding surveillance design parameters-such as the number 32 and placement of surveillance sites, target populations, and case definitions-are often 33 determined by expert opinion or deference to operational considerations, without formal analysis 34 of the influence of design parameters on surveillance objectives. Here we propose a simulation 35 framework to guide evidence-based surveillance network design to better achieve specific 36 surveillance goals with limited resources. We define evidence-based surveillance design as a 37 constrained, multi-dimensional, multi-objective, dynamic optimization problem, acknowledging 38 the many operational constraints under which surveillance systems operate, the many 39 dimensions of surveillance system design, the multiple and competing goals of surveillance, and 40 the complex and dynamic nature of disease systems. We describe an analytical framework for 41 the identification of optimal designs through mathematical representations of disease and 42 surveillance processes, definition of objective functions, and the approach to numerical 43 optimization. We then apply the framework to the problem of selecting candidate sites to expand 44 an existing surveillance network under alternative objectives of: (1) improving spatial prediction 45 of disease prevalence at unmonitored sites; or (2) estimating the observed effect of a risk factor 46 on disease. Results of this demonstration illustrate how optimal designs are sensitive to both 47 surveillance goals and the underlying spatial pattern of the target disease. The findings affirm 48 the value of designing surveillance systems through quantitative and adaptive analysis of 49 network characteristics and performance. The framework can be applied to the design of 50 surveillance systems tailored to setting-specific disease transmission dynamics and surveillance 51 needs, and can yield improved understanding of tradeoffs between network architectures. 52 53 Author summary 54 Disease surveillance systems are essential for understanding the epidemiology of 55 infectious diseases and improving population health. A well-designed surveillance system can 56 achieve a high level of fidelity in estimates of interest (e.g., disease trends, risk factors) within its 57 operational constraints. Currently, design parameters that define surveillance systems (e.g., 58 number and placement of the surveillance sites, target populations, case definitions) are 59 selected largely by expert opinion and practical considerations. Such an informal approach is 60 less tenable when multiple aspects of surveillance design-or multiple surveillance objectives-61 need to be considered simultaneously, and are subject to resource or logistical constraints. 62 Here we propose a framework to optimize surveillance system design given a set of defined 63 surveillance objectives and a dynamical model of the disease system under study. The 64 framework provides a platform to conduct in silico surveillance system design, and allows the 65 formulation of surveillance guidelines based on quantitative evidence, tailored to local realities 66 and priorities. The approach facilitates greater collaboration between health planners and 67 computational and data scientists to advance surveillance science and strengthen the 68 architecture of surveillance networks. 69 1 Introduction 70 Infectious disease surveillance systems provide vital information on patterns of disease 71 occurrence across space, time, and populations of interest, and ultimately provide the basis for 72 evidence-based disease control policy decisions [1]. Considerable progress has been made 73 supporting infectious disease control decision-making with computational approaches to 74 evaluate the outcomes of alternative decisions [2]. Examples include optimizing when, where, 75 and among which populations to allocate public health resources [3,4], determining the optimal 76 balance between multiple intervention approaches (e.g., case detection, treatment, vaccination, 77 and sanitation improvement) [5][6][7][8], and optimizing the start time, duration, and dose of drug 78 treatment programs [9,10]. In contrast, little attention has been paid to the development of tools 79 for improving infectious disease surveillance system designs, and formalization of methods to 80 optimize surveillance networks has largely been overlooked. 81 The 'design parameters', which are the high-level characteristics that define infectious 82 disease surveillance networks-such as the number and locations of surveillance sites, 83 sampling frequency for laboratory testing or community-based surveys, and selection of 84 diagnostic techniques-can greatly influence the degree to which the resulting surveillance data 85 serves public health objectives, including early detection of outbreaks (e.g., the coronavirus 86 disease outbreak in 2020) [ Furthermore, a generalized framework can inspire the application of quantitative surveillance 132 optimization across broader settings, resulting in system designs better aligned with local 133 realities and public health priorities. 134 2 Surveillance design as a multi-objective, multi-dimensional, 135 constrained and dynamic optimization problem 136 The search for optimal disease surveillance designs is a highly complex problem. This 137 results from the multiple, often competing goals of surveillance data collection, idiosyncratic 138 surveillance network design, the need to represent operational constraints that govern 139 surveillance systems, and the complexity and dynamic nature of diseases under surveillance. 140 Simple optimization problems involving a single design parameter and objective for a given 141 target disease-such as the optimal placement of a new surveillance site to maximize the 142 proportion of influenza cases detected-may be solved in relatively straightforward fashion by 143 testing all possible designs and choosing the design that generates optimal network 144 performance (e.g., the new site location that results in the highest proportion of cases detected 145 overall). However, surveillance network optimization quickly becomes non-trivial when the 146 design space increases (e.g., selecting 10 sites out of 200 alternative sites), when multiple 147 objectives (such as increasing case detection, improving spatial and temporal trend coverage, 148 and risk factor identification) are subject to simultaneous analysis and optimization, or when 149 optimization is subject to constraints regarding resource limitations and operational plausibility. 150 Uncertainty regarding the functioning of the epidemiologic system and shifts in patterns of 151 diseases further complicate matters. Hence, our optimization goals are multidimensional, 152 dynamic, and stochastic. In this section, we describe the relevance of surveillance objectives, 153 . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint sampling efforts may also be entered into the analysis, greatly expanding the dimensionality of 175 the problem. Moreover, the set of design parameters to optimize depends on the surveillance 176 goals. For example, when the surveillance goal is accurate estimation of the temporal trend of a 177 disease, it may be that the placement of sites is less important than sampling frequency. Table  178 1 shows examples of design parameters from select real world infectious diseases surveillance 179 systems, and their potential impacts on surveillance system performance. 180 181 . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10. 1101/2020

Target population
Population to be monitored for outcomes of interest.
Target populations representative of a general population provide a means of tracking overall disease incidence and trends in the population as a whole. Target populations informed by demographic differences in disease risk, transmission potential, or detection probability may provide advantages for monitoring outcomes in vulnerable populations, anticipating outbreaks, or tracking rare diseases.

Sampling strategy
Type of sampling used to identify cases among the target population.
Sampling strategies influence the representativeness of surveillance data, as well as the ability of surveillance systems to detect rare or underreported conditions. Strategies that adequately characterize a general population may be biased with respect to critical subpopulations, especially those facing stigma. Diagnostic tests and other related factors such as specimen types, the quality of the specimen, and the time from onset to specimen collection can influence the sensitivity and specificity of the surveillance system. Operational constraints. Operational restrictions on surveillance system designs-due 185 to budgetary, logistical, political and cultural considerations-add critical constraints to the 186 optimization problem. Absent constraints, the optimal design may be self-evident, e.g., sampling 187 at maximal frequency and intensity. Yet when there is a fixed budget for samples, the optimal 188 balance between design parameters-say, number of samples and sampling frequency-189 depends on the relative value of precise cross-sectional estimates of disease prevalence versus 190 characterizing disease incidence over time, which in turn depends on the specific objectives of 191 surveillance and the dynamics of the underlying disease system. 192

Isolation of
Dynamic and imperfectly understood disease systems. Surveillance systems must 193 respond to shifts in the epidemiology of target infections. Optimal designs will likely shift in 194 response to the evolution of underlying epidemiology and available knowledge. For instance, as 195 infections emerge, become endemic, or approach elimination within populations or 196 subpopulations, the goals of surveillance, and the resulting optimal designs, can (and must) 197 evolve alongside them. The dynamic nature of optimal surveillance design may be especially 198 important in emerging economies that are undergoing epidemiologic transitions. For instance, 199 as a region or nation approaches elimination of a particular infectious disease, surveillance 200 goals generally shift from enumeration of endemic cases occurring in the general population to 201 detection of nexuses of sporadic transmission. This may require new designs (e.g., shifting to a 202 more sensitive diagnostic test within a limited area, or increasing the coverage of 203 subpopulations involved in ongoing transmission), and adjustment of system objectives (e.g., 204 maximize detection of the few remaining cases instead of minimizing false positive detections). 205 Additionally, as cases caused by novel pandemics (e.g., the 2020 coronavirus disease 206 pandemic, or 2009 H1N1 pandemic) start to increase exponentially, surveillance systems may 207 need to switch from tracking individual cases to population-based surveillance (e.g., pathogen 208 testing for a proportion of patients with a non-specific syndrome) in order to monitor the 209 progression of the outbreak and develop mitigation strategies without depleting public health 210 resources. 211

212
The aforementioned challenges of surveillance optimization-multiple objectives, 213 combinatorial complexity of relevant design parameters, operational constraints, and dynamic 214 and uncertain epidemiology of target diseases-suggest the need for a generalized framework 215 for surveillance network optimization. Advances in computation for simulation-based studies 216 have benefitted many related fields, including optimal disease control [36-39], yet applications of 217 simulation optimization to the design of disease surveillance networks have scarcely been 218 pursued. In the following sections, we detail a simulation and optimization framework for 219 designing infectious disease surveillance networks, and demonstrate its application in a site 220 selection context. Our framework facilitates a quantitative approach to designing surveillance 221 systems tailored to local disease transmission dynamics and surveillance needs, as well as a 222 more general study of optimal network design principles under varying objectives and 223 epidemiological circumstances. 224 . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10. 1101/2020 Broadly, our framework ( Figure 1) allows for evaluation of surveillance system 225 performance across a predefined design space under different epidemiologic scenarios 226 (disease system model) and network characteristics (surveillance model). Numerical 227 optimization algorithms are applied to efficiently identify the region(s) of design space that yield 228 superior network performance based on one or more specific surveillance goals (simulation 229 optimization search). The optimization procedure (Figure 1 and Box 1) yields a set of network 230 designs (i.e., optimal design parameter values) that maximize performance with respect to the 231 specified public health goal(s), according to the specified data and models. 232 233 . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20048231 doi: medRxiv preprint 234 235 . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20048231 doi: medRxiv preprint 236 Figure 1. Schematic of surveillance system optimization. The surveillance system optimization 237 procedure uses data and knowledge about disease transmission and case ascertainment to identify 238 optimal surveillance designs with regard to predefined surveillance goals. First, a disease system model 239 is defined, using observed epidemiologic data and/or theory, and taking into account relevant factors 240 influencing disease dynamics or distribution. Multiple realizations of disease data ( ) may be generated to 241 explore optimal designs under uncertainty or variability of the underlying system (Section 3.1).

242
Furthermore, an ensemble of disease models can be combined to reduce the chance of model 243 misspecification. Next, a surveillance model is defined to represent how information on the state of the 244 disease system is captured as a function of design parameters and any other relevant variables (e.g., 245 factors known to affect the sensitivity and specificity of a diagnostic test, or estimated underreporting 246 rates for an area; Section 3.2). To initiate the optimization process, an initial design parameter set, 1 , is 247 drawn from the design space subject to operational constraints ( ) ≤ 0, ℎ( ) = 0 and, along with 248 underlying disease data , input to the surveillance model to generate a realization of surveillance 249 information, ℐ 1 = ℐ( 1 , ). The objective function, , is evaluated based on the disease data , and 250 surveillance information ℐ 1 (Section 3.3). If a stopping criterion (e.g., reaching a large number of iterations; 251 de minimis improvement in objective function) is not met, a new design parameter set, , is proposed 252 from the design space using metaheuristic search algorithms (e.g., simulated annealing, genetic 253 algorithm, particle swarm algorithm) when the design space is large, or enumeration when the design 254 space is small. This new design parameter set is then used to generate a new realization of surveillance 255 information and evaluation of the objective functions (Section 3.4). After a stopping criterion is met, 256 design parameter sets with the best objective function values are output as optimal surveillance designs.

258
Box 1. Surveillance System Optimization Procedure Input: Epidemiologic data and/or theory, surveillance performance data and/or theory, and other auxiliary data (e.g., disease risk factors) Output: the design parameter set with the highest/lowest (i.e., optimal) objective function value

Initialization:
Define a disease system model to represent the underlying dynamics of the target disease system in the spatial, temporal, and demographic context of interest Generate disease distributions as realization(s) of the system Sample initial design parameter set, 1 , within the design space subject to constraints ( ) ≤ 0, ℎ( ) = 0 Generate realization(s) of surveillance information, ℐ 1 , given and 1 Evaluate objective function(s) given ℐ 1 and while stop criterion is not met do Propose a new design parameter set, , within the design space using metaheuristic search algorithms or enumeration Generate realization(s) of surveillance information, ℐ , given and Evaluate objective function(s), , given ℐ and end while return the best design parameter set, ̂ (i.e., with the optimal objective function value) 259 . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20048231 doi: medRxiv preprint 3.1 Specify and parameterize disease system model 260 An accurate representation of epidemiologic characteristics of the target disease(s) is 261 essential for a successful optimization. This representation can be generated using 262 observational data, outputs of mechanistic transmission models, or other approaches, and 263 represents the best estimate of the disease's epidemiology that is used to evaluate surveillance 264 network performance using objective functions (defined in section 3.3 Define objective 265 function(s), below). To avoid potential model-misspecification, an ensemble of disease models 266 and multiple realizations of disease models (i.e., with varying epidemiologic parameter values) 267 can be utilized in the framework. The structure of the disease system model output-such as 268 spatial and temporal resolution-should be tailored to the surveillance objectives and design 269 parameters. For instance, if a surveillance objective is to better estimate the spatial distribution 270 of a disease, the target disease data must include geographical information about cases. 271

272
In order to identify optimal network designs, a model representing key aspects of the 273 sampling of and extraction of information from underlying disease processes by the surveillance 274 system is needed. The surveillance model must represent the mechanisms through which 275 variation in network design parameters is expected to impact the epidemiologic information 276 obtained and thus governs optimization with respect to system objectives. Surveillance models 277 generally comprise a set of probability distributions relating target estimands to the underlying 278 disease distribution, conditional on network design and other relevant considerations. For 279 example, to optimize the diagnostic protocol for minimal bias in reporting, a surveillance model 280 may be constructed for the distribution of reported cases conditioned on diagnostic method, 281 background prevalence of the target disease and conditions with similar clinical presentation, 282 and the distribution across subpopulations of factors that impact diagnostic sensitivity and 283 specificity. When random errors contributed by surveillance processes are not explicitly taken 284 into account, as may be the case when seeking to maximize the size of the population covered 285 by a surveillance network, the surveillance model becomes a set of conditional Dirac delta 286 distributions, and is deterministic. During the process of surveillance model specification, 287 aspects of surveillance design that will be allowed to vary during optimization (i.e., the 288 parameters to be optimized), and those that will be fixed (i.e., design aspects that are relevant 289 to performance, but which it is not feasible or desirable to change) must be decided upon. 290 Surveillance models may be as granular (e.g., modeling the full sequence of events necessary 291 for each individual case to be reported) or abstract (e.g., modeling the overall proportion of 292 cases detected in a population) as is deemed necessary for the optimization procedure, 293 recognizing, however, that computational complexity may limit the feasibility of certain 294 representations. 295

296
Changes to design parameters can be analyzed in relation to their influence on network 297 performance in the context of specific surveillance system objectives. That is, performance is 298 evaluated with respect to achieving a specific goal or goals. This evaluation is formalized by 299 . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20048231 doi: medRxiv preprint defining objective functions, which define the specific minimization or maximization problem to 300 be solved, based on the design parameters and surveillance goals of interest. Thus, network 301 performance is estimated through the iterative evaluation of objective functions, which are 302 minimized (or maximized) as the design parameter space is searched. To better characterize geographic, temporal, or demographic distribution of disease, the objective function may be expressed as: If bias in surveillance sampling and estimation is not a concern (e.g. for asymptotically unbiased estimators), then minimizing uncertainty may be the primary goal. Uncertainty can be represented by standard error, standard deviation, inter-quantile range, or other expressions.
To determine the effect of a risk factor on infection more precisely when assuming a linear relationship between the risk factor and disease rate, the objective function may be expressed as: = var(̂) ̂estimated regression coefficient of the effect of the risk factor on the disease rate from the ascertained data To forecast the peak case count more precisely, the objective function may be expressed as: = var( ) forecasted peak case count based on ascertained data overall or for a specific area Maximize loglikelihood If a probability distribution ~( , … ) can be expressed by the surveillance model, then maximizing the likelihood of true data D under the estimated distribution can be used to simultaneously address bias and variance.
To better estimate the effect of a risk factor on infection rates when assuming a linear relationship between the risk factor and disease rate, the objective function may be expressed as: if a normal distribution with a variance of σ 2 is assumed for the true effect of a risk factor . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10. 1101/2020 To improve estimation of outbreak probabilities, the objective function may be expressed as: , if outbreak probabilities in subsequent weeks are assumed to be conditionally independent.
̂estimated outbreak probability in time period indicator (0 or 1) for actual occurrence of an outbreak in time period

Maximize classification performance
When QI and QD are categorical, the performance of the surveillance system can be measured by classification evaluation metrics, such as sensitivity, specificity, positive predictive value, F1 scores, area under the receiver operating characteristic curve, etc.
To improve our ability to discriminate outbreaks from false alarms, the objective function may be expressed as the area under the ROC curve: πtpproportion of true outbreaks correctly identified πfpproportion of non-outbreak time periods falsely identified as outbreaks To improve our ability to detect a rare disease, the objective function may be expressed as the maximum of the average 1 score: proportion of true cases reported |proportion of reported cases that are true threshold condition for reporting a case, assumed in this example to represent a probability 311

312
The goal of the optimization process (while block in Box 1; the loop in Simulation 313 optimization search component of Figure 1) is to thoroughly explore the response surface of the 314 objective function(s) over the design space so as to identify designs likely to yield optimal or 315 near-optimal surveillance performance. Candidate surveillance designs are drawn from the 316 design space, and the expectations of resulting objective function values across realizations are 317 evaluated with respect to the simulated true and ascertained disease data; this process is 318 repeated iteratively until a stopping criterion is reached, e.g., the convergence on estimated 319 optimum(a); exhaustive sampling of the design space; or the exceedance of a computational 320 budget. When the design parameter space is small, exhaustive evaluation of objective function 321 values across the entire design parameter space is possible. Sufficient and efficient searching 322 of large design parameter spaces, by contrast, requires heuristic or metaheuristic optimization 323 algorithms (e.g., simulated annealing, genetic algorithms, particle swarm optimization, or 324 Bayesian model based optimization). 325 Multiple surveillance objectives can be optimized simultaneously through multi-objective 326 optimization approaches, such as through weighted sums of objective functions or Pareto 327 optimization [40]. Generating weighted sums of objective function values allows for the 328 specification of relative importance of different objectives. If one objective is less important, it 329 would be assigned a smaller weight when compared with other objectives. In this way, optimal 330 designs are not overly influenced by less important objectives. Pareto optimization outputs a set 331 of optimal solutions (Pareto optimal set) for which no other solutions can perform better under 332 all objectives. That is, improving the performance on one objective leads to worsening at least 333 one of the other objectives. Decision makers are then tasked with choosing the "best" design 334 . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint Here, we demonstrate an application of our surveillance system optimization framework 343 in the context of selecting candidate sites to add to an existing cross-sectional survey network. 344 We consider two surveillance design objectives-optimal prediction of the geographical 345 distribution of the disease (hereafter referred to as spatial prediction) and optimal estimation of 346 the effect of a risk factor (hereafter referred to as effect estimation). We demonstrate how 347 optimal designs can vary in relation to epidemiological characteristics of the target disease; in 348 this case, the rate of decrease in correlation of disease prevalence rates over distance, which 349 determines whether prevalence changes abruptly or smoothly over the spatial domain. 350 We first describe the demonstration setting, the data available for design optimization, 351 the specification and parameterization of the disease and surveillance system models, and the 352 resulting formalized objective functions for optimizing spatial predictions and effect estimation. 353 We demonstrate the use of an exhaustive search strategy to find the single most optimal site to 354 add to the existing network for both goals, as well as the Pareto-optimal set of single sites to 355 add when considering both objectives simultaneously. We simulate the addition of an arbitrary 356 number of sites, acknowledging that in real-world applications of the framework, the number of 357 sites might be determined by budgetary constraints and/or the marginal informational gains per 358 added site. We conclude our demonstration by considering the best set of three sites to add, 359 which introduces substantial combinatorial complexity, motivating the use of a metaheuristic 360 algorithm to efficiently search for optimal regions of design space. 361 362 363 We generated a set of 100 potential surveillance sites scattered uniformly at random 364 across a unit grid, and randomly selected 30 sites to represent a virtual existing surveillance 365 network. We seeded two point sources for a risk factor influencing expected disease prevalence 366 rates (Figure 2A), then simulated disease prevalence under two scenarios of spatial auto-367 correlation by adjusting the scale parameter ( ) of a log-Gaussian spatial process centered on a 368 linear function of the risk factor. We refer to these as spatially patchy ( = 0.1; Figure 2B) and 369 spatially smooth ( = 0.3; Figure 2C) disease scenarios. Additional details of data generation 370 are provided in Text S1. 371 372 . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Demonstration setting
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20048231 doi: medRxiv preprint 373 Figure 2. Simulated data used for surveillance system optimization. Spatial variation of (A) 374 the risk factor X and (B) log prevalence when ρ = 0.1 and (C) ρ = 0.3. Triangles represent the 375 existing 30 surveyed sites; dots represent the 70 candidate sites; crosses represent two point 376 sources of the risk factor of interest (e.g. locations of mass gatherings); background color in 377 Panel A and contour lines in panels B and C represent the levels of risk factor X. Three 378 realizations of the log prevalence surface when ρ = 0.1 or 0.3 are shown in Figure S1. 379 380

381
Available epidemiologic data to characterize the relevant aspects of the disease system 382 include simulated prevalence rates observed at the 30 sites enrolled in the surveillance network. 383 Data characterizing the surveillance system and design space include the coordinates of the 30 384 enrolled and 70 candidate sites. Additional data to support optimization include levels of risk 385 factor X at each sampling location. Theoretical inputs include the assumption of a log-linear 386 relationship between X and disease prevalence, and that spatial disease prevalence residuals 387 follow a Gaussian process with exponential covariance function. 388 389

Set up and initialization 390
Disease system model specification and simulation. In this demonstration, relevant 391 aspects of the disease system include the correlation of disease outcomes over space, as well 392 as the association of disease outcomes with risk factor X. Based on the observed disease 393 prevalence at participating sites, we assume the log of the prevalence value Y is generated from 394 an underlying random spatial process with an i.i.d mean-zero normally distributed noise ε with a 395 variance of σd 2 , and can be modelled by: 396 Y = exp( 0 + 1 + + ), 397 where β0 represents log of the overall mean prevalence rate, β1 represents the effect of a unit 398 increase in risk factor , and η represents a mean-zero Gaussian process accounting for spatial 399 correlation induced by additional dependence not captured by X. The spatially correlated error 400 term η follows a multivariate normal distribution with a variance-covariance matrix C, in which 401 each entry cij represents the covariance between the residuals at the ith and the jth location 402 when i ≠ j, and the variance of the residuals at the ith location when i = j. Covariance between 403 . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10. 1101/2020 sites i and j is specified c ij = σs 2 e -d ij /ρ , where dij is the distance between sites i and j, and ρ is the 404 scale parameter as before; and the variance at site i is σd 2 + σs 2 . The correlation of the residuals 405 between two sites as a function of the distance between them is shown in Figure S1. 406 Parameters β0, β1, σs, σd, and ρ were estimated based on the prevalence rates and risk factor 407 levels at the 30 in-network sites, after which 1000 realizations of log-prevalence rates at the 70 408 candidate sites were drawn according to the fitted parameters, observed prevalence at in-409 network sites, risk factor levels at candidate sites, and distance matrix between in-network and 410 candidate sites. 411 Surveillance model specification.

425
Objective functions: Spatial interpolation. The first surveillance function we wish to 426 optimize is prediction of the geographical distribution of the disease. Therefore, we define the 427 objective function as the mean squared error ( where , represents the log prevalence rate at the jth unenrolled site in the kth disease 431 system model realization, while ̂, ( ) represents the predicted log prevalence rate at the jth 432 site after augmenting the existing network with sθ in the kth realization. We denote the objective 433 function value for this objective as OFV1. Other objective functions, such as the MSE of log 434 predicted prevalence rate at the existing 30 sites or across all 100 sites, can also be used. 435 Existing literature on optimal spatial design provides more options for relevant objective 436 functions [44][45][46]. 437 Objective functions: Effect estimation. Our second surveillance goal is precise 438 estimation of the effect of the risk factor X on the disease outcome, so the objective function is 439 formalized as: . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https: //doi.org/10.1101//doi.org/10. /2020 Search algorithms. When a single site is to be added to the network, the design space 446 is small enough to allow for enumeration of objective function values at all possible designs. 447 Therefore, the algorithm for proposing new designs simply steps sequentially through sites 448 { 31 … 100 }. However, when the optimization question is shifted to the best three sites to add, 449 the design space expands to 54,740 combinations, making sequential enumeration a 450 prohibitively expensive search strategy. Under these conditions, heuristic (greedy) or 451 metaheuristic algorithms play an important role in finding the optimal or near-optimal solution 452 within a reasonable amount of time [47]. Moreover, the evaluation of objective function values 453 across realizations can be paralleled to further reduce computational time. 454 We illustrate the use of a simulated annealing (SA) meta-heuristic algorithm popular in 455 spatial sampling network design [48,49] to more efficiently explore the design space when three 456 sites are to be added. In SA, a random initial design is proposed, after which, at each iteration, a 457 new design is sampled from the neighborhood of the current design and the objective function 458 value (OFV) for the new design is evaluated. Here, the neighborhood of a set of n sites to enroll 459 is defined as designs sharing n-1 sites with the current design. If the new OFV is superior to the 460 current OFV, the new design is accepted as the next design; otherwise, it is accepted with a 461 probability of − ∆ , where ∆ is the change in the OFV and T is a tuning parameter 462 analogous to temperature [50]. T decreases at a rate α after each iteration, causing SA to 463 accept deterioration in the OFV more frequently at the beginning of the run and rarely towards 464 the end. Probabilistically accepting worse solutions early in the search enables the algorithm to 465 escape local optima. For our demonstration, we set the initial temperature T0 and cooling rate α 466 separately for each objective and epidemiologic scenario, following suggestions by Sait and 467 Youssef [50], and set the stopping criteria is to be T≤10 -6 . We repeat the SA process 3 times to 468 examine the convergence of the result. 469 470

Optimal surveillance designs 471
Selecting one additional site to optimize spatial prediction. The mean squared error 472 of spatial predictions across unenrolled sites (OFV1) is minimized by enrolling sites that are in 473 close proximity to multiple out-of-network sitesespecially clusters of unmeasured sites at 474 long distances from existing network locations (Figure 3, panels A and B). These optimal 475 placements address informational gaps by enrolling sites that increase the average covariance 476 between measured and unmeasured locations, and tend to fall in areas close to several 477 unenrolled sites but away from the initially enrolled locations. Furthermore, the amount of 478 information that can be inferred from the same set of neighboring sites increases with the scale 479 parameter ρ. Thus, in the spatially patchy disease scenario, where the scale of spatial 480 autocorrelation is small, optimal placement occurs in the center of a tight cluster of unenrolled 481 sites (Figure 3, panel A). Under the spatially smooth scenario, the same cluster is correlated 482 with initially enrolled sites, and optimal site placement falls in the center of a loose cluster of 483 unmeasured sites located quite far from the initial network (Figure 3, panel B). Under the 484 parameter values used to generate demonstration data, there is no clear influence of risk factor 485 level X on site selection to optimize spatial prediction. 486 Selecting one additional site to optimize effect estimation. The variance of the effect 487 of risk factor X on log disease prevalence (OFV2) is lower when the value of X at the cell to be 488 . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10. 1101/2020 added lies towards an extreme of X's observed range and when the site to be added is relatively 489 uncorrelated with (i.e., distant from) initially enrolled sites (Figure 3, panels C and D). In the 490 spatially patchy disease scenario, where the scale of spatial autocorrelation is limited, optimal 491 site placement is dominated by the level of risk factor X, and the available site with highest X is 492 chosen ( Figure 3C). In the spatially smooth scenario, with an extended scale of spatial 493 autocorrelation , the correlation of outcomes between the site with the highest X level and 494 nearby initially enrolled sites results in selection of an alternative location where the value of X is 495 less extreme, but prevalence is expected to be more independent of previously observed 496 outcomes ( Figure 3D). 497 498 . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint .
499 Figure 3. Optimal site placement to augment a surveillance network for spatial prediction 500 or effect estimation under scenarios of spatially patchy or smooth disease distributions. 501 Black triangles represent initially enrolled sites, gray circles represent unselected candidate 502 sites, and the cyan circle indicates the optimal site to add to the network. White crosses 503 represent point sources for risk factor X. Raster colors represent objective function values for 504 hypothetical sites added across a regular 41*41 grid in order to visualize the response surface 505 in relation to initial network locations and the underlying risk factor. 506 507 . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10. 1101/2020 Single site selection based on multiple objectives. When simultaneously optimizing site 508 enrollment for spatial prediction and effect estimation, the output is a Pareto optimal set 509 containing designs that are considered equally optimal because no objective function value can 510 be improved without impairing the other objective function values. A set of six candidate sites 511 emerges for the spatially smooth disease scenario, including four alternative selections to the 512 optimal locations for each single objective (Figure 4). The Pareto optimal set for the spatially 513 patchy scenario includes only one non-dominated site in addition to the optimal locations for 514 either objective individually ( Figure S2). Since the solution given by Pareto optimization is not 515 unique, some way of reconciling the objective criteria, such as a weighted sum or expression of 516 total cost may be required to choose the optimal design. Notably, we did not incorporate cost 517 associated with adding sites in our analysis, but this could be accomplished by including a third 518 objective function representing the marginal information gain per added site. In this case, the 519 spatial prediction OFV, effect estimation OFV, and the cost-effectiveness OFV would be jointly 520 optimized. Black triangles represent initially enrolled sites, and gray dots represent unchosen candidate 529 sites. Background color in Panel B represents log prevalence when ρ = 0.3 using the same 530 color scheme as in Figure 2C, while contour lines represent levels of risk factor X. 531 532 Selecting three additional sites to optimize spatial prediction. As a final example, we 533 demonstrate the use of metaheuristic algorithms to search larger design spaces, applying 534 simulated annealing to select three additional sites out of seventy candidate sites 535 . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10. 1101/2020 simultaneously. Simulated annealing optimizations seeded with different initial designs 536 converged to the same best set of three additional sites to enroll for enhanced spatial prediction 537 under the spatially smooth disease scenario ( Figure 5). All three SA runs ( Figure 5A, colored 538 lines) converged to the same optimal design within 6,000 iterations. Given the parameters and 539 the stopping criteria we used, each run terminated after 8,630 iterations. Even with three runs, 540 the total number of objective function evaluations was 25,890, less than half of what would be 541 required if using enumeration. Figure 5B shows the location of the optimal three-site set. The 542 results effect estimation, as well as for the spatially patchy outcome scenario are shown in 543 Figures  triangles represent existing sites, blue triangles represent the optimal additional sites, and gray 550 dots represent unchosen alternative sites. Background color in Panel B represents log 551 prevalence when ρ = 0.3 using the same color scheme as in Figure 2C, while contour lines 552 represent levels of risk factor X. 553 554

555
Surveillance system designs that provide reliable, timely estimates of the spatial-556 temporal distributions of endemic and epidemic diseases, are critical to the efficient allocation of 557 resources for public health responses. However, opportunities to apply numerical optimization to 558 surveillance system design have heretofore been overlooked in the literature. In this paper, we 559 have presented a framework for surveillance optimization via simulation to enhance design 560 decision making and facilitate research into optimal design principles under uncertain or 561 . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20048231 doi: medRxiv preprint changing epidemiological conditions. While we focus on surveillance of human disease, the 562 framework could also be applied to the optimization of vector or environmental surveillance. 563 The framework presented can arrive at improved surveillance system designs through 564 the incorporation of data and models of local disease transmission status, diverse surveillance 565 goals, resource and operational constraints, and by stimulating collaboration between health 566 planners, researchers, and software developers. However, it should also be recognized that the 567 rationality of the output optimal design will be highly dependent on the accuracy and relevance 568 of data or models used to represent disease and surveillance processes during optimization, as 569 well as the performance of the optimization search algorithm. There is much future work to be 570 done to develop and validate simulation models that can represent relevant case generating 571 and measurement processes accurately; to analyze the sensitivity of optimal design to the 572 specification of disease system models and changes in disease epidemiology; and to adopt 573 optimization approaches from related fields-such as environmental monitoring network design 574 and signal processing [51-53]-to disease surveillance design applications. 575 576