Disease control as an optimization problem

In the context of epidemiology, policies for disease control are often devised through a mixture of intuition and brute-force, whereby the set of logically conceivable policies is narrowed down to a small family described by a few parameters, following which linearization or grid search is used to identify the optimal policy within the set. This scheme runs the risk of leaving out more complex (and perhaps counter-intuitive) policies for disease control that could tackle the disease more efficiently. In this article, we use techniques from convex optimization theory and machine learning to conduct optimizations over disease policies described by hundreds of parameters. In contrast to past approaches for policy optimization based on control theory, our framework can deal with arbitrary uncertainties on the initial conditions and model parameters controlling the spread of the disease, and stochastic models. In addition, our methods allow for optimization over policies which remain constant over weekly periods, specified by either continuous or discrete (e.g.: lockdown on/off) government measures. We illustrate our approach by minimizing the total time required to eradicate COVID-19 within the Susceptible-Exposed-Infected-Recovered (SEIR) model proposed by Kissler et al. (March, 2020).


Introduction
The COVID-19 pandemic has already caused over four million deaths worldwide. The effects of the virus have been widespread and substantial, from the collapse of healthcare systems [1][2][3] to the enforcement of isolation and quarantine. In the case of Nepal, the national lockdown lasted for 120 days uninterrupted [4].
In these circumstances, identifying reliable and effective disease control policies is of utmost importance. Here by "policy" we mean a deliberate intervention intended to mitigate the effects of a disease as it runs its course. In much of the mathematical literature on epidemiology, the process of generating a policy is as follows [5]: (1) based on expert intuition, a number of suitable policies to control the disease are proposed; (2) the impact on the population of each of the considered policies is assessed through dynamical models of disease spread; (3) the outcomes of all policies are compared and a decision is taken as to which one is deemed to be the best.
The main advantage of this three-step process is that the final recommended policy is comprehensible, i.e., it can be interpreted and explained. Moreover, for simple on/off policies, one a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 can sometimes derive analytic results [6]. The process has, nonetheless, two disadvantages. First of all, the class of policies devised by an human could well be suboptimal, since the optimal policy (under some figure of merit) could be extremely complicated and counter-intuitive. Second, the method generally requires one to numerically simulate each policy, and so is inapplicable when the considered class of policies depends on many control parameters: due to the exponentially large number of conceivable policies, by the time one finds the optimal disease control policy, it would be too late to enforce it.
Other approaches for disease control rely on optimal control theory to identify a suitable policy (see, e.g., [7][8][9][10]). The starting point of all these works is that both the initial conditions (namely, the number of individuals infected, exposed, etc.) and the model parameters (such as the disease's reproduction number) specifying the spread of the disease are known with high precision. This requirement is never met in a real-life epidemic, especially close to the outbreak, when the uncertainty in the disease's reproduction number can be very high [11,12]. Policies derived through optimal control theory are thus not guaranteed to have the desired effects in practice.
An additional disadvantage of optimal control theory is that government measures in the model cannot be constrained to be discrete. It is true that some optimal control problems admit a discrete solution, a so-called bang-bang control, but one cannot enforce this property on the solution a priori. This makes optimal control theory all the more impractical, for discrete measures, such as a lockdown that is either on or off, have so far dominated global efforts to control the COVID-19 epidemic [13]. Last but not least, optimal control theory can only handle scenarios where the policies vary over time continuously, thus not making it compatible with observed government measures to control COVID-19, which for the most part have been applied on a discrete, weekly basis.
In this paper we introduce a general framework that maps any disease control scenario to an optimization problem. Contrary to the optimal control approach, our framework can accommodate constraints on the disease dynamics which must hold for whole regions of the initial conditions and the disease's model parameters. Our framework can enforce policies to be weekly and/or discrete. Invoking tools from optimization theory and machine learning [14], we propose efficient heuristics to solve the optimization problem and hence identify the government policy that best controls the disease.
To illustrate the power of our approach, we use these optimization techniques to generate long-term plans to fight COVID-19, under the assumption that the disease's dynamics are accurately captured by a variant of the Susceptible-Exposed-Infected-Recovered (SEIR) compartmental model [5] proposed in [15]. Our results confirm that optimal policies tend to be too complicated to be devised by a human.
In reality, most epidemiological models only provide short-term approximations to the spread of the disease, with long term projections becoming less and less reliable [16]. In addition, notwithstanding the enormous knowledge gathered since the initial COVID-19 outbreak, many questions remain to be answered regarding the correctness and accuracy of compartmental models such as SEIR: their basic assumptions (e.g., are recovered patients temporarily or permanently immune to the disease?); the actual value of the model's parameters (e.g., the basic reproduction number R 0 ); and the role of variables not modeled (e.g., age, geographic distribution, contact tracing policies, role of superspreaders).
In this regard, the goal of this work is not to propose a concrete government policy, but rather to present an efficient method to obtain an optimal one, given all the available information. To estimate the effect of our methods in a realistic scenario, we conduct a numerical simulation where we re-calculate the optimal policy plans every month, based on new, incoming data. The very final policy plan that we present requires less stringent physical distancing measures compared with the one which is not re-calculated every month. All the code used in our simulations is freely available at [17].

The framework in a nutshell
Our starting point is an epidemic that affects a closed population. This assumption is not limiting, since a large ensemble of population centres where individuals are free to commute can also be modeled as a closed system [18]. To gain an understanding of how the disease spreads, it is standard to divide the population into different sectors or compartments [5] (see Fig 1). One can define, e.g., the compartment of all those individuals who are currently infected. This compartment can, in turn, be sub-divided into different compartments, such as symptomatic/ asymptomatic. Once the number of relevant compartments is fixed, one can estimate the occupation of each of them and arrange the resulting numbers in a vector x. The disease is subsequently analysed by looking at how x changes with time.
In order to control or even extinguish an epidemic, governments can enforce a number of different measures: mass vaccination, physical (or social [19]) distancing measures, or even a full lockdown, are common examples of interventions aimed to fight the disease. When and to which degree such measures are applied is determined by the disease control policy. Consider a disease control policy based on vaccination campaigns, where the intervention consists of vaccinating a number of individuals per day, across all compartments, i.e., without distinguishing between susceptible, exposed, recovered, and so on. Let v(t) be the fraction of the total population vaccinated on day t. This function, between the initial and final times t 0 and t f (i.e. on the interval [t 0 , t f ]), determines the government's vaccination policy. Similarly, let s(t) take the value 1 if the country is in lockdown on day t and 0 if it is not. Then the government's  [15]). The main compartments are: susceptible; exposed; infected; hospitalized; critical and recovered. This compartmental splitting captures different possible evolutions as well as time delays between transitions. The "infected" compartment, for instance, contains those who will recover without hospitalization (I R ); those who will be hospitalized but won't need critical care (I H ); and those who will end up receiving critical care (I C ). The "exposed" compartment is introduced here to model the time delay between the exposure to the disease and the development of symptoms (incubation period), in particular, the possibility of infecting others, which is what is relevant for the model. In this model, the compartment "recovered" includes both dead and alive individuals; in principle, it could be sub-divided further.
If the government is intervening both through vaccination and lockdown, then its disease control policy will be identified by both functions v(t) and s(t). Note that, whereas v(t) can take a continuum of values, s(t) can only have finitely many. In the following, we will call the first class of interventions continuous; and the second, discrete. In general, a policy for disease control will combine both kinds of measures, but, as we will see, optimizing over one class or the other requires very different techniques. The above are instances of non-adaptive policies for disease control, because the functions s, v only depend on the time t, and not, e.g., on the current death rate. A general adaptive policy for disease control would take into account the whole past history of data gathered by the government before deciding what to do at each step. Although in the following all our proposed policies are non-adaptive, the formalism we introduce allows one to optimize over adaptive policies as well.
In conclusion, a disease control policy can always be associated to a time-dependent vector function α and perhaps some other observed variables o, where each vector entry represents a type of government intervention at time t. In turn, we can use a variable vector μ 2 R n to parametrize the class of considered policies, that is, α(t, o) = α(t, o;μ). Since μ completely determines the policy α, we can also regard the parameters μ as the disease policy. We will do so from now on.
The applied policy μ is assumed to influence the compartment occupation within the time interval [t 0 , t f ]. That is, x is both a function of t and μ. In this work we are interested in devising policies for disease control which guarantee that the spread of the disease evolves under certain conditions. For example, any country has a fixed number of critical care beds, which we denote by B c . At each time t 2 [t 0 , t f ], it is desirable that the number of individuals admitted to critical care in hospitals, CðtÞ, does not exceed that capacity. That is, we require that for t 2 ½t 0 ; t f �: ð2Þ We will call any such condition on the policy, or on its effect on the evolution of the disease, a constraint. Further examples of relevant constraints are the requirement that the number of planned vaccinations does not exceed the government's total supply, or that the death rate does not surpass a given threshold. Finally, among all policies satisfying the desired constraints, we typically wish to identify the one that minimizes a certain quantity. For example, for many diseases, a simple physical distancing policy that satisfies the constraint (2) consists of declaring a lockdown throughout the whole time interval [t 0 , t f ], i.e., s(t) = 1 for t 2 [t 0 , t f ]. This policy is arguably impractical, difficult to enforce and harmful to its citizens' psychological health, as well as to the national economy. More rationally, one is interested in finding alternative disease policies which, while respecting the critical care occupation constraint, minimize the number of days of lockdown. Alternatively, one may wish to minimize the total number of deaths during the interval [t 0 , t f ], or the number of infected people. In general, the figure of merit or quantity that we wish to minimize will be a complicated functional of the considered policy α and x. We will call this functional the objective function. A sufficiently complex objective function, together with the appropriate optimization constraints, can meet any conceivable set of societal demands.
The optimization problem sketched above is mathematically ill-defined, unless we specify how x varies with the parameters μ determining the policy. In order to predict the natural course of a disease or how a given policy might affect its spread, experts make use of mathematical models. In this paper, we will mainly be concerned with deterministic compartmental models, but we will show also how our method extends to stochastic models. In these models, the whole population is divided into a number of basic compartments and the interactions between those compartments are modeled, in the deterministic case, through a system of ordinary differential equations. Given the occupation x 0 of the compartments at time t 0 , these models allow us to compute the value of x at any instant t 2 [t 0 , t f ] as a function of the policy μ.
That is, each model provides an implicit functional relation of the form Past literature on disease control has made extensive use of compartmental models to recommend specific strategies for policy-makers in an effort to control the spread of a disease. In many cases, suitable policies are devised through a mixture of intuition and grid search, see [15,[20][21][22]. The starting point is a family of disease control policies with one or two unknowns. For instance, in pulse vaccination [21], those unknowns are the time intervals between two mass vaccinations and the vaccination rate. In this paper, we propose a scheme that allows for the optimization over policies specified by thousands of parameters in the space of a few hours. Our scheme, detailed in the following sections, is based on a standard tool in optimization theory and machine learning known as gradient descent [23,24]. Starting with a rough guess for the optimal policy μ (0) , gradient descent methods generate a sequence of policies μ (1) , μ (2) , . . . which typically exhibit increasingly better performance. Although the gradient method is not guaranteed to converge to the optimal policy, after many iterations it generates solutions that are good enough for many practical purposes. In fact, gradient descent is the method most commonly used to train deep neural networks [14] and support vector machines [25].
Importantly, the computational resources required to carry out the gradient descent method are comparable to the cost of running a full simulation between times t 0 and t f with the considered disease model. Furthermore, the necessary computations can be parallelized for policies depending on many parameters. Although the focus of this paper is on compartmental disease models, our main ideas can also be used to understand ecological systems undergoing more complex dynamics [26], see Appendix D. Even in such complicated scenarios, the former scaling laws hold: provided that we can run the considered disease model, we can apply the gradient method to optimize over policies of disease control.

The gradient method
We next introduce the gradient method and show how one can use it to tackle an abstract optimization problem. Given functions f ; g i : R n ! R, for i = 1, . . ., K, consider the following task: Each of the conditions g i (μ) � h i is called a constraint.
If f is sub-differentiable, a simple heuristic to solve this problem is the projected gradient method [23]. Call μ ? the solution of the problem. Starting from an initial guess μ (0) , the gradient method generates a sequence of values (μ (k) ) k with the property that lim k!1 μ (k) = μ ? , provided that M, f are, respectively, a convex set and a convex function [23]. The sequence (μ (k) ) k is generated recursively via the relation where, for any set A 2 R n , π A (z) denotes the point y 2 A that minimizes the Euclidean distance, i.e., min y2A ky − xk 2 . Unfortunately, in the problems we encounter in this paper, f is not convex and, sometimes, neither is M. This implies that the sequence output by the projected gradient method is not guaranteed to converge to the solution of the problem, but to a local minimum thereof.
In machine learning, optimization problems with non-convex objective function f and M ¼ R n are legion. To solve them, deep learning practitioners typically use variants of the gradient method sketched above. One of these variants, Adam [27], is extensively used to train neural networks.
Adam works as follows. Starting with the null vectors m ð0Þ ; v ð0Þ 2 R n , vector sequences are generated according to the following iteration rule: where G (k) � G (k) denotes the vector of the element-wise product; similarly, the fraction and square root in the definition of μ (k) are defined element-wise. Recommended values for the free parameters �, b 1 , b 2 , δ are � = 0.001, b 1 = 0.9, b 2 = 0.999 and δ = 10 −8 [27]. Like the projected gradient method, Adam is not guaranteed to converge to the optimal solution of the problem. However, provided that the initial conditions and the learning rate � are chosen with care, Adam has been observed to typically output a local minimum that is "good enough".
Note that, by taking M ¼ R n , it is not clear how to enforce that the solution satisfies constraints of the form g i (μ) � h i . The answer is to include those constraints as penalties in the objective function. That is, rather than minimizing f, we apply Adam to minimize the function where ν i � 1 and z + denotes the positive part of z, i.e., z + = z for z > 0; otherwise, z + = 0. For high enough values of {ν i } i , the solution of the problem will just violate the constraints slightly, i.e., h i − g(μ ? ) 2 [−δ, 1), for δ � 1. If no violation whatsoever is desired then one can instead optimize over a function of the form with δ 0 > 0. In some situations, the objective function f will be complicated to the point that computing its exact gradient is an intractable problem. It might be possible, though, to generate a random vectorr μ f 2 R n with the property In such a predicament, we can solve the original optimization problem (4) through stochastic gradient descent methods [24]. Stochastic gradient descent consists of applying the considered gradient method, with the difference that, every time that the method requires the gradient of f, we input the random variabler μ f 2 R n instead. Namely, it suffices to replace r μ f(μ (k−1) ) byr μ f ðμ ðkÀ 1Þ Þ in the iterative Eqs (5) and (6). As before, if both M and f are convex, stochastic gradient descent methods are guaranteed to converge to the optimal solution of problem (4) [24].

Overview of techniques
Our novel contribution is to show how to apply the gradient method to any disease control scenario and optimize over discrete or continuous response policies. In both cases the task is first to identify a suitable functional A, that is to be minimised, and then to show how to compute the gradient of it, which appears in the first line of the Adam algorithm in Eq (6), such that the algorithm can step to the next iteration.
In a discrete policy, the parameters describing the government intervention can only take a finite number of values. Eq (1) is an example of a discrete policy, because on day t the government can either declare a lockdown (s(t) = 1) or declare no lockdown (s(t) = 0): we allow for no values of s(t) inbetween.
On the other hand, a continuous policy is one in which the parameters that define the policy, which in general we denote by the vector μ, are allowed to vary over a continuum. For example, the fraction v(t) of the population vaccinated on day t can take any value between 0 and 1.
Regardless of whether we chose to optimise over discrete or continuous policies, our starting point is an ordinary differential equation of the form where the entries of the vector x represent the occupations of the different compartments of a disease model and μ represents a continuous or discrete parametrization of the effects of a given policy. Let xðt; � μ; x 0 Þ be the solution of Eq (10) with initial conditions x(0) = x 0 and μ ¼ � μ, where the initial policy condition is typically taken to be 'lockdown off'. We consider the problem of finding the parameters μ ? such that x(t;μ ? , x 0 ) minimizes a given functional A. This functional defines how we wish to control the disease and what for: it might represent the number and duration of lockdown periods, etc. If the initial conditions x 0 are known, then it makes sense to consider functionals of the form where, as discussed in Section 3, constraints which we wish the solution x(t;μ ? , x 0 ) to satisfy are incorporated into Lðt; μ; xðt; μ; x 0 ÞÞ. For instance, if we want that the number of patients in critical care does not exceed the number of beds, i.e. Eq (2) to hold, then one of the terms in L would be with ρ(t) � 1. Similarly, to model a constraint on the vaccine supply of the form with λ � 1.
As it turns out, minimizing functional (11) (or more complicated ones, see Appendix B) requires different techniques depending on whether the desired policy is continuous or discrete.

Continuous policies
If all the parameters μ defining the policy are allowed to take arbitrary values in R n and their effect on the disease's equations of motion (26) is differentiable, then one can simply apply gradient descent methods to minimize A(μ, x 0 ). The only difficulty stems in computing r μ A (i.e., the first line of the Adam algorithm in Eq (6)). For functionals of the form (11), we have In order to evaluate this quantity to step the algorithm, the challenge is to compute the derivatives @x i @m j . In Appendix B we show that these derivatives arise as the solution of a system of ordinary differential equations, of complexity comparable to that of Eq (10). We also explain how to use stochastic gradient descent to deal with scenarios where the model parameters (e.g., the basic reproduction number R 0 ) and/or the initial conditions x 0 are only known to lie within some bounds, or when the evolution is stochastic. In either case, once a method to compute (14) is established (or, in the case of uncertainty in the initial conditions, some unbiased statistical estimator thereof (see Appendix B)), one can simply run the vanilla gradient descent (5) or the Adam algorithm (6) to arrive at a quasi-optimal policy.

Discrete policies
When some or all the entries of μ are restricted to take values on a finite, fixed set, it is clear that one cannot apply gradient descent methods straightforwardly in order to minimize the objective function. In fact, optimizations over discrete variables are, in general, a very difficult endeavor: it can be argued that problems which appear simple do not have an efficient solution [28]. Nonetheless, in Appendix C we present two heuristics to tackle such optimizations in the context of policies for disease control.
The first heuristic consists of mapping the original minimization problem to an optimization over probabilistic policies, whereby the government decides how to intervene by sampling a discrete probability distribution, which is continuously dependent on a number of auxiliary (continuous) parameters. By applying stochastic gradient descent methods over such auxiliary parameters, we can find the probabilistic policy that minimizes the average value of the objective function. As shown in Appendix C, independently of the initial conditions, the gradient method is guaranteed to converge to a deterministic policy.
The second heuristic is tailor-made for optimizations over lockdown (discrete) policies, although it can be easily extended to deal with arbitrary discrete policies. Consider a scenario in which the government is allowed to declare or lift a lockdown at any time. In this case, we can parametrize the policy by the duration of each lockdown phase and the pauses in between, and use gradient descent methods to arrive at the optimal times. In Appendix C we show that estimating the correspondent gradients only requires a minor modification of the methods developed for continuous policies.

Application: COVID-19
To illustrate the use of gradient descent for policy design, we consider a variant of the extended "susceptible-exposed-infected-recovered" (SEIR) disease model proposed in [15] to predict the impact of COVID-19 in the USA, a schematic of which is shown in Fig 1 and the details of which appear in Appendix A. Each state variable X in the diagram corresponds to the fraction of the whole population N pop in the corresponding diagram. From now on, we denote intensive quantities like X with a normal mathematical font, while extensive quantities X ¼ X � N pop are to be represented with calligraphic letters. The clinical parameters of the model in Fig 1, such as recovery and hospitalization rates, were estimated in [29,30], based on early reports from COVID-19 cases in the UK, China and Italy. Following the model in [15], the disease transmission is assumed to be seasonal by analogy with the known behavior of betacoronaviruses such as HCoV-OC43 [30], with a baseline reproduction number between 2 and 2.5, following fits of the early growth-rate of the epidemic in Wuhan [11,12].
A relevant compartment in this model is CðtÞ � C C ðtÞ � N pop , the population occupying a critical care bed at time t. The patients sent to critical care cannot breathe unassisted, and thus it is fundamental to ensure that such capacity is not surpassed, namely, that the constraint (2) holds. For our simulations, we chose a population size of N pop ¼ 47 million and B c ¼ 9:5 � 10 À 5 � N pop . That is, we assumed that the healthcare system provides 95 critical care beds per million inhabitants. This is a good approximation to the healthcare capacity of many European countries, as well as the USA.
According to the chosen model, without intervention, the number of citizens requiring a bed in a critical care unit evolves according to Fig 2 (see Appendix A for the exact initial conditions of our numerical simulations). As the reader can appreciate, between the third and seventh months, the number of people in need of critical care exceeds the capacity of the considered healthcare system by 18 times.
Vaccines against COVID-19 were not available during the first year of the pandemic. Hence initially most governments opted to control the disease via distancing measures and/or lockdowns. The effect of implementing a policy s(t) is to multiply the disease's basic reproduction number R 0 by a factor of r [15,30] Depending on the discrete or continuous nature of the intervention, we shall consider two kinds of such non-pharmaceutical policies. On one hand, we will speak of lockdown policies when s is only allowed to take the values {0, 1}; those correspond to situations in which a lockdown is either on or off. For discrete s(t) as in Eq (1), the effect of a lockdown (s = 1) results in the reduction of the transmission rate r ¼ � r in Eq (15). On the contrary, when the population is free to interact (s = 0) then r = 1 and there is no change in the disease's basic reproduction number.
On the other hand, we will speak of physical-distancing policies when s is allowed to take any value in the interval [0, 1]. Continuous values of s(t) correspond to intermediate measures (for example mandatory face masks, suspension of sport events, remote working, school closures), the effect of which can be tentatively estimated from available data [29].
If distancing measures are the only type of intervention that a government uses, then a non-adaptive continuous policy is fully determined by the function s(t;μ) 2 [0, 1]. To begin with, we will assume that the government can only declare new measures at the beginning of each week, i.e., that s(t) does not vary within the weekly intervals t 2 [7k, 7(k + 1)] ≕ I k , for k 2 Z. With these conditions, s(t) can be expressed as where w I k ðtÞ is the characteristic function of week number k, i.e., w I k ðtÞ equals 1 if t is in the kth week; and 0, otherwise. The characteristic function, thus, ensures that there is only one policy s(t) per week. sðyÞ ¼ 1 1þexpðyÞ denotes the sigmoid function, which guarantees that s(t) 2 [0, 1] by continuously mapping the variables fs k g k , such that, for t 2 [t 0 , t f ], s(t;{s k }) is everywhere differentiable, making it amenable to the gradient method. The parameters to optimize over are μ � fs k g k , since they fully define the government's disease policy.
In 2021, several vaccines against COVID-19 successfully passed clinical trials, and nowadays many governments are fighting the disease through national vaccination campaigns. We model the effect of a COVID-19 vaccination campaign by assuming that: a) the government vaccinates the population across all compartments, i.e., without distinguishing between susceptible, exposed, recovered, etc.; b) just susceptible individuals benefit from the vaccine; c) all such vaccinated individuals become immune to the disease. The reader can find the explicit model in Appendix A. Obviously, this model is a gross simplification of the effect of COVID-19 vaccines currently in the market, which on one hand varies depending on the particular vaccine and the prior exposure to COVID-19, and on the other hand does not seem to always confer full immunity to the disease [31]. More complicated vaccination models can be built to properly assess a government on vaccination policies, but, for the sake of illustration of our techniques, this simplified model will suffice. We place a cap of Λ = 0.0011 days −1 on the fraction of the population that can be vaccinated per day. We also assume that changes in the vaccination rate can only be made weekly. This brings us to parametrize the vaccination rate via the function Finally, we assume that the government holds a supply of vaccines for just a third of the total population. This means that, together with (2), we need to enforce the extra constraint We achieve this by adding the term (13) to the objective function, with V ¼ 1 3 , λ = 10 4 . Without additional constraints, this model represent the case of a vaccine with 100% efficacy. A reduced efficacy, let us denote it by e < 1, can be straightforwardly modeled simply by multiplying the vaccination rate v(t) and the vaccine supply V by e.
In all our examples, we never allow the number of people in the critical care compartment to exceed the maximal occupancy, i.e., we impose the constraint in Eq (2). To enforce it, we add the term (12) to the objective function, with rðtÞ ¼ 100 B c . Finally, in order to solve the differential (29), we use the Euler explicit method (30) with step size δ = 1.0 to generate all our plots, apart from in Figs 6 and 7, where we use δ = 0.1.
Next, we use the gradient method to derive the optimal interventions to control COVID-19 in a number of disease scenarios.

Fighting COVID-19 via physical distancing measures
We first consider policies exclusively based on non-pharmaceutical interventions; more concretely, continuous weekly policies s(t) of the form (16). For starters, we tackle the problem of quickly steering the population towards herd immunity, all the while respecting the constraint (2) on the critical care capacity. Our goal can be captured by a functional of the form (11), witĥ AðμÞ ¼ 0 and Lðt; μ; xðt; μ; x 0 ÞÞ ¼ jSðtÞ À S h j; ð19Þ where S (as depicted in Fig 1) is the component of x that denotes the proportion of individuals in the population who are susceptible to the disease. S h ¼ 1 R 0 is the proportion of susceptible individuals required for herd immunity to be guaranteed; thus, once S(t)<S h , although people will continue to become infected, the natural evolution of the disease will be such that the rate and number of infected quickly dies out. Fig 3 illustrates the results of optimizing the objective function in Eq (19) for continuous physical-distancing measures. The plot shows both the critical care occupancy CðtÞ and the total number SðtÞ of susceptible individuals between times t 0 and t f = t 0 + 3 × 365. The number of susceptible individuals reaches S h on day 864, after which the disease can be considered extinct. As the reader can appreciate, the critical care occupancy curve never surpasses the critical value B c .
The results in Fig 3 are achieved at the cost of enforcing constant physical-distancing measures throughout the considered time frame. Naturally, the next problem we consider is that of minimizing the aggregate economic cost E associated with the physical-distancing measures implemented by the government over the first two years of the disease (while respecting the critical care capacity constraint (2)). Thus we take an objective function of the form (11) This plot shows the critical care occupancy CðtÞ, and also the physical-distancing measure s(t) between times t 0 and t f = t 0 + 2 × 365. The aggregate cost of the optimal policy is equal to the economic cost of sustaining a full lockdown for 294 days. As anticipated, the optimal policy found by the computer is very complicated. Notice that the critical care occupancy grows quickly towards the end of the plot. The reason for this is that we asked the computer to minimize the total time in lockdown over a fixed period of two years, which is exactly what it did: it minimized the physical-distancing measures in the first two years, with complete disregard for what could happen next. To overcome such a pathological case, one can introduce additional terms into Eq (20) to make the final slope of the curve CðtÞ less steep, or even decreasing.

Fighting COVID-19 via lockdowns
In some circumstances, the only distancing measures considered by governments are discrete: lockdown on or off, as in Eq (1). Let the objective function be given by Eq (20), with EðsÞ ¼ s, i.e., we wish to minimize the total time under lockdown. As stated earlier, optimizations over discrete government interventions cannot be carried out directly with the gradient method. We conduct them instead with the two heuristics proposed in Appendix C.
The first heuristic allows optimizing over weekly on/off confinements, and its results are shown in Fig 5. This time the critical care occupancy curve touches the critical care capacity just once, after the end of year 1. The reason for this is that we demanded lockdowns to last exactly one week: had we allowed the government to declare a lockdown on any day of the week, the computer would have found a tighter solution, with every peak of the red curve touching the dashed line. Even under this discrete weekly simplification, the solution found by the computer is non-trivial: it requires the government to declare a lockdown 27 times (as we will soon see, one can limit the total number of lockdowns in the final policy by adding

PLOS ONE
Disease control as an optimization problem constraints to the optimization problem). The total length of the lockdown in the course of two years is 371 days.
In the second discretization method, the disease control policy is continuously parametrized through the vector μ = (t 1 , t 2 , . . ., t 2N ), and lockdown is assumed to take place within the time intervals [t 1 , t 2 ], [t 3 , t 4 ], . . .. In this parametrization, lockdowns can be declared or lifted at arbitrary times within [t 0 , t f ], and not only on Mondays, like in the first heuristic. This second discretization method has the advantage of allowing one to set the maximum number N of lockdowns throughout the period [t 0 , t f ]. For N = 9, the corresponding critical care occupation and lockdown graphs are shown in Fig 6. The total length of the lockdown is 338 days.

Fighting COVID-19 via limited lockdowns and vaccination.
We next study the effect of vaccination campaigns in reducing the total time spent in lockdown. To ensure that the final recommended policies are easy to implement, we place the limit N = 5 on the maximum number of lockdowns.
If we allow the government to declare or lift lockdowns at any point in time, the second heuristic outputs the policy depicted in Fig 7. Note that the optimal policy does not require 5 lockdowns, but 4. The policy involves vaccinating the population at maximum rate since the very beginning of the government intervention until about day 200, after which the vaccination rate gradually drops to zero. The total span of the required lockdown is 186 days. This must be compared with the 338-days lockdown required when we allow for N = 10 lockdowns, but no vaccines are available. Even at such a slow vaccination pace, and under a shortage of supplies, the effects of a vaccination campaign prove to be very impressive.
Let us now see how the picture changes when the government is not allowed to enforce an intervention in the middle of the week. Enforcing a maximum number N of lockdowns is not automatic when we deal with the first heuristic; it requires us to add an extra constraint to the

Fig 6. Occupation of critical care beds (red) and lockdown measures (blue) for a period of two years.
Optimization over deterministic policies with arbitrary initial and final times for each lockdown period. A total of 9 lockdown periods has been fixed prior to the optimization. Notice that the first lockdown period, around day 60, has been basically removed by the optimization procedure.  Fig 5). As the reader can appreciate, the time spent in lockdown is considerably higher when we require lockdowns to be enforced or lifted on Mondays than when we allow the government to intervene at arbitrary times.

Dealing with uncertainty
In practice, the predictions of any mathematical model for a physical system will not be perfect for a number of reasons. First, basic parameters of the model, such as the transmission rate or the initial occupation x 0 , are only known up to approximations. Even if reality were exactly described by a particular mathematical model, small errors in such parameters would accumulate in the long run, making long-term predictions unreliable. Second, reality is never exactly described by mathematical models: on the contrary, any tractable disease model is, at best, a rough approximation to reality. Consequently, even the most successful disease models in the market cease to deliver solid predictions beyond 4 weeks [16].

Uncertainty in the parameters.
These considerations make us question how practical a two-year disease control policy really is. Consider the physical-distancing policy depicted in Fig 4, which was obtained by applying the gradient method to the SEIR model in [15]. Here,  Table 1 of Appendix A. In reality of course, the values of the parameters are never all equal to their averages, so we proceed to generate sets of parameters with some fluctuations. Let Δν be the vector with entries given by the difference between the upper and lower bounds of all the entries of the table, and suppose that the actual parameters of "reality" are unknown and uniformly distributed in the region of values N a ¼ n : À a Dν 2 � ν À � ν � a Dν 2 � � , where a can be interpreted as the amount of noise or uncertainty. How robust is the afore-mentioned policy to uncertainty in the initial parameters? Fig 9 shows the result of generating 1000 independent parameter samples from the region N 0:05 , corresponding to a 5% uncertainty, with respect to the given interval of values, and running the corresponding models for the optimal physical distancing policy in Fig 4. As one can see, for some sampled values of parameters, the critical care capacity of the healthcare system is exceeded. This is not surprising, since the policy depicted in Fig 4 was devised to perform well under the assumption that ν ¼ � ν and not ν 2 N 0:05 . In order to tame the behavior in the plot in Fig 9, we use stochastic gradient descent to minimize the average value of the objective function A assuming a uniform distribution of ν over N a , as explained at the end of Appendix B. By adding to A sufficiently strong penalties for the violation of each optimization constraint, we make sure that such constraints will approximately hold for most of the points in N a . Fig 10 shows a lockdown policy minimizing the physical distancing measures under the condition that constraint (2) holds for different values of n 2 N 0:05 , i.e., the critical care capacity is not exceeded. This time, the violation of condition (2) is neither so extreme nor so frequent. This comes, however, at the cost of enforcing physical distancing measures with a cost equivalent to 331 days of lockdown. Repeating the optimization for N 0:25 , Fig 11, we see that the critical care capacity is rarely surpassed. However, this time the total cost is equivalent to 414 days of lockdown.

PLOS ONE
This result is what one would have expected. As time goes by, the predictions of the disease model for different values of ν diverge: any policy that aims to satisfy constraint (2) for large ranges of these parameters will necessarily require extensive physical distancing measures.
In practical policy-making, graphs such as Figs 10 and 11 should not be understood to represent the actual physical distancing policy, but rather to provide a provisional policy plan. A policy plan gives a recommendation for action for the immediate future, given the current knowledge of the disease. In Fig 10, the policy plan is advising not to declare physical distancing measures in the first weeks. That is the measure that the government should adopt then. After a first time period, say four weeks, more data will have been gathered: this will allow us to obtain a better estimate of the parameters ν, and then re-run the models for another two years ahead. The measure to enforce should then be whatever the new policy plan recommends for the following four-week time period. The process is then repeated.
To test how this idea would perform in practice, we consider a scenario where the parameters defining the disease model are unknown, but the region in parameter space in which they live shrinks every month (to be precise, we used a 28-day period, corresponding to four weeks). That is, at month k, the government is informed that the parameters ν satisfy n 2 N 0:25= ffi ffi k p . Every four weeks, the policy is recalculated to minimize the physical distancing measures for the rest of the two years ahead, using the range n 2 N 0:25= ffi ffi k p . The final curves for the critical care occupation and the physical distancing measures are shown in Fig 12 for ν, in a sequence of shrinking regions N 0:25= ffi ffi k p , for each month k (red region) and in the case of a fixed uncertainty region corresponding to the last month, i.e., N 0:049 (inner dark blue region). The total cost of the physical distancing measures is equivalent to 358 lockdown days. This has to be compared with the cost of 414 days predicted by the initial policy plan under the assumption n 2 N 0:25 . In principle, one could further decrease the total planned cost by devising adaptive policy plans, where the measure to be taken at each moment depends not only on the current time t, but also on the past history of physical distancing measures and their observed effects. In fact, we tried optimizing over generic adaptive policies described by a continuous version of a neural network architecture known as Long Short-Term Memory (LSTM) [32]. In all our numerical experiments, such simple LSTM architectures could not improve the performance of nonadaptive strategies, but this could be due to ineffective training on our side.

Nondeterministic models.
So far, all models considered in the optimization are deterministic, namely, given the model parameters and initial conditions, the compartments evolve along a unique trajectory. As a more general case, we can consider the one in which the disease's dynamics is governed, rather than by Eq (23), by a stochastic differential equation. More concretely, consider the model that results when we replace the first and second lines of Eq (23) by where ξ represents a Gaussian noise term reflecting stochastic fluctuations on the virus' transmissivity [5]. Solving this equation by discretization, each occurrence of ξ at time t k = t 0 + kδ is , where η denotes the magnitude of the noise term. For our simulation, we consider the noise η = 0.05. As before, the method of stochastic gradient descent is used for obtaining a noise-robust policy. At each time step, we average the gradient over 1024 × η different stochastic evolutions. In contrast to the case of noisy parameters, for a stochastic evolution the sampling has to be repeated at each time step, see Appendix B for more details. The results of our simulations are shown in Fig 13. Notice that, in contrast to Fig  9, the noise is not uniformly distributed in a limited interval, but is given by a Gaussian, which allows for (rare) events far away from the mean. This justifies the different representations of uncertainty in these plots. Despite the presence of relatively high noise in the evolution, as shown by the wide fluctuations in Fig 13, the algorithm is able to devise strong policies that maintain the critical care occupation almost always, i.e., with a high probability, below its capacity.

Conclusion
In this paper, we have applied standard tools from optimization theory and machine learning to identify optimal disease control policies, given an epidemiological model. This is in stark contrast to standard practice in mathematical epidemiology, where human intuition is used to narrow down the considered set of policies to a uni-parametric family. We saw that the optimal solutions found by our algorithms are highly counter-intuitive, and thus unlikely to be identified by a human. This supports the idea that policies for disease control should be based on a combination of both human expertise and machine learning.
Compared to previous approaches that tried to identify suitable disease control policies through optimal control theory, our framework allows one to devise policies that satisfy arbitrary constraints under arbitrary uncertainties in the initial conditions and model parameters that determine the disease's dynamics, and stochastic evolution. Our methods, in addition, allow one to optimize over discrete policies, as well as policies that can just vary at certain fixed times.
To illustrate our ideas, we studied a scenario in which a computer is tasked with outputting the minimal amount of non-pharmaceutical interventions for an epidemic in a hypothetical country, in such a way as to never exceed the critical care bed capacity. We looked at situations in which these measures were continuous (recommendations on the interval [0, 1]) as well as discrete (either 0 or 1)-a lockdown that is off or on, respectively for periods of 2 years. We experimented with measures that are just allowed to change weekly, as well as those in which there is a maximum number of lockdowns that is allowed to be declared. We also showed how vaccine supplies, even when meager and poorly distributed, can dramatically shorten the total confinement required to keep the disease under control.
Admittedly, some of our computer-generated policies are too complicated to be implemented in practice. Our formalism allows, however, to limit the complexity of the found solutions by adding extra constraints to the original optimization problem. To highlight this feature, we generated a near-optimal weekly policy plan of vaccinations and on/off weekly confinement measures with a cap on the total number of lockdowns.
We examined the problems that one may encounter when applying our techniques to scenarios where the model parameters are not known with high accuracy, or the model evolves stochastically. This led us to propose practical policy plans which must be continually revisited, to account for our ever-changing and ever-growing knowledge in an epidemic. We tested the viability of this approach by simulating a scenario where the uncertainty on the disease model parameters decreases with time. As expected, the final policy implemented was safe for the final range of parameters and required considerably less physical distancing measures than the initial policy plan hinted.
In this last regard, an interesting problem for future research is devising a gradient-friendly ansatz for adaptive policy plans for disease control, where the actual measure at each time depends on the whole history of disease indicators accessible to the government. In theory, such plans should predict lower values of the average objective function in scenarios where the model parameters are unknown. In our experience, though, gradient descent applied to the standard LSTM architecture seems to be unable to beat the non-adaptive score.
Finally, we would like to remark once more that, since the optimization problems we dealt with in this paper are non-convex, the gradient method is not guaranteed to converge to the minimum of the (average) objective function. While conducting this research, in order to convince ourselves that the solutions found by our numerical methods were close to optimal, we had to repeat our optimizations several times, with different initial policies μ (0) and learning rates �. Such a redundant use of computational resources would have been entirely avoidable if we had had some rough approximation to the exact solution of the problem. Hence we conclude this paper with a challenge for the operations research community: develop mathematical tools which allow one to lower bound the solution of minimization problems involving ordinary differential equations.

A Models for the spread of COVID-19
In all our numerical simulations, we assume that the dynamics of the COVID-19 are well approximated by a compartmental model of the SEIR type. When the government policy reduces to enforcing physical distance measures, we adopt a simplified version of the model used in [15]. This model divides those infected with COVID-19 into three different compartments: I R , or those who recover by themselves from the disease; I H , those who require hospitalization but do not enter a critical care unit; and I C , those who are both hospitalized and visit a critical care unit before recovery. The dynamics of the model are governed by the system of ordinary differential equations below, see the diagram in Fig 14: Here denotes the virus' transmitivity, that is subject to seasonal variability. rðtÞ ¼ ð� r À 1ÞsðtÞ þ 1 models the effect of a government-mandated lockdown s(t) 2 {0, 1} on the virus' transmission rate. The values of the remaining parameters are taken from [15], and appear in Table 1. If the government has a vaccine available, the model changes. In that case, the first and last lines of Eq (23) shall be replaced by where v(t) denotes the vaccination rate. In all our numerical simulations, we take the total population to be 47 million; the critical care bed capacity per inhabitant C c is also taken to be 9.5 × 10 −5 . We assume that the government starts its intervention on day t 0 = 60, 30 days after the outbreak of the disease. We model the disease outbreak by assuming that, at time t out = 30, there are 10 individuals in compartment E. By default, the values of the disease parameters are taken to be the arithmetic means of the intervals shown in Table 1. We assume that the government can vaccinate at most 50, 000 individuals per day. This means that, at any given time t, vðtÞ � L ≔ 50;000 47;000;000 � 0:00106. In addition, we assume that the total supply of vaccines can just cover a third of the total population. In other words, R dtvðtÞ � 1 3 .

B Optimization over continuous measures for disease control
In this section, we explain how to apply the gradient method to optimize over continuous classes of government interventions. Our starting point is an ordinary differential equation of the form The entries of vector x represent the occupations of the different compartments of a disease model. In the case of adaptive policies, some of such entries might also represent the components of the cell state θ [32], i.e., the internal variables used by the government to keep track of the evolution of the disease and guide future government interventions. μ 2 R n represents a parametrization of the effects of a given policy. Let xðt; � μ; x 0 Þ be the solution of Eq (26) with initial conditions x(0) = x 0 and μ ¼ � μ. Given the set M � R n , we consider the problem of finding the parameters μ ? 2 M such that x(t;μ ? , x 0 ) minimizes a given functional A. This functional defines the means by which we wish to control the disease: it might represent the number and duration of lockdown, etc. For the time being, let us assume this functional to be of the form From the discussion in Section 4, the functional in Eq (27) might also contain constraints which we wish the solution x(t;μ ? , x 0 ) to satisfy, such as (12) and (13). Note that we are assuming to know the initial conditions x 0 with precision. We will relax this requirement by the end of the section.
To minimize A(μ, x 0 ) via gradient descent, we need to compute r μ A. For functionals of the form (27), we have that The next question is thus how to compute the derivatives @x i @m j . To this aim, define the variables y i j ðt; μ; x 0 Þ � @x i ðt;μ;x 0 Þ @m j . Differentiating Eq (26) by μ j , we have that x; μÞ @x l y l j ; i ¼ 1; :::; m; j ¼ 1; :::; n: ð29Þ In order to obtain fy i j ðtÞg for each time t, it hence suffices to solve the system of coupled differential equations given by (26) and (29) with initial conditions x(t 0 ) = x 0 , y i j ðt 0 Þ ¼ 0. This can be achieved numerically through several different methods, depending on the desired accuracy. The simplest such method is called Euler explicit [33]: for some δ > 0, it consists of regarding time as a discrete variable of the form t k = t 0 + δk, for k ¼ 0; :::; d t f À t 0 d e. The quantities fx i ðt k Þ; y i j ðt k Þ : kg are then obtained by recursively applying the relations We will also encounter situations where our functional A is more complicated than (27). Some parameters z (not policy parameters) regulating the evolution (26), such as the basic reproduction number of the disease, might be unknown, or perhaps the initial conditions x 0 are just known within some bounds. In such cases, the problem's objective function A might adopt the form The prescription above also allows optimizing over control policies in scenarios where the disease's equations of motion are not deterministic, but probabilistic. Suppose, for instance, that the disease's dynamics are governed by a stochastic differential equation where the entries of ξ 2 R s represent Gaussian white noise. When we solve this equation by discretization, each occurrence of ξ at time t k = t 0 + kδ is to be replaced by an s-dimensional vector ξ k of independent Gaussian variables of 0 mean and standard deviation If we aim to minimize the expectation value of the objective function, we can estimate the gradient of the average through Eq (32), with z = (ξ 1 , . . ., ξ M ) and and use stochastic gradient descent to find the minimum. In the specific case we considered in our simulations, the noise simply affected the "susceptible" and "exposed" compartments, where the usual infection rate is multiplied by a factor (1 + ξ), thus modelling a stochastic fluctuation of the infection rate. This is arguably the most interesting term to study stochastic fluctuations, since the product SI appearing in the differential equation is at the origin of the nonlinearity of the SEIR model. In the next section, we will use the same idea to carry out policy optimizations in scenarios where the disease's evolution is influenced by a finite number of discrete random variables.

C Optimization over discrete policies of disease control
The section above explains how to conduct optimizations over disease control policies, as long as the parameters μ defining the policy are allowed to vary all over R n . Some policies, though, are by their very nature, discrete. For instance, on day t we can either declare a lockdown (s(t) = 1) or not declare a lockdown (s(t) = 0). As we discuss in section 5 in the main text, the effect of a lockdown policy in the evolution of the disease can be modeled by introducing a term in Eq (26) that is proportional to s(t). Namely, G i ðt; x; μÞ ¼Ĝ i ðt; x; μÞ þ sðtÞF i ðxÞ.
To optimize over such discrete policies via gradient descent methods, one could think of introducing a continuous variable l 2 R and writing its effect on the disease's equations of motion by means of a piece-wise continuous function of λ, e.g.: Here Θ(z) denotes the Heaviside function (i.e., Θ(z) equals 1 for z � 0, or 0, otherwise). In this case, however, the gradient method would not work, since the Heaviside function has zero derivative everywhere except at 0. In every iteration of Adam, r λ A would be null, and so λ (k) = λ (0) for all k.
Applying the gradient method to optimize over policies for disease control involving discrete government interventions is therefore not straightforward. In the following, we propose two heuristics to tackle this problem. C.1 Optimization over deterministic discrete policies through non-deterministic discrete policies. Suppose, for the time being, that our lockdown policy were probabilistic, i.e., at each week k, we declare a lockdown with probability p k ð1Þ ¼ sðs k Þ; otherwise, with probability p k ð0Þ ¼ 1 À sðs k Þ, we let the population roam freely. We wish to minimize our average objective function, that is, the expression � Aðs; μÞ ¼ where c is the whole vector of weekly lockdowns, and μ corresponds to the continuous parameters of the policy, e.g.: vaccination rates. In principle, we could apply gradient descent to minimize (35). Estimating the exact gradient of the above expression is, however, unrealistic, as it involves summing a number of terms exponential in the number of weeks q. Instead, we will produce a random unbiased estimate of the gradient and invoke stochastic gradient descent methods, see Section 3.
Let us first differentiate Eq (35) Since both vectors can be sampled efficiently, we can use them (and their averages) to optimize over � Aðs; μÞ via stochastic gradient descent. At this point, the reader might object that our original goal was to minimize (27) over policies with deterministic lockdown. Very conveniently, independently of the initial values of s, μ, the stochastic gradient method will converge to a policy p ? , μ ? such that the deterministic before to estimate the average of the objective function. The only difference is that, at time t 0 + 7(k − l), instead of sampling c k , we set it equal to a. Now, let us assume that, for y 1 6 ¼ y 2 , the family of functions p k (a|θ k ;ν k ) available is rich enough to regard p k (a|θ k = y 1 ;ν k ), p k (a|θ k = y 2 ;ν k ) as independent. Then one can argue as for non-adaptive policies and conclude that the gradient method will converge to a deterministic policy.
C.2 Optimization over discrete policies with continuous lockdown times. Our second heuristic to devise discrete policies for disease control requires considering lock-down policies of the following form: for some i; 1 otherwise: Here t 0 is fixed and the variables ft k g n k¼1 are assumed to be non-negative and to add up to t f − t 0 ; this policy can hence be parametrized by a vector μ 2 R n , with τ = (t f − t 0 )softmax(μ), where softmax(μ) denotes the vector ν with components n i ¼ expðm i Þ P j expðm j Þ . Intuitively, ft k g n i¼1 divide the interval [t 0 , t f ] into n different parts. In each part, lockdown is alternatively declared (s = 1) or suspended (s = 0), see Fig 15. At time t, the disease's basic reproduction number is given by (15), with s(t) defined as above. To find out y i j � @x i ðt;μÞ @μ j , we invoke Eq (29). In computing the term we have the problem that, due to (48), G is not continuous or differentiable. To work our way out, we approximate s(t) by a piece-wise continuous function with bounded derivative that transitions from 0 to 1 (or viceversa) linearly and in time δ � 1, see Fig 16; later we will take the limit δ ! 0.
The new functionsðtÞ has zero derivative with respect to μ i , except for t satisfying In that case, the derivative ofŝ with respect to τ j , with j � u, will (approximately) be @ @t jŝ ðtÞ ¼ ðÀ 1Þ The derivative with respect to any of the variables {τ j : j > u} is zero. The dominant term on the 3 × 3 matrices that, in principle, might depend on some controllable parameters μ. This equation is to be solved under the initial conditions u(0, X, Y) = u 0 (X, Y) and the homogeneous von Neumann boundary conditions ruðt; X; YÞ � nðX; YÞ ¼ 0; for ðX; YÞ 2 @O; ð56Þ where nðX; YÞ 2 R 2 denotes the vector normal to the contour @O at location (X, Y). The authors of [26] solve this equation numerically via the Finite Element Method (FEM) [34]. Suppose that we wished to optimize the policy parameters μ 2 R n over some functional A depending on u(t, X, Y;μ, u 0 ) (instead of x(t;μ, x 0 )) via the gradient method. Then at some point we would need to compute the quantities v i j ðt; X; Y; μ; u 0 Þ � @u i ðt;X;Y;μ;u 0 Þ @m j . Let v j 2 R 3 be the vector with components v i j and differentiate both (55) and (56) with respect to μ i . This results in the equation rv j ðt; X; YÞ � n ¼ 0; for ðX; YÞ 2 @O: Since u(0, X, Y;μ, u 0 ) does not depend on μ, this new diffusion equation must be solved for the initial conditions v j (X, Y, 0) = 0. This can be achieved numerically in the same way that the authors of [26] solved Eq (55), that is, via the FEM.