Predicting COVID-19 in very large countries: The case of Brazil

This work presents a practical proposal for estimating health system utilization for COVID-19 cases. The novel methodology developed is based on the dynamic model known as Susceptible, Infected, Removed and Dead (SIRD). The model was modified to focus on the healthcare system dynamics, rather than modeling all cases of the disease. It was tuned using data available for each Brazilian state and updated with daily figures. A figure of merit that assesses the quality of the model fit to the data was defined and used to optimize the free parameters. The parameters of an epidemiological model for the whole of Brazil, comprising a linear combination of the models for each state, were estimated considering the data available for the 26 Brazilian states. The model was validated, and strong adherence was demonstrated in most cases.


Introduction
In December 2019, the new coronavirus severe acute respiratory syndrome coronavirus 2 (SARS-Cov-2) was first identified in Wuhan, China.On March 11, the World Health Organization designated COVID-19 as a pandemic.As of August 2020, more than 23 million COVID-19 cases and 80,000 deaths had been reported worldwide [1].In Brazil, the virus was first identified in São Paulo city on February 26, and the first death occurred in March in Rio de Janeiro.The new cases detected at the beginning of the pandemic largely coincided with Brazilian cities with airport access, with approximately 2 million Brazilians exposed in approximately 20 weeks.As of the 7th week after the first case, the virus had spread to cities without airports, probably via road transport, increasing the population at risk.Within 5 weeks, according to Wesley Cotta/Ministry of Health data [2], all Brazilian states had registered active cases of the disease.In addition to the different characteristics of the states, which have HDI index values ranging from 0.631 for the state of Alagoas to 0.824 for the state of Distrito Federal, different measures were taken to achieve social isolation, implying different courses of the pandemic.
Brazil, with a population of approximately 200 million inhabitants, is composed of 26 states and the federal district.Approximately 30 years ago, the country implemented a universal and decentralized health system (Sistema U ´nico de Sau ´de) [3].There was instability in the federal management of the health crisis caused by the pandemic, with a number of changes in the Ministry of Health during a short period of time.States independently made several important decisions for controlling the epidemic, leading to high heterogeneity in the non-pharmacological measures taken to mitigate the pandemic.The utilization of the health system and decisions about isolation guidelines served as a guide for most of the official communications from states (26) and municipalities (5570) in the Brazilian press [4].
Internationally, machine learning has been widely used to predict disease behaviour, to forecast demand for health services, to plan and evaluate measures to reopen quarantined sites [5][6][7], and also for medical diagnoses [8].The choice of the best model to forecast demand has been discussed in the literature and remains controversial.COVID-19 is a new disease, and its transmission dynamics and natural history are not yet completely clear; in addition, there is a variable proportion of asymptomatic and mildly symptomatic cases.Those are not notified to health authorities, and although individuals affected with milder cases do not need treatment, they may transmit the disease [9].The modeling challenge is even greater in large countries with significant inequality and a heterogenous evolution of the pandemic, such as Brazil.This country has high-quality data on COVID-19, which originate from the epidemiological surveillance of acute influenza and respiratory syndromes, and are available electronically.
We propose a modification of the Susceptible, Infected, Removed, and Dead (SIRD) [10][11][12][13] model to describe the dynamics of health system usage based on reported cases only, and not on the epidemic as a whole [14].A comparison between models applicable to CoViD19 can be found in [15], where the authors test eight empirical functions, four methods of statistical inference, and five dynamic models built from variations of the SIR model, all of them with data from the epidemic in China.In their work, the models are compared using the Akaike information criterion (AIC), mean square error (MSE) and robustness index, allowing assessment of overestimation and underestimation.In the specific case of dynamic models, they establish a cost-benefit relationship between model complexity and predictive capacity.
The simplicity of the SIRD model compared to more complete and sophisticated models [16,17] makes it easier to tune its parameters and simplifies its use by public agents in management and public communication.Only a portion of those infected by COVID-19 will use the health system.It is known that the peak of the SIRD model is dependent on the population considered.This led us to consider a weighting of the total population to estimate the utilization of the public health system.The proposal was validated by applying the model to each Brazilian state.Tuning the modified SIRD model for each state permits a comparative assessment of its main parameters, infection rate and removal rate, in addition to an assessment of the basic reproduction rate R 0 and the effective reproduction rate R t , all of them relevant parameters for public health management and decision making [18].The model can be used together with solutions for tracking individuals with the purpose of monitoring the epidemic in contagion and geographic location [19].
The global model for all of Brazil was obtained from a linear combination of the estimated active cases for each state.The main contribution of the novel model developed is the demonstration that the data from Brazil as a whole does not follow a simple SIRD model, and the prediction that the epidemic would intensify in the second half of the year due to the natural risk associated with its presence in different states and locations.We have proposed an algorithm that describes this behaviour.Additional important contributions are the possibility of using the model estimates to predict the infection rate and the reproducibility index for the whole country.These indices, which can be reliably estimated from our model, are important for public management and can be easily communicated to society.

The model
In this section, the machine learning problem is introduced based upon the so-called SIRD model.With this particular model structure, an optimization algorithm using a heuristic search is introduced into the learning algorithm.Additionally, a data-driven optimization technique is introduced as a solution to the susceptible data unavailability problem by introducing a degree of freedom to the algorithm.To start the optimization problem structure, consider the SIRD differential model described by equations Eqs (1)-( 4): where β, γ and μ are the average number of contacts per person per period of time, the inverse of the number of days required for a person to pass from the infected to the recovered state, and the average number of deaths per period of time.The continuous time series of the susceptible, infected, removed and death are represented respectively by S(t), I(t), R(t) and D(t).Considering that the existing data sets are usually sampled uniformly, there is an advantage of using the discrete representation of the SIRD model, which can be achieved by Eqs ( 5)-( 8), where Δt is the sample time of the data-sets, P is the total population that should be considered, and the discrete time series of the susceptible, infected, removed and death are represented respectively by S(k), I(k), R(k), and From model Eqs ( 5)-( 8), determining the mean squared error e(k + 1) of the model from the provided data is straightforward.Therefore it is possible to consider the error equation with an aggregated value for each component, such as the maximum value of each component, resulting in the weighted error given by Eq (9).The complete set of data for each model variable is represented by the vectors S, I, R, and D: where the components with the upper tilde are the output components of the differential model ( 5)-( 8) for a particular set of parameters β, γ and μ, e.g., the component ĨðkÞ is the simulation of Eq (6) for a particular set of parameters at k time instants from the initial sample.From that, it is possible to compute the mean squared error using Eq (10), where N is the number of data samples.

MSE ¼ P N k¼0 eðkÞ N ð10Þ
The MSE defined could be employed as the cost function for the data-driven problem, but due to high amplitude difference of the model components mean values, the cost function must take into consideration a weighting parameter.This parameter is used to attribute the same importance to the error of each component of the model.This equalizes the importance of all components on the cost function, and enhances the backward and forward stability of the optimization search space.
Notice that to simulate the components, SðkÞ, ĨðkÞ, RðkÞ, and DðkÞ, the learning algorithm must solve Eqs ( 5)-( 8) for a particular set of parameters.This is straightforward provided that the initial conditions I(0), R(0), and D(0) are known, as they are related to the size of the population P by ð11Þ

Definition of susceptible component
The fraction of susceptible individuals is usually not available on data-sets.It is usually computed from Eq (11), using the components I(k), R(k), D(k), and the estimated population size P.But this assumption is not quite accurate, as the entire population, P, cannot be considered susceptible, specially in case of COVID-19.The model we propose will fit the data-set when the information provided by is actually the number of people who visited a health care facility, and are then tracked by the data.In the next section, we will discuss an algorithm capable of computing the influence of the susceptible component into the cost function.
The susceptible component is subject to imprecision because it depends on environmental aspects such as isolation, disease health impact, targeted people, and even climate conditions.Several studies suggest that it should not be used in the optimization problem.This is problematic: to predict the behavior of the epidemic, it is necessary to consider the initial value of the susceptible components in (11), the value of the considered population size.For example, we could determine the susceptible component value at the time where (2) is zero.So for that can write (2) when t = t p resulting in where t p is the instant of t where the peak occurs.This leads to the condition which shows that the peak moment t p depends on the correct selection of the population size, P. Given a susceptible component value computed from (11) and a population size P, the parameters β, μ and γ in ( 13) are bounded by this particular representation.Now consider that the correct initial value of the susceptible component, S(0), is not the total population, but actually a proportion of it, λP.This happens in scenarios where part of the population is immune or are not impacted by the disease symptoms, and therefore are only carriers.In this particular case it is possible to rewrite (13) as Any distortion of the susceptible component, or of the population value, will be acknowledged by the new degree of freedom of the model, λ.The previous parameters are guided by their particular components, ( 2)-( 4), and the existent data-set.The susceptible component can be computed by considering λ from (15).Considering the characteristics of CoViD 19, where not all infectious people seek the health care system, and that our interest is to model the dynamics of the people who use to the health care system, and specially estimating the peak S(t p ), the weighting of the total population λP allows the SIRD model to represent this dynamic:

Optimization problem
The optimization problem can be structured in the form arg fb;g;m;lg2O min with e p (k) representing the weighted error at sample k, given by ( 9), and the component data reference of S(k) being obtained by (15).In the usual formulation of heuristic optimization algorithms, the arguments must be bounded by the search space O.
The search argument boundaries determination is straightforward as each parameter presents a physical interpretation in the model.
• β is the amount of people one contagious individual infects per time unit; • γ −1 is the amount of time that an infected individual takes to recover; • μ is the proportion of infected people that dies per time unit; • λ is the proportion of the population considered as initially susceptible.
Limits for each of these parameters are given by common sense.Even better is to obtain them numerically, considering the influence of the basic reproduction number, R 0 .For this model, R 0 can be obtained from the relation The basic reproduction number measures the average number of people one contagious person will infect during the contamination period.When R 0 > 1, one person will infect more than one other, and therefore the disease will be capable of self-sustaining growth.Conversely, when R 0 < 1, the disease by itself will not become epidemic.There is a vast literature concerning the value of R 0 for the most common epidemics.We propose the following approach for solving the optimization problem (16): instead of searching for the set of arguments {β, γ, μ, λ}, we search for {R 0 , D, μ, λ}, where D = γ −1 .The search for R 0 and D is better conditioned then the search for β and γ, as R 0 directly define the existence of the epidemic.

Validation for Brazilian states
Brazil is a large country, and there are several cultural and environmental aspects that make its states diverse.Each state can be treated as an isolated epidemic environment, and we can fit a model for each individual state.In Fig 2, the R 2 values for all states are presented for each scenario: searching for {R 0 , D, μ, λ} and searching for {R 0 , D, μ} with the previous setting λ = 1.For each state, R 2 is smaller when λ is selected by the model, i.e., the extra degree of freedom is considered in the population size.Without the reduction of the population considered, there is no set of values for the argument {R 0 , D, μ} that can properly fit the data set analysed.Note that this fitness performance of the algorithm is only possible due to the new degree of freedom introduced represented by the parameter λ.

Results from the model
The global model for Brazil is determined via the linear combination of state models.The epidemic curve of active cases, estimated on June 21, 2020, can be analysed in Fig 3 .The peak observed in June 2020 is strongly influenced by the peak in the state of São Paulo.The decay of the curve followed by support indicates that Brazil is expected to have a stable number of active cases until September or October 2020 and an increasing number of active cases until the first quarter of 2021.An important finding is that application of the SIRD model to Brazil as a whole (dotted in the figure) results in a different prediction for the case dynamics, indicating control of the epidemic in October 2020.The simulation was carried out on 06/27/2020 and the result clearly demonstrates that the model for Brazil does not follow the SIR model as well.The proposed model shows more realistic behaviour about the duration of the epidemic.Some models mistakenly predicted the end of the epidemic at the end of August 2020 [20].
In Fig 4, the predicted value for use of the Brazilian health system is presented, based on the average proportion of the population that will attend hospitals and health centres throughout the epidemic.In Fig 4, this value is shown as more data were provided to the model, i.e., as the epidemic progressed in the country.The value becomes close to the current (most recent) value of 1.0% of the population on 5/17/20, approximately 1.5 months before the first peak, when the predicted use was 0.8% of the country's population.
It can also be seen in Fig 4 that the growth of the number of individuals seeking health care and its future predictions have approximately linear behaviour, implying a constant growth rate, which indicates that the number of new cases is stable.This behavior is consistent when

Results by state
One of the challenges of public health management was described as "flattening" the epidemic curve, a way to openly communicate the strategies chosen for this purpose.suffer from the acute phase of the epidemic being Mato Grosso and Mato Grosso do Sul, which are located in the central west, and Parana ´, which is located in the south.In Fig 6, the estimated parameters are presented for the ten best adjusted states for the first scenario.From the table, it is possible to see that most values of R 0 stay in this range, except for SC (Santa Catarina), indicating that the inclusion of the λ parameter helps with the parameter bias usually observed in direct optimisations, where it is required that λ = 1.
Considering the SIRD model for each state, it is observed that the recovery rate D for individuals who accessed the health system is approximately 17 days and that the basic reproduction rate R 0 is approximately 2.9.Two other relevant parameters are the average mortality rate μ, which is close to 0.8%, which implies an estimated number of deaths of 128, 000 in August 2020.The rate of use of the average health system λ is on the order of 0.6% and may reach 1.0%, which implies an expectation that 2 million people will seek care in the health system.From Fig 6, it is possible to verify some other relevant features.The first is related to the comparison analysis of the parameter D. In the data, a recovered person is not a person considered to no longer be contagious, but rather a person who has been cleared by the hospital as recovered from the disease.Therefore, the state transfer dynamics, from infected to recovered, maps the time in which a person needs to receive health care until they are considered no longer affected by the disease and thus cured.That is why the estimated parameters for D have an average value of 17.97, when it is well known that a person is only contagious during their first week of symptoms.From the data, it was found that COVID-19 has a consistent value for the daily death rate, which has an average value of 0.6%.Compared with the model results, the state that is most off is RJ, with an estimated value of 2.0%.Notice however that this state that has the worst value for R 2 in Fig 2 .This particular problem was caused by a lack of rigorous data collection.The data contain many outliers, and during a long period of time, the data-set was not updated.The synthesis of the results, based on the models by state, can be analysed in Fig 7.

Space-time analysis
To calculate the R t values for each Brazilian state, we used the predicted number of infected individuals, considering a window of w days.The infection series can be predicted from the likelihood function considering a Poisson process [21,22].This procedure can be applied to both raw data and data predicted by the model.The results are illustrated in Fig 8 using a window w = 5 days displaced by 1 day, normalised by the z-score method.A second method, based on the extraction of the β transmission rate, was applied for each state, resulting in a third estimate for control of the epidemic [23].The results are illustrated in Fig 8.This last method, although probably affected by the fluctuations in the updated data in addition to social isolation factors, is shown to be correlated with the other two.The estimate of the R t rate from the model works as a filter when compared to the rate obtained from the data window, where the estimate tends to the mean the data and seems suitable for use in the prediction of contagion behavior.
As noted in Figs 1 and 8, there is a rapid recovery of the model from variations in the data.In the specific case, the variation was caused by the variation of the data sources and their consolidation, but it is expected that the model recovery will be equally rapid if the change comes from real cases.In this sense, the rates estimated from data and from the model can be used in a combined way for decision making, since they are interpreted on a 1-or 2-week horizon to observe their effects.When we analyse the temporal and geographical progression of the virus, representing the effective reproductive index R t for each moment and each state, as shown in Fig 9, a considerable portion of the states are still observed to have indices greater than 1.Other states show a decline but are in the opening process.According to the graphs, it is visible that the epidemic started in the northern and southeastern regions.Then it spread progressively throughout the country but did not progress equally in each state.The southeastern region reaches its peak weeks later than the northern region.Compared to the northern region, the southeastern region maintains high R t values at present, showing less control of the epidemic.Note that at the moment, there are still three states that have R t values greater than 1.0, i.e., the epidemic is still growing.An example of this behaviour is the state of Parana ´.As previously mentioned, according to state models, Brazil will experience a second smaller peak in September 2020.This second peak has an amplitude mostly determined by infection numbers from the state of Parana ´, which has the potential to reach values equivalent to those of São Paulo (predominant state in the amplitude of the first peak).

Conclusion and future directions
In conclusion, our modified SIRD model allowed the estimation of the COVID-19 epidemic model for the whole Brazil and may be used in other very large countries, such as the USA, India and Russia.The results obtained and observations during the training and tuning process are compatible with other works [25].When predicting the future of the pandemic in those countries, it is important that local variation in epidemic stage is accounted in the model to provide accurate results.The use of the composite model to understand the epidemic in Brazil allowed for a more realistic modeling, regarding the predictions of the use of the health system, as well as the average control parameters of the epidemic.We also found that COVID-19 peaked in Brazilian states during periods in which the peak of respiratory diseases also used to occur.At the time of the writing of this paper, June 2020, the values of R 0 and R t higher than 1 found for Brazilian States and the high values predicted until the last quarter of 2020 suggests that non-pharmacological measures would be needed for months, what turned out to be true.Another aspect that the model brought as evidence of its predictive capacity was the stable level between the months of June and October, with the beginning of a decline in cases after a considerable period of stable number of cases.
As a management element for a country of continental dimensions such as Brazil, the model proved to be effective in providing information to support decision making in the public and private spheres.However, its predictive capacity can be expanded considering aspects of closure and opening of economically active segments of society, such as the proposal studied in [26].One feature of the model is that it needs to be updated as new data become available daily to improve its estimates.Limitations to be addressed refer to the characterisation of the initial state of the epidemic, prediction of the possibility of new waves, and the inclusion of vaccination processes.In order to improve its capacity, increasing the complexity, we can analyse the impact of the quarantine as in the model tuned in Dynamic-Susceptible-Exposed-Infective-Quarantined (D-SEIQ) [27].
As part of the dissemination strategy for this work, we created a repository that allows interested parties to access all the code described in python as well as the preliminary tuning results.We intend to improve communication with the inclusion of information about vaccination, risk analysis as proposed in [28,29], and also modifications of the model in the case of incorporating the latency aspect in the transfer, which has gained greater understanding in the light of new observations [30].
The optimization problem can thus be rewritten in the formarg fR 0 ;D;m;lg2 � O min P N k¼0 e p ðkÞ Nð18Þwhere � O is the new search space, considering R 0 and D.
Fig 1 shows the data and the model predicted for two distinct Brazilian states, Maranhão and São Paulo.Using all data to fit each model, it is possible to see that in scenarios where the data were collected rigorously and strict isolation was implemented, such as in Maranhão, the algorithm was able to fit the data pattern with high fidelity.Even in scenarios where the data were not collected properly, such as São Paulo, the model was able to visualize the main pattern of the data.The modified SIRD model exhibited strong adherence to the data for most states with R2 values between 0.99 and 0.82.Fig 1 illustrates the adjustment of the model for the date 7/21/20 for the state of Maranhão, which has a population of approximately 7 million inhabitants (approximately 20 inhabitants per km 2 ) and declared 12 days of lockdown in May 2020; and the state of São Paulo, which has a population of approximately 46 million inhabitants (approximately 166 inhabitants per km 2 ).The basic reproducibility index estimated for Maranhão was R 0 = 3.52 and the basic reproducibility index estimated for São Paulo was R 0 = 4.78.

Fig 2 .
Fig 2. R 2 value comparing the predictions with the real data of each state and the reliability metric that shows the proportion of the data that is reliable to use on the learning algorithm.https://doi.org/10.1371/journal.pone.0253146.g002 Fig 5 illustrates the peaks in Brazil, with the first state to reach the peak being Pernambuco and the last states to