## Figures

## Abstract

Fitting Susceptible-Infected-Recovered (SIR) models to incidence data is problematic when not all infected individuals are reported. Assuming an underlying SIR model with general but known distribution for the time to recovery, this paper derives the implied differential-integral equations for observed incidence data when a fixed fraction of newly infected individuals are not observed. The parameters of the resulting system of differential equations are identifiable. Using these differential equations, we develop a stochastic model for the conditional distribution of current disease incidence given the entire past history of reported cases. We estimate the model parameters using Bayesian Markov Chain Monte-Carlo sampling of the posterior distribution. We use our model to estimate the transmission rate and fraction of asymptomatic individuals for the current Coronavirus 2019 outbreak in eight American Countries: the United States of America, Brazil, Mexico, Argentina, Chile, Colombia, Peru, and Panama, from January 2020 to May 2021. Our analysis reveals that the fraction of reported cases varies across all countries. For example, the reported incidence fraction for the United States of America varies from 0.3 to 0.6, while for Brazil it varies from 0.2 to 0.4.

**Citation: **Trejo I, Hengartner NW (2022) A modified Susceptible-Infected-Recovered model for observed under-reported incidence data. PLoS ONE 17(2):
e0263047.
https://doi.org/10.1371/journal.pone.0263047

**Editor: **Alberto d’Onofrio,
International Prevention Research Institute, FRANCE

**Received: **December 7, 2020; **Accepted: **January 11, 2022; **Published: ** February 9, 2022

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

**Data Availability: **The data underlying the results presented in the study are available from WHO COVID-19 Explorer. Geneva: World Health Organization, 2020. Available online: https://covid19.who.int/(last cited: [November 10th, 2020]).

**Funding: **Research presented in this article was supported by the Laboratory Directed Research and Development program of Los Alamos National Laboratory, projects 20210709ER and 20210043DR. The content is solely the responsibility of the authors and does not necessarily represent the official views of the sponsors. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Los Alamos National Laboratory is operated by Triad National Security, LLC, for the National Nuclear Security Administration of U.S. Department of Energy (Contract No. 89233218CNA000001).

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

Susceptible-Infected-Recovered (SIR) models, introduced by Kermack and McKendrick and further developed by Wilson and Worcester [1, 2], have been extensively used to describe the temporal dynamics of infectious disease outbreaks [3–5]. They have also been widely used to estimate the disease transmission rate by fitting the models to observed incidence data [6–8], such as time series of daily or weekly reported number of new cases provided by [9–12], for example. Implicit in all these model fittings is the assumption that all the infected individuals have been observed. Yet that assumption is problematic when disease incidences are under-reported. Under-reporting of incidence is prevalent in health surveillance of emerging diseases [13, 14], and also occurs when a disease presents a large fraction of asymptomatic carriers, e.g., Typhoid fever, Hepatitis B, Epstein-Barr virus [15] and Zika [16]. Lack of systematic testing and the presence of sub-clinical patients, which are prevalent in both Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), the causative agent of the coronavirus disease (COVID-19) pandemic [17–20], and Influenza [21, 22], also leads to under-counting incidence and death. Directly fitting an SIR model to raw under-reported incidence will underestimate the transmission rate (see Under-estimation of the transmission rate section). Therefore, failing to account for the under-reporting will under-estimate the severity of the outbreak, possibly leading decision makers to call the epidemic under control prematurely.

To account for under-reporting in an SIR-type model, Shutt *et. al* [ 23] propose to split the infected individuals into two: an observed category and an unobserved category. This is a special case of the Distributed Infection (DI) models introduced in [24]. However, fitting this model to data is problematic since there are no data from the unobserved category. Furthermore, making inferences about DI model parameters is difficult as there are no adequate stochastic model extensions for the DI models, which implies that there is no analytic expression for the likelihood. A partial solution of this problem is to use Approximate Bayesian Computations as in [23] or rely on particle filtering [25]. Finally, we mention two recent approaches to model asymptomatic individuals in SIR-type models: First Lopman *et. al*. in [26] model Norovirus outbreaks using an SEIR model, with E standing for “exposed”, where the infected would progress from symptomatic to asymptomatic to immune. Once immune, individuals could cycle between immune and asymptomatic infection. Second, Kalajdzievska *et. al*. in [15] propose an SIcIR model, with Ic standing for “infectious carrier”, where infected individuals are separated into asymptomatic and symptomatic groups by a given probability as they progress from the susceptible group.

The aim of this paper is to present a novel approach to estimate the under-reported from reported incidence data and apply this methodology to COVID-19 incidences. The COVID-19 pandemic is a particular example of an infectious disease that poses many challenges in quantifying the under-reported incidence, and hence estimating its infectiousness [19, 27, 28], as under-reporting arises from the presence of sub-clinical infections [20, 29], asymptomatic individuals [30, 31], and lack of systematic testing [17, 18]. Accordingly, asymptomatic individuals account for 20–70% of all the infections [30]. Additionally, early in the China outbreak, before traveling restrictions, 86% of all infections were not documented [19].

In the development of our methodology, we present two innovations: First, we introduce an alternative to the DI models that directly describes the dynamics of the observed under-reported incidences. Specifically, assuming that a constant fraction of the newly infected individuals is observed, we derive a set of integral-differential equations describing the local temporal dynamics of the observed incidence. Second, we use the local dynamics of the observed incidence to propose a model for the conditional expectation of new cases, given the observed past history. Making additional distributional assumptions, we obtain a likelihood for the epidemic model parameters: the transmission rate *β*, and the fraction *p* of observed incidence. We refer to Bettencourt and Ribeiro [32] for an interesting alternative framework that leads to a likelihood for the basic reproduction number . We show that as the epidemic progresses, both of these parameters become identifiable.

## Materials and methods

### Data source

The time series of the daily number of confirmed COVID-19 cases and total population, *N*, of the eight analyzed countries, were obtained from World Health Organization (WHO) reports. Both data sets can be freely downloaded online: https://covid19.who.int/WHO-COVID-19-global-data.csv and https://worldhealthorg.shinyapps.io/covid/, respectively [12, 33]. We used all available incidence reports up to the present study, which corresponds to the reports from January 03, 2020 to May 18, 2021.

### Model development

Our epidemic model is developed in three steps. First, we extend a generalized SIR model to describe the dynamics of the observed (under-counted) infections. Second, we introduce a local version of that SIR model to describel the evolution of the epidemic in a series of observational time windows given the past time serie of observed incidences. This more flexible model is used to compute the conditional expectation of current observed incidence given the past history. Third, we develop a computationally tractable approximation for the conditional expectation to speed up Monte-Carlo Markov Chain (MCMC) inferences of our model parameters.

#### Generalized SIR model.

Classical mass-action epidemic models, such as the SIR models, are simple yet useful mathematical descriptions of the temporal dynamics of disease outbreak [3–5]. These models describe the temporal evolution of the number of susceptible *S*(*t*), infected *I*(*t*) and recovered *R*(*t*) individuals in a population of fixed size *N* = *S*(*t*) + *I*(*t*) + *R*(*t*). We model their dynamics through the set of integral-differential equations [34]:
(1)
(2)
(3)
with initial conditions *S*(0), *I*(0), *R*(0). The parameter *β* measures the transmission rate (also called infection rate [35, 36]) and the function *F*(*t*) is the cumulative distribution of the time from infection to recovery. When *F*(*t*) = 1 − *e*^{−γt}, the exponential distribution with mean *γ*^{−1}, our model reduces to the standard SIR model (see Murray [37] for example). For completeness, the proof of existence and uniqueness of the solution of System (1)–(3) is provided in the appendix. An alternative proof can be found in [34].

The model parameters *β* and *F*(*t*) are epidemiologically relevant and provide insights into the outbreak. For example, the basic reproductive number as defined by Lotka [38, 39]:
(4)
The term *S*(*t*)/*N* in Eq (4) is the fraction of susceptible individuals in the population that can be infected and is the average recovery time. is arguably the most widely used measure of the severity of an outbreak [40, 41], at least in the absence of interventions to control it. It measures the expected number of secondary infections attributed to the index case in a naïve population. Other quantities of interest, such as the maximum number of infected individuals and the total number of infections, can be expressed in terms of the reproductive number , e.g. Weiss [36].

For many diseases, it is reasonable to assume that the disease progression from infection to recovery is known, either because the disease is well characterized, or because the date of onset of symptoms, hospital admissions, and discharge data are available [42]. Thus we will assume throughout this paper that we know the distribution of the recovery period *F*(*t*) and we will focus on estimating the transmission rate *β*.

#### Modeling the observed disease incidence.

Let , and denote the observed number of susceptible, infected and recovered individuals as a function of time. We make the following modeling assumptions:

- (A1). The true underlying dynamics follows the SIR dynamics described by Eqs (1)–(3) with known fixed population size
*N*and time-to-recovery distribution*F*(*t*). - (A2). A constant fraction
*p*of newly infected individuals is observed, that is , with 0 <*p*< 1. The same fraction*p*of initial cases is observed, i.e., , , and . - (A3). The recovery distribution is the same for observed and unobserved infected individuals.

Under these assumptions, the observed number of infected individuals at time *t* is
(5)
and similarly, . The number of observed susceptible individuals is
(6)
Eq (6) follows by solving the differential equation and using the identity , which results from (A2) and *N* = *S*(0) + *I*(0) + *R*(0).

These equations capture the intuitive idea that under-reported incidence results in a larger number of observed susceptible and fewer infected and recovered individuals through the epidemic evolution. Consider the ratio, which yields from Assumption (A2) and Eqs (5) and (6):
(7)
For a standard SIR model with *p* = 1, the ratio *v*(*t*) is unity. However, for the observed process, the ratio *v*(*t*) starts at one and then monotonically decreases over time. It follows that fitting an SIR model to observed incidence data, neglecting the under-reporting, will produce a nearly unbiased, but possibly noisy estimate for *β* early in the outbreak when *v*(*t*) ≈ 1. As more data becomes available and *v*(*t*) decreases, the estimated transmission rate *β* will under-estimate the true value. As a consequence, one might at later times in an outbreak underestimate the severity of the outbreak and call the epidemic under control prematurely.

The following theorem describes the dynamics of the observed number of susceptible, infected and recovered individuals when only a fraction *p* of the infected individuals are observed.

**Theorem 1** *Under assumptions (A1), (A2), and (A3), the process of the observed individuals evolves according to the following set of integral-differential equations*:
(8)
(9)
(10)

The conclusion of the theorem follows from algebraic manipulations of Eqs (1) to (6). The addition of the positive term to implies a slower depletion rate of the observed susceptible population than would be expected under the standard SIR model. Note that this positive term must be small enough such that , for all *t* ≥ 0 and all *p* > 0, condition imposed from Assumption (A2). Assumption (A2) of observing the same fraction *p* of initial infected and recovered individuals was established only for the technical mathematical proofs of Eq (5) and . This mathematical assumption will be relaxed in the following local dynamics definition.

#### A stochastic model for the observed incidence.

Observed incidences of disease are typically reported at regular time intervals. Precisely, let 0 = *t*_{0} < *t*_{1} < *t*_{2} < … < *t*_{n} denote the boundaries of the observation windows. For simplicity, we assume that *t*_{k} = *k*Δ, and we denote by *Y*_{k} the number of new cases of the disease observed in the interval (*t*_{k−1}, *t*_{k}], *k* = 1, 2, …, *n*. We also assume that the new cases *Y*_{k} depend on the actual observed past history of incidences . As a result, our model takes into account the impact of fluctuations in the reports. Indeed, imagine that the reported cases *Y*_{k} are much larger than what is predicted by Model (8)–(10). That excess of cases will alter the observed dynamics of the outbreak, making it progress faster. Similarly, smaller numbers of incidences will slow down the outbreak. The following model takes into account past fluctuations in the incidence to model locally the dynamics of the process at each time interval given the past history.

**Definition 1** *Let Y*_{1}, *Y*_{2}, …, *Y*_{k} *be the sequence of observed incidences and assume that the cumulative probability distribution F for the time to recovery is continuous. We model the local dynamics of the observed number of susceptible* *and infected* *individuals at time t in the interval* (*t*_{k−1}, *t*_{k}] *through the set of differential-integral equations*
(11)
(12)
*with initial conditions* *and with the convention that* , where both *and* *for all t* ≥ 0 *and p* > 0. *For this model, the conditional expectation of incidence given the past history is*
(13)
*for all k* = 1, 2, …, *n*.

**Remark 1** *Continuity of the cumulative distribution of the time to recovery F implies that* *is left continuous. Furthermore, if F has a probability density, then* *admits a right-hand derivative at t*_{k−1}.

The local model described in Definition 1 has the same infection dynamics, Eq (11), as the global model. What differs is the evolution of the number of infected individuals, and how it relates to the history of past incidences. The following heuristic serves to motivate Eq (12) in Definition 1. Decompose the integral in Eq (9) for the number of infected individuals into a sum over each observed window (*t*_{j−1}, *t*_{j}] to write
To get , replace by its local instantiation on (*t*_{k−1}, *t*_{k}] and by *Y*_{j}/Δ, the empirical rate of new infections, on the interval (*t*_{j−1}, *t*_{j}]. This is interpreted as assuming that the *Y*_{j} new infections in the interval (*t*_{j−1}, *t*_{j}] occur uniformly in that interval. This allows us to take into account the actual number of observed incidence in each time interval instead of using modeled derived quantities, which provides the needs flexibility for our local epidemic model to better track more complex epidemic dynamics than is possible using a global generalized SIR model.

We use the expression for the conditional expectation of incidences in the interval (*t*_{k−1}, *t*_{k}] given the time series of past observed incidences in Definition 1 to model the conditional distribution of *Y*_{k} given . Specifically, we assume that the conditional distribution of is negative binomial
(14)
A negative binomial counts the number of success in a sequence of identically and independently Bernoulli with probability of success *p* = *μ*_{k}/(*μ*_{k} + *r*) before *r* failures (with probability 1-*p*) occur, *μ*_{k} is the conditional expectation defined in Eq (13). With this parametrization, the conditional expectation and variance are
(15)
respectively. The shape parameter *r* controls the amount of over dispersion when compared to a Poisson distribution for which . In particular, as the shape parameter *r* grows to infinity, the negative binomial model converges to a Poisson distribution with rate *μ*_{k}. Thus, the negative binomial distribution allows us to account for the extra-Poisson variability that arises in the data. Other distributions are possible, such as beta negative binomial distribution [43] or the Conway-Maxwell-Poisson distribution [44].

With repeated application of the chain rule, we combine the set of conditional distributions for *Y*_{k}|*Y*_{1}, *Y*_{2}, …, *Y*_{k−1} into a joint likelihood for the model parameters
(16)
(17)
where Γ denotes the gamma function and *μ*_{k} depends only on *β* and *p*. Since, in the model formulation, the distribution of *Y*_{1} does not contain any information about the transmission rate and the fraction of observed cases, the term is dropped from the likelihood.

#### Approximation of the conditional expectation.

To reduce the computational burden required to numerically solve the set of differential-integral Eqs (11) and (12), and the ensuing integration in Eq (13) to evaluate the conditional expectation, we propose to approximate the conditional expectation *μ*_{k} by linearizing both and around *t*_{k−1} in Eq (13), and integrate the result explicitly. The following lemma encapsulates the resulting approximation.

**Lemma 1** *Assume that the cumulative probability distribution F for the time to recovery has a probability density f*. *The conditional expectation μ*_{k} *can be approximated by*
(18)
*when* *and* , *and μ*_{k} = 0 *otherwise. Here*,
(19)
(20)
(21)
(22)
*for all k* = 1, 2, …, *n*.

#### Proof of Lemma 1.

Eqs (19)–(22) follow directly from the definition of and the evaluation of Eqs (11) and (12) at *t*_{k−1}. To prove Eq (22), we first take the derivative of , Eq (12), with respect to *t* and simplify it as follows:
Then, we evaluate at *t*_{k−1} and simplify the resulting equation, using the definition of each *t*_{k} = Δ*k*:
From the definition, both and are non-negative quantities, and the hypothesis implies that , for all *p* > 0. Therefore, all these equations are well defined. In the proof of Eq (18), the linear approximation of both and around *t*_{k−1} are:
(23)
(24)
Substituting these equations in the integrand of Eq (13) and solving it yields:
When , from Eqs (20) and (22), and . Then *μ*_{k} ≈ 0. When , using the definition of in the previous equation and simplifying it yields:
where the conclusion of Eq (18) follows.

**Remark 2** *Better approximations for μ*_{k} *can be obtained using higher order Taylor expansions for* *and* . *This requires the distribution F of time to recovery to have higher order derivatives*.

#### Identifiability.

It is known that the measured growth rates in early SIR outbreaks are insensitive to under-reporting. Indeed, in early outbreaks, *S*(*t*) ≈ *N* and hence *I*′(*t*) ≈ (*β* − *γ*)*I*(*t*). Under Assumption (A2), we have that and , which imply that
It follows that the disease incidence grows exponentially with rate *β* − *γ*, irrespective on the fraction *p* of observed incidence. Hence the transmission rate *β* can be estimated if the recovery rate *γ* is known, but the fraction *p* cannot be estimated at that early stage of the outbreak.

As the outbreak matures and moves away from its early exponential growth phase, it becomes possible to estimate both the transmission rate *β* and the fraction *p* of observed cases. The following theorem provides verifiable conditions for both these parameters to be identifiable.

**Theorem 2** *Set*
(25)
(26)
*If the vector* (*U*_{1}, *U*_{2}, …, *U*_{m}) *and* (*V*_{1}, *V*_{2}, …, *V*_{m}) *are linearly independent, then β and p are identifiable*.

The proof of Theorem 2 is found in the appendix.

**Remark 3** *As we note earlier, β and and p are not identifiable in the early stages of an outbreak. This is also evident in Theorem 2: In the early stages, we have that* , *so that the vectors* (*U*_{1}, …, *U*_{k}) and (*V*_{1}, …, *V*_{k}) *are essentially co-linear. Later in the outbreak, as S*_{k}(*u*) *is no longer close to N, both parameters become identifiable*.

### Bayesian parameter estimation

We use the Metropolis-Hastings algorithm to draw Monte-Carlo Markov chain (MCMC) [45] samples from the posterior distribution of the model parameters given the epidemic outbreak data. Our implementation transforms the original parameters Θ = (*β*, *p*, *r*) into , where *ξ* = log(*β*), *η* = log(*p*/(1−*p*)) and *ρ* = log(*r*), and selects proposals from a multivariate Gaussian distribution with mean and diagonal covariance matrix *Σ* with entries 0.001, 0.01, and 0.01. The results presented in the next section are from 40,000 MCMC samples gathered after 40,000 burn-in iterations when starting from Θ_{0} = (0.5, 0.5, 25). Our implementation used the approximation for *μ*_{k} presented in Lemma 1.

Following [46, 47], we model the distribution of time to recovery from COVID-19 as the convolution of a lognormal distribution (with mean = 5.2 and sdlog = 0.662) with a Weibull distribution (with mean = 5 and sd = 1.9). The mean and standard error of the resulting recovery time distribution are 10.27 and 4.32, respectively. We refer the interested reader to [30] for a detailed description of additional disease progression parameters of SARS-CoV-2 infection.

Separate chains were run for the time series of incidence data from each country, using all the data from the date of the first confirmed COVID-19 cases to May 18th, 2021 (see Table 1). The assumption of a constant transmission rate does not hold, as each country implemented various mitigation and control strategies, from national lockdown orders to closing of public meeting places (see Table 1 which shows the date on first implementation of mitigation as reported in [48]). To avoid having to model the change in the transmission rate resulting from the implementation of mitigations, our parameter estimation starts on the first day of intervention as reported in Table 1. We still use the whole time series from the time of first confirmed incidence to estimate the number of infected individuals as defined by Eq (12).

To reduce the impact of weekly reporting patterns (e.g. fewer cases are reported over the weekend) we apply a moving average of seven days to the raw incidence counts before executing the MCMC algorithm. Finally, the initial conditions , *R*(0) = 0, are set using the reported national population counts and number of initial cases as reported in Table 1.

## Results and discussion

### Analysis of COVID-19 incidence data

We performed separate Bayesian inferences for eight American Countries: the United States of America (USA), Brazil, Mexico, Argentina, Chile, Colombia, Peru, and Panama. Fig 1 shows histograms of the marginal posterior distribution of the transmission rate *β* after the start of mitigation, the fraction observed *p*, and the negative binomial shape parameter *r* for each country. The median and 95% credible intervals of these posterior distributions are presented in Table 2.

The *x*-axis corresponds to the estimated values, and the *y*-axis is the bin’s relative frequency.

Even though each country used different mitigation strategies, with various level of enforcement, the credible intervals for the transmission parameter of each of the eight countries overlap, with the exception of Peru. There are several hypothesis for why this may be the case: the effectiveness of the various mitigation strategies is compromised by having a small fraction of non-compliant individuals, or most of the benefits of the mitigation strategy are achieved by wearing face masks and moderate social distancing. A third hypothesis is that the estimated transmission rate in our model is a time average of the instantaneous transmission rates, and that averaging lessens the differences in transmission rates.

Similarly, the posterior distributions for the fraction of observed incidence are similar across most of the analyzed countries. The two exceptions are Peru and Mexico, with the under-reporting in Mexico being particularly acute. This is consistent with the observation that Mexico has one of lowest numbers of tests performed per reported case [49]. While an under-reporting factor of about 15 is very large, we believe this effect is real because of how well the model fits the data (see the appendix) and narrowness of the posterior distribution.

Related analyses of COVID-19 data in Mexico have used values for the fraction of reported cases of *p* = 0.2 or *p* = 0.4 to analyze and forecast the evolution of the COVID-19 pandemic and hospital demands [50, 51]. These values are closer to the values that we found for the other Latin American countries. However, these values were not derived from the data. It would be interesting to use our model to investigate the under-reporting in Mexico at a county level to see how the results would differ from local to national levels.

Excess deaths [52] provide an alternative measure of the true impact of COVID-19. Using that measure, [53] reports that COVID-19 deaths in Mexico are under-reported by a factor of 3, whereas we show a factor of 15 for under-reported incidence. This difference may be due differential testing rates of deceased and infected individuals which may arise from the standard of care of severely ill patients admitted to intensive care units that requires COVID-19 testing [50, 54].

Our analysis flags Peru as being different from the other countries both in term of having a higher transmission rate, and a lower reported fraction. Our analysis does not reveal why this is the case, and further analysis incorporating country level explanatory variables to predict transmission rates and under-reporting is needed to uncover the reasons why Peru is different from the other countries in America we studied.

Finally, the estimate of the shape parameter *r* of the negative binomial distribution shows that the relative inflation of the Poisson variance ranges from 2%-5%. That effect is statistically significant. Again, the distributions across the eight countries are commensurate, with the United States and Peru exhibiting more extra Poisson variability than the other countries.

### Under-estimation of the transmission rate

In light of Eq (7), we suggested in the introduction that failing to account for under-reporting leads to underestimating the transmission rate *β*. Here, we numerically demonstrate this effect by fitting an SIR-type model directly to raw incidence data, deliberately neglecting to model under-reporting. The median and 95% credible intervals of the posterior distribution for the transmission rate when modeling under-reporting, and when not are displayed in Table 3.

The parameter *β*_{p} coincides with the values from Table 2, while *β*_{1} refers to estimates when the fraction observed is *p* = 1. The posterior distribution for shape parameter *r* were similar for *p* unknown and *p* fixed.

Observe that in all cases, the 95% credible intervals for the transmission rate do not overlap. This shows that knowledge of the fraction of reported incidence is statistically important.

### Variation on the fraction of reported cases

In this section, we consider modeling and estimating a time dependent fraction *p*(*t*) of reported incidence, which can arise from uneven availability of COVID-19 tests [31, 50, 55]. To this end, we model the reported fraction *p*(*t*) with a piece-wise constant function:
(27)
for all 0 < *t* ≤ *t*_{n} ≤ *ξ*_{M}, where denotes the indicator function for each interval [*ξ*_{k−1}, *ξ*_{k}). We regularize the sequence of reported fractions *p*_{1}, *p*_{2}, …, *p*_{M} by adding the penalty
(28)
to the loglikelihood. We assume that the variation between reported fraction, *p*_{k} − *p*_{k−1}, are identically and independently normally distributed with mean zero and variance 1/λ.

Similarly as in the previous section, we performed separate Bayesian inferences to estimate the posterior distributions of *β*, *p*_{1}, *p*_{2}, …, *p*_{M}, *r* and λ for each analyzed country: the United States of America, Brazil, Mexico, Argentina, Chile, Colombia, Peru, and Panama. We transform *p*_{k} and λ into *η*_{k} = log(*p*_{k}/(1 − *p*_{k})) and *l* = log(λ) and use uniform improper priors on the transformed parameters. We defined *p*(*t*) with constant pieces of length modulo 90 days and we use the equations from Lemma 2 to compute the expected incidence *μ*_{k}. These equations generalize the equations from Lemma 1 when *p* = *p*_{k} for all *k* = 1, 2, …, *M*, see the appendix for further details. The median and 95% credible intervals of the posterior distributions of *β*, *r*, and λ are presented in Table 4. For clarity in the presented results for λ values, we decided to round them to the nearest integer values. The analogous results for the posterior distributions for each reported fraction, *p*_{1}, *p*_{2}, *p*_{3}, *p*_{4} and *p*_{5}, are plotted in the second panel of Figs 2–4 for the United States of America, Brazil, and Peru. The corresponding results for Mexico, Argentina, Chile, Colombia, and Panama are shown in the second panel of S1–S5 Figs. In all cases, the 95% credible intervals for each *p*_{k} values are displayed in the blue-shadow areas, while their median values are plotted in blue-dashed-dotted lines.

Additionally, in the first panel of Figs 2–4 and S1–S5 Figs we show the credibility of the model estimates for the daily COVID-19 incidence for each country, where the expected median of reported cases, *μ*_{k}, are plotted in red lines, the upper and lower credible intervals are plotted in blue lines, while the expected incidences lie in the blue-shadow area with probability of 95%. The negative binomial distribution function, Eq (14), was used to build the credible intervals. To estimate the expected cases, the parameter values for *β* and *r* were set equal to the values provided in Table 4 and the *p*_{k} values were set to the estimated median values of *p*(*t*) as shown in the second panel of each figure and for each country, respectively.

From Table 4, the marginal posterior distributions for the parameter λ overlap for all analyzed country. All these marginal posterior distributions skewed to the right with large values. For most countries, the credible intervals for *p*(*t*) include a constant function. That is, statistically, we do not have enough evidence to reject the hypothesis that the reported fraction *p*(*t*) for each country is not a constant function during the entire analyzed data set. And for countries that have a small variance λ^{−1} for the increment *p*_{k} − *p*_{k−1}, we have further evidence that *p*(*t*) is nearly constant. The one country for which a constant *p*(*t*) is not retained is Mexico (see S1 Fig).

The second panels of Figs 2–4 shows that there are some variations across all *p*_{k} credible intervals for the United States of America, Brazil, and Peru. Interestingly, the credible intervals of *p*_{k} for each country are all contained in a wider credible interval than those obtained when assuming a constant fraction *p* of observed cases as reported in Table 2. Comparing Tables 2 and 4, we see that there are not significant changes on the posterior distributions for *β* and *r* when we assume *p* constant and *p* variable. In general, we observe more variation for the observed proportion *p* across the countries than within a country. The latter result is not surprising, as countries implemented different testing policies which may affect the way the incidence data were reported [49, 50, 55].

### Strengths and weaknesses of the proposed local SIR model

Our model locally exploits the SIR dynamics, using past observations to set the initial conditions. This results in a flexible model that can fit complex patterns, such as multiple waves that typically require a time varying transmission rate, with a single parameter. This flexibility comes at a cost: our single estimated transmission rate is a time average of the true time varying one. And while we show that our model empirically fits the data well within the credible intervals, we over-estimate the expected incidence in the valleys and under-estimate near the peaks. It follows that the derived estimates for the reproductive number near a local bottom of an outbreak will have a positive bias, leading to a more conservative view of the effect of mitigation.

Our formulation can be generalized to build epidemic models having non-parametric transmission rates. Such models will alleviate the weakness discussed above, and can be used to identify model-based uncertainties in models. These extensions will be presented in a forthcoming paper. We are also planning to extend the model by incorporating the exposed class, which will provide a more realistic model to study COVID-19 pandemic. As COVID-19 disease progression depends on both the length of time an individual remains in the exposed and infectious classes [30]. This model extension would help us to analyze the effect of different infectious period distributions that could change at the early outbreak due to interventions such as testing, isolation or contact tracing.

Finally, our model has a limited ability to estimate time-varying under-count fractions. Numerical experiments have shown that adding more flexibility to how the latter varies over time degrades our ability to estimate the transmission rate.

## Conclusion

We present a new extension of the standard SIR epidemiological models to study the under-reported incidence of infectious diseases. The new model reveals that fitting a SIR model type directly to raw incidence data will under-estimate the true infectious rate when neglecting under-reported cases. Using the epidemic model we also present a Bayesian methodology to estimate the transmission rate and fraction of under-reported incidence with credible intervals that result directly from incidence data. We also argue that our statistical model can properly track and estimate complex incidence reports, where the resulted estimates update as more data are incorporated.

Using our methodology on the COVID-19 example, we found that the credible intervals for the transmission rates overlap across the eighth analyzed American countries: the United States of America, Brazil, Argentina, Chile, Colombia, Peru, and Panama. In all the cases, the median transmission rates are above 0.105 and below 0.122 (see Tables 2–4). And, for most countries, the credible intervals for the time dependent fraction of reported cases *p*(*t*) include a constant function, and they also provide a range values for the fraction of reported cases per each country. In average, from January 03, 2020 to May 18, 2021: the reported incidence fraction for the United States of America and Panama varies from 0.3 to 0.6; the reported incidence fraction for Brazil, Chile, Colombia, and Argentina varies from 0.2 to 0.5; the reported incidence fraction for Peru varies from 0.15 to 0.35 while for Mexico varies from 0.05 to 0.1 (see Figs 2–4 and S1–S5 Figs).

## Appendix

### Proof of existence and uniqueness of solutions of the generalized SIR model

To prove existence and uniqueness of solutions of System (1)–(3), it is further assumed that the fraction of recovered individuals is defined through a probability distribution function, *F*: [0, ∞) → [0, 1], with the following properties.

**Property 1** *There exists an integrable function f*: [0, ∞) → [0, ∞) *such that*
*for all t* ∈ [0, ∞).

**Property 2** *The average recovery time is finite, i.e*.,

**Theorem 3** *Let U be an open set of* [0, *N*] × [0, *N*] × [0, *N*] × [0, ∞) *and K a compact subset of U containing* (*S*(0), *I*(0), *R*(0), *t*_{0}), *the initial condition of System* (1)–(3), *with f*(*t*) *continuously differentiable with respect to t*, *t* ≥ 0 *in U*. *Then there exists a unique solution of System* (1)–(3) *through the point* (*S*(0), *I*(0), *R*(0)) *at t* = 0, *denoted X*(*S*(*t*), *I*(*t*), *R*(*t*), *t*), *with X*(*S*(0), *I*(0), *R*(0), (0)) = (*S*(0), *I*(0), *R*(0)), *for all t such that X*(*S*(*t*), *I*(*t*), *R*(*t*), *t*) ∈ *K*.

**Proof of Theorem 3** From Property 2, System (1)–(3) is well defined and it is equivalent to
(29)
(30)
(31)
which is obtained by taking the derivative with respect to *t* of Eqs (2) and (3) and using Property 1. Therefore, it is enough to prove existence and uniqueness of solutions of System (29)–(31). It follows that the function *G*: *U* → **R**^{3} defined by
(32)
is continuously differentiable in *U*, see for example [56, pp. 32]. Since , , , and exist and are continuous in *U*, then *G* is continuously differentiable in *U*. Therefore, the solution of System (29)–(31) exists for the initial condition *S*(0), *I*(0), *R*(0) and is unique in *K*.

**Proof of Theorem 2** Set *α*_{1} = *β*/*p* and *α*_{2} = *β*(1 − *p*)/*p*. Since
identifiability of *α*_{1} and *α*_{2} implies identifiability of *β* and *p*. We can estimate *α*_{1} and *α*_{2} by minimizing the sum of squares
(33)
The two parameters are identifiable if and only if the vectors (*U*_{1}, …, *U*_{m}) and (*V*_{1}, …, *V*_{m}) are not co-linear.

### Modeling the time dependence fraction of reported incidence

The following definition describes the dynamics of the observed susceptible and infected individuals when constant fractions *p*_{k} of infected individuals are observed at each interval (*t*_{k−1}, *t*_{k}], i.e., for all *t* in that interval of time. This hypothesis allows us to study the case when the parameter *p* is a piece-wise time dependent function, as it is defined in Eq (27).

**Definition 2** *Let Y*_{1}, *Y*_{2}, …, *Y*_{k} *be the sequence of observed incidences and assume that the cumulative probability distribution F for the time to recovery is continuous. We model the local dynamics of the observed number of susceptible* *and infected* *individuals at time t in the interval* (*t*_{k−1}, *t*_{k}] *through the set of differential-integral equations*:
(34)
(35)
*with initial conditions for the observed susceptible individuals*
(36)
*and under the hypothesis* 1 ≥ *p*_{k} > 0, , *and* *for all t and k* = 1, 2, …, *n*. *For this model, the conditional expectation of incidence given the past history is*
(37)
*for all k* = 1, 2, …, *n*.

Note that Definition 1 and Definition 2 are the same when *p* = *p*_{k} for all *k* = 1, …, *n*. In the following, we provide the mathematical motivation of Definition 2, using similar ideas as from the derivation of Definition 1.

First, from the definition of *I*(*t*), we re-write Eq (2) as follows:
Similarly for *S*(*t*),
where in the second equation we used the hypothesis *N* = *S*(0) + *I*(0). Then, from the above two equations, we estimate *I*(*t*) and *S*(*t*) with the equations:
(38)
(39)
which follow by estimating *p*_{j} *S*′(*u*) with and then setting for all *u* ∈ (*t*_{j−1}, *t*_{j}] and all *j* = 1, 2, …, *k*−1. The last equality follows by assuming that the total cases *Y*_{j} occur uniformly in the observed interval. Now, solving the integral of Eq (39), with the initial conditions defined by Eq (36) and simplifying it, yields:
The above equation implies that . Therefore, from the estimates and , Eqs (38) and (39), and the true transmission dynamics process, Eq (1), we have:
where . Therefore, and satisfy Definition 2.

The next lemma provides a recursive formula to approximate the conditional expectation *μ*_{k} defined by Eq (37). The equation results directly from solving the integral of Eq (37) with the linear approximation of both and around *t*_{k−1}. Its proof is similar to the proof of Lemma 1.

**Lemma 2** *Assume that the cumulative probability distribution F for the time to recovery has a probability density f*. *The conditional expectation μ*_{k} *can be approximated by*
(40)
*when* *and* , *and μ*_{k} = 0 *otherwise. Here*,
(41)
(42)
(43)
(44)
*for all k* = 1, 2, …, *n*.

## Supporting information

### S1 Fig. Daily COVID-19 incidence and fraction of reported cases for Mexico from February 28, 2020 to May 18, 2021.

https://doi.org/10.1371/journal.pone.0263047.s001

(TIF)

### S2 Fig. Daily COVID-19 incidence and fraction of reported cases for Argentina from March 03, 2020 to May 18, 2021.

https://doi.org/10.1371/journal.pone.0263047.s002

(TIF)

### S3 Fig. Daily COVID-19 incidence and fraction of reported cases for Chile from March 03, 2020 to May 18, 2021.

https://doi.org/10.1371/journal.pone.0263047.s003

(TIF)

### S4 Fig. Daily COVID-19 incidence and fraction of reported cases for Colombia from March 06, 2020 to May 18, 2021.

https://doi.org/10.1371/journal.pone.0263047.s004

(TIF)

### S5 Fig. Daily COVID-19 incidence and fraction of reported cases for Panama from March 10, 2020 to May 18, 2021.

https://doi.org/10.1371/journal.pone.0263047.s005

(TIF)

## References

- 1. Kermack WO, McKendrick AG. A contribution to the mathematical theory of epidemics. Proceedings of the royal society of london Series A, Containing papers of a mathematical and physical character. 1927;115(772):700–721.
- 2. Wilson EB, Worcester J. The law of mass action in epidemiology. Proceedings of the National Academy of Sciences of the United States of America. 1945;31(1):24. pmid:16588678
- 3.
Anderson RM, Anderson B, May RM. Infectious diseases of humans: dynamics and control. Oxford university press; 1992.
- 4.
Brauer F, Castillo-Chavez C, Castillo-Chavez C. Mathematical models in population biology and epidemiology. vol. 2. Springer; 2012.
- 5.
Diekmann O, Heesterbeek JAP. Mathematical epidemiology of infectious diseases: model building, analysis and interpretation. vol. 5. John Wiley & Sons; 2000.
- 6. Chowell G, Hengartner NW, Castillo-Chavez C, Fenimore PW, Hyman JM. The basic reproductive number of Ebola and the effects of public health measures: the cases of Congo and Uganda. Journal of theoretical biology. 2004;229(1):119–126. pmid:15178190
- 7. Chowell G, Rivas A, Hengartner N, Hyman J, Castillo-Chavez C. The role of spatial mixing in the spread of foot-and-mouth disease. Preventive Veterinary Medicine. 2006;73(4):297–314. pmid:16290298
- 8. Chowell G, Ammon CE, Hengartner NW, Hyman JM. Estimating the reproduction number from the initial phase of the Spanish flu pandemic waves in Geneva, Switzerland. Mathematical Biosciences & Engineering. 2007;4(3):457. pmid:17658935
- 9.
CDC. Centers for Disease Control and Prevention. 2020 [cited 10 November 2020]. Available from: https://www.cdc.gov.
- 10.
University JH. COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. 2020 [cited 10 November 2020]. Available from: https://github.com/CSSEGISandData/COVID-19.
- 11.
PAHO. Pan American Health Organization. 2020 [cited 10 November 2020]. Available from: https://www.paho.org/en.
- 12.
WHO. COVID-19 Global Data, Geneva: World Health Organization. 2020 [cited 10 November 2020]. Available from: https://covid19.who.int/WHO-COVID-19-global-data.csv.
- 13. Del Valle SY, McMahon BH, Asher J, Hatchett R, Lega JC, Brown HE, et al. Summary results of the 2014-2015 DARPA Chikungunya challenge. BMC infectious diseases. 2018;18(1):245. pmid:29843621
- 14. Lai CC, Liu YH, Wang CY, Wang YH, Hsueh SC, Yen MY, et al. Asymptomatic carrier state, acute respiratory disease, and pneumonia due to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2): Facts and myths. Journal of Microbiology, Immunology and Infection. 2020;53(3):404–412. pmid:32173241
- 15. Kalajdzievska D, Li MY. Modeling the effects of carriers on transmission dynamics of infectious diseases. Mathematical Biosciences & Engineering. 2011;8(3):711. pmid:21675806
- 16. Duffy MR, Chen TH, Hancock WT, Powers AM, Kool JL, Lanciotti RS, et al. Zika virus outbreak on Yap Island, federated states of Micronesia. New England Journal of Medicine. 2009;360(24):2536–2543. pmid:19516034
- 17. Doll M, Pryor R, Mackey D, Doern C, Bryson A, Bailey P, et al. Utility of Re-testing for Diagnosis of SARS-CoV-2/COVID-19 in Hospitalized Patients: Impact of the Interval between Tests. Infection Control & Hospital Epidemiology. 2020; p. 1–6.
- 18. Esbin MN, Whitney ON, Chong S, Maurer A, Darzacq X, Tjian R. Overcoming the bottleneck to widespread testing: A rapid review of nucleic acid testing approaches for COVID-19 detection. RNA. 2020; p. rna–076232. pmid:32358057
- 19. Li R, Pei S, Chen B, Song Y, Zhang T, Yang W, et al. Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2). Science. 2020;368(6490):489–493. pmid:32179701
- 20. Team TNCPERE. The epidemiological characteristics of an outbreak of 2019 novel coronavirus disease (COVID-19)—China. China CDC Weekly. 2020;2(8):113–122.
- 21. Furuya-Kanamori L, Cox M, Milinovich GJ, Magalhaes RJS, Mackay IM, Yakob L. Heterogeneous and dynamic prevalence of asymptomatic influenza virus infections. Emerging infectious diseases. 2016;22(6):1052. pmid:27191967
- 22. Reed C, Angulo FJ, Swerdlow DL, Lipsitch M, Meltzer MI, Jernigan D, et al. Estimates of the prevalence of pandemic (H1N1) 2009, United States, April–July 2009. Emerging infectious diseases. 2009;15(12):2004. pmid:19961687
- 23. Shutt DP, Manore CA, Pankavich S, Porter AT, Del Valle SY. Estimating the reproductive number, total outbreak size, and reporting rates for Zika epidemics in South and Central America. Epidemics. 2017;21:63–79. pmid:28803069
- 24. Hyman JM, Li J, Stanley EA. The differential infectivity and staged progression models for the transmission of HIV. Mathematical biosciences. 1999;155(2):77–109. pmid:10067074
- 25. Romero-Severson E, Hengartner N, Meadors G, Ke R. Decline in global COVID-19 transmission. medRxiv. 2020.
- 26. Lopman B, Simmons K, Gambhir M, Vinjé J, Parashar U. Epidemiologic implications of asymptomatic reinfection: a mathematical modeling study of norovirus. American journal of epidemiology. 2014;179(4):507–512. pmid:24305574
- 27. Ke R, Sanche S, Romero-Severson E, Hengartner N. Fast spread of COVID-19 in Europe and the US suggests the necessity of early, strong and comprehensive interventions. medRxiv. 2020. pmid:32511619
- 28. Sanche S, Lin YT, Xu C, Romero-Severson E, Hengartner N, Ke R. High contagiousness and rapid spread of severe acute respiratory syndrome coronavirus 2. Emerg Infect Dis. 2020;26(7):1470. pmid:32255761
- 29. Bai Y, Yao L, Wei T, Tian F, Jin DY, Chen L, et al. Presumed asymptomatic carrier transmission of COVID-19. J Am Med Assoc. 2020;323(14):1406–1407. pmid:32083643
- 30.
Bar-On YM, Sender R, Flamholz AI, Phillips R, Milo R. A quantitative compendium of COVID-19 epidemiology. arXiv:2006.01283v3 [Preprint]. 2021 [cited 2021 August 10]. Available from: https://arxiv.org/abs/2006.01283.
- 31. Rothe C, Schunk M, Sothmann P, Bretzel G, Froeschl G, Wallrauch C, et al. Transmission of 2019-nCoV infection from an asymptomatic contact in Germany. New England Journal of Medicine. 2020;382(10):970–971. pmid:32003551
- 32. Bettencourt LM, Ribeiro RM. Real time bayesian estimation of the epidemic potential of emerging infectious diseases. PLoS One. 2008;3(5). pmid:18478118
- 33.
WHO. COVID-19 Explorer. Geneva: World Health Organization. 2020 [cited 10 November 2020]. Available from: https://worldhealthorg.shinyapps.io/covid/.
- 34. Hethcote HW, Tudor DW. Integral equation models for endemic infectious diseases. Journal of mathematical biology. 1980;9(1):37–47. pmid:7365328
- 35. Kirkeby C, Halasa T, Gussmann M, Toft N, Græsbøll K. Methods for estimating disease transmission rates: Evaluating the precision of Poisson regression and two novel methods. Scientific reports. 2017;7(1):1–11. pmid:28842576
- 36. Weiss HH. The SIR model and the foundations of public health. Materials matematics. 2013; p. 0001–17.
- 37.
Murray J. Mathematical Biology, 2nd edition. Berlin: Springer-Verlag; 1993.
- 38.
Heesterbeek JAP. A brief history of
*R*_{0}and a recipe for its calculation. Acta Biotheoretica. 2002;50:189–204. pmid:12211331 - 39. Sharpe FR, Lotka AJ. L. A problem in age-distribution. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. 1911;21(124):435–438.
- 40. Heesterbeek J, Dietz K. The concept of Ro in epidemic theory. Statistica Neerlandica. 1996;50(1):89–110.
- 41. Heffernan JM, Smith RJ, Wahl LM. Perspectives on the basic reproductive ratio. Journal of the Royal Society Interface. 2005;2(4):281–293.
- 42. Team WER. Ebola virus disease in West Africa—the first 9 months of the epidemic and forward projections. New England Journal of Medicine. 2014;371(16):1481–1495.
- 43. Wang Z. One mixed negative binomial distribution with application. Journal of Statistical Planning and Inference. 2011;141(3):1153–1160.
- 44. Shmueli G, Minka TP, Kadane JB, Borle S, Boatwright P. A useful distribution for fitting discrete data: revival of the Conway–Maxwell–Poisson distribution. Journal of the Royal Statistical Society: Series C (Applied Statistics). 2005;54(1):127–142.
- 45. Makowski D, Wallach D, Tremblay M. Using a Bayesian approach to parameter estimation; comparison of the GLUE and MCMC methods. Agronomie. 2002;22(2):191–203.
- 46. Ferretti L, Wymant C, Kendall M, Zhao L, Nurtay A, Abeler-Dörner L, et al. Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing. Science. 2020;368 (6491). pmid:32234805
- 47. Li Q, Guan X, Wu P, Wang X, Zhou L, Tong Y, et al. Early transmission dynamics in Wuhan, China, of novel coronavirus–infected pneumonia. New England Journal of Medicine. 2020;382(13):1199–1207. pmid:31995857
- 48.
Wikipedia. COVID-19 pandemic lockdowns. 2020 [cited 10 November 2020]. Available from: https://en.wikipedia.org/wiki/COVID-19_pandemic_lockdowns.
- 49.
Hasell J, Mathieu E, Beltekian D, Macdonald B, Giattino C, Ortiz-Ospina E, et al. COVID-19: Daily tests vs. Daily new confirmed cases per million. 2021 [cited 15 June 2021]. Available from: https://ourworldindata.org/grapher/covid-19-daily-tests-vs-daily-new-confirmed-cases-per-million?country=~MEX.
- 50. Capistran MA, Capella A, Christen JA. Forecasting hospital demand in metropolitan areas during the current COVID-19 pandemic and estimates of lockdown-induced 2nd waves. PloS one. 2021;16(1):e0245669. pmid:33481925
- 51. Saldaña F, Flores-Arguedas H, Camacho-Gutiérrez JA, Barradas I. Modeling the transmission dynamics and the impact of the control interventions for the COVID-19 epidemic outbreak. Math Biosci Eng. 2020;17(4):4165–4183. pmid:32987574
- 52.
Cuéllar L, Torres I, Romero-Severson E, Mahesh R, Ortega N, Pungitore S, et al. Excess deaths reveal the true spatial, temporal, and demographic impact of COVID-19 on mortality in Ecuador. Cold Spring Harbor Laboratory Press. 2021. https://doi.org/10.1101/2021.02.25.21252481
- 53. Dahal S, Banda JM, Bento AI, Mizumoto K, Chowell G. Characterizing all-cause excess mortality patterns during COVID-19 pandemic in Mexico. BMC Infectious Diseases. 2021;21(1):1–10. pmid:33962563
- 54. Murthy S, Gomersall CD, Fowler RA. Care for Critically Ill Patients With COVID-19. JAMA. 2020;323(15):1499–1500. pmid:32159735
- 55. Wu SL, Mertens AN, Crider YS, Nguyen A, Pokpongkiat NN, Djajadi S, et al. Substantial underestimation of SARS-CoV-2 infection in the United States. Nature communications. 2020;11(1):1–10. pmid:32908126
- 56.
Wiggins S. Introduction to applied nonlinear dynamical systems and chaos. vol. 2. Springer Science & Business Media; 2003.