Addressing delayed case reporting in infectious disease forecast modeling

Infectious disease forecasting is of great interest to the public health community and policymakers, since forecasts can provide insight into disease dynamics in the near future and inform interventions. Due to delays in case reporting, however, forecasting models may often underestimate the current and future disease burden. In this paper, we propose a general framework for addressing reporting delay in disease forecasting efforts with the goal of improving forecasts. We propose strategies for leveraging either historical data on case reporting or external internet-based data to estimate the amount of reporting error. We then describe several approaches for adapting general forecasting pipelines to account for under- or over-reporting of cases. We apply these methods to address reporting delay in data on dengue fever cases in Puerto Rico from 1990 to 2009 and to reports of influenza-like illness (ILI) in the United States between 2010 and 2019. Through a simulation study, we compare method performance and evaluate robustness to assumption violations. Our results show that forecasting accuracy and prediction coverage almost always increase when correction methods are implemented to address reporting delay. Some of these methods required knowledge about the reporting error or high quality external data, which may not always be available. Provided alternatives include excluding recently-reported data and performing sensitivity analysis. This work provides intuition and guidance for handling delay in disease case reporting and may serve as a useful resource to inform practical infectious disease forecasting efforts.

This is a PLOS Computational Biology Methods paper.

Introduction
Accurate forecasts of future disease rates are useful for informing public health policy and response to infectious disease outbreaks [1]. Strategies for improving forecasts are of great interest to the public health and epidemic modeling communities, and the most up-to-date reports of case diagnoses can often provide useful insight into disease dynamics in the near future. For many diseases, however, surveillance and case identification take time, and there is often a delay between when people become infected with an infectious disease and when that infection is officially reported and made publicly available [2].
Often, reporting at the national level requires many intermediate reporting steps; diagnostic labs and health care systems report suspected and/or confirmed cases to local public health systems, which then report to states, which then report to the national level [2]. Delays in each one of these reporting steps can result in substantial lag in the information available for forecasting, and these delays may vary on a variety of factors, including state infrastructure and threshold for reporting suspected vs. confirmed cases. National, state, and local reporting systems produce regular reports providing the most up-to-date statistics on disease diagnoses (e.g., weekly statistics provided by the National Notifiable Diseases Surveillance System), and these provisional reports may be revised/corrected through a process called backfill as new data become available. Real-time reports of current disease burden often substantially underestimate the number of cases eventually reported (hereafter referred to as the validation case counts) for a given time period [3].
Reporting delay can also be a result of the time indexing of the forecasted endpoint. During the COVID-19 pandemic, a common problem is to predict the number of COVID-19 positive cases that will eventually be reported, relative to the date of symptom onset (e.g., [4]). Restated, we may want to predict how many newly symptomatic cases there are today based on tests that have not yet been administered, which naturally results in a delay between symptom onset and official case reporting. Reporting delay also occurs when predicting hospitalization rates among newly-diagnosed patients, since hospitalization due to disease takes time to occur [5]. In this way, the problem of delayed event or case reporting can arise in a variety of ways, either through official reporting processes or implicitly through our definitions of what it means to be a "case".
When the goal is to forecast future disease burden, it is not always clear how one should use these error-prone real-time case reports. When the reporting delay is minimal, we may expect reporting to have little impact on model forecasts and use these real-time reports without correction. When reporting delay is substantial and when real-time estimates and forecasts of disease burden are relied upon to inform policy interventions, additional thought is needed to address the reporting delay.
One common strategy for handling reporting delay is to rescale real-time case reports using the expected amount of under-reporting (e.g., [6]). These scaling factors (hereafter referred to as reporting factors) can be estimated from historical data on real-time case reporting, if available, and sophisticated strategies have been developed to model reporting factors and the disease process jointly (e.g., [3,[7][8][9]). Other work has used external data to predict the validation cases (e.g., social media, Google Trends) and account for reporting delay using a weighted version of the real-time data and the validation data predictions [10]. Backcast imputation approaches handle the reporting error by generating past weeks' validation data using the most recent data reports for those weeks and accounting for uncertainty in reporting patterns [11]. These methods have been studied across multiple disciplines beyond infectious disease modeling such as in actuarial claims prediction [9] and in correction of cancer registry case reporting [12]. During the COVID-19 pandemic, a substantial literature on "nowcasting" has arisen to tackle reporting delay due to delays in testing after symptom onset and to leverage data on individual disease cases when available, where estimation is further complicated by incomplete data on date of symptom onset among confirmed cases [4,13,14]. Some of the studies describing how to handle reporting delay compare two or three methods head to head in narrow modeling settings, but there are no recommendations for how reporting delay should be handled for infectious disease forecasting in general practice.
In this paper, we propose a general framework for conceptualizing and addressing reporting delay in terms of a two-stage estimation problem: (1) estimation of the reporting factors as a function of time since the initial report and (2) disease modeling and forecasting, accounting for under-or over-reporting. This framework highlights and extends existing methods in the literature for handling delayed case reporting, and we demonstrate how these approaches can be implemented for infectious disease forecast modeling in general. We then propose several novel strategies for leveraging either historical data on case reporting or external social media/ Google trends data to estimate the amount of reporting error. We apply these methods to address reporting delay in data on dengue fever cases in Puerto Rico from 1990 to 2009 and to reports of influenza-like illness (ILI) in the United States between 2010 and 2019. Through a simulation study, we compare how these methods perform when all assumptions are satisfied and evaluate their robustness when assumptions are violated. This work provides intuition and guidance for the comparative performance of various methods for handling delay in disease case reporting and may serve as a useful resource to inform practical infectious disease forecasting efforts.
In Section 2, we introduce the data examples, and we describe the various conceptual approaches for handling reporting delay in Section 3. We implement various approaches for several data examples in Section 4, and we explore performance and robustness of these methods in greater detail in Section 5. We present a discussion in Section 6.

Disease surveillance data with reporting delay
We consider two motivating datasets in which delayed reporting of disease cases results in discrepancies between the number of cases reported in real-time and the number of cases eventually reported. In our first data example, we model cases of dengue fever reported between 1990 and 2009 in Puerto Rico. In our second data example, we explore influenza-like illnesses in the United States between 2010 and 2019.

Dengue fever surveillance data in Puerto Rico, 1990-2009
We obtain publicly-available data on dengue fever cases in Puerto Rico between 1990 and 2009 that are provided alongside McGough et al. (2020) [3] as part of the R package NobBS. These data provide the official diagnosis week and the week of case reporting for over 53,000 cases of laboratory-confirmed dengue fever in Puerto Rico as collected by the Puerto Rico Department of Health and the Centers for Disease Control and Prevention. Using these person-level data, we calculate the number of cases diagnosed in each calendar week, which we call the validation data. For each calendar week, we also calculate the number of these validation cases that were actually reported as part of the initial case counts from the first week and the number that were reported in each of the following 6 weeks. These weekly reporting data were constructed to mimic weekly "official" case reports. We will refer to these constructed "reports" throughout, but it should be understood that these are hypothetical weekly reports rather than officiallyreleased governmental statistics. A subset of the resulting case counts are shown in Fig A in S1 Text. In Fig 1a, we show that only a very small proportion (� 5%) of eventually-reported cases were included in the initial case "report" (i.e., officially reported within 1 week of diagnosis). The vast majority of cases were reported in the following six weeks.

Seasonal influenza-like illness in the United States, 2010-2019
We also consider publicly-available data on influenza-like illness (ILI) in the United States between 2010 and 2019 compiled as part of the U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet). Both validation and officially-reported real-time data (i.e., data that were available up to d weeks after the initial case report) are available at https://github.com/ cmu-delphi/delphi-epidata using the Delphi API [15]. Validation data were defined as the case counts for each calendar week that were reported as of June 13th, 2021. In modeling of ILI cases, it is common to break up the data recorded by calendar year into flu seasons, where weeks are re-defined relative to the start of the season. For our analysis, we identified the start of each 35-week flu season as the 40th week in the calendar year as defined by the Morbidity and Mortality Weekly Report (i.e., MMWR week 40) produced by the US National Notifiable Diseases Surveillance System [16]. For example, the 2018-2019 season spans the end of 2018 and the beginning of 2019. In addition to the total number if ILI cases nationally, we also downloaded data stratified by state from 2010 through the 2018-2019 season. Real-time reporting data were available for the 2017-2018 and 2018-2019 seasons and also for the 2016-2017 season for some states. Florida was excluded from analysis due to the lack of real-time reporting data provided by the Delphi API. ILI cases occurring after the 2018-2019 season were not included in this analysis to avoid inclusion of COVID-19 diagnoses.
A mild amount of case reporting delay was evident for these data, where roughly 70-90% of eventually-reported cases were reported initially (Fig 1b). There was a clear seasonal trend in reporting at the national level, where more recent seasons had a higher proportion of validation cases reported initially. Unlike the dengue fever data, we did not obtain disease diagnosis and reporting dates for individual patients; rather, we considered the total number of reported cases for each week. Due to excluded providers or other reasons, it was possible for real-time case counts to be larger than the validation counts. Therefore, the discrepancy between the real-time reported and validation case numbers was due to a combination of under-reporting (i.e., not-yet-reported cases) and over-reporting (cases currently reported that were eventually excluded). This phenomenon was particularly evident in the 2017-2018 season data for Vermont, where the number of reported cases often decreased as the number of weeks since the initial report increased (Figs B-C in S1 Text).

Methods
In this section, we develop the notation and describe various methods for handling reporting delay in general. Table A in S1 Text provides a summary of the notation. Suppose we consider time series data of reported cases for an infectious disease of interest, where cases are reported across multiple calendar time-points t for each of multiple disease seasons s. For example, we may consider data on weekly case reports of seasonal influenza-like illness diagnoses, reported across several seasons. Throughout this paper, we will use "week" to refer to the calendar timepoint within each season and "year" to refer to the disease seasons, but these methods can be applied to more general time-scales. For a given calendar week t in a given year/season s, the first reported case counts are often inaccurate, and the final "validation" counts result from revisions at later dates. Let d denote the number of weeks since the first reported case counts for a given calendar week. The number of weeks since the week's first report will also be called the "lag" in reporting. Define N ts (d) to be the reported case counts for week t in season s reported with lag d, where N ts (0) corresponds to the very first report and N ts (1) corresponds to the final/validation case counts.
In this paper, we provide and compare strategies for forecasting future validation case counts using error-prone real-time reported case data, N ts (d). For example, we may want to obtain a forecast of validation cases at calendar time t = 5 for season s (denoted N 5s (1)) using the most recently-reported data for the previous 4 calendar weeks (N 1s (3), N 2s (2), N 3s (1), and N 4s (0)) along with case data from past seasons. The impact of and correction for reporting delay will naturally depend on how accurate the real-time case data is. Let π ts (d) denote expected proportion of validation cases for calendar week t in season s that are reported by lag d weeks as follows where N ts (d) and N ts (1) may be viewed as random variables with data realizations N ts (d) and N ts (1). A value of π ts (d) less than 1 indicates under-reporting and a value greater than 1 indicates over-reporting, relative to the validation counts. As d ! 1, we expect that π ts (d) ! 1 and N ts (d) ! N ts (1). The reporting factor for week t and season s at lag d is defined as 1 p ts ðdÞ . We emphasize that this reporting factor is a function of the lag week d, and we will assume there is some threshold τ such that π ts (d) = 1 for all d > τ. We will not know the reporting factors in practice, and additional work will be needed to estimate the degree of reporting error and to account for reporting delay in the analysis.
In the remainder of this section, we describe a general two-stage strategy for addressing reporting delay, including (stage 1) estimating reporting factors and (stage 2) forecast modeling, accounting for under-or over-reporting. First, we will describe several conceptual strategies for how to implement forecast modeling while addressing reporting delay given estimates of the reporting factors. Then, we will provide corresponding strategies for estimating reporting factors. These methods are summarized in Fig 2. We provide an extension of these methods for forecasting proportions rather than case counts in S1 Text Section 10.

Conceptual strategies for handling reporting delay
In this section, we provide several conceptual strategies for handling reporting delay. In implementing each of these methods, we assume that the forecasting model structure is fixed beforehand. Several of these methods require estimates of π ts (d). We discuss how to obtain these estimates in the next section.
Rescaling of real-time data. One general strategy for correcting reporting delay is to use the most recently-reported case counts, N ts (d), and assumptions about the relationship between real-time and validation data (i.e., assumptions about π ts (d)) to predict N ts (1) as follows: where π ts (d) is replaced with an estimate [6]. Rescaling strategies are often used to correct for general case under-or over-reporting in disease modeling literature, but a key difference from simple rescaling approaches is that the correction factors are functions of the lag d and possibly t, s, and additional covariates. Given a predictionN ts ð1Þ from Eq 2, we can then fit our forecasting model usingN ts ð1Þ in place of the real-time data. Modeling real-time data with mean model offset. The rescaling method involves preprocessing the real-time data to obtain predictions of the validation data for week t in season s. Then, the predictions are input into the forecast modeling pipeline. An alternative strategy involves specifying a forecasting model for the real-time data directly, accounting for the reporting delay in the model structure [3,9]. Usual implementations involve specifying a model for the incremental cases reported for a given calendar week across lag times d, conditional on the cases reported up to lag time d − 1 [9]. These incremental case models are linked together into a chain that ultimately can be used to predict future values for N ts (1). Letting n ts (d) = N ts (d) − N ts (d − 1) be the data realization of the random incremental counts n ts (d) reported on lag week d, we can model the random incremental counts using a negative binomial or Poisson regression model with the following mean structure: logðEðn ts ðdÞÞÞ ¼ logðEðN ts ð1ÞÞÞ þ logðy ts ðdÞÞ; ð3Þ where y ts (d) = π ts (d) − π ts (d − 1) corresponds to the expected proportion of eventually-reported cases that are reported on lag week d. This model has the same mean structure as a corresponding model for the validation case counts, N ts (1), but with an offset term. A key limitation of this formulation is that y ts (d) must be greater than zero; in other words, we can only apply this method when we have positive increments. In general, however, we may have overreporting of cases in addition to under-reporting. Using logic in S1 Text Section 3, we instead propose modeling the cumulative counts directly using a negative binomial or Poisson regression model with mean structure as follows: This mean structure formulation requires only that π ts (d) > 0 for all d > 0. Individual increments, however, may increase or decrease. In implementing mean model offset approaches like Eq 3, researchers often estimate π ts (d) and the disease model parameters jointly and specify a prior distribution for y ts (d) or π ts (d) (e.g., [3]). To make the estimation strategy more compatible with existing out-of-the-box software, however, we propose pre-estimating π ts (d) and then fitting the disease model treating log ðp ts ðdÞÞ as a fixed offset in the mean model. Alternative distributional assumptions for N ts (d) may also be considered (e.g., lognormal as in [17]), but this offset-based method is limited to settings where mean counts are modeled using some type of log link function. Compartmental susceptible-infectious-recovered models, for example, are not compatible with this method. Imputation of validation case counts. The rescaling method involves replacing the realtime data with a predicted value of N ts (1) prior to disease modeling. However, this approach does not capture the uncertainty in this prediction. Recasting this approach in a missing data framework, the rescaling method predicts N ts (1) using the expected value of N ts (1) given the observed data. It is well-known in the missing data literature, however, that so-called conditional mean imputation may result in under-coverage of model parameters [18]. Similarly, we could see under-coverage of resulting disease model forecasts.
To address this challenge, we can perform disease modeling based on draws or imputations of the validation data (e.g., [7,11]). In order to keep the handling of reporting delay separate from the disease modeling, we propose obtaining M multiple imputations/draws of the validation data for the current season up to the current calendar time. Fixing these imputations, we can then fit the forecasting model to each of the M imputed datasets and obtain M forecasts (e.g., 1-week forecasts) and corresponding standard errors. We then combine results across imputed datasets using Rubin's multiple imputation combining rules [18]. Additional details about this algorithm can be found in S1 Text Section 4.
Many different approaches can be used to define a distribution for imputing validation case counts given the available data. Using results from the actuarial literature as in S1 Text Section 3, we propose imputing validation case counts from the following truncated normal distribution: where π ts (d) is replaced with an estimate and where truncation limits l and u restrict imputed values to be greater than N ts (d) if π ts (d) < 1 and less than N ts (d) if π ts (d) > 1. A key property of this distribution is that the variance of the imputed validation data decreases as π ts (d) increases, meaning that imputed N ts (1) will be closer to N ts (d) when the expected amount of reporting error is small. The truncated normal distribution in Eq 5 will have expectation different than N ts ðdÞ p ts ðdÞ due to non-symmetric bounds. In Fig E in S1 Text, we demonstrate that the percent error in this expectation relative to N ts ðdÞ p ts ðdÞ is small, particularly in settings with a large amount of reporting error.
A primary limitation of the multiple imputation approach is that it requires the forecasting model to be fit multiple times. For many slow estimation methods, this may not be feasible. When the forecasting model involves Markov Chain Monte Carlo (MCMC) estimation (e.g., some Bayesian models), however, a simple alternative is to handle the missing data within the MCMC estimation algorithm, where a single imputed value for each missing N ts (1) is generated using Eq 5 within each MCMC iteration as in Hohle et al. (2014) [7]. Parameters are then drawn within each iteration, conditioning on the imputed validation data. The resulting posterior forecast distributions can then be used directly.
Exclusion of most recent data reports. In some cases, the most recently reported realtime data may be too error-prone to contribute meaningfully to forecasting given the information available to the forecaster. In this setting, we propose excluding the most recently reported data from modeling and forecasting.

Estimating reporting factors
For many of the methods described in Section 3.1, the challenge of handling reporting delay boils down to specification of a relationship between the real-time and validation data, i.e., π ts (d). Below, we describe several different strategies for estimating π ts (d) using either realtime case reports for past seasons or by leveraging external data to directly nowcast N ts (1).
Reporting factors as a function of lag only. A common assumption in the literature is that the reporting factors are constant in t and s. Using data from past seasons/weeks in which both real-time and validation data are available, we estimate the inverse reporting factors aŝ p ts ðdÞ ¼pðdÞ ¼ ; ð6Þ withpðdÞ ¼ 1 for all d greater than some fixed threshold, τ. An equivalent estimator has been been explored in detail elsewhere (e.g., [9,19]). More sophisticated estimation strategies using parametric assumptions for π(d) can also be used, as discussed in England and Verrall (2002) [9]. An alternative strategy for estimating π(d) would be to average N ts ðdÞ N ts ð1Þ across s and t. Although not shown, simulations demonstrated that this proportion averaging approach tended to produced unstablepðdÞ with larger variability.
Reporting factors based on regression modeling of real-time data. The estimator in Eq 6 assumes that the reporting factor depends on d only, but we can imagine settings where the reporting may vary across seasons (s) and/or within seasons (t). Additionally, we may have covariates X predictive of the difference between real-time and validation case reports (e.g., number of patients tested for the disease or disease-related Google search volumes). In this setting, we propose estimating π ts (d) indirectly by modeling the validation case counts as a function of available data, the lag d, the season s, the week t, and/or covariates X. For example, we may fit a Poisson regression model with a log link, adjusting for lag, season, week, and X and using the log of the available counts as an offset in the mean model. See S1 Text Section 6 for an example. Through this model, we can incorporate seasonality and inter-/intra-season trends into our estimation fo π ts (d). Let function f represent the estimated mapping between the real-time data and the validation data. We can estimate the inverse reporting factors aŝ wherep ts ðdÞ ¼ 1 for d > τ and whereN ts ð1Þ ¼ f ðN ts ðdÞ; XÞ. The above strategy for estimating π ts (d) does not require reporting information for individual people; rather, it uses aggregate case counts by week t and lag d. In many modern nowcasting applications, however, line-list data with dates of disease diagnosis and reporting for individual people are available (e.g., [4]). In this setting, we can view the time between disease diagnosis and reporting as a time-to-event outcome, perhaps incorporating censoring or truncation. Classical modeling strategies for event time outcomes can be applied to estimate a delay distribution as a function of person-level predictors, which can then be aggregated to obtain estimates of π ts (d) overall. Approaches for handling settings where baseline disease diagnosis or symptom onset dates are unknown for patients in the historical dataset have been explored elsewhere [13,14,20].
Reporting factors using "local" real-time data. The previously-described methods leverage past season data on the relationship between real-time and validation case counts to estimate reporting factors. However, these methods may perform poorly if reporting in the current season differs substantially from reporting in previous seasons in an unpredictable way (i.e., not easily predicted based on past-seasons' reporting trends). In this setting, we would like to use real-time data from the most recent weeks to estimate reporting factors that better capture current reporting practices. The challenge, however, is that validation data are not yet available for recent weeks.
We propose using the changes in available case counts across d to obtain a conservative estimate of π ts (d) as follows:p ts ðdÞ ¼ where the numerator contains past weeks' available data and the denominator contains the most recently-reported data for the previous K weeks. Define τ to be the lag value after which we assume true π ts (d) = 1 for all d > τ. For this estimator, we definep ts ðdÞ ¼ 1 for d > min(K, τ). K can take any positive integer value, but we recommend setting K = τ in practice.
The estimator in Eq 8 is conservative in that it is biased toward 1 relative to the true π ts (d) when reporting is such that (1) N ts (j) � N ts (j + 1) for j � 0 (monotone under-reporting) or (2) N ts (j) � N ts (j + 1) for j � 0 (monotone over-reporting). This bias comes from using the most recent case counts rather than the corresponding validation values in the denominator. Despite this bias, Eq 8 may provide a more accurate estimate of π ts (d) than the previouslydescribed approaches when the current weeks' reporting is very different from reporting in past seasons.
We could further generalize this local estimation strategy to incorporate seasonality, covariate associations, and/or residual autocorrelation to estimate π ts (d). We propose viewing historical observed values of π ts (d) and local estimated values from Eq 8 as themselves a time series that can be used to forecast π ts (d) for each fixed value of d. For example, we can construct a time series of observed/estimated values of π ts (0) for prior weeks, fit an ARMA (S1 Text Section 5) model to some function of π ts (0) (e.g., logit(π ts (0))) adjusting for any covariates or seasonality we want to account for, and obtain a one-week-ahead forecast to estimate unknown π ts (0) for the current week. In this way, we can simultaneously combine historical information on seasonality and other trends while leveraging information on reporting from the recent past. When applied to our real data examples, this approach did not yield improved performance over Eq 8, so we did not evaluate this approach in additional detail. However, we hypothesize that there may be other infectious disease settings where this generalization could provide additional flexibility and corresponding improvements in forecast performance. Kline et al. (2021) [21] provides a related approach for modeling increments in reported COVID-19 cases as a function of reporting lag and day of week with an autoregressive structure.
Reporting factors based on proxy shrinkage. The preceding methods assume some prior information is available about the relationship between the real-time and validation case counts. For many diseases however, historical data on validation case counts are readily available but the real-time reports for past seasons are either unavailable or incomplete. In this setting, we propose leveraging external data sources to estimate π ts (d).
Suppose we have historical data on the relationship between validation case counts N ts (1) and some external data, collectively denoted p ts , that are associated with N ts (1). Examples of external data include Google search volumes, social media mentions (e.g., from Twitter), and Wikipedia searches [22,23]. Using these historical data, we fit a nowcast model for validation case counts given p ts . This model could be complicated and include data from many different sources, incorporate penalization, etc. Let g denote the mapping between p ts and model-predicted validation case counts. These predictions may be viewed as an error-prone proxy for the current season validation cases.
Using this external data model proxy, we propose estimating the validation case counts as a weighted average of the proxy and the observed real-time data as followŝ and we estimate corresponding inverse reporting factors asp ts ðdÞ ¼ N ts ðdÞ N ts ð1Þ . For large d, we expect N ts (d) and N ts (1) to be very close. Therefore, we will define weights w d such that w d ! 0 as d ! 1 and impose w d = 0 for all d > τ. As the number of lag weeks increases, the shrinkage estimator will put more and more weight on the observed data. In a related method, with ω = 0.75. The choice of weight and performance of this method will depend on the degree of reporting delay and how well the proxy model g predicts the validation case counts. Sensitivity analysis. All of the above methods for estimating reporting factors require either historical real-time data or a good proxy for the validation case counts based on external data. In practice, we may not have either available. In this case, one approach for implementing the methods in Section 3.1 is to perform multiple parallel analyses assuming different plausible reporting mechanisms in a sensitivity analysis framework. This approach can be used to evaluate the robustness of forecasts to assumptions about the delay mechanism.
On drawing reporting factors. Thus far, we have proposed a two-stage analysis procedure where we first estimate reporting factors π ts (d) and then apply the methods in Fig 2a to obtain forecasts. However, this approach does not account for uncertainty in the estimated values for π ts (d). Jones et al. (2020) [24] and others have highlighted the potential for poor performance of rescaling or multiplier strategies as in Eq 2 when a single estimate of π ts (d) is used without accounting for uncertainty. In Bayesian forecast modeling, a common solution is to build reporting delay into the model structure and forecast assuming a prior distribution for π ts (d) rather than fixing its value (e.g., [3]). All of the methods discussed in Section 3.1 be similarly modified to incorporate uncertainty in estimated π ts (d). As shown by our simulation results, however, concerns raised about uncertainty propagation for rescaling methods can be substantially mitigated by implementing imputation strategies as in Eq 5.

Application to dengue fever and seasonal influenza forecasting
We considered data on dengue fever cases in Puerto Rico (1990Rico ( -2009) and for seasonal influenza-like illness (ILI) cases in the United States (2010-2011 through 2018-2019 seasons). Our goal was to evaluate how the quality of disease forecasts is impacted by the handling of reporting delay. With the exception of the mean model offset method, all of the methods in Fig 2 for handling reporting delay can be applied in conjunction with any forecasting model. For demonstration, we considered two reasonable forecasting model settings: an auto-regressive moving average (ARMA) model for log-cases (e.g., [25]) and a Bayesian Gaussian process-based model (hereafter, referred to as Inferno from Osthus (2021) [26]). Both models assumed a log link mean structure. Additional details about these forecast models are provided in S1 Text Section 5.

Obtaining weekly forecasts, accounting for reporting delay
For each week after the first two seasons, we constructed an artificial dataset consisting of the real-time and historical observations that would have been available to use for forecasting. First assuming that reporting does not vary over t and s, we estimated reporting factorspðdÞ from Eq 6 using the previous 2 years' data. To allow estimated reporting factors to vary by t and s, we also estimatedp ts ðdÞ indirectly as in Eq 7 by fitting a Poisson regression model with a log link to the prior observed validation data, including d, s, and a 3-degree natural spline of t as predictors and treating the available data log(N ts (d)) as an offset. Additional details and diagnostics can be found in S1 Text Section 6.1. Finally, we estimate "local" π ts (d) using Eq 8 and the previous K weeks' real-time reported data (using K = 15 for ILI and K = 6 for dengue fever). As an additional sensitivity analysis, we also estimate "local" π ts (d) across a range of K values between 2 and 100 for each dataset.
Fixing these reporting factors and using either an ARMA (2,2) or Inferno forecasting model structure, we then fit the forecasting model using each of the rescaling, model offset, and imputation methods in Section 3.1. For ARMA, the imputation method was implemented using 10 imputed datasets and Eq 5. For Inferno, imputation using Eq 5 was implemented within a Bayesian MCMC estimation algorithm. We also fit the forecasting model to the observed realtime data without correction and after excluding the most recent 1-3 weeks of data. To benchmark the performance of these methods, we also fit the forecasting model using the validation data. Inferno model estimation was based on 2500 MCMC iterations after a burn-in of 2500 iterations. Results are also provided for an ensemble based on an equal weight linear combination of forecasts from the 13 methods (excluding validation data analysis).
For each forecast fit, we obtained a prediction for the validation case counts for the current week (nowcast) and for 1 and 4 weeks ahead (forecasts). We also calculated corresponding 50%, 67%, 95%, and 99% prediction intervals. For the Inferno model, ensemble forecasts and corresponding prediction intervals were obtained using the stacked draws across methods. For the ARMA model, we obtained 2500 draws of predicted log-case counts using the Gaussian prediction distribution for each of the methods. Ensemble forecasts and prediction intervals for the validation case counts were then obtained using these stacked draws. We repeated this process for each calendar week of available data. We then summarized the predictive performance using the metrics described in Section 4.2, aggregating across all calendar weeks and seasons for each dataset and forecast model structure separately.

Performance metrics
We evaluated the performance of the various methods based on properties of nowcasts (i.e., predicted validation values for the current week) and 1 and 4 week forecasts (i.e., predicted validation case counts for future weeks). Absolute prediction error for a given nowcast/forecast was defined as the absolute difference between the model estimate and the validation case counts. For assessing the accuracy of the point and interval estimates jointly, we calculated the weighted interval score from   [27] as follows: where K is the number of intervals (in our case, 11), ρ = (0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.05, 0.02), u j and l j are the upper and lower confidence/credible interval limits corresponding to level 1 − ρ j , andf ts is the median of the nowcast/forecast distribution. These values of ρ were chosen to match the quantiles used by the US COVID Forecast Hub (https:// covid19forecasthub.org/). We evaluated the WIS using the validation values. We also calculated the empirical coverage of the 95% prediction intervals using the forecasts and validation case counts across all calendar weeks.

Results
Prediction error and coverage. Fig 3 presents the average absolute nowcast/forecast errors for the dengue fever and US national ILI outcomes along with corresponding coverages of 95% prediction intervals. Weighted interval scores are presented in Fig G in S1 Text. For both datasets and forecast model structures, analysis of the observed data without correction resulted in large undercoverage (2-60%) and higher prediction error relative to analysis of the validation data. Compared to uncorrected analysis of the observed data, the exclusion method (1-3 weeks) resulted in improved coverage and similar or reduced prediction error across all endpoints for the dengue fever data. In contrast, the exclusion method resulted in similar or increased prediction error in the national US ILI data.
The rescaling, offset, and imputation methods uniformly resulted in higher coverage and similar or reduced prediction error relative to uncorrected observed data analysis for the dengue fever and US national ILI data. For the dengue fever data, the imputation methods tended to give better coverage than the rescaling methods (e.g., around 65% vs nearly 90%). In general, the rescaling method can be unstable (i.e., produce high error) in the dengue fever data, particularly for the Inferno forecast model (Fig H in S1 Text). As described in S1 Text Section 7, this is due to a combination of very high under-reporting and a low disease rate, which results in seasons. The ensemble method corresponds to an equal-weight linear combination of all methods except validation data analysis. "Model" indicates that reporting factors were estimated via regression and allowed to vary by t and s. "Lag" indicates that reporting factors were estimated via Eq 6. "Local" indicates that reporting factors were estimated via Eq 8. Relative absolute biases are calculated relative to the largest value in each column.
https://doi.org/10.1371/journal.pcbi.1010115.g003 rescaled validation estimates that are very sensitive to small fluctuations in the observed case counts. Both the mean model offset method and the imputation method resulted in stabler nowcasts/forecasts in this setting.
The equal weight ensemble method attempts to produce more stable forecasts than any individual method by borrowing information across many methods. For both the Inferno dengue fever model and both ARMA and Inferno national influenza models, the ensemble estimation resulted in similar or better coverage than any of the individual methods, particularly for nowcasts. Resulting nowcast and forecast errors were generally between errors from the exclusion and rescaling/offset/imputation methods for the ILI data and similar or smaller than errors for other methods for the dengue fever data. Across both datasets, there is little evidence of improved forecast/nowcast performance for model-based vs. lag-based estimation of reporting factors. However, we observed lower forecast biases when reporting factors were estimated using local reporting data. We did not formally assess the performance of proxy-based reporting factor estimation for these data.
In Figs I-J in S1 Text, we present the relative accuracy of several methods across individual weeks during follow-up. We find that the relative performance of uncorrected data analysis tended to improve for the dengue fever data when validation case counts were very small, suggesting that the correction methods may over-correct in this setting. Similarly, uncorrected analysis of the US national ILI data gave improved relative performance shortly after the season peak for seasons with sharp decreases in cases, suggesting that the correction methods may be less able to adapt to these rapid decreases in validation case counts. Fig M in S1 Text provides bias and weighted interval scores across methods, focusing on weeks near the season peaks and minima.
Comparative rankings. We then calculated the proportion of weeks across seasons in which each of 7 methods performed best in terms of weighted interval scores for 1 week forecasts. For this analysis, we focused on local estimation of π ts (d) as it generally resulted in similar or better performance than the same methods using lag-based or model-based estimation.
If all methods performed equally, we would expect each method to be the best performer about 14% of the time (95% confidence interval of [12%,16%] for dengue fever and state-level ILI and [10%,18%] for national ILI, assuming independent trials). As shown in Fig 4, however, we saw that corrected analysis tended to outperform uncorrected analysis of the observed data for dengue fever and US national ILI data. For these data sources, these results suggest that we are almost always better off performing some kind of correction strategy to handle the reporting delay. Among the correction methods, the imputation and offset methods tended to have similar or better performance than the other methods in terms of 1-week forecast interval score rankings for the national influenza-like illness datasets. In contrast, the exclusion and rescaling methods had the best performance in the dengue fever data.
Uncorrected analysis results for state-level ILI tell a different story than seen for national US ILI and dengue fever, where analysis of the observed state ILI data without correction had the best weighted interval score performance at least 17%+ of the time. As shown in Figs K-L in S1 Text, the relative performance of the delay correction methods varied between states, but the imputation-based correction method tended to have the best performance on average. For the Inferno model, uncorrected analysis outperformed all correction methods on average (lowest forecast weighted interval score 25% of the time).
Unlike in the dengue fever and national ILI data, the reporting delay mechanism sometimes changed substantially between seasons for individual states (Figs B and C). Although not shown, this resulted in reduced forecast performance for the strategies that relied on past season reporting data to correct delay in the current season. As shown in Fig N in S1 Text, the local π ts (d) estimation method produced much better estimates of the inverse reporting factors for the current season. This serves as a cautionary note against using lag-based and modelbased π ts (d) estimation methods when past season information on reporting practices is poorly representative of the current season.
For all datasets, we observed better comparative performance for naive observed data analysis with the Inferno model than with the ARMA model. This was due to (1) a tendency for the observed data Inferno model to perform particularly well relative to the other methods in weeks where the validation case counts were very low (Figs I, J, and O in S1 Text) and (2) poor ARMA forecast performance across the board when analysis was based on real-time observed data (Fig P in S1 Text). These results suggest that ARMA forecast models may tend to perform poorly when applied to real-time data subject to a large amount of reporting error.
Weeks K used for local π ts (d) estimation. In the analyses presented in Fig 3, we apply the local estimation strategy in Eq 8 with K set equal to τ, the number of weeks beyond which we assume π ts (d) = 1. However, any positive integer value for K could theoretically have been chosen. In Figs Q-S in S1 Text, we present the π ts (0) estimates and corresponding nowcast/forecast performance obtained as a function of K for the dengue fever, national US ILI, and Vermont ILI datasets. We demonstrate that the estimate in Eq 8 looks more and more like the simple lag-based estimate using Eq 6 as K becomes large. The best nowcast and forecast performance was generally seen for K near τ, with larger K resulting in particularly large increases in prediction errors for the Vermont data.

Performance of methods for simulated real-time data
The analyses in Section 4 provide intuition on the performance of the various reporting delay correction methods in three real data settings. A key limitation of several of the methods explored in Section 4 is that they require high-quality data on real-time case reports from past seasons along with corresponding assumptions about how reporting varies across and within seasons. Since reporting delay for the current season is not well-understood at the time of forecasting in practice, it is important to evaluate the impact of these assumptions and violations on method performance. To further develop our intuition for when these methods will and will not perform well, we conducted a simulation study.

Simulation set-up
We simulated validation disease counts to mimic observed disease rates of dengue fever in Puerto Rico between 1990 and 2009 as follows: 1. Let β ts be the three-week moving average of the validation case counts in the actual dengue fever data, and let τ t be the average of N ts (1)/β ts across all seasons. Define θ ts = β ts � τ t to be the mean disease rate in the simulated data.
2. We generated 200 simulated versions of the validation data for each t and s by drawing from a negative binomial(r, p) distribution with p ¼ r rþy ts and r = 100. The hyperparameter r was chosen to mimic the variability in actual dengue fever validation data. For each set of simulated validation data, we also simulated 4 external proxy variables such that p ts = 2 log(N ts (1) + 0.1) + e ts , where e ts � N(0, σ 2 ) and where σ 2 took values in (0.01, 1,4,16). These error rates corresponded to correlations between transformed p ts and N ts (1) of 0.99, 0.80, 0.50, and 0.13.
In order to clarify general properties of the various methods, we also repeated the above simulation procedure with data simulated to mimic observed disease rates of national US influenza-like illness for the 2010-2011 to the 2018-2019 seasons. Detailed results for ILI are presented in S1 Text Section 9, and we focus on the results for dengue fever-like data in the main paper.

Performance under Scenarios 1-4 in the simulated dengue fever 2009 season
In this simulation study, our goal was to compare the performance of each of the reporting delay correction methods when the reporting factor estimates were correctly specified (Scenario 1 and model-based π ts (d) for Scenario 2) and when they were mis-specified (lag-based π ts (d) for Scenario 2 and Scenarios 3-4). For this exploration, we focused our attention on the first 100 simulated datasets corresponding to 2009.
For each calendar week in 2009 and simulated datasets under Scenarios 1-4, we applied the procedure described in Section 4.1 to construct an artificial real-time dataset that would have been available for forecasting. For each artificial dataset, we applied the methods in Fig 2  exactly as we did in the actual dengue fever data, but this time we repeated this analysis across 100 simulated versions of the data. We obtained nowcasts, forecasts, and corresponding 95% prediction intervals, and we summarized these nowcasts/forecasts using the performance metrics in Section 4.2. We then aggregated these metrics across the 100 simulated datasets and 50 calendar weeks in 2009 for each of the simulation scenarios. Reporting factors constant (Scenario 1). All reporting correction strategies resulted in lower prediction error and higher coverage relative to analysis of the observed data without correction (e.g., coverages < 10% for uncorrected analysis compared to coverages over 90%). We also see improvements over observed data analysis in terms of weighted interval scores for all reporting correction methods. The exclusion methods produced higher nowcast and forecast errors on average relative to the other correction methods. This indicates that the most recently reported cases can contribute useful information for nowcasting/forecasting even in the presence of extreme under-reporting as long as the reporting factors are correctly specified. Imputation methods resulted in higher nowcast and forecast coverage than the rescaling and model offset methods, although this coverage was overly conservative (greater than 95). Model-based estimation of π ts (d) resulted in a small increase in prediction error relative to lagbased estimation. More flexible estimation of π ts (d), therefore, may come with some small price in terms of prediction error when reporting is truly constant across t and s. Local estimation of π ts (d) resulted in small additional increases in prediction error and weighted interval scores.
Reporting factors vary by week (Scenario 2). When reporting factors truly varied within each season, correction methods using model-based estimates of π ts (d) outperformed use of lag-based estimates, which incorrectly assumed reporting did not vary over s or t. However, the difference in performance was modest, even in this setting with extreme intra-season variation in reporting. The local π ts (d) estimation method performed well in this setting, with higher coverages than the model-based method but also with slightly higher prediction error and weighted interval scores. As before, the exclusion method resulted in higher nowcast error than the other correction methods, but the exclusion method did provide better coverages than the rescaling, offset, and imputation methods.
Large reporting factor change between seasons (Scenarios 3-4). When reporting for the current season was very different than the previous seasons, reporting factor estimation using past season data (lag-based and model-based) resulted in poor estimates of the current season under-reporting. When the current season under-reporting was much better than previous seasons, use of past-season reporting factors resulted in over-correction and over-estimation of the validation case counts, and we saw huge prediction errors and low coverage (e.g., 10-60%) resulting from very poorly specified estimates of π ts (d) for the rescaling, offset, and imputation methods. When the current season under-reporting was much worse than previous seasons, use of past-season reporting factors resulted in under-correction and under-estimation of validation case counts, but resulting prediction errors were still less than those from analysis of  0.04, 0.54, 0.84, 0.0.94, 0.97, 0.99, 1). The ensemble method corresponds to an equal-weight linear combination of all methods except validation data analysis and exclusions of 4 and 5 weeks' data. "Model" indicates that reporting factors were estimated via regression and allowed to vary by t and s. "Lag" indicates that reporting factors were estimated via Eq 6. "Local" indicates that reporting factors were estimated via Eq 8. Relative absolute biases are calculated relative to the largest value in each column. Model-based π ts (d) estimation assumes reporting factors vary across weeks but incorrectly models how reporting factors vary across weeks.
https://doi.org/10.1371/journal.pcbi.1010115.g005 uncorrected observed data. Combined, these results indicate that we may be better off underestimating the degree of under-reporting than severely over-estimating it.
In contrast to the model-based and lag-based methods for estimating π ts (d), the local estimation method performed well with much higher coverages (e.g., < 20% vs 95% for 1 week forecasts) and much lower prediction errors for both Scenarios 3 and 4. The exclusion methods also had good performance. To summarize, the best performing methods were those that did not rely on past season data for handling of under-reporting.

Performance as a function of the reporting rate (Scenario 5)
Previous simulations demonstrate that all methods can perform well in ideal settings when reporting factors are correctly specified. However, the methods relying on reporting factor estimates can run into trouble when the reporting factors are poorly specified. We now suppose that we do not have good historical data from which to estimate these reporting factors. This may be the case, for example, if we are forecasting a newly-emerging disease. We want to evaluate how other methods (e.g., the exclusion method and the rescaling method with π ts (d) estimated using proxy shrinkage) perform for data with different amounts of under-reporting.
We considered simulation Scenario 5, where the true reporting mechanism for each season was varied between a = 0.05 (strong under-reporting) and a = 1 (no under-reporting). Reporting correction, estimation, and forecasting proceeded as before. Performance diagnostics were aggregated across weeks within each season and stratified by the true value of a. Each result, therefore, represents an aggregation across 50 weeks in each year and across 10 simulated datasets, resulting in 500 repeated comparisons.
Exclusion method performance as a function of a. In Fig X in S1 Text, we provide prediction errors and weighted interval scores for 1 week forecasts in 2006-2009 based on the uncorrected observed data, the validation data, and results from the exclusion method after excluding between 1 and 5 weeks of recent data. Results are shown as a function of true a. For the exclusion method, there is a trade-off: excluding recent data may avoid some bias due to inclusion of error-prone real-time data but will lose out on the information in the most recent data, which may also negatively impact forecast efforts. For roughly a > 0.50, the forecast error due to throwing out information outweighed the error due to including the real-time data, and prediction error for the exclusion method was higher than uncorrected analysis. For roughly a < 0.50, exclusion tended to outperform uncorrected analysis of the observed data, particularly in terms of weighted interval scores. These results may help guide decisions for whether or not to throw out the recent data as a function of the plausible range for a.
Performance of proxy shrinkage reporting factors as a function of a. Fig Y in S1 Text provides a similar exploration for the rescaling method with π ts (d) estimated using proxy shrinkage based on proxies of varying quality (i.e., correlation with N ts (1)). The proxy shrinkage method was implemented defining where ω took values 0.5 or 1. When the correlation between the proxy and the validation data was very small (e.g., <0.15), proxy shrinkage served to increase prediction error relative to uncorrected observed data analysis. It also increased weighted interval scores relative to uncorrected analysis unless the degree of underreporting was very high (e.g., a < 0.3). When the proxy correlation was high (e.g., 0.80), rescaling using proxy shrinkage reduced prediction error relative to uncorrected analysis and reduced weighted interval scores when the degree of under-reporting was moderate to high (e.g., a < 0.7). Any advantage of incorporating the proxy to address reporting delay depends on (1) the amount of reporting error and (2) the quality of the proxy. As a conservative ruleof-thumb, we suggest applying the proxy shrinkage method only if a < 0.5 and the proxy is of sufficient quality (e.g., correlation > 0.50).

Sensitivity analysis with rescaling method.
In settings where the degree of reporting error may be comparatively modest but we still want to evaluate the impact of under-reporting on forecasts, a sensitivity analysis approach may be appealing. In performing a sensitivity analysis, we calculate performance metrics across different assumed values for the reporting factors. In Fig 6, we performed such an analysis across the simulated datasets and across different true and assumed values for a to demonstrate how coverage in 2008 and 2009 forecasts varied as a function of assumed a. In practice, this exploration would be conducted a single time on the observed data, where the true reporting mechanism would be unknown. Unsurprisingly, we obtained the highest coverages when the value of a was correctly specified. Coverages were fairly robust to mispecification of a when the true and working values of a were both fairly high. When the true amount of under-reporting was very strong (e.g., a< 0.10), however, even mildly misspecified a resulted in a large loss of coverage.

Discussion
Delayed reporting of infectious disease cases presents a challenge to forecasting efforts, and no general recommendations for addressing reporting delay in the forecasting setting currently exist. In this work, we synthesize many existing strategies for handling reporting delay into a single unified framework and propose a two-stage estimation procedure that can be applied to forecast modeling in general. In the first stage of the proposed estimation procedure, we leverage either (a) historical data on the accuracy of real-time reporting or (b) external data correlated with disease rates (e.g., social media or Google trends) to estimate the amount of disease under-or over-reporting. In the second stage, we describe how to implement forecast modeling, accounting for the over-or under-reporting using existing methods in the literature (sometimes, with modification). We applied this methodology to address reporting delay in data on influenza-like illness diagnoses in the US in 2010-2019 and data on dengue fever cases in Puerto Rico between 1990 and 2009 [3]. This analysis demonstrates the potential for improving forecast performance by accounting for delayed case reporting. However, analysis of state-level influenza-like illness data highlights the need for good estimates of the current season reporting mechanisms for rescaling, mean model offset, and imputation methods; when reporting mechanisms are very poorly specified, these correction methods can sometimes do more harm than good.
The comparative performance of the various methods was further evaluated through a simulation study. In ideal settings with correctly specified reporting mechanisms, the rescaling, mean model offset, and imputation methods resulted in reduced forecast errors relative to exclusion of the most error-prone recently reported data. This indicates that, even in the setting with very high under-reporting, the observed data can contribute useful information for forecasting when the reporting mechanism is well-understood. Caution is needed when applying the rescaling method in the setting where disease rates and anticipated inverse reporting factors are low, since this method can be sensitive to small fluctuations in the observed case data in this setting. The mean model offset and imputation methods provide additional stability in this setting and tended to result in better forecast coverage than rescaling in general.
In implementing the rescaling, mean model offset, and imputation methods in this paper, we used a fixed estimate of π ts (d). However, this strategy does not account for our uncertainty in the estimate of π ts (d). One reasonable approach for addressing this issue is to view π ts (d) as an estimated parameter with a corresponding distribution (e.g., normal with meanp ts ðdÞ and variance based on estimate uncertainty) and apply the imputation correction method using separate draws of π ts (d) for generating each imputation. We did not implement this approach in this paper, but we hypothesize that this approach could be useful in settings where there is a lot of unexplained variability in past reporting data.
Several strategies were proposed for estimating π ts (d). When past season data were available and reporting varied predictably across weeks and seasons, the lag-based and model-based estimation strategies in Eqs 6 and 7 performed well. In settings where the current season's reporting practices differed dramatically from past seasons, however, these methods failed to capture current season reporting dynamics. To address this issue, we proposed an estimator that uses the "local" recent real-time data from the past few weeks to estimate reporting factors for the current week, providing more accurate estimates.
In practice, historical data on reporting delay may be unavailable or incomplete and reporting may be very poorly understood. In this setting, the strategy of excluding the most recent data from forecast modeling produced more reliable forecast performance than methods that used fixed estimates of π ts (d) based on real-time data. However, this exclusion strategy involves a trade-off between loss of information in recently reported data and protection against bias caused by reporting delay. Our simulations suggest that the exclusion method is best applied when the reporting error is high (e.g., when the initial case reports are less than 50% of the validation case counts).
In addition to excluding the most recent data, an alternative strategy is to leverage external data to estimate the amount of under-reporting indirectly by independently predicting the validation case counts. When these predictions are good enough, the comparison between these predicted validation case counts (here, called proxies) and the observed cases provides insight into reporting error. As with the exclusion method, this proxy approach features a trade-off between providing information for estimating reporting errors and adding more noise into the estimation. Through simulation, we evaluated how the performance of the proxy-based estimation strategy varied as a function of the amount of reporting error and the quality of the proxy (i.e., the correlation with validation case counts). We found that even poor proxies (e.g., correlation 0.13) contributed some information to under-reporting correction when the amount of under-reporting was very high. As the amount of under-reporting diminished, better and better proxies were needed to improve on uncorrected analysis, particularly in the setting of ILI forecasting.
Synthesizing the insights from our data analysis and simulations, we produced some general recommendations for the handling of reporting delay in Fig 2b. These recommendations provide guidance for handling delay in disease case reporting and may serve as a useful resource to inform practical infectious disease forecasting efforts. Overall, our results clearly demonstrate potential benefits of accounting for reporting delay to improve forecast performance. However, the best-performing methods require either high-quality data on historical reporting mechanisms or strongly predictive proxies of validation case counts. Historical reporting data are often not recorded or not publicly available for many commonly-studied diseases. This motivates additional thinking about disease reporting infrastructure, particularly as new initiatives are underway to expand reporting and forecasting capabilities at the US national level [28].
This work focuses on the problem of handling reporting delay and defines the "true" case rates we want to predict as the eventually-reported validation case rates. In reality, many diseases may be poorly or incompletely captured by official reporting mechanisms [29], as plainly demonstrated by the COVID-19 pandemic. Given estimates of the amount of reporting error relative to the unobserved "true" case counts, the rescaling, mean model offset, and imputation methods discussed in this paper can be applied. The challenge is then to estimate the amount of reporting error, a discussion of which is beyond the scope of this work. Alternatively, this problem can be handled in a sensitivity analysis framework, where we can compare the robustness of our forecasts across multiple plausible assumptions about the amount of reporting error relative to the "true" case counts. Additional work is needed to leverage multiple data streams to inform estimation of reporting error when the target is defined as the neverobserved "true" case counts.

S1 Text. Supporting information 1: Figures and Tables. Fig A: Comparing initially reported and validation case counts for Puerto Rico dengue fever and US national influenza-like illness.
In all plots, the highest (black) curve corresponds to validation case counts. The lowest (red) curve corresponds to the cases initially reported (lag = 0). Results from the current season may be viewed as the "truth" here, but data analysis and forecasting is conducted using estimates from the previous two seasons (except for local lag-based estimation). Table A Fig M: Forecast bias and weighted interval scores for 3 weeks just before and just after season peak and for 3 weeks surrounding the season minimum/valley. Results based on aggregating nowcast and forecast performance for 3 weeks just before and 3 weeks just after each season's peak. Results also provided after aggregating 4 weeks before and after season minimum/valley. For ILI, the season valley corresponds to the minimum case counts within the first 35 weeks of the flu season. Fig N: Estimated π ts (0) obtained from different lag estimation methods for state-level ILI in the 2018-2019 flu season. Data were downloaded on June 13th, 2021. The local lag method used data from the previous 15 weeks to estimate the inverse reporting factors. The standard lag method used data from the previous 2 seasons to estimate the inverse reporting factors. Observed inverse reporting factors for each week are also plotted. Fig O: Relative accuracy (1/absolute prediction error, scaled across methods) of 1 week ahead forecasts for dengue fever using real-time or validation data for modeling. Results based on absolute prediction error for 5 week rolling window centered at plotted week. Results for 50 weeks per season are shown. The black line represents observed validation case counts for each week. Fig P: Dengue fever data nowcasts and forecasts from ARMA and Inferno modeling of observed data, compared to validation and initial case reports. Fig Q: Impact of K on estimated π ts (0) using local estimation in dengue fever and national US ILI data (ARMA model) Local estimates for π ts (d) are obtained using Eq 8. Fig R: Impact of local estimation K on nowcast/forecast errors in dengue fever and national US ILI data using rescaling method (ARMA model). Results based on aggregation across 50 weeks (dengue fever) or 35 weeks (influenza) across all seasons. Local estimates for π ts (d) are obtained using Eq 8. in dengue fever simulated data (ARMA models, . Results correspond to the 2009 simulated data and are aggregated across 100 simulated datasets and 50 weeks. Relative weighted interval scores (WIS) are calculated relative to the largest value in each column. Fig X: Forecast performance of exclusion method by proportion of initially-reported cases in dengue dever simulated data (ARMA models, Scenario 5). Results aggregated across 50 weeks and 10 simulation replicates for each season/proportion. Fig Y: Forecast performance of rescaling method with reporting factors based on proxy shrinkage by proxy quality and proportion of initially-reported cases in dengue fever simulated data (ARMA models, Scenario 5). Results aggregated across 50 weeks and 10 simulation replicates for each season/proportion. Fig Z: Average nowcast and forecast biases and coverage of 95% confidence intervals for US ILI simulated data (ARMA models, . Results correspond to the 2018 flu season simulated data and are aggregated across 100 simulated datasets and 35 weeks. Relative biases are calculated relative to the largest value in each column. Fig AA: Median forecast weighted interval score for US ILI simulated data (ARMA models, . Results correspond to the 2018 flu season simulated data and are aggregated across 100 simulated datasets and 35 weeks. Relative weighted interval scores (WIS) are calculated relative to the largest value in each column.