Validation of Statistical Models for Estimating Hospitalization Associated with Influenza and Other Respiratory Viruses

Background Reliable estimates of disease burden associated with respiratory viruses are keys to deployment of preventive strategies such as vaccination and resource allocation. Such estimates are particularly needed in tropical and subtropical regions where some methods commonly used in temperate regions are not applicable. While a number of alternative approaches to assess the influenza associated disease burden have been recently reported, none of these models have been validated with virologically confirmed data. Even fewer methods have been developed for other common respiratory viruses such as respiratory syncytial virus (RSV), parainfluenza and adenovirus. Methods and Findings We had recently conducted a prospective population-based study of virologically confirmed hospitalization for acute respiratory illnesses in persons <18 years residing in Hong Kong Island. Here we used this dataset to validate two commonly used models for estimation of influenza disease burden, namely the rate difference model and Poisson regression model, and also explored the applicability of these models to estimate the disease burden of other respiratory viruses. The Poisson regression models with different link functions all yielded estimates well correlated with the virologically confirmed influenza associated hospitalization, especially in children older than two years. The disease burden estimates for RSV, parainfluenza and adenovirus were less reliable with wide confidence intervals. The rate difference model was not applicable to RSV, parainfluenza and adenovirus and grossly underestimated the true burden of influenza associated hospitalization. Conclusion The Poisson regression model generally produced satisfactory estimates in calculating the disease burden of respiratory viruses in a subtropical region such as Hong Kong.


Introduction
Respiratory viruses have been associated with substantial disease burden in relation to hospitalization and mortality. Reliable estimates of such disease burden are important to determine the costs and benefits associated with prevention and control strategies. Since various respiratory pathogens cannot be distinguished from each other on clinical features and as the majority of respiratory disease is not investigated virologically, the disease burden is usually estimated by statistical models. These models have so far largely been applied to influenza and to a more limited extent, to respiratory syncytial virus (RSV). The Center of Disease Control and Prevention of the United States (USCDC) has applied a Serfling cyclical regression model to estimate the baseline mortality by incorporating the long term and seasonal trends of pneumonia and influenza (P&I) mortality [1]. This model may not be applicable to subtropical and tropical regions where seasonal trends of mortality and influenza and other respiratory virus activity are relatively unpredictable. Even in temperate regions with well isolated winter peaks of influenza and RSV, concerns have been raised about the potential confounding effect of the co-circulation of other respiratory viruses on Serfling model derived estimates of influenza disease burden [2].
Two other approaches have been used to address these concerns and also to allow disease burden estimates to be made in subtropical and tropical regions with more variable and diffuse influenza virus activity than in temperate regions. One is the rate difference model (also known as excess rate model) which defines the periods of influenza epidemics and baseline periods (based on a minimal threshold of laboratory defined virus activity) and then calculates rate differences in influenza morbidity by comparing epidemic and baseline periods [3,4]. This model may allow some correction for confounding effects of other viruses given the well separated peaks for different respiratory viruses. Another model is the Poisson regression model [5,6], which can adjust for confounding of seasonal trends of disease outcomes, co-circulation of other respiratory pathogens and other potential confounding factors. These two methods have been applied in temperate, subtropical and tropical settings with varying degrees of success, demonstrating substantial influenza associated mortality and hospitalization that is comparable among different geographic regions such as the US, Hong Kong and Singapore [3,4,[6][7][8][9]. Similar approaches of rate difference and Poisson regression models have been applied to estimate the disease burden associated with RSV [10][11][12][13], but few has been developed for parainfluenza and adenovirus, which are less predominant causes of hospitalization than influenza and RSV.
A recent study by Thompson et al. [14] compared the performance of the Serfling cyclical regression, autoregressive integrated moving average (ARIMA) model, rate difference model and Serfling-Poisson regression model in estimating the excess mortality rates associated with influenza in the US. Both Serfling regression and ARIMA models may not be applicable to the subtropical and tropical regions as they need well-separated nonepidemic periods to define baseline levels of mortality. The latter two methods require virological surveillance data from the relevant region to be included in the analysis. The authors conclude that the Poisson regression models permit estimation of influenza associated deaths but require robust virological data while the simple rate difference models may be useful in regions with sparse viral surveillance data or complex influenza seasonality [14]. However, none of the estimates from these models have been validated using a time series of virologically confirmed cases. One recent Canadian study [15] compared a convenience sample of virologically confirmed cases of RSV and influenza in children less than two years of age to validate estimates of hospitalization rates derived from statistical methods. However, their virologically confirmed cases were collected from one single hospital which served only 10% of the population of Quebec province and it may be difficult to confidently extrapolate these data to the population denominator. Furthermore, their conclusion may not be applicable to older children or to tropical regions which have more complex influenza seasonality patterns when compared with temperate regions. There is thus a need for studies that validate such statistical estimates of disease burden estimates with directly observed and virologically confirmed outcomes. In this study we used the hospitalization data of virologically confirmed cases to validate two of the above-mentioned methods: rate difference and Poisson regression models, for estimating excess hospitalization associated with each respiratory virus.

Ethics Statement
This study has been approved by the Ethics Committee of Li Ka Shing Faculty of Medicine, the University of Hong Kong (EC1880-02).The consent was not required for this study as there was no personal information collected from subjects.
From October 2003 to September 2006, the study subjects were recruited from the only two public hospitals located on Hong Kong Island: Pamela Youde Nethersole Eastern Hospital (PYNEH) and Queen Mary Hospital (QMH) [16,17]. All the pediatric patients who were admitted into these two hospitals for acute respiratory diseases (ARD) on one sampling day of each week were tested for infection of respiratory viruses by immunofluorescence and culture at the microbiology laboratory of QMH. In this study the weekly number of age-specific hospital admissions with laboratory confirmed virus infection was divided by its population denominator to obtain the ''directly observed rate''. This rate was regarded as a true virus associated hospitalization rate for acute respiratory disease in our study population against which currently used models can be validated. Specifically, anonymous data on children aged ,18 years with a discharge diagnosis of ARD, 460-466 or 480-487 (International Classification of Diseases, 9 th Revision, Clinical Modification) (ICDCM9) from the two public hospitals were obtained from the computerized database of the Hong Kong Hospital Authority with permission. Data for each record included ICDCM9 codes for up to 4 discharge diagnoses in addition to the ARD diagnosis, age, gender, dates of admission and discharge and disposition (alive or dead).

Rate difference model
As previously described, the period of influenza predominance was defined as a period of two or more consecutive weeks in which the weekly numbers are greater or equal to 4% of the annual number of virologically confirmed influenza A and B diagnoses and less than 2% of the annual number of RSV diagnosis [3]. For comparison, periods of at least two consecutive weeks in which both the numbers of RSV and influenza virus diagnoses were less than 2% of their annual totals were defined as periods of baseline activity for both viruses. To calculate ARD hospitalization attributable to influenza, we compared mean hospitalization rates during the periods of influenza predominance with those during the baseline periods. The estimate for the whole Hong Kong Island was made by multiplying the number of ARD hospitalizations in the two hospitals by the reciprocal of the proportion of pediatric patients served (i.e. 1/0.725). As RSV, parainfluenza and adenoviruses have less definable seasonality and their peaks usually overlapped with those of influenza ( Figure S1), we could not find a satisfying definition for their baseline and predominance periods, and therefore did not apply the rate difference model to these viruses.

Poisson Regression Model
As in our previous study, the Poisson regression model with a log link was used to model the weekly numbers of ARD admissions [6]. The Poisson model assumes that the mean of hospital admissions is equal to its variance, but this assumption was not supported in our data as the variance of hospitalization data was larger than its mean (termed as over-dispersion of variance), we therefore adopted a quasi-likelihood method, which allows greater variance than the conventional Poisson distribution, to adjust for this over-dispersion [18]. Our model differs in some respects from the Serfling-Poisson model used by Thompson and colleagues which adopted a pair of sinusoidal terms to adjust for seasonal variation of mortality [14]. To make the model more robust to the effects of less predictable seasonality, we first built a core model to control the confounders, including long-term trends, seasonal patterns of ARD admissions and meteorological factors, with natural cubic spline smoothing functions of time, weekly average temperature and relative humidity. Smoothing functions were applied to remove seasonality and long term variations which are expected to be associated with time-varying confounders. The goodness of fit of the core model or its adequacy in controlling for time-varying confounding was judged by a lack of autocorrelation in its residuals. If there was still autocorrelation after adjusting for all the potential confounders, additional auto-regressive (AR) terms of residuals were added to the core model until the updated residuals distributed randomly and independently of each other. The virus activity variables for influenza (type A and B), RSV, parainfluenza (type I, II and III) and adenovirus were then simultaneously entered into the core model. The baseline hospitalization specific for a certain virus was calculated as the expected hospitalization numbers when the weekly proportions of that virus were set equal to zero and the observed data of other variables were simultaneously entered into the Poisson model. In this baseline level that specified virus was assumed not circulating in the community while the effects on hospitalization associated with cold winter, other co-circulating respiratory viruses and other unknown seasonal factors were represented. The age-specific excess hospitalization rate for each specified virus was defined as the difference between the annual sums of observed and virusspecific baseline hospitalization for each age group divided by the age-specific population. The 95% confidence intervals of excess rates were estimated by bootstrapping the residuals of the full model 1,000 times. A detailed description for Poisson modeling approach is provided in File S1.
In our previous study, the virus activity was measured by the weekly proportions of specimens positive for each virus (proportion variable), respectively. We calculated the proportions based on the virology data from the microbiology laboratory of QMH which covered all the age groups on Hong Kong Island. We also repeated the analysis with the weekly numbers of positive specimens as an alternative proxy for virus activity (number variable) in our Poisson models.
The log-link Poisson models have been previously criticized for assuming that the numbers of hospital admissions increase exponentially with the proportions of positive isolates [19]. We therefore also built a Poisson model with an identity link, in which the hospital admissions increase proportionally with unit increase of virus activity. The age-specific excess hospitalization rate associated with influenza derived from each of these Poisson models was compared with the directly observed rate of influenza and the model that provided the most accurate estimate for influenza was chosen for analysis of other respiratory viruses.

Results
The virologically confirmed, population-based age-specific hospitalization rates for each respiratory virus during these three study years are shown in Table 1  and 16 weeks respectively, in which neither influenza virus nor RSV was active (baseline period). The rate difference model could not be used for the 2004-05 season since an influenza predominance period could not be defined. When compared to the directly observed virologically confirmed influenza hospitalization rates for each age group, the rate difference model failed to produce estimates that closely matched the directly observed incidence rates for the other two years (Table 1 and Figure 2).

Influenza associated ARD hospitalization rates
Poisson Regression model. The log-link Poisson models with the proportion variables yielded the estimates close to the directly observed rates for the children older than two years, but the estimates tended to much higher for the ,1 and 1 to ,2 age groups ( Table 1). The identity-link function overestimated the true disease burden to a greater extent than did the log-link function in most age-year categories, no matter what influenza proxy variable was used. The influenza effect for each age group was found to be statistically significant (p,0.05) in all the models, the only exception being the ,1 age group in the identity-link Poisson models with proportion variables (Table 1). Overall, the log-link models with the number variables provided the estimates closest to the directly observed rates, despite underestimation observed for the 2-5 and 5-10 age groups (Figure 2). The log-link Poisson models with the number variables were chosen for further analysis for other respiratory viruses.

ARD hospitalization rates associated with other respiratory viruses
In the age group younger than one year, the estimates of excess hospitalization rates for parainfluenza were close to those directly observed from young children, but those for RSV tended to be lower for the 2003-2004 and 2005-2006 seasons ( Table 2). For children older than 1 year of age, the estimates for excess hospitalization associated with RSV or parainfluenza derived from the log-link Poisson model with the number variables had wide confidence intervals and were markedly deviant from the directly observed rates. The estimates for adenovirus associated hospitalization were unreliable at all ages.

Discussion
As previously suspected, the rate difference model is not applicable when influenza does not appear as a sharp peak with reasonable separation from RSV [3]. We further demonstrate that this model greatly underestimates the actual disease burden (even when influenza predominant periods are separated from periods of RSV circulation) if significant influenza activity exists outside the predominant peak of influenza virus activity, as commonly occurs in subtropics and tropics. In addition to RSV, other common respiratory viruses can also have significant confounding effects, especially on hospitalization of children, and these cannot be readily accounted for in this model.
In Poisson regression models, the baseline levels of influenza are predicted from a model fitted to observed numbers of hospital admissions. The excess hospitalization estimate is robust to potential confounding effects due to uncontrolled individual factors that do not change over a short period of time, such as smoking status and preexisting chronic conditions [20]. Given the relatively broad and variable seasonality of influenza in the tropical and subtropical regions such as Hong Kong, it is not surprising that the Poisson regression model which uses a nonparametric smoothing function for modeling any pattern of weekly hospitalization outperformed the rate difference model in our study. Additionally, the Poisson regression model allows estimation of disease burden while more efficiently adjusting for co-morbidity caused by other respiratory viruses and also for confounding of seasonal variations of hospitalization and meteorological conditions [6]. However, the Poisson regression model has been criticized for being potentially inadequate in adjustment for confounding factors, which would result in allocating more variation of hospital admissions to explanatory variables (influenza virus activity in our model) than they deserve and thereby overestimate the influenza effects [21]. In this study, we carefully checked the adequacy of core models in terms of adjusting for confounding factors. If autocorrelation of weekly hospitalization data was still detected in the residuals after adjusting for confounding factors by smoothing, this would suggest that some unobservable confounding might remained unadjusted for. We then would further remove the autocorrelation by adding AR terms of the residuals of core models. In this way, the long-term and seasonality associated confounding factors were expected to be well controlled. The close match between Poisson regression estimates and directly observed numbers suggested that our control strategy for confounding is appropriate and sufficient.
The debate over log-link Poisson models has focused on whether it is appropriate to assume the multiplicative risk for hospitalization or mortality associated with the unit increase of proportions of positive specimens [9]. Here we did a sensitivity analysis by using the identity-link Poisson regression model which assumes the additive risks associated with influenza virus activity. The results showed that the log-link Poisson generally returned estimates smaller than those from the identity-link models. Thompson et al. also found that these two link functions produced the similar results [9]. But the lack of a clear biological rationale has made it difficult to choose the proper link function and there are no epidemiological studies that could provide solid evidence to support a linear or log-linear relationship between influenza cases and hospital admissions. As the log-link function ensures a nonnegative estimate for predicted hospital admission numbers, the log-link Poisson regression model is probably more appropriate for influenza disease burden studies. In this study we adopted the quasi-likelihood method in Poisson models to control over-dispersion of variance in hospitalization data. The negative binomial model, which is a generalized form of Poisson models and addresses the over-dispersion problem by a scale parameter has also been introduced into the influenza disease burden study [8,15]. We then built the negative binomial model to estimate influenza associated hospitalization rates of each agedisease category. The results showed that the estimates of negative binomial models with a log-link or identity-link tended to be slightly larger than those of log-link Poisson models in the age groups younger than 5 years, but the log-link negative binomial models markedly underestimated the rates in the 5-10 age group, and influenza effects were found not statistically significant in the 0-1 and 5-10 age groups in the identity-link negative binomial models (data not shown). These findings suggest the quasilikelihood methods performed better than the negative binomial regression in terms of producing estimates closer to the directly observed data.
The optimal indicator of virus activity in such studies has also been contentious. We have previously used the proportion of specimens positive for influenza as proxy for influenza virus activity. In this study, we carried out sensitivity analysis by replacing the positive proportions with the numbers of positive specimens as the influenza virus activity variable into the models. Both the log-link and identity-link models with the number variable generally yielded estimates closer to the directly observed rates, compared with the models with the proportion variable. Our virology data were obtained from one laboratory which consistently received 100 to 200 specimens for respiratory virus diagnosis each week from the patients admitted with acute respiratory disease and these specimen numbers were driven by clinical need and were not capped in any way. However, data from a community influenza-like-illness surveillance program that covers a city or even larger area, the influenza case numbers may be subject to bias introduced by pre-determined targets of numbers to be collected each week. Alternatively, changes in the numbers of sentinel sites or in health seeking behavior may trigger artificial changes in specimen numbers. Therefore, the proportion of positive specimens is likely to be a more robust indicator in such situations.
A limitation of this study is that the nasopharyngeal specimens were tested by immunofluorescence and viral culture, not by the more sensitive polymerase chain reaction (PCR). We found that immunofluorescence and culture under-estimates influenza cases by approximately 10% and RSV by 13% [23]. Other studies reported a 10% to 20% increase of samples positive for RSV and parainfluenza when PCR was used [24,25]. Hence the directly observed rates in the present study may underestimate the true disease burden of these respiratory viruses. Even if we take into account this potential underestimation of directly observed rates by multiplying a factor of 10%, the log-link Poisson model with the number of specimen variable remained as the best model that offered the estimates closest to the directly observed rates of influenza and RSV, and the excess rates for parainfluenza in the ,1 age group were almost equal to the true disease burden.
Although parainfluenza and adenovirus caused substantial hospitalizations of children in our study, the Poisson model did not return reliable estimates for these two respiratory viruses, with the exception of parainfluenza infection in those ,1 year of age. The poor estimates in older children could be due to relatively low proportions of samples positive for parainfluenza and adenovirus from the surveillance network ( Figure S1). Compared to influenza and RSV, these two viruses exhibit less clearly defined seasonal variations, which may further increase the difficulty in relating the variation of hospitalizations to the weekly proportions of parainfluenza and adenovirus.
In our study, the rate difference model tends to provide smaller estimates of influenza associated disease burden than does the Poisson model. Interestingly, Thompson et al. found that the rate difference model consistently returned higher estimates than the Poisson model in the US [14]. This could be a result of different definitions adopted by the two studies. Their study did not consider co-circulation of RSV when they defined baseline and epidemic periods, largely owing to the overlapping seasonal peaks of these two viruses in the US [22]. Moreover, the well defined single peak of influenza in winter in the US would allow a more accurate estimate from the rate difference model that in a tropical or subtropical region where the virus activity is more dispersed throughout the year.
Contrary to our findings, the Canadian study by Gilca et al. found that estimates derived from both log-link Poisson and rate difference models significantly underestimated the pneumonia and influenza admissions with laboratory confirmed influenza in children of two years old or younger [15]. Their study has the limitation that laboratory-confirmed diagnoses of influenza and RSV came from a convenience sample from one hospital that served 10% of Quebec children and may not represent the pediatric RSV/influenza epidemiology of the province. In contrast, our study is based on the data directly derived from 72.5% of all hospitalizations in Hong Kong Island, i.e. the population denominator. Gilca et al. adjusted for interaction between influenza and RSV by adding a product term into their Poisson models. We did a sensitivity analysis with the same approach and the results showed that influenza effects changed only marginally and the RSV estimates remained nearly unchanged (data not shown). Therefore, we did not need to include any interaction term in our full models.
Poisson regression models very accurately estimated the excess rate for influenza in the 2-18 year old age groups, but the influenza estimates for the children younger than two years were much higher than the directly observed rates. Interestingly, the estimates for RSV were accordingly lower in the young children. As the peaks of influenza and RSV occasionally overlapped in the study period, it is possible that such under-and over-estimate could be the result of multicollinearity between virus variables in our Poisson models. We then did a sensitivity analysis by assessing the effects of single virus without adjustment for co-circulation of other viruses, i.e. only one virus variable being entered into the model. The results showed that the influenza estimates slightly increased after model change with the exception of the 2-,5 age group. The RSV associated rates slightly increased for the 1-,2 and 5,-10 age groups and but decreased for the children younger than 1 year old (Table S1). The Poisson estimates for parainfluenza and adenovirus without adjustment for co-circulation were dramatically decreased (compared to the full models) but this is more likely due to the unstable nature of these estimates as shown in their wide confidence intervals.
The results indicate a mild extent of collinearity between influenza and RSV variables, which may not significantly affect our estimates. Some of the confidence intervals for estimated age-specific rates of influenza associated hospitalization tended to be rather wide, presumably due to the fact that the study covered a period of only three years. We would expect to get narrower confidence intervals with longer time series, although it is difficult to sustain the systematic virological diagnosis of all ARD admissions on an ongoing basis to provide the virologically confirmed independent validation of the estimates.
In conclusion, Poisson regression modeling is applicable to the assessment of disease burden due to influenza associated hospitalization in children. It also yields reasonable estimates for RSV and parainfluenza in children ,1 year of age, but performs less well for the older children. None of these methods provided reliable estimates of disease burden for adenovirus hospitalization. Although mortality was not assessed and could not be validated in this study, we could infer that the good correlation between the output of the Poisson model and virologically confirmed hospitalization will also be applicable to mortality estimates. Such an approach to estimation of influenza disease burden requires long-term virological surveillance data and thereby restricts its application in some geographic regions. However, with the heightened attention on influenza arising from the threat of avian influenza H5N1, and now from the recent pandemic H1N1, such virological data is becoming increasingly available in many countries making Poisson regression modelling feasible in an increasing number of contexts. Figure S1 Proportions of specimens positive for common respiratory viruses. Data were collected from the influenza surveillance network of Hong Kong Island in the study period.

Supporting Information
(TIF) Author Contributions