The authors have declared that no competing interests exist.
Contributed concepts: MP PJH CTW SR. Contributed feedback on analyses: CTW SR JAH. Organized the collaboration: SR. Conceived and designed the experiments: KMP JW CTW MP PJH WH HZ YG SR. Performed the experiments: JW WH HZ YG. Analyzed the data: KMP. Contributed reagents/materials/analysis tools: YG. Wrote the paper: KMP.
An ability to forecast the prevalence of specific subtypes of avian influenza viruses (AIV) in livebird markets would facilitate greatly the implementation of preventative measures designed to minimize poultry losses and human exposure. The minimum requirement for developing predictive quantitative tools is surveillance data of AIV prevalence sampled frequently over several years. Recently, a 4year time series of monthly sampling of hemagglutinin subtypes 1–13 in ducks, chickens and quail in livebird markets in southern China has become available. We used these data to investigate whether a simple statistical model, based solely on historical data (variables such as the number of positive samples in host X of subtype Y time
H5 and H9 subtypes of avian influenza viruses (AIV) are two of the three avian subtypes (H7 is the third) known to cause infection in humans
The three subtypes of avian influenza that have occurred naturally in humans thus far are H5, H7 and H9
The main goal of the analyses described here was to investigate whether a regression model, with biologically interpretable parameters, could be developed from surveillance data as an easytouse tool for anticipating the prevalence of H9 or H5 in Chinese livebird markets. As a case study, we focused on prevalence in chickens because this poultry species is a major staple with relatively high AIV prevalence. In addition, we used the statistical framework to investigate whether AIV prevalence in other host species was associated with prevalence levels in chickens. Lastly, we conducted model selection to identify which surveillance data were most crucial for predicting prevalence of H9 and H5. We discuss which missing data would likely improve model accuracy.
Routine sampling of
Cloacal and tracheal swabs were collected from each bird. Birds were counted as positive if virus was isolated from at least one of the two samples. Virus was isolated using embryonated chicken eggs and AIV subtypes H113 and Avian Paramyxovirustype1 (APMV1) were identified using monospecific antisera in hemagglutination inhibition (HI) tests
Positive counts were aggregated at a monthly scale and transformed to counts per 100 birds ([count/sample size] x 100), which is close to the mean sample sizes (see below). Data from H2, H7, H8, H10, H12 and H13 were discarded since these subtypes were very rare. Our previous work identified strong host preferences between subtypes, and different sample sizes for each host species were collected, thus it was important to model prevalence within individual host species
Covariates were considered within the same time step and at a lag of one month because, by observation, peaks of incidence are of that approximate duration. Hence, if weather variables contribute to prevalence peaks, we might expect high prevalence to occur ∼one month after a rise or dip in weather values. Furthermore, since the infectious period and transition times through the market are very short (∼ 5–10 days and 2–3 days, respectively), we did not expect prevalence of other subtypes to affect prevalence of H5 or H9 at more than one time step in the past (i.e., one month). All covariates were normalized by taking the difference from each point to the mean of all points and then dividing the result by the standard deviation. Because the infectious period is relatively short (usually <1 week
We used generalized linear modeling. H9 data were modeled with a negative binomial (NB) error structure (log link; ‘glm.nb’ function in the ‘MASS’ package in R 2.15.1
First, we tested whether the prevalence of H9 and H5 in chickens in retail markets is associated with the prevalence of these subtypes in the other host species (this model is referred to as “DK+QA” since it includes parameters for subtype prevalence in ducks and quail), by constructing models with only these data (H9 or H5 in ducks and quail, separately) as covariates. Second, in order to examine forecast ability and to identify which variables were important for predicting the prevalence of H9 and H5, we conducted model selection using prevalence data from the other subtypes in each of the 3 hosts and weather data. We included data from the same time step as well as data from 1 month in the past. Due to the large number of possible models, we performed a preliminary step and fit all single variable models and selected variables which improved model AIC by at least 2 points over the interceptonly model. From this subset of variables, we fit all possible combinations and selected the top model (referred to as “Best”) by AIC.
We assessed the appropriateness of models by probability plots and autocorrelation function plots of residuals, and Vuong tests
First, we fit models of H9 in chickens using H9 prevalence in ducks and quail as covariates to evaluate whether these other host species of H9 are associated with H9 prevalence in chickens. H9 in ducks and quail were not correlated with H9 in chickens when considered on their own (
Data at the prevalence of H9 per 100 chickens sampled. Data were modeled by negative binomial regression with a log link. “Best” is the set of covariates that were selected by AIC: allH4, allH6, QAH6, QAH9, allH5_{t1}, where “all” is the prevalence of subtype HX in all 3 host species (CK+DK+QA), “QA” is for prevalence in only quail, “DK” is for prevalence in only duck, and t1 is the prevalence in the previous month.
Model  AIC  

211.9  0  52.0  NA 

213.4  0.01  52.2  30.8 

211.5  0.06  56.3  3.8 

213.2  0.07  59.2  3.2 

191.8  0.57  30.0  0.4 
Cragg & Uhler’s method: (1(L_{0}/L_{m})^{2/N} )/1L_{0}^{2/N}; L = likelihood; 0 = interceptonly model; m = full model; N = number of data points.
Mean Squared Prediction Error: sum(ym)^{2}/N; smaller values indicate better fits; y = observed data; m = mean of predicted data; N = number of points predicted.
Normalized Mean Squared Prediction Error: sum((ym)/s)^{2}/N; smaller values indicate better fits; y = observed data; m = mean of predicted data; s = standard deviation of predicted data; N = number of points predicted.
Best model covariates: H4 prevalence in all hosts, H6 prevalence in all hosts, H6 prevalence in quail, H9 prevalence in quail, H5 prevalence in all hosts one month in the past.
Covariates in other models: DKH9 = H9 prevalence in ducks, QAH9 = H9 prevalence in quail, DKH9+QAH9 = sum of DKH9 and QAH9.
Covariate  Estimate  SE  P 

1.24  0.15  <0.0001 

0.28  0.15  0.059 

−0.56  0.29  0.058 

1.05  0.29  0.0003 

0.24  0.13  0.064 

0.41  0.13  0.0021 
Model covariates: allH4 = H4 prevalence in all hosts, allH6 = H6 prevalence in all hosts, QAH6 = H6 prevalence in quail, QAH9 = H9 prevalence in quail, allH5_{t1} = H5 prevalence in all hosts one month in the past.
The best model of H9 prevalence fit the data quite well both quantitatively (pseudoR^{2} = 0.57; large reduction in MSPE relative to the interceptonly model;
The model was fit (red) on the first 3 years of data (black). Forecasts are shown for the fourth year of data using 3 methods: 1) Forecasting the full 12 months of data (blue), 2) Iterative fitting and forecasting where additional data were included at each step (SxS A, purple), and 3) Iterative fitting and forecasting using a sliding window where model parameters were always estimated from 36 months of data (SxS B, green). BD show an alternative way of viewing the fits. B shows the fit of the model and C and D show the fit of the forecasted points using the two best methods (SxS A (C) and SxS B (D)).
Method  MSPE  NMSPE 

30.0  0.4 

24.5  1.9 

23.1  1.8 

24.1  1.6 
Insample data are for the fitted model. Other methods are described in
Mean Squared Prediction Error (MSPE): sum(ym)^{2}/N; smaller values indicate better fits; y = observed data; m = mean of predicted data; N = number of points predicted.
Normalized Mean Squared Prediction Error (NMSPE): sum((ym)/s)^{2}/N; smaller values indicate better fits; y = observed data; m = mean of predicted data; s = standard deviation of predicted data; N = number of points predicted.
Modeling H5 data was more challenging because H5 is rare relative to H9 and its rarity increased in the last year of the time series, the section used for outofsample prediction. Similar to H9, H5 in ducks and quail were not correlated with H5 in chickens (
Data are the prevalence of H5 per 100 chickens sampled. Data were modeled by zeroinflated negative binomial regression with a log link on the abundance component. “Best” is the set of covariates that were selected by AIC: maximum wind speed and DKH9_{t1}, where “DK” is for prevalence in ducks, and t1 is the prevalence in the previous month.
The model was fit (red) on the first 3 years of data (black). Forecasts are shown for the fourth year of data using 3 methods: 1) Forecasting the full 12 months of data (blue), 2) Iterative fitting and forecasting where additional data were included at each step (SxS A, purple), and 3) Iterative fitting and forecasting using a sliding window where model parameters were always estimated from 36 months of data (SxS B, green). BD show an alternative way of viewing the fits. B shows the fit of the model and C and D show the fit of the forecasted points using the two best methods (SxS A (C) and SxS B (D)).
Model  AIC  

71.8  0  1.7  NA 

71.4  0.14  1.7  111.1 

74.1  0.06  1.7  64.1 

73.1  0.20  1.7  39.5 

60.9  0.49  0.8  0.8 
Column statistics are by the same methods as described in
Cragg & Uhler’s method: (1(L_{0}/L_{m})^{2/N} )/1L_{0}^{2/N}; L = likelihood; 0 = interceptonly model; m = full model; N = number of data points.
Mean Squared Prediction Error: sum(ym)^{2}/N; smaller values indicate better fits; y = observed data; m = mean of predicted data; N = number of points predicted.
Normalized Mean Squared Prediction Error: sum((ym)/s)^{2}/N; smaller values indicate better fits; y = observed data; m = mean of predicted data; s = standard deviation of predicted data; N = number of points predicted.
Best model covariates: Maximum windspeed, H9 prevalence in ducks one month in the past.
Covariates in other models: DKH5 = H5 prevalence in ducks, QAH5 = H5 prevalence in quail, DKH5+QAH5 = sum of DKH5 and QAH5.
Component  Covariate  Estimate  SE  P 



−0.96  0.58  0.099  

1.04  0.43  0.017  

0.47  0.20  0.022  



−2.81  8.63  0.74  

−0.60  0.94  0.53  

−8.81  17.08  0.61 
The zeroinflated negative binomial model is a mixture of two separate data generation processes (i.e., model “components”): one to describe zeros (binomial) and the other to describe counts from a negative binomial model.
DKH9_{t1} = H9 prevalence in ducks one month in the past.
Method  MSPE  NMSPE 

0.8  0.8 

2.9  1.1 

2.9  1.1 

0.1  8.0 
Insample data are for the fitted model. Other methods are described in
Mean Squared Prediction Error (MSPE): sum(ym)^{2}/N; smaller values indicate better fits; y = observed data; m = mean of predicted data; N = number of points predicted.
Normalized Mean Squared Prediction Error (NMSPE): sum((ym)/s)^{2}/N; smaller values indicate better fits; y = observed data; m = mean of predicted data; s = standard deviation of predicted data; N = number of points predicted.
Anticipating the prevalence of specific subtypes of AIV in domestic poultry settings is critical for planning and implementing costeffective public health preventions. To date, it has not been possible to forecast the prevalence of any AIV subtype in poultry because the appropriate data have not been available and the ecology and population dynamics of AIV are complex. Here, we evaluated the possibility that surveillance data that were collected for purposes other than our analyses could be used both to gain information on factors that may influence the dynamics of AIV within livebird markets in southern China and to create a tool for predicting prevalence. Our two most important findings were that: prevalence of H9 and H5 in chickens was uncorrelated with prevalence of these subtypes in ducks or quail
One striking difference between the best models of H9 and H5 was that H5 dynamics were mainly associated with environmental variables (except for the influence of H9) whereas H9 was unaffected by weather and mainly associated with the dynamics of other subtypes. Our previous study
A second difference between H9 and H5 was in the relative performance of the alternate methods of outofsample prediction. The stepbystep method was best for both H5 and H9, but it was crucial to use the stepbystep sliding window approach for H5. This was because the time series shows an obvious change in prevalence patterns  from distinct peaks in the first two years to sporadic cases in the later years. Thus, in the case of H5 in chickens, inclusion of more data for model fitting decreased prediction accuracy due to a shift in the dynamics of the system. Excluding the very early data (the sliding window approach of stepbystep B) improved outofsample prediction accuracy because the earlier dynamics have less weight in the prediction. In the case of H9, where there was no apparent shift in prevalence regime, the penalty for including earlier data was much less severe, although it did exist. This suggests that in dynamical systems with complex ecology such as AIV in poultry, it can be a good strategy to update model parameters with the most recent data (and exclude earlier data) if the primary objective is to forecast prevalence. However, this would only be the case in systems that continually shift to different behaviors rather than those that cycle between similar behaviors.
The apparent regime shift for H5 may partly have been due to the low isolation rates in chickens relative to other hosts species (especially geese) and was not observed when data from the same time period in other Chinese provinces were summed with the Shantou data
The best model of H9 included prevalence data from H4, H6 and H5. Previous analyses of the data showed that H6 and H9 tend to coinfect with each other more often than with other subtypes and H6 and H5 were the only subtypes besides H9 that tended to infect quail
It is difficult to predict the future of biological systems using past events, even when data collection is designed for predictive modeling and the data are collected with high accuracy. The fact that our models, which are based on data collected primarily to obtain viral isolates, captured future AIV prevalence as well as they did shows that a simple statistical framework could serve as a tool for AIV control policy decisions. Moreover, from a management perspective, it is relevant to consider the qualitative fit which is remarkably good: the models captured the timing of major peaks, and did not predict outbreaks that did not occur, which is a key aspect for prevention. For example, although the model of H9 predicts a double peak at months 23 and 26, with relatively high prevalence in between, our model predicts a rise beginning at month 21 with a single peak at month 24. From a management perspective, capturing the second peak in the doublepeak sequence is not important since high surveillance and interventions could be initiated at the outset of the predicted rise, which would allow preparedness for the second peak that was not captured. Similarly, the model of H5 captured the timing of major peaks and although it did predict single cases when none occurred (i.e.,
In our data, the most frequent, consistent sampling interval was one month. Very few covariates showed any significant signal when lagged by one month, which is not completely surprising since the infectious period is much shorter (∼ 1 week). We would expect that potential effects from other subtypes or weather would occur on the time scale of the infectious period since this is also the maximum length of time that individuals remain in the market. We could not model lagged effects at these biologically relevant time scales (i.e., 1 week) because the sampling frequency was not high enough. Instead, we considered covariates from the same time step as the response variable since any potential lagged effects would be subsumed into the same time step. Thus, the models we presented could not be used for forecasting,
Nevertheless, we showed that covariates selected using past data can predict future data, highlighting that a simple statistical framework could be used for predicting prevalence patterns of specific subtypes despite the complex ecological context. In order to develop statistical forecasting tools that can be applied towards anticipating the timing of outbreaks, the time interval should approximate the infectious period of the virus (weekly at most) and turnover times of different poultry types. It may be that the same number of samples with much better temporal coverage would permit a statistical model similar to those presented here but with much higher utility.
Another crucial data gap that existed in the AIV surveillance program from which we obtained our data is concurrent information on the population dynamics of each host species (i.e., rates of influx and outflow from poultry holdings). These data should be relatively easy to collect (although are subject to privacy laws in some areas) and are crucial for extending our strictly statistical method to incorporate mechanistic details such as transmission between hosts. Semimechanistic time series models (TSIR) have been very successful at forecasting the behavior of disease systems accurately
Our study has shown that reasonable forecasts can be made with a statistical model based solely on historical patterns. The limitation of this approach is that it is unclear if our models will maintain reasonable accuracy in the longterm, especially if large perturbations due to weather or human intervention cause a dramatic shift in AIV dynamics. Thus, to use our approach in the longterm it may be important to periodically repeat the model selection routine in case predictor variables change. It will also be useful to collect surveillance data that would enable the development of mechanistic models that could be used to evaluate how interventions may affect prevalence and predictors of prevalence. A better mechanistic understanding of AIV prevalence in source populations and transmission within markets will help with developing models that produce reliable forecasts year after year.