An integer GARCH model for a Poisson process with time-varying zero-inflation

Isuru Panduka Ratnayake; V. A. Samaranayake

doi:10.1371/journal.pone.0285769

Abstract

A serially dependent Poisson process with time-varying zero-inflation is proposed. Such formulations have the potential to model count data time series arising from phenomena such as infectious diseases that ebb and flow over time. The model assumes that the intensity of the Poisson process evolves according to a generalized autoregressive conditional heteroscedastic (GARCH) formulation and allows the zero-inflation parameter to vary over time and be governed by a deterministic function or by an exogenous variable. Both the expectation maximization (EM) and the maximum likelihood estimation (MLE) approaches are presented as possible estimation methods. A simulation study shows that both parameter estimation methods provide good estimates. Applications to two real-life data sets on infant deaths due to influenza show that the proposed integer-valued GARCH (INGARCH) model provides a better fit in general than existing zero-inflated INGARCH models. We also extended a non-linear INGARCH model to include zero-inflation and an exogenous input. This extended model performed as well as our proposed model with respect to some criteria, but not with respect to all.

Citation: Ratnayake IP, Samaranayake VA (2023) An integer GARCH model for a Poisson process with time-varying zero-inflation. PLoS ONE 18(5): e0285769. https://doi.org/10.1371/journal.pone.0285769

Editor: Cathy W. S. Chen, FCU: Feng Chia University, TAIWAN

Received: September 25, 2022; Accepted: May 2, 2023; Published: May 18, 2023

Copyright: © 2023 Ratnayake, Samaranayake. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting Information files.

Funding: Enter: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The standard Poisson point process, which assumes statistical independence between observations, is not suitable for modelling time series of counts that display serial dependence. For example, counts of infectious disease occurrences can be considered as dependent on previous counts because of the infectious nature of the illness. One way to address this deficiency is to define a time series where the count at a given time is generated by a Poisson distribution whose intensity parameter is dependent on past counts and past intensities. Observe that for a discrete time process defined in such a manner with a linear dependent structure and observed at equally spaced time points, the intensity parameter at a given time is the mean count at that time conditional on the past information. Since many count data processes have an excess of zeros than what is possible by the underlying probability distribution, that is they exhibit zero-inflation, modelling serial dependence is not sufficient. Existing models for count data that accommodate serial dependence and zero-inflation, however, do not allow for the zero-inflation probability to vary over time cyclically or be driven by an exogenous variable. This becomes a handicap when modelling count data time series, such as infectious disease or death counts that vary seasonally or with respect to an external factor that varies with time. To address this shortcoming, we propose the Time-Varying Zero-Inflated Poisson integer-valued GARCH model (TVZIP-INGARCH), which is based on a zero-inflated Poisson process whose intensity parameter satisfies the integer-valued GARCH (INGARCH) formulation introduced by Ferland, Latour, and Oraichi [1]. It is also a generalization of the model introduced by Zhu [2] which assumes constant zero-inflation. The proposed model can accommodate cases where the zero-inflation probability is driven by a deterministic function of time, such as a sinusoidal wave, or by exogenous variables. Before we introduce this model formerly, we will discuss some relevant INGARCH type count data time series models that have been proposed in the past. Note that we extended an existing non-linear INGARCH model without zero-inflation to include a zero-inflation component as well as an exogenous input variable. The performance of this new formulation and several existing INGARCH type models with constant zero-inflation were compared to that of the proposed model with respect to model fit on two real-life data sets.

Rydberg and Shephard [3] were the first to propose a count data time series model that account for serial dependence. In their model, the current conditional mean is a linear function of both the observed count and the conditional mean at the pervious time point. Similar models have been proposed by several other authors and these are discussed in Chapter 4 of the book by Kedem and Fokianos [4]. Heinen [5] generalized the lag one model of Rydberg and Shephard to include an arbitrary number of lags for both the past counts and past conditional means and named it the Autoregressive Conditional Poisson model with lags p and q (ACP (p, q)). The formulation of this model resembles that of the generalized autoregressive conditional heteroscedastic (GARCH) model of Bollerslev [6], but models the conditional mean rather than the conditional variance. Heinen derived the properties of his model only for the ACP (1, 1) case. The general case was investigated by Ghahramani and Thavaneswaran [7] who referred to the Heinen paper as the origin of the ACP model. Independently, Ferland et al. [1] proposed what was termed the Integer-valued GARCH (INGARCH) process, which is essentially the same as the ACP model of Heinen.

The INGARCH model of order (p, q) is defined as follows: (1) where is the count process, λ_t defines the conditional mean of X_t given the sigma-field generated by X_l and λ_l values for l<t. The model is defined with the parameter constraints for positivity of λ_t and for stationarity of X_t. In addition to presenting the model, Heinen [5] also derived the stationarity conditions, covariance functions, and addressed the problem of maximum likelihood estimation (MLE) of the parameters. As mentioned earlier, Zhu [2] extended this model to accommodate zero-inflation. In the following literature review, we will focus on additional work done on INGARCH models without zero-inflation first before moving onto the zero-inflated models.

Weiß [8] extended the previous results on the class of INGARCH models and derived a set of Yule-Walker type equations for the autocorrelation function for the general INGARCH case. Several important theoretical contributions for both the linear and non-linear Poisson autoregressive models were made by Fokianos et al. [9]. For the model where the conditional mean is a linear function of the past conditional mean and past count, they proved that the maximum likelihood estimates are asymptotically normal. Note that the model they considered is INGARCH (1, 1), but the authors refrain from calling it as such and labelled it as Poisson autoregression because of their stated belief that the GARCH moniker should be reserved for formulations that model variance. In the non-linear case where the conditional mean is a non-linear function of its past values and past counts, the authors established that the intensities of the Poisson distribution at each time point form a geometrically ergodic Markov chain under some general assumptions. Fokianos and Tjøstheim [10] further extended these results and showed that in the log-linear Poisson autoregression case, the maximum likelihood estimates are asymptotically normal, and that covariance matrix of the parameter estimates are consistent. Chen and Lee [11] extended the log-linear Poisson INGARCH formulation by introducing linear effects exogenous variables to the equation for the logarithm of the conditional mean and proposed the log-linear Poisson INGARCHX and negative binomial INGARCHX models. Here X denotes a model with exogenous inputs. The authors suggest that the forecasting performance of the model can be improved by the addition of covariates to the model, and that it helps to properly analyse the causal relationships between exogenous factors and the count time series. The softplus INGARCH model, a non-linear version of INGARCH, was proposed by Weiß, Zhu, and Hoshiyar [12]. This model utilizes the softplus link function as an alternative to the log-link function and thus allowing for negative autocorrelations. We extended this model to include zero-inflation and accommodate an exogenous variable as input. Details of this extension are given in a later section.

Negative binomial (NB), Generalized Poisson (GP), and Double Poisson (DP) are well known discrete distributions that can also be used as an alternative to the Poisson process Zhu [13]. A negative binomial INGARCH model (NB-INGARCH), which is an alternative to the Poisson INGARCH formulation, was proposed and the stationary conditions and the autocorrelation function of the process were obtained by Zhu [14]. Ye, Garcia, Pourahmadi, and Lord [15] allowed the negative binomial INGRACH model from Zhu [14] to incorporate covariates, so that the relationship between a time series of counts and correlated external factors could be properly modelled. Xu, Xie, Goh, and Fu [16] adapted the INARCH formulation and proposed a new dispersed INGARCH (DINARCH) model to handle the conditional overdispersion and underdispersion. Xu et al. [16] also mention that when the dispersion parameter is not constant this model coincides with that of Zhu [14] without the moving average terms.

The idea of modelling the zero-inflated probability in a count data process as a function of covariates can be dated back to Lambert [17]. Zhu [2] while introducing the zero-inflated Poisson, also presented a zero-inflated negative binomial integer-valued GARCH model, but with the assumption of constant zero-inflation probability. In this paper two types of negative binomial distributions are discussed, namely Negative Binomial 1 (NB1), and Negative Binomial 2 (NB2). Chen and Lee [18] investigated the zero-inflated generalized Poisson autoregressive (ZIGP-AR) model of Lee, Lee, and Chen [19] and proposed the zero-inflated generalized Poisson INGARCH (ZIGP-INGARCH) which allows for structural breaks. In the ZIGP-INGARCH model zero-inflation is introduced to the generalized Poisson INGARCH (GP-INGARCH) model of Zhu [20]. Gonçalves, Lopes, and Silva [21] introduced a new class of zero-inflated INGARCH models that included general compound Poisson deviates and named it as the zero-inflated compound Poisson INGARCH (ZICP-INGARCH) process. This model can include both zero-inflated Poisson and zero-inflated negative binomial INGARCH models of Zhu [2].

More recent developments in the INGARCH literature are as follows. Xiong and Zhu [22] and Li, Chen, and Zhu [23] considered the robust estimation methods for INGARCH models. Liu, Zhu, and Zhu [24] generalized the range of observations from infinite to categorical. Cui, Li and Zhu [25] and Xu and Zhu [26] generalized the INGARCH models from non-negative integer-valued to the integer-valued cases. Lee, Kim and Seok [27] introduced one parameter exponential family INGARCH model with zero-inflation and it was named as ZIEF-INGARCH. This new model can accommodate above mention zero-inflated versions of INGARCH models. In addition to that they developed residuals based CUSUM tests which can be used to examine the change points for the proposed model. A summary of various count time series published recently can be found in Davis et al. [28].

None of the above models allow for a time-varying zero-inflation probability. It is imperative that such accommodation should be made because empirical count data time with large number of zero counts can display strong cyclical behavior or seasonality with respect to the observed zero values. Ignoring this time-varying property of the zero-inflation parameter decreases the predictive performance of the model. Recognizing this, Yang [29], discussed the importance of modelling zero-inflation probability as a time-varying function. In the above article, it was assumed that both the zero-inflation and the intensity parameter are driven by the linear combination of past observations of exogenous variables and connects them to the conditional mean of the count via a log-link function. A recently introduced approach to modelling time-varying zero-inflation and the intensity parameter is the adaptive log-linear zero-inflated generalized Poisson model proposed by Xu et al. [30]. The authors assumed that the counts are from a zero-inflated generalized Poisson distribution, with the logarithm of the intensity propagated through a GARCH type model augmented with additional terms of exogenous variables and associated coefficients. All model parameters are assumed to be time-dependent. If this time dependence and the exogenous variable are removed, then the model becomes the log-linear Poisson autoregressive formulation proposed by Fokianos and Tjøstheim [10]. If the time dependence and zero-inflation is removed, then it becomes the log-linear INGARCHX model of Chen and Lee [11]. In applying to monthly crime data from New South Wales, Australia, the authors assumed constant coefficients over a local interval at any given time point t (with the interval adaptively selected from among a prechosen set of nested intervals) and estimated the parameters separately using data from each such interval. In their approach, the shifting of the intervals allows for parameters to change from one interval to another and thus allowing for the zero-inflation as well as other model parameters to vary with time. Note that a version of this model with a time varying exogenous input but non-time varying zero-inflation was included in the comparison of our proposed model against several constant zero-inflation models in an empirical setting.

We propose a different approach based on a generalization of the model proposed by Zhu [2]. In our formulation, it is the zero-inflation probability, rather than the intensity of the Poisson process, that is allowed to be governed by exogenous variables. We also assume that the INGARCH model parameters remain constant over time, allowing for a more parsimonious model that is relatively easier to estimate. While the model proposed by Xu et al. [30] is more flexible, the approach we propose is presented as a simpler and practical alternative, albeit a less sophisticated one. As mentioned earlier, the proposed model can also accommodate the case where the zero-inflation probability is driven by a deterministic function of time or driven by an exogenous variable. In addition, the intensity parameter of the Poisson process is assumed to vary dynamically through a GARCH type model. Thus, the INGARCH part of the proposed model can be viewed as observation driven, in the sense that recursive substitutions can be employed to show that the current intensity of the process conditional on the past is a linear function of past observations and past intensities. Also note that we model the conditional mean rather than the logarithm of the conditional mean as is done in Xu et al. [30].

The remainder of this paper is organized as follows: the first, the Time-Varying Zero-Inflated Poisson INGARCH model (TVZIP-INGARCH) is introduced. Two cases, namely a deterministic cyclically varying zero-inflation component and a model with the zero-inflation parameter driven by an exogenous set of stochastic variables are discussed. A section on parameter estimation procedures is presented next, which is followed by the results of a simulation study. Following which is a section that introduces extensions to the softplus INGARCH model. The performance of this extended model together with additional zero-inflated INGARCH type models are compared to the proposed model in the next section. This section also presents the results of fitting the above models to two empirical data sets. A discussion and conclusions are presented in the final section of this paper.

The time-varying zero-inflated INGARCH model

As Zhu [2] stated, the probability mass function (pmf) of a zero-inflated Poisson model with parameter vector (λ,ω), with X representing the count, can be written in the following form:

Further, Zhu [2] presented the mean and the variance of the distribution as follows:

Moving on to define the proposed time-varying zero-inflated INGARCH model, assume that is a discrete time series of count data, and is the sigma field generated by . The conditional distribution of X_t given and ω_t, is assumed to be a zero-inflated Poisson (ZIP) with parameter vector (μ_t,ω_t) where . Then, where, (2)

The dynamic propagation of the conditional mean (μ_t) of the zero-inflated Poisson process is defined by: with the intensity parameter λ_t of the Poisson process formulated as: where and Furthermore, is a function of variables, propagating over time, which is used to model the time-varying zero-inflation. Note that elements of the vector V_t may consist of stochastic exogenous variables that vary with time, or it may be a scaler equal to time t. In addition, Γ denotes a vector of parameters. It is assumed that 0<ω_t<1 for all Note that Fokianos et al. [9] defined their linear Poisson autoregressive model for instead of with constant initial conditions. The model in (2) can also be defined in a similar manner. It is this alternative formulation that was used in our simulation study.

The above model is denoted by TVZIP-INGARCH (p, q). If p>0 and q = 0, then the model becomes a TVZIP-INARCH model with order p, denoted by TVZIP-INARCH (p). The conditional mean and conditional variance of X_t given and ω_t are specified by the following equations: (3)

The conditional variance to conditional mean ratio, or the dispersion ratio, of TVZIP distribution is: (4)

Note that results (3) and (4) can be derived using the definition of the conditional mean and variance and applying standard procedures utilized in deriving similar quantities related to GARCH type models.

The result in (4) indicates that TVZIP-INGARCH (p, q) can be used to model integer- valued time series with overdispersion if the values of ω_t and λ_t are uniformly bounded below by positive constants.

Case 1: Zero-inflation driven by a deterministic function of time.

In Case 1, it is assumed that the zero-inflation function ω_t =g(V_t,Γ) is such V_t is a scaler equal to t. For example, we may assume the function g to be defined as follows: (5) where s is the seasonal length, and .

As mentioned above, the time-varying zero-inflation function ω_t = g(V_t,Γ) should always be bounded between zero and one. The range of values for A,B, and C in (5) that are needed to satisfy the above criterion are derived in S1 Appendix. Note that herein a simple example is used, where the function g consists of a sine function and a cosine function of equal period, but g could be any other function of time that, with proper selection of parameters, can be bounded between zero and one.

Case 2: Zero-inflated function driven by an exogenous variable.

The proposed model also accommodates the case where the zero-inflation probability is determined by one or more exogenous variables. In this case g(V_t,Γ) is considered a function, with the interval (0, 1) as its range, of the vector of exogenous variables V_t One example of g is the logistic function. Note that V_t can be a scaler seasonal autoregressive time series, a vector seasonal time series, or a scaler or vector time series that varies non-seasonally. For illustrative purposes, consider the case where V_t is a scaler purely seasonal autoregressive time series, denoted by V_t, with period s, g the logistic function, and ε_t is a white noise error term. Then we can write, (6)

Estimation procedure

The use of both the Expectation Maximization (EM) algorithm and Maximum Likelihood (ML) method to estimate the model parameters were developed for the general TVZIP-INGARCH (p, q) case. The TVZIP-INGARCH (p, q) process is discussed below, and the procedure for the TVZIP-INARCH (p) case follows in a similar manner.

Expectation maximization estimation for the TVZIP-INGARCH (p, q) process.

Let X₁,X₂,…,X_N be generated according to the model (2). There are two types of zeros generated by this model. They are the zeroes arising from the Poisson distribution with intensity parameter λ_t and the zeroes generated by a Bernoulli process with the probability of obtaining a zero specified by the zero-inflation parameter. Therefore, a given observation can be hypothetically categorized as arising out of a Bernoulli process or as an observation from the Poisson distribution. Let us define to be a Bernoulli random variable such that Z_t = 1 if X_t is a generated from the Bernoulli process and Z_t = 0 if it is generated by the Poisson distribution. Then,

Also, let Z =(Z₁,Z₂,…,Z_N), and ω_t = g(V_t,Γ). Note that , where r is the dimension of the vector V_t. For notational simplicity, we define the composite parameter vector , with the original parameters renamed as ϕ_k, k = 1, 2,…,r+p+q+2. This simplified notation is used in situations where generic statements are made without reference to a specific portion of (2).

Paralleling the derivations in Zhu [2], the conditional log likelihood can be written as (see S2 Appendix for details), (7)

The first derivatives of the conditional log likelihood function (7) with respect to and are as follows: (8) (9)

Finally, by combining (8) and (9) the first derivative of the conditional log likelihood function with respect to Φ is given by: (10)

The two-step (E step and M step) Expectation Maximization algorithm is now used to estimate the parameter vector . Let and we replace Z_t by and define . Following this replacement of Z in the conditional log likelihood function, is maximized.

E Step: Determine τ_t using the equation

M Step: After Z_t is replaced by its estimate, we proceed to maximize . First set for k = 1,2,…,r+p+q+2.

If, , the solution to the system of equations in (10) exists, then , where S(Φ) is the Fisher’s score matrix, and is the vector that maximizes the log likelihood, thus providing us with the estimate of the parameter vector .

Since a closed form solution does not exist, we require an iterative procedure to find the estimates. Let us consider the first order Taylor expansion of evaluated at the value around the initial parameter values Φ₀, yielding . We also let the matrix of the second derivatives of the log likelihood function to be defined as .

From the above, we obtain the first order approximation , and this result provides the standard Newton-Raphson algorithm. For an appropriately chosen initial value , the above Newton-Raphson algorithm can be used to obtain a sequence of improved estimates recursively. The improved estimates at i^th iteration are updated as the initial values for the next iteration as follows:

This process is repeated until the differences between successive estimates are sufficiently close to zero. In our study, convergence of the EM procedure was determined by using the criterion:

Maximum likelihood estimation for the TVZIP-INGARCH (p, q) process.

The conditional likelihood function L(Φ) of the TVZIP-INGARCH model (2) is, (11)

The conditional log likelihood function, l(Φ) obtained from (11) is given by (12)

Let and . Then, (13) and (14)

The first derivatives of the conditional log likelihood function (12) are as follows, (15) (16)

We can use Newton-Raphson (NR) iterative procedure to obtain the maximum likelihood estimated for the Eq (12) by setting for all k. With a reasonable initial starting value , the i^th iteration is calculated using , where and for k = 1,2,…,r+p+q+2. We stop the algorithm once pre specified convergence criteria is satisfied.

Both the EM and the ML estimation procedures require careful selection of initial values of the parameter vector to initiate the iterative algorithm. As described in Fokianos et al. [9], the starting value is computed by fitting a ARMA (p, q) model to the data. The initial value for the parameter in is selected from a range of values on a grid that satisfies the conditions of the time-varying zero inflation function. As discussed in Weiß [31], when q>0 the parameter estimates of a INGARCH (p, q) are extremely sensitive to the choice of the initial conditional mean. In other words, inappropriate initialization of may exhibit a significant effect on . To address this issue, two courses of actions for computing are suggested. One is to treat as a parameter during estimation such that and the other is to use a fixed value, for example . The results (see S3 Appendix) show that both initialization procedures produce relatively similar estimates, with a slightly better fit for the data observed for the initialization case with . Therefore, the initial values of this study are specified according to case.

Simulation study

We investigated the finite sample performance of estimators using a simulation study. The poissrnd function of MATLAB software was employed to generate the relevant data, based on recursively computed conditional intensity parameter. In order to initiate the recurve process, the intensity at time t = 0 and count data at times t≤0 were set to zero (i.e., λ₀ = 0 and X_l = 0 for l≤0). For time periods t≥1, X_t was generated as follows. For each let U_t be a random variable generated from a uniform (0,1) distribution and let ω_t be the zero-inflated probability at time t. Then, X_t was set to zero if U_t≤ω_t, Otherwise X_t was generated from the Poisson distribution with intensity parameter λ_t, where λ_t was updated recursively using Eq (2). The process was repeated until the complete time series of length N was generated. Note that this is the same procedure Zhu [2] employed to generate data for his simulation study. Lengths of the time series studied were set to N = 120 and N = 360, and thousand (m = 1000) simulations runs were carried out for each parameter and sample size combination. We carried out two separate sets of simulation studies based on the two types of zero-inflation function introduced in Section 2. The profile log likelihood function given in Eq (12) was maximized using the constrained nonlinear optimization function fmincon in MATLAB. The zero-inflation probability (ω_t = g(V_t,Γ)) was allowed to vary cyclically as a deterministic function of time or to be driven by an exogenous variable. Following Zhu [2], the Mean Absolute Deviation Error (MADE) was utilized as the evaluation criterion. The MADE is defined as, where m is the number of replications and is the true value while is the estimated value of ϕ at j^th replication run. In addition, computational effort is expressed by CPU time (in seconds) and it is used as a criterion to evaluate the performance. Simulation results for Case 1 and Case 2 are reported below.

Simulation results for Case 1: Deterministic sinusoidal zero-inflation function.

In this portion of the simulation study, the sinusoidal zero-inflated function ω_t = g(V_t,Γ) expressed in Eq (5) was used to generate cyclically varying zero-inflation probabilities between zero and one. The following constraints are set to the parameters in the vector , where , , and a constant such that . Note that the above constraints, with δ = 0.0001, were applied in our simulation study in order to bound the zero-inflation probabilities between 0 and 1. A very small value for δ was selected to allow wider bounds for , |A|, and |B|.

Tables 1 through 3 provide the simulation results for the MLE estimation technique, while Tables 4 through 6 provide simulation results for the case where estimates were obtained using the EM algorithm. The frequency of the sinusoidal wave was set at s = 12, mimicking a 12-month cycle present in monthly data. The parameter vector for the simulation study was expressed as , where A and B are the parameters in the sinusoidal model while (α₀,α₁), (α₀,α₁,α₂), and (α₀,α₁,β₁) are the parameter combinations in TVZIP-INARCH (1), TVZIP-INARCH (2), and TVZIP-INGARCH (1, 1) models, respectively. The parameter combination of Γ = (A,B)^T was set at (0.10, 0.10)^T, (0.25, −0.20)^T,and (−0.35, −0.30)^T, representing minimal to minimal, minimal to moderate, and minimal to maximum zero-inflation ranges. The following models were considered:

(A) TVZIP-INARCH (1) models:

A1.

A2.

A3.

(B) TVZIP-INARCH (2) models:

B1.

B2.

B3.

(C) TVZIP-INGARCH (1,1) models:

C1.

C2.

C3.

Note that the two estimations procedures were run on identical simulation samples for each model, parameter, and sample size combinations and hence variations due to sampling error will not be seen when comparing across estimation methods. The following tables provide summary data from the simulation runs, with the actual parameter values, estimated values, the mean absolute deviation (MADE) between them, and the CPU time taken by the estimation procedures.

Download:

Table 1. Means of MLE estimates MADE (within parentheses), and computational efforts (CPU time) for TVZIP-INARCH (1) models where zero-inflation is driven by a sinusoidal function.

https://doi.org/10.1371/journal.pone.0285769.t001

Download:

Table 2. Means of MLE estimates and MADE (within parentheses), and computational efforts (CPU time) for TVZIP-INARCH (2) models where zero-inflation is driven by a sinusoidal function.

https://doi.org/10.1371/journal.pone.0285769.t002

Download:

Table 3. Means of MLE estimates, MADE (within parentheses), and computational efforts (CPU time) for TVZIP-INGARCH (1, 1) models where zero-inflation is driven by a sinusoidal function.

https://doi.org/10.1371/journal.pone.0285769.t003

Download:

Table 4. Means of EM estimates, MADE (within parentheses), and computational efforts (CPU time) for TVZIP-INARCH (1) models where zero-inflation is driven by a sinusoidal function.

https://doi.org/10.1371/journal.pone.0285769.t004

Download:

Table 5. Means of EM estimates, MADE (within parentheses), and computational efforts (CPU time) for TVZIP-INARCH (2) models where zero-inflation is driven by a sinusoidal function.

https://doi.org/10.1371/journal.pone.0285769.t005

Download:

Table 6. Means of EM estimates, MADE (within parentheses), and computational efforts (CPU time) for TVZIP-INGARCH (1, 1) models where zero-inflation is driven by a sinusoidal function.

https://doi.org/10.1371/journal.pone.0285769.t006

The above simulation results show that the MLE and EM procedures produced almost identical means of estimates for the parameters in TVZIP-INARCH (1) and TVZIP-INARCH (2) models. For example, the means of estimates in Table 1 are almost identical to the corresponding means of estimates in Table 4. The MADE values for corresponding estimates are also almost identical across the two estimation methods. This similarity extends to means of corresponding parameter estimates across Tables 2 and 5 as well. Even in cases where the means of estimates are not identical, they are extremely close. For instance, for Model A1 with N = 120 and the true value of α₀ equals to 1.00, the mean of the MLE estimates for this parameter is 1.0473 while the mean of the EM estimates is 1.0472.

However, for the TVZIP-INGARCH (1, 1) process, there are relatively larger differences between means of estimates for the MLE and EM methods. For example, when N = 120 with true parameter values of α₀ = 1.00, α₁ = 0.20, β₁ = 0.20, the means of parameters estimates are for the MLE method (Table 3, row 1) while EM mean estimates are (Table 6, row 1). Generally, the larger sample size produced means of estimates closer to their true value. An exception to this is observed in the case of TVZIP-INGARCH (1, 1), where the means of estimates for (α₁,β₁) did not improve with increasing sample size. For instance, when N = 120 with true values of α₁ = 0.40, β₁ = 0.30, the means of the corresponding MLE estimates are 0.4668 and 0.2249 respectively (see Table 3, Model C3), but when the sample size increases to 360, the means of the corresponding estimates changed to 0.4961 and 0.2008 respectively, which is a movement in the wrong direction. The MADE values decreased consistently with increasing sample size for both the MLE and EM estimation methods. Another relevant observation is that the simulation results for the INGARCH portion (α₁,β₁) of the model behave similar to the results obtained by Zhu [2] in the sense that the mean of the estimates are not very close to the true values even with higher sample sizes. We investigated whether this relative inaccuracy of the estimates is due to the initial values that were used for these two parameters. This was done by creating a grid of potential initial values centered on the original initial values obtained by fitting an ARAMA to the data. This approach did not improve the results and thus we reverted to using the original initial values. It is possible that many combinations of and values provide very similar likelihoods and thus the algorithm does not converge to the true values efficiently. Another important observation is that the EM algorithm took more CPU time than the MLE method across all parameter and model combinations.

Simulation study for Case 2: Zero-inflation function driven by an exogenous variable

In this part of the study, we allow the exogenous variable to generate zeros through a logistic model as described in Eq (6). The parameter vector for the simulation study under this scenario is , where δ₀ and δ₁ are the parameters in the logistic part of the model, while (α₀,α₁), (α₀,α₁,α₂), and (α₀,α₁,β₁) are the parameter combination for TVZIP-INARCH (1), TVZIP-INARCH (2), and TVZIP-INGARCH (1, 1) models, respectively. The parameter combinations of δ₀ and δ₁ were set to (-2, 0), (-1, -1) and (2, 1), representing three types of changes in zero-inflation probability with respect to the exogenous variable. These are no change (δ₁ = 0), decrease (δ₁<0), and increase (δ₁>0)in the zero-inflation probability with increasing values of the exogenous variable. We generated an exogenous stationary AR (12) time series given in Eq (6) using η = 0.25. The following models were considered:

(A) TVZIP-INARCH (1) models:

A1.

A2.

A3.

(B) TVZIP-INARCH (2) models:

B1.

B2.

B3.

(C) TVZIP-INGARCH (1,1) models:

C1.

C2.

C3. Tables 7 through 9 provide the simulation results for the MLE estimation techniques, while Tables 10 through 12 provide EM algorithm related simulation results.

Download:

Table 7. Means of MLE Estimates, MADE (within parentheses), and computational efforts (CPU time) for TVZIP-INARCH (1) models where zero-inflation is driven by an exogenous variable.

https://doi.org/10.1371/journal.pone.0285769.t007

Download:

Table 8. Means of MLE Estimates, MADE (within parentheses), and computational efforts (CPU time) for TVZIP-INARCH (2) models where zero-inflation is driven by an exogenous variable.

https://doi.org/10.1371/journal.pone.0285769.t008

Download:

Table 9. Means of MLE Estimates, MADE (within parentheses), and computational efforts (CPU time) for TVZIP-INGARCH (1, 1) models where zero-inflation is driven by an exogenous variable.

https://doi.org/10.1371/journal.pone.0285769.t009

Download:

Table 10. Means of EM Estimates, MADE (within parentheses), and computational efforts (CPU time) for TVZIP-INARCH (1) models where zero-inflation is driven by an exogenous variable.

https://doi.org/10.1371/journal.pone.0285769.t010

Download:

Table 11. Means of EM Estimates, MADE (within parentheses), and computational efforts (CPU time) for TVZIP-INARCH (2) models where zero-inflation is driven by an exogenous variable.

https://doi.org/10.1371/journal.pone.0285769.t011

Download:

Table 12. Means of EM Estimates, MADE (within parentheses), and computational efforts for TVZIP-INGARCH (1, 1) models where zero-inflation is driven by an exogenous variable.

https://doi.org/10.1371/journal.pone.0285769.t012

In general, the EM algorithm takes more computational effort to estimate parameters when compared to the same model parameter estimation done by the MLE procedure. Both the MLE and EM methods produced fairly accurate estimates for the parameters across both TVZIP-INARCH (p) models. However, for the TVZIP-INGARCH (1, 1) case where the GARCH portion of the parameters (α₁,β₁) show relatively less accurate estimates, for both procedures even for the large sample size case. For instance, when N = 120 with true values of α₁ = 0.40 and β₁ = 0.30 the means of the corresponding parameter estimates are 0.3916 and 0.2439 respectively, for the MLE method, and when N = 360 these means of estimates change to 0.4794 and 0.2205 respectively (see Table 9, Model C3). A similar phenomenon is seen in the simulation results of Zhu [2]. As done in Case 1, using additional initial value combinations did not yield any improvement in this regard. In general, MADE values decrease with the increase in sample size.

Extension to the softplus INGARCH model

Herein we introduce a generalization of the softplus INGARCH model proposed by Weiß et al. [12] by incorporating zero-inflation to the underlying count-data distribution and allowing a set of exogenous variables to act as inputs to the model that defines the conditional mean of the process. In the following, we assume that the underlying distribution is zero-inflated versions of the Poisson or the negative binomial distributions or the generalized Poisson. The zero-inflated softplus INGARCHX model we developed is defined as follows.

Let is the count process, and μ_t defines the conditional mean of X_t given the sigma-fields and such that . Here, is the sigma-field generated by and is the sigma-field generated by the set of exogenous covariates {V₁,…,V_t} with for j = 1,…,t. Note that Θ is a set of parameters defined later in the section.

Let be a probability mass function of a discrete random variable U_t , with conditional mean , for . We define λ_t such that if the distribution of U_t is Poisson or negative binomial, and if the distribution is generalized Poisson, where φ is the dispersion parameter. This defines the underlying count data distribution without zero inflation. Now let ω be the zero-inflation probability associated with X_t. Then the conditional probability mass function of is given by,

where Then the conditional mean of X_t given and is for . If is representing a Poisson, negative binomial Type 1 or negative binomial Type 2 distribution, then the g(λ_t) = λ_t. If the underlying distribution is generalized Poisson, then If the proposed softplus INGARCHX version, we let λ_t to be defined as with the softplus link function given by

with c>0.

The set of parameters in the model can now be defined as

Real data examples

In this section the proposed TVZIP-INGARCH (p, q) model is applied to two real-world datasets. The time-varying component given by Eq (5) can be formulated in many ways. In example one, we selected two scenarios for this component, and labeled them Sc1 and Sc2. In example two, the time-varying component was labeled as obeying Scenario 4 (Sc4). Details regarding these scenarios can be found under each example. The proposed model’s performance is compared with constant zero-inflated probability versions of the INGARCH (p, q) model such as zero-inflated Poisson (ZIP), zero-inflated negative binomial Type 1(ZINB1), zero-inflated negative binomial Type 2 (ZINB2), all proposed by Zhu [2], and the zero-inflated generalized Poisson (ZIGP) model proposed by Chen and Lee [18]. The zero-inflated compound Poisson INGARCH models such as zero-inflated geometric Poisson INGARCH (ZIGEOMP-INGARCH) and the zero-inflated Neyman Type A INGARCH (ZINTA-INGARCH) of Gonçalves et al. [21], were also fitted to the data. In addition, the following formulations were also included in the comparison. The log-linear INGARCH model of Fokianos and Tjøstheim [10] was modified to include zero-inflation. We also modified the log-linear INGARCHX model of Chen and Lee [11] to include zero-inflation, but it should be noted that with the inclusion of zero-inflation the above two models are nested within the non-time-varying version of the model of Xu et al. [30]. We extended the softplus INGARCH model introduced by Weiß et al. [12], as described in an earlier section, by introducing zero inflation and an exogenous variable. Apart from our proposed models, this provides us with six distinct model categories or model structures, namely the zero-inflated versions of C1: INGARCH, C2: compound Poisson INGARCH, C3: log-linear INGARCH, C4: log-linear INGARCHX, C5: softplus INGARCH, and C6: softplus INGARCHX. We denote the versions of our proposed model under the C0 category. Note that model categories C1 –C6 all have constant zero-inflation probability, and hence for consistency, we label them as falling under zero-inflation Scenario 3 (Sc3).

The INGARCH, log-linear INGARCH, log-linear INGARCHX, the softplus INGARCH, and the softplus INGARCHX models were fitted assuming the zero-inflated versions of the Poisson, negative binomial Type 1, negative binomial Type 2, and generalized Poisson distributions. For definitions of negative binomial Type 1 and Type 2 distributions, the reader is referred to Zhu [2]. Note that the compound Poisson model of Gonçalves et al. [21] was fitted assuming the zero-inflated geometric Poisson and the zero-inflated Neyman Type A distributions. The above combinations of model categories and distributions yield a total of twenty-three count data time series formulations. These twenty-three formulations are listed in Table 1 in S4 Appendix. These formulations were then fitted using three model orders, namely M1 with p = 1,q = 0; M2 with p = 2,q = 0; and M3 with p = 1,q = 1. This gives rise to sixty-nine total models.

Not all of these sixty-nine models were fitted to data in each of the two examples shown below. Example 1 does not contain data on an exogenous variable and hence the log-linear INGARCHX and the softplus INGARCHX model categories were not used in the first example. Models belonging to the above two categories, however, were fitted to the data in Example 2 in place of the log-linear INGARCH and the softplus INGARCH model categories. Excluding versions of the proposed model, forty-five models with constant zero-inflation probability were fitted to each of the example data sets, with eighteen models common to both sets (twelve models from C1 category and six models from C2 category).

Since we are comparing the proposed time-varying zero inflation model against models with constant zero-inflation probability, an argument can be made that any superior performance by the proposed model can be due to the presence of a change point. Thus, fitting a suitably selected constant zero-inflation model separately to the two sub-series before and after a detected change point may negate any such superiority. To address this potential criticism, the following approach was taken with respect to the INGARCH models (C1 category) with zero-inflated versions of Poisson, negative binomial Type 1, negative binomial Type 2, and the generalized Poisson distributions under model orders M1, M2, and M3. First, we fitted the above set of constant zero-inflation models to the example data. Then the model with the smallest AIC value from among the zero-inflated INGARCH models within each model order M1, M2, M3 was selected. We repeated the same procedure using the Bayesian Information Criteria (BIC) and found that the models selected using the AIC values also had the lowest BIC values. Following that, the CUSUM test proposed by Lee et al. [27] was applied using the residuals from the selected model. If a change point was detected, the model under consideration was fitted independently to the two subseries before and after the change point. It is noteworthy that a change point was detected every time this test was carried out. The AIC and BIC values were computed for all models, including in cases where the model was fitted to two sub-series. Note that the theoretical properties of the CUSUM test proposed by Lee et al. [27] are based on ten assumptions which holds for the zero-inflated versions of Poisson, negative binomial, and generalized Poisson, according to Lee et al. [27] and Lee and Lee [32]. We did not, however, verify whether models in other categories would also satisfy all these assumptions. Thus, we carried out this procedure for all models falling into the zero-inflated INGARCH (C1) category but did not do so for other model categories.

Following the above process, we identified models with the lowest AIC and BIC values within each model category (C1, C2, C3, C4, C5, C6) by model order (M1, M2, M3) combination. Models thus identified can be seen in cells highlighted in light gray in Table 1B of S5 Appendix (for Example 1) and Table 1B of S6 Appendix (for Example 2). Note that Table 1A in each Appendix provides the AIC and BIC values for versions of our proposed model. Within each model category (but across model orders), we identified the model with the lowest AIC value and the model with the lowest BIC value among models highlighted in light gray. Then from among the above models, we also identified the best model(s) across all model categories by using AIC and BIC values. After the best model or models across all model categories were selected, standardized Pearson residuals were calculated, and used to check model adequacy.

The first example is based on the Influenza A associated pediatric deaths data set downloaded from the Centre for Disease Control and Prevention web page. This data was modelled by using the TVZIP-INGARCH (p, q) formulation with a sinusoidal zero-inflation function. In the second example, we used the Pediatric mortality caused by Influenza B data set downloaded from the same web page to demonstrate the performance of TVZIP-INGARCH (p, q) process where the zero-inflation is driven by an exogenous variable. The data sets used in this study are available in the supplement labeled S1 Data.

Real data example—Use of a deterministic sinusoidal zero-inflation function

The TVZIP-INARCH (1), TVZIP-INARCH (2), and TVZIP-INGARCH (1, 1) models with sinusoidal zero-inflation were fitted to the Influenza A associated pediatric mortality data set. Performance of the proposed models was compared to those of the constant zero-inflated count data time series models mentioned above.

The data set provides weekly count data of U.S. pediatric deaths caused by type A influenza virus over the time period from week 40 of 2015 to week 43 of year 2018. This consists of 160 weekly observations of pediatric death counts. The data were extracted from weekly U.S. Influenza Surveillance report, which was published by Centre for Disease Control and Prevention (CDC) [33]. Summary statistics of the data showed a mean of 1.5063 and a variance of 6.352, suggestive of overdispersion. Fig 1 illustrates the frequency of each pediatric mortality case caused by influenza A virus using a bar chart. Observe that there are 86 zeros, which comprises 53.75% of the total time points.

Download:

Fig 1. Bar chart of Influenza A associated pediatric deaths.

https://doi.org/10.1371/journal.pone.0285769.g001

Fig 2 illustrates the original time series of the counts followed by the plots of its autocorrelation function (ACF) and the partial autocorrelation function (PACF), respectively. The bar chart shows the excess number of zeros in the data, and the time series plot demonstrates prolonged periods of zero counts, which supports the use of zero-inflated Poisson time series models to analyze this data. Furthermore, we can observe an annual seasonality in the portions of the time series with excessive zero counts.

Download:

Fig 2. Influenza A associated pediatric deaths, sample ACF plot, and sample PACF plot.

https://doi.org/10.1371/journal.pone.0285769.g002

To understand the zero-inflation behaviour of this data set, we aggregated weekly data in to its corresponding calendar month and constructed the total monthly zero mortality counts. These counts were then converted into a monthly proportion by dividing each monthly total zeros by the maximum zero count over the observed period to scale values at or below one. According to the Fig 3, the plot of the monthly proportion of zero counts exhibits an approximate sinosidual behaviour throughout the observed time span. Thus, the general sinosidual function mentioned in Eq (5) can be used to model the zero-inflation behaviour of this data. More details of this modelling will be described later in this section.

Download:

Fig 3. Monthly proportion of zero mortality counts versues the fitted sinusoidal zero-inflated function.

(Red) Monthly proportion of zero mortality counts. (Blue) Fitted sinusoidal zero-inflated function.

https://doi.org/10.1371/journal.pone.0285769.g003

As mentioned earlier, two scenarios were considered for the time-varying zero-inflation component of the proposed model. In the models listed under Scenario 1 (Sc1) we assumed a piecewise constant zero-inflation fucntion. The monthly zero-inflation was allowed to vary according to a sinusoidal zero-inflation function with a 12-month period, but the weeks within a given month were assigned the same zero-inflated probability value associated with that month. The models under Scenario 2 (Sc2) used a sinusoidal function describing a time-varying zero-inflation probability that changes on a weekly basis. Note that all other model categories we consider fall under Scenario 3 (Sc3) where the zero-inflation probability remains constant over time. In addition to the proposed model under scenarios Sc1, and Sc2, we fitted models in categories C1, C2, C3, and C5. All models were fitted under the model orders M1, M2, M3 with various underlying zero-inflated distributions described at the beginning of the real data example section. The EM algorithem was used to estimate the model parameters and the results for Sc1 and Sc2 scenarios are reported in the Table 13. The EM algorithm was chosen over the MLE method to be in line with the method used in Zhu [2] for its real data example.

Download:

Table 13. Estimated parameters, standard errors (within parentheses), AIC and BIC for the pediatric death counts caused by virus A.

https://doi.org/10.1371/journal.pone.0285769.t013

Table 13 shows that the models which fall under the Sc1 exhibited lower AIC and BIC values compared to their counterparts under Sc2. We could conclude that the use of a piecewise constant time-varying zero-inflation function improved the fit compared to version where the time-varying zero-inflation probability changes on a weekly basis.

As described ealier, both the AIC and BIC values were used to identify the best models within each model category by model order combination for the rest of the constant zero- inflation models. The models thus identified are shown in light gray highlighted cell in Tables 1A and 1B in S5 Appendix. Models with the lowest AIC or BIC values within each model category are identifies in boldface type in Tables 1A and 1B in S5 Appendix. If the model has the lowest information criteria values with respect to both AIC and BIC, then it is identified in italic boldface font. These models with the lowest AIC or BIC within each model category are also shown listed in Table 14, togther with the Sc1 and Sc2 versions of the proposed model under model order M2, which was selected as the best among all three orders.

Download:

Table 14. The AIC and BIC values for the best model in model catergory C1,C2, C3 and C5 under Scenario 3 and the best model order for the proposed model under Scenarios Sc1 and Sc2.

https://doi.org/10.1371/journal.pone.0285769.t014

The performance of the two versions of the TVZIP-INGARCH (p, q) models with the lowest AIC and BIC values are compared with those of other INGARCH time series models that assume constant zero-inflation in Table 14. Among all the models, the TVZIP-INARCH (2) under Sc1 zero-inflation assumption has lower AIC and BIC values. Note that the second best model with respect to AIC and BIC criteria is also a version of the proposed TVZIP-INGARCH model.

The TVZIP-INARCH (2) model, with the piecewise zero-inflation formulation, which has the lowest AIC and BIC values amoung all models considered, was selected for further investigation. The standardized Pearson residuals were calculated for the selected model and used for diagnostic tests. The ACF plot and the histogram, both based on the standardized Pearson residuals are presented in Fig 4.

Download:

Fig 4. The ACF plot and the histogram, both based on the standardized Pearson residuals of fitted TVZIP-INARCH (2) model.

https://doi.org/10.1371/journal.pone.0285769.g004

The standardized Pearson residuals display a mean (0.0017), which is close to zero, and a variance (0.8108), which is close to one. The ACF plot indicates that residuals of the fitted model meet the assumption of no autocorrelation at 5% significance level. Additionally, the mean of the fitted model (1.4608) is close to the observed mean of the count data series (1.5063). Based on results from both model selection criteria, model fit, and residual diagnostic results, we conclude that the TVZIP-INARCH (2) model with Sc1 formulation for zero-inflation probability provided the best fit to the data. In this model, we assumed that weeks within any given month has a constant zero-inflation, yet monthly zero-inflation varies cyclically.

Real data example—Zero-inflation function is driven by exogenous variable

In this subsection we examine the performance of the TVZIP-INGARCH (p, q) models when the zero-inflation function is driven by an exogenous variable. Influenza B associated pediatric mortality data set was used, and the TVZIP-INARCH (1), TVZIP-INARCH (2), and TVZIP-INGARCH (1, 1) models were fitted to the data. Results were compared with forty-five constant zero-inflated INGARCH (p, q) models. We selected the weekly average of nationwide low temperatures as the exogenous variable that drives the zero-inflation probability. This is motivated by one of the reasons generally accepted as a cause for the easy transmission of influenza during winter months, namely the cold temperatures driving people indoors. Influenza B associated pediatric mortality data set was accessed from the weekly U.S. Influenza Surveillance Report [33], and the temperature data were extracted from the United States National Oceanic and Atmospheric Administration (NOAA) weather prediction center [34]. Both data sets spanned the time period from week 40 in year 2014 to week 39 in year 2018. The influenza B data set contained 209 weekly observations of pediatric death counts. Summary statistics of pediatric mortality cases due to influenza B showed a mean of 0.8517 and a variance of 2.4538. Since the empirical variance was higher than the empirical mean, the data exhibit overdispersion. Fig 5 illustrates the frequency of each pediatric mortality count caused by influenza B virus type using a bar chart. The bar chart shows that there are 128 zeros, which comprises 61.24% of total of the time points. This suggests that the pediatric mortality data are zero-inflated.

Download:

Fig 5. Bar chart of influenza B associated pediatric deaths.

https://doi.org/10.1371/journal.pone.0285769.g005

The time series plot, ACF, and PACF plots are given in Fig 6. Based on time series plot, we can see that there is an annual seasonality exhibited in this data set. Moreover, it shows that there were periods with clusters of zeros that occur periodically. Therefore, as discussed in the above section, we suggested TVZIP-INGARCH (p, q) to model the count data series.

Download:

Fig 6. Influenza B associated pediatric deaths, sample ACF plot, and sample PACF plot.

https://doi.org/10.1371/journal.pone.0285769.g006

In this example, the time-varying zero-inflation was modelled by considering an exogenous time series, namely the weekly average of nationwide low temperature. Comparison of the two time series plots is given in Fig 7.

Download:

Fig 7. Influenza B associated pediatric deaths and weekly average of nationwide low temperature.

(Blue) Time series plots of Influenza B associated pediatric deaths. (Red) Weekly average of nationwide low temperature.

https://doi.org/10.1371/journal.pone.0285769.g007

Fig 7 shows that periods with higher values of low temperature coincide with periods of zero pediatric mortality caused by Influenza B. Hence, it is suggestive that periods with high zero counts (low pediatric mortality) are related to periods with higher values of low temperature. Thus, we used the weekly average low temperature data to model the time-varying zero-inflated component in the pediatric mortality data set. In this example, time-varying zero-inflation process was modelled using the formulation from Eq (6) with low temperature as the input variable to the logistic model. We label this zero-inflation scenario as Sc4. Results from this the TVZIP-GARCH formulation under model orders M1, M2, and M3, with varying underlying distributions are given in Table 15. Note that we used the EM algorithm as was done in Example 1.

Download:

Table 15. Estimated parameters, standard errors (within parentheses), AIC and BIC for the pediatric death counts cause by virus B.

https://doi.org/10.1371/journal.pone.0285769.t015

Results in Table 15 show that the TVZIP-INARCH (2) model fitted under Scenario 4 (Sc4) exhibited the lowest AIC and BIC values compared to the other models.

Forty five models were fitted under the constant zero-inflation scenario (Sc3). Table 1B in S6 Appendix provides list of these models, which fall into categories C1, C2, C4, and C6 with model orders M1, M2, and M3. The zero-inflated versions of Poisson, negative binomial Type 1, negative binomial Type 2, and generalized Poisson distributions we considered as the underlying distributions associated witn the counts in categories C1, C4, and C6. In C2, we assumed the zero-inflated geometric Poisson and zero-inflated Neyman Type A as the underlying distributions. As was done in Example 1, a test for a change point was conducted if a model had the lowest AIC value under each model category C1 by model order combination. Note that at this initial stage, models with the lowest AIC value under the C1 category had the lowest BIC value as well. If such a point was detected, the model under consideration was independently fitted to sub-series before and after the change point. Comparison of the versions of the proposed model with the forty-five existing models, as well as for models fitted to sub-series, are repoted in Table 1A and 1B in S6 Appendix. The models with the lowest AIC or BIC values in each model category by model order combination are highlighted in light gray. Models with the lowest AIC or BIC values within each model category are identifies in boldface type in Tables 1A and 1B in S6 Appendix. If the model has the lowest information criteria values with respoect to both AIC and BIC, then it is identified in italic boldface font.

The results for TVZIP-INGARCH (p, q) models under Scenario 4 (Sc4) together with the contant zero-inflation models under Scenario 3 (Sc3) that had the lowest AIC or BIC values within each model category are shown in Table 16.

Download:

Table 16. The AIC and BIC values for the best model in model category C1, C2, C4 and C6 with scenario Sc3 and Sc4.

https://doi.org/10.1371/journal.pone.0285769.t016

Based on the BIC criterion, TVZIP-INARCH (2) model provided the best fit to the data, albeit by a marginally lower BIC value compared to the next best model. However, a different result is obtained when considering the AIC values. AIC values indicate that the ZINB2 softplus INARCHX (2) model, which assumes a constant zero-inflation, exhibits a marginally better fit to the data. This is the new model we introduced as a generalization of the softplus INGARCH formulation of Weiß et al. [12] by including zero-inflation and an exogenous variable to the original version. The inclusion of an exogenous variable seems to lead to an enhanced performance with respect to AIC. This advantage vanishes slightly when BIC is used as a model selection criterion. Since both models perform well, model adequacy checks were carried out for the TVZIP-INARCH (2) and the ZINB2 softplus INARCHX (2) models. The ACF plots, and the histograms of standardized Pearson residuals for the fitted models are presented in Figs 8 and 9 respectively.

Download:

Fig 8. The ACF plot and the histogram, both based on the standardized Pearson residuals of fitted TVZIP-INARCH (2) model.

https://doi.org/10.1371/journal.pone.0285769.g008

Download:

Fig 9. The ACF plot and the histogram, both based on the standardized Pearson residuals of fitted ZINB2 softplus INARCHX (2) model.

https://doi.org/10.1371/journal.pone.0285769.g009

The ACF plots indicate that the standardized Pearson residuals of both TVZIP-INARCH (2) and the ZINB2 softplus INARCHX (2) models do not exhibit significant autocorrelations. However, the mean (-0.0054) and the variance (1.0006) of the residuals of the TVZIP-INARCH (2) model are closer to 0 and 1 respectively. The residuals of the ZINB2 softplus INARCHX (2) model have a mean (-0.0389) close to 0, but variance (0.8557) is further away from 1 compared to the variance of the TVZIP-INARCH (2) Model. While both models satisfied the residual checks for model adequacy, the TVZIP-INARCH (2) may be considered to perform marginally better due to its slightly lower BIC value and the residual mean and variance being closer to the ideal values of 0 and 1 respectively. In addition to that, the mean of the fitted values of TVZIP-INARCH (2) model (0.8421) is closer to the sample mean (0.8517) than that of the ZINB2 softplus INARCHX (2) model (0.9264). Thus, we can conclude that while both these models performed well, our proposed TVZIP-INGARCH model has a tenuous edge over the Softplus INGARCHX model. In a practical sense, however, one can argue that both models perform equally well.

One reason for the good performance of ZINB2 softplus INARCHX model may be due to the nature of the softplus function that can bring the value of the conditional mean close to zero for some values of the exogenous variable. Thus, while the zero-inflated component was set up to be non-time-varying, the influence of the exogenous variable can bring the conditional mean of the process very close to zero over some periods. In other words, softplus link function allows the exogenous variable to significantly increase the probability of obtaining zero counts in a time-varying fashion.

The main aim of this example is to illustrate the utility of the proposed formulation in modelling a real-life situation. Based on the results from this example, TVZIP-INARCH (2) formulation with zero-inflation modeled using an exogenous variable and the ZINB2 softplus INARCHX (2) model appear as good candidates to be included in the tool set a practitioner may consider when modeling count data time series with time-varying zero inflation, in situations where an appropriate exogenous variable is available. Equally important is the observation that no other model outperformed the proposed model with respect to any criteria based on residual analysis.

Conclusions

A time-varying zero-inflated Poisson integer GARCH (TVZIP-INGARCH) model was proposed to accommodate situations where zero-inflation is driven by either a deterministic function of time or a set of exogenous variables. Monte-Carlo simulation results indicate that the Expected Maximization (EM) and Maximum Likelihood Estimation (MLE) methods produce very similar results with respect to parameter estimates. It is observed that both EM and MLE techniques estimated the model parameters with good accuracy when the underlying model has a purely ARCH component. When the model has a GARCH component, the GARCH parameters are estimated with lesser accuracy, which is a phenomenon also seen in the study of the existing ZIP-INGARCH model proposed by Zhu [2]. When tested on two real-life data sets, the TVZIP-INGARCH models performed better than the other constant zero-inflated INGARCH formulations with one possible exception. Even in this case the competing model does not outperform the proposed model. These results illustrate the usefulness of the proposed model as one of the tools the practitioner can utilize for modeling empirical count data time series with time-varying zero-inflation. In addition, the flexibility of modelling zero-inflation through deterministic cyclical functions or through exogenous time series provide the proposed model added versatility.

Supporting information

S1 Appendix. Derivation of conditions for zero-inflation probability to lie inside (0,1).

https://doi.org/10.1371/journal.pone.0285769.s001

(DOCX)

S2 Appendix. Derivation of the conditional mass function.

https://doi.org/10.1371/journal.pone.0285769.s002

(DOCX)

S3 Appendix. Initialization of the lambdas.

https://doi.org/10.1371/journal.pone.0285769.s003

(DOCX)

S4 Appendix. List of existing INGARCH type models used in the comparison study.

https://doi.org/10.1371/journal.pone.0285769.s004

(DOCX)

S5 Appendix. Model comparison results for the pediatric death counts cause by virus A.

https://doi.org/10.1371/journal.pone.0285769.s005

(DOCX)

S6 Appendix. Model comparison results for the pediatric death counts cause by virus B.

https://doi.org/10.1371/journal.pone.0285769.s006

(DOCX)

S1 Data. Datasets.

https://doi.org/10.1371/journal.pone.0285769.s007

(XLSX)

S1 File. Codes.

This file contains the codes for simulation study, real data example 1, and real data example 2.

https://doi.org/10.1371/journal.pone.0285769.s008

(ZIP)

Acknowledgments

The authors wish to thank the academic editor and the reviewers for their insightful and constructive comments, which helped to greatly enhance the quality and breath of this paper.

References

1. Ferland R, Latour A, Oraichi D. Integer‐valued GARCH process. J Time Ser Anal. 2006;27(6): 923–942.
- View Article
- Google Scholar
2. Zhu F. Zero-inflated Poisson and negative binomial integer-valued GARCH models. J Stat Plan Inference. 2012;142(4): 826–839.
- View Article
- Google Scholar
3. Rydberg TH, Shephard N. BIN models for trade-by-trade data. Modelling the number of trades in a fixed interval of time. Technical report 0740. Econometric Society [Internet]. 2000 [Cited 2022 July 16]. Available from: http://ideas.repec.org/p/ecm/wc2000/0740.html
- View Article
- Google Scholar
4. Kedem B, Fokianos K. Regression models for time series analysis. Wiley series in probability and statistics. Hoboken, NJ: Wiley-Interscience; 2002.
5. Heinen A. Modeling time series count data: An autoregressive conditional Poisson model. Technical report MPRA paper 8113. University Library of Munich, Munich, Germany[Internet]. 2003 [Cited 2022 July 16]. Available from: https://mpra.ub.uni-muenchen.de/8113/
6. Bollerslev T. Generalized autoregressive conditional heteroscedasticity. J Econom. 1986;31(3): 307–327.
- View Article
- Google Scholar
7. Ghahramani M, Thavaneswaran A. On some properties of autoregressive conditional Poisson (ACP) models. Econ Lett. 2009;105(3): 273–275.
- View Article
- Google Scholar
8. Weiß CH. Modelling time series of counts with overdispersion. Stat Methods Appt. 2009;18: 507–519.
- View Article
- Google Scholar
9. Fokianos K, Rahbek A, Tjøstheim D. Poisson autoregression. J Am Stat Assoc. 2009;104(488): 1430–1439.
- View Article
- Google Scholar
10. Fokianos K, Tjøstheim D. Log-linear Poisson autoregression. J Multivar Anal. 2011;102(3): 563–578.
- View Article
- Google Scholar
11. Chen CWS, Lee S. (2017). Bayesian causality test for integer-valued time series models with applications to climate and crime data. J R Stat Soc Ser C Appl Stat 2017;66: 797–814.
- View Article
- Google Scholar
12. Weiß CH, Zhu F, Hoshiyar A. Softplus INGARCH models. Stat Sin. 2022;32: 1099–1120.
- View Article
- Google Scholar
13. Zhu F. Modeling time series of counts with COM-Poisson INGARCH models. Math Comp Model. 2012c;56(9–10): 191–203.
- View Article
- Google Scholar
14. Zhu F. A negative binomial integer-valued GARCH model. J Time Ser Anal. 2011;32(1): 54–67.
- View Article
- Google Scholar
15. Ye F, Garcia TP, Pourahmadi M, Lord D. Extension of negative binomial GARCH model analyzing effects of gasoline price and miles traveled on fatal crashes involving intoxicated drivers in Texas. Transp Res Rec. 2012;2279(1): 31–39.
- View Article
- Google Scholar
16. Xu H-Y, Xie M, Goh TN, Fu X. A model for integer–valued time series with conditional overdispersion. Comput Stat Data Anal. 2012;56: 4229–4242.
- View Article
- Google Scholar
17. Lambert D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 1992;34: 1–14.
- View Article
- Google Scholar
18. Chen CWS, Sangyeol L. Generalized Poisson autoregressive models for time series of counts. Comput Stat Data Anal. 2016;99: 51–67.
- View Article
- Google Scholar
19. Lee S, Lee Y, Chen CWS. Parameter change test for zero-inflated generalized Poisson autoregressive models. Statistics. 2015;87(15): 2981–2996.
- View Article
- Google Scholar
20. Zhu F. Modeling overdispersed or underdispersed count data with generalized Poisson integer-valued GARCH models. J Math Anal Appl. 2012;389(1): 58–71.
- View Article
- Google Scholar
21. Esmeralda G, Nazaré M, Filipa S. Zeroinflated compound Poisson distributions in integer-valued GARCH models. Statistics. 2015;50(3): 558–578.
- View Article
- Google Scholar
22. Xiong L, Zhu F. Robust quasi-likelihood estimation for the negative binomial integer-valued GARCH (1,1) model with an application to transaction counts. J Stat Plan Inference. 2019;203: 178–198.
- View Article
- Google Scholar
23. Li Q, Chen H, Zhu F. Robust estimation for Poisson integer-valued GARCH models using a new hybrid loss. J Syst Sci Complex. 2021;34: 1578–1596.
- View Article
- Google Scholar
24. Liu M, Zhu F, Zhu, K. Modeling normalcy-dominant ordinal time series: an application to air quality level. J Time Ser Anal. 2022;43: 460–478.
- View Article
- Google Scholar
25. Cui Y, Li Q, Zhu F. (2021). Modeling Z-valued time series based on new versions of the Skellam INGARCH model. Braz J Probab Stat. 2021;35: 293–314.
- View Article
- Google Scholar
26. Xu Y, Zhu F. A new GJR-GARCH model for Z-valued time series. J Time Ser Anal. 2022; 43: 490–500.
- View Article
- Google Scholar
27. Lee S, Kim D, Seok S. Modeling and inference for counts time series based on zero-inflated exponential family INGARCH models. J Stat Comput Simul. 2021;91: 2227–2248.
- View Article
- Google Scholar
28. Davis RA, Fokianos K, Holan SH, Joe H, Livsey J, Lund R, et al. Count time series: a methodological review. J Am Stat Assoc 2021;116: 1533–1547.
- View Article
- Google Scholar
29. Yang M. Statistical models for count time series with excess zeros. Doctoral dissertation, University of Iowa; 2012. [Cited 2022 July 16] Available from: https://iro.uiowa.edu/esploro/outputs/doctoral/Statistical-models-for-count-time-series/9983776903502771
30. Xu X, Chen Y, Chen CWS, Lin X. Adaptive log-linear zero-inflated generalized Poisson autoregressive model with applications to crime counts. Ann Appl Stat. 2020;14(3): 1493–1515.
- View Article
- Google Scholar
31. Weiß CH. An introduction to discrete-valued time series. Hoboken, NJ, USA: John Wiley & Sons; 2018.
32. Lee Y, Lee S. CUSUM test for general nonlinear integer-valued GARCH models: comparison study. Ann Inst Stat Math. 2019;71: 1033–1057.
- View Article
- Google Scholar
33. CDC [Internet]. Influenza-associated pediatric mortality [Cited 2022 July 16]. Available from: https://gis.cdc.gov/GRASP/Fluview/PedFluDeath.html
34. WPC (World Prediction Centre) [Internet]. National high and low temperature [Cited 2022 July 16]. Available from: https://www.wpc.ncep.noaa.gov/

[ref1] 1. Ferland R, Latour A, Oraichi D. Integer‐valued GARCH process. J Time Ser Anal. 2006;27(6): 923–942.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Zhu F. Zero-inflated Poisson and negative binomial integer-valued GARCH models. J Stat Plan Inference. 2012;142(4): 826–839.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Rydberg TH, Shephard N. BIN models for trade-by-trade data. Modelling the number of trades in a fixed interval of time. Technical report 0740. Econometric Society [Internet]. 2000 [Cited 2022 July 16]. Available from: http://ideas.repec.org/p/ecm/wc2000/0740.html
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Kedem B, Fokianos K. Regression models for time series analysis. Wiley series in probability and statistics. Hoboken, NJ: Wiley-Interscience; 2002.

[ref5] 5. Heinen A. Modeling time series count data: An autoregressive conditional Poisson model. Technical report MPRA paper 8113. University Library of Munich, Munich, Germany[Internet]. 2003 [Cited 2022 July 16]. Available from: https://mpra.ub.uni-muenchen.de/8113/

[ref6] 6. Bollerslev T. Generalized autoregressive conditional heteroscedasticity. J Econom. 1986;31(3): 307–327.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref7] 7. Ghahramani M, Thavaneswaran A. On some properties of autoregressive conditional Poisson (ACP) models. Econ Lett. 2009;105(3): 273–275.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref8] 8. Weiß CH. Modelling time series of counts with overdispersion. Stat Methods Appt. 2009;18: 507–519.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref9] 9. Fokianos K, Rahbek A, Tjøstheim D. Poisson autoregression. J Am Stat Assoc. 2009;104(488): 1430–1439.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref10] 10. Fokianos K, Tjøstheim D. Log-linear Poisson autoregression. J Multivar Anal. 2011;102(3): 563–578.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref11] 11. Chen CWS, Lee S. (2017). Bayesian causality test for integer-valued time series models with applications to climate and crime data. J R Stat Soc Ser C Appl Stat 2017;66: 797–814.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref12] 12. Weiß CH, Zhu F, Hoshiyar A. Softplus INGARCH models. Stat Sin. 2022;32: 1099–1120.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref13] 13. Zhu F. Modeling time series of counts with COM-Poisson INGARCH models. Math Comp Model. 2012c;56(9–10): 191–203.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref14] 14. Zhu F. A negative binomial integer-valued GARCH model. J Time Ser Anal. 2011;32(1): 54–67.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref15] 15. Ye F, Garcia TP, Pourahmadi M, Lord D. Extension of negative binomial GARCH model analyzing effects of gasoline price and miles traveled on fatal crashes involving intoxicated drivers in Texas. Transp Res Rec. 2012;2279(1): 31–39.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref16] 16. Xu H-Y, Xie M, Goh TN, Fu X. A model for integer–valued time series with conditional overdispersion. Comput Stat Data Anal. 2012;56: 4229–4242.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref17] 17. Lambert D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 1992;34: 1–14.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref18] 18. Chen CWS, Sangyeol L. Generalized Poisson autoregressive models for time series of counts. Comput Stat Data Anal. 2016;99: 51–67.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref19] 19. Lee S, Lee Y, Chen CWS. Parameter change test for zero-inflated generalized Poisson autoregressive models. Statistics. 2015;87(15): 2981–2996.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref20] 20. Zhu F. Modeling overdispersed or underdispersed count data with generalized Poisson integer-valued GARCH models. J Math Anal Appl. 2012;389(1): 58–71.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref21] 21. Esmeralda G, Nazaré M, Filipa S. Zeroinflated compound Poisson distributions in integer-valued GARCH models. Statistics. 2015;50(3): 558–578.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref22] 22. Xiong L, Zhu F. Robust quasi-likelihood estimation for the negative binomial integer-valued GARCH (1,1) model with an application to transaction counts. J Stat Plan Inference. 2019;203: 178–198.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref23] 23. Li Q, Chen H, Zhu F. Robust estimation for Poisson integer-valued GARCH models using a new hybrid loss. J Syst Sci Complex. 2021;34: 1578–1596.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref24] 24. Liu M, Zhu F, Zhu, K. Modeling normalcy-dominant ordinal time series: an application to air quality level. J Time Ser Anal. 2022;43: 460–478.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref25] 25. Cui Y, Li Q, Zhu F. (2021). Modeling Z-valued time series based on new versions of the Skellam INGARCH model. Braz J Probab Stat. 2021;35: 293–314.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref26] 26. Xu Y, Zhu F. A new GJR-GARCH model for Z-valued time series. J Time Ser Anal. 2022; 43: 490–500.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref27] 27. Lee S, Kim D, Seok S. Modeling and inference for counts time series based on zero-inflated exponential family INGARCH models. J Stat Comput Simul. 2021;91: 2227–2248.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref28] 28. Davis RA, Fokianos K, Holan SH, Joe H, Livsey J, Lund R, et al. Count time series: a methodological review. J Am Stat Assoc 2021;116: 1533–1547.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref29] 29. Yang M. Statistical models for count time series with excess zeros. Doctoral dissertation, University of Iowa; 2012. [Cited 2022 July 16] Available from: https://iro.uiowa.edu/esploro/outputs/doctoral/Statistical-models-for-count-time-series/9983776903502771

[ref30] 30. Xu X, Chen Y, Chen CWS, Lin X. Adaptive log-linear zero-inflated generalized Poisson autoregressive model with applications to crime counts. Ann Appl Stat. 2020;14(3): 1493–1515.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref31] 31. Weiß CH. An introduction to discrete-valued time series. Hoboken, NJ, USA: John Wiley & Sons; 2018.

[ref32] 32. Lee Y, Lee S. CUSUM test for general nonlinear integer-valued GARCH models: comparison study. Ann Inst Stat Math. 2019;71: 1033–1057.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref33] 33. CDC [Internet]. Influenza-associated pediatric mortality [Cited 2022 July 16]. Available from: https://gis.cdc.gov/GRASP/Fluview/PedFluDeath.html

[ref34] 34. WPC (World Prediction Centre) [Internet]. National high and low temperature [Cited 2022 July 16]. Available from: https://www.wpc.ncep.noaa.gov/

Figures

Abstract

Introduction

The time-varying zero-inflated INGARCH model

Case 1: Zero-inflation driven by a deterministic function of time.

Case 2: Zero-inflated function driven by an exogenous variable.

Estimation procedure

Expectation maximization estimation for the TVZIP-INGARCH (p, q) process.

Maximum likelihood estimation for the TVZIP-INGARCH (p, q) process.

Simulation study

Simulation results for Case 1: Deterministic sinusoidal zero-inflation function.

Simulation study for Case 2: Zero-inflation function driven by an exogenous variable

Extension to the softplus INGARCH model

Real data examples

Real data example—Use of a deterministic sinusoidal zero-inflation function

Real data example—Zero-inflation function is driven by exogenous variable

Conclusions

Supporting information

S1 Appendix. Derivation of conditions for zero-inflation probability to lie inside (0,1).

S2 Appendix. Derivation of the conditional mass function.

S3 Appendix. Initialization of the lambdas.

S4 Appendix. List of existing INGARCH type models used in the comparison study.

S5 Appendix. Model comparison results for the pediatric death counts cause by virus A.

S6 Appendix. Model comparison results for the pediatric death counts cause by virus B.

S1 Data. Datasets.

S1 File. Codes.

Acknowledgments

References