Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A generalized right truncated bivariate Poisson regression model with applications to health data

A generalized right truncated bivariate Poisson regression model with applications to health data

  • M. Ataharul Islam, 
  • Rafiqul I. Chowdhury
PLOS
x

Abstract

A generalized right truncated bivariate Poisson regression model is proposed in this paper. Estimation and tests for goodness of fit and over or under dispersion are illustrated for both untruncated and right truncated bivariate Poisson regression models using marginal-conditional approach. Estimation and test procedures are illustrated for bivariate Poisson regression models with applications to Health and Retirement Study data on number of health conditions and the number of health care services utilized. The proposed test statistics are easy to compute and it is evident from the results that the models fit the data very well. A comparison between the right truncated and untruncated bivariate Poisson regression models using the test for nonnested models clearly shows that the truncated model performs significantly better than the untruncated model.

Introduction

The bivariate Poisson models have emerged to address an increasingly important area of research with a wide range of applications in various fields where paired count data are correlated. Data on paired counts arise commonly in various areas of research such as traffic accidents (number of accidents and fatalities), medical research (number of visits to health professionals and utilization of health care services), epidemiology (number of episodes of depression and number of visits to doctors), marketing (number of sales of different products), sports (number of goals scored by opponent teams), econometrics (job switching voluntarily or involuntarily), etc. As mentioned, a typical example of such dependence arises in the traffic accidents where the extent of physical injuries may lead to fatalities. Leiter and Hamdan [1] suggested bivariate probability models applicable to traffic accidents and fatalities. A similar problem was addressed by Cacoullos and Papageorgiou [2]. Several other studies defined and studied the bivariate Poisson distribution [310].

The bivariate Poisson distributions have been developed following various assumptions. Among those, the most comprehensive one has been proposed by Kocherlakota and Kocherlakota [11]. The bivariate Poisson form is shown by using a trivariate reduction method [8] allowing for correlation between the variables, which is considered as a nuisance parameter. This bivariate Poisson regression is used by Jung and Winkelmann [8], Vernic [12] and Karlis and Ntzoufras [9, 13] among others. However, there is no attempt for modelling right truncated bivariate Poisson regression model so far which can play an important role for analyzing bivariate count data arising from a wider range of problems emerging in various fields.

Leiter and Hamdan [1] suggested joint distributions for number of accidents and number of fatalities by Poisson-Bernoulli (or Binomial) and Poisson-Poisson distribution. An alternative Poisson-Binomial model was proposed by Cacoullos and Papageorgiou [2]. In this paper, we have used the Poisson-Poisson distribution and the model for covariate dependence based on extended generalized linear model proposed by Islam and Chowdhury [14]. The marginal and conditional models are employed which are expressed in terms of the family of exponential distributions. Then a generalized linear model for this bivariate form is developed and the link functions are identified and expressed as functions of covariates. In some cases, we need to use right truncated Poisson distribution for both the outcome variables due to restrictions in the definition of variables. A model is shown for taking account of right truncated bivariate Poisson and a generalized bivariate count regression model is proposed using marginal-conditional approach. As both the tests for goodness of fit and over(under)dispersion are of prime concern in dealing with these models, some test procedures are shown in this paper. Gurmu and Trivedi [15, 16] suggested overdispersion tests for truncated Poisson regression models for univariate case. In this study, Gurmu and Trivedi [15] test is extended for bivariate Poisson for both untruncated and right truncated Poisson regression. For testing the goodness of fit tests for right truncated bivariate data, the test proposed by Islam and Chowdhury [14] is extended for truncated data. The estimation and test procedures are applied to health data, namely, the Health and Retirement Study [17] on number of health conditions ever had and utilization of healthcare services. Both the models with or without truncation are fitted to the same data and goodness of fit tests are applied to compare the models.

The bivariate Poisson-Poisson model

Let Y1 be the number of occurrences of the first event in a given interval follow a Poisson distribution with parameter λ1 and the probability of the second event, Y2, for given Y1, where Y2 = Y21 + … + Y2y1, be Poisson with parameter, λ1y1, y21, …, y2y1 = 0, 1, …. Then the marginal, conditional and joint distributions of Y1 and Y2 can be shown as follows

Multiplying the marginal and conditional distributions shown above, we obtain the following joint distribution (1)

The probability mass function for Y2 is (2)

It is noteworthy that in the proposed model, the marginal models for Y1 and Y2 do not allow same joint probability mass function from g(y2)g(y1y2) and g(y1)g(y2y1) which is expected as we observe from corresponding marginal probabilities.

The bivariate Poisson-Poisson expression in Eq (1) can be expressed as bivariate exponential form as follows (3)

The link functions are ln λ1 = xβ1, ln λ2 = xβ2, where,

Hence, we can show that , and It is noteworthy that E(Y1) = μ1 = λ1 and E(Y2) = μ2 = λ1λ2. Hence, and

The log likelihood function is (4)

Bivariate right truncated Poisson-Poisson model

Let Y1 be the number of occurrences of the first event in a given interval which has a truncated Poisson distribution with mass function (5) where

Then the probability of the second event, Y2, for given y1, where Y2 = Y21 + … + Y2y1, the total number of second event cases among the truncated y1 events in the specified time interval, is: (6) where

Multiplying the conditional and marginal probability functions, the joint distribution of Y1 and Y2 can be shown as follows (7)

The truncated Poisson-Poisson expression can be expressed as bivariate exponential form as follows (8)

The expected value of Y1 can be shown as where incomplete gamma function approximations are obtained based on relationships between cumulative Poisson function and incomplete gamma functions shown in Abramowitz and Stegun Chapters 6, 7 and 26 [18].

Similarly the conditional expected value for Y2 can be expressed as

The link functions in model Eq (7) are ln λ1 = xβ1, ln λ2 = xβ2, where , x′ = (1, x1, …, xp), , and the log likelihood and is shown below: (9)

Test for goodness of fit and under(Over) dispersion

In Sections 2 and 3, the proposed generalized models for untruncated or right truncated data are based on marginal and conditional probabilities as functions of covariates, we need to use a test for goodness of fit suitable for such models. A test for goodness of fit is proposed for the bivariate Poisson-Poisson model [14] using marginal and conditional means is: (10) where T1y1 can be shown asymptotically as a χ2 with 2 degrees of freedom. Hence T1 is distributed asymptotically as χ2 with 2(k1 + 1) degrees of freedom as each T1y1 is independent chi-square, k1 + 1 is the number of groups of distinct y1 values such as Y1 = 0 with frequency n0, Y1 = 1 with frequency n1, …, Y1 = k1 with frequency nk1, , and . Here , and .

Similarly, the goodness of fit test statistic as shown in Eq (10) can be modified for the bivariate truncated model too. In this case, the null and alternative hypotheses are: H0: the data follow the proposed truncated bivariate Poisson model and, H1: the data do not follow the proposed truncated bivariate Poisson model. The modified test statistic for testing the goodness of fit is: (11) where T1y1 can be shown asymptotically as a χ2 with 2 degrees of freedom. Hence T1 is distributed asymptotically as χ2 with 2(k1 + 1) degrees of freedom as each T1y1 is independent chi-square, k1 + 1 is the number of groups of distinct y1 values such as Y1 = 0 with frequency n0, Y1 = 1 with frequency n1, …, Y1 = k1 with frequency nk1, , , V1 and V2|1 are the variances of Y1 and Y2 given Y1 respectively (see Appendix). Using , we obtain and from V1 and V2|1 for i = 1, 2, …, ny1.

We can modify the test for goodness of fit as shown in Eqs (10) and (11) to develop test statistics for over(under)dispersion. As overdispersion and underdispersion may influence the fit of the proposed untruncated Poisson regression models, we have used the method of moments estimator [14, 1921] to estimate dispersion parameter for untruncated model, φu,r where where

Using the mean, variance and correction factor [15] for right truncated Poisson marginal and conditional models, we can define and where , , h(kr, λri) = P(Yri = kr), H(kr, λri) = P(Yrikr) and then using these values we can estimate φr for the truncated model.

A test for over(under)dispersion, T2, is proposed assuming adjusted variances for the marginal mean of Y1 and conditional means of Y2 given Y1 = y1. If there is over(under)dispersion, then the test results in acceptance of the null hypothesis that the expected values are equal to the adjusted variances.

Then T2 for untruncated bivariate Poisson regression model is: (12) where T2y1 can be shown asymptotically as a χ2 with 2 degrees of freedom. Hence T2 is distributed asymptotically as χ2 with 2(k1 + 1) degrees of freedom as each T2y1 is independent chi-square, , and u denotes untruncated model, , and k1 + 1 is the number of distinct counts observed for Y1. Similarly, for right truncated model T2 can be defined as follows: (13) where T2y1 can be shown asymptotically as a χ2 with 2 degrees of freedom. Hence T2 is distributed asymptotically as χ2 with 2(k1 + 1) degrees of freedom as each T2y1 is independent chi-square, , and , t denotes truncated model.

In the above tests for the goodness of fit, specific models (proposed models) are tested. Hence, the test for goodness of fit may be termed equivalently as the test for specification of the proposed models.

Test for model selection

Voung [22] developed a general test for nonnested models. This statistic tests the hypothesis that two nonnested parametric models are equally distant in the Kullback-Leibler sense from the true data distribution. This test statistic is used to select a parsimonuous model. Consider following exponential density form for generalized linear models: where θ is the natural link function which is a function of expected value of Y, b(θ) is a function of θ, a(ϕ) is dispersion parameter and c(y, ϕ) is a function of y and ϕ. We can obtain expected value and variance of Y from E(Y) = b′(θ) and Var(Y) = a(ϕ)b′′(θ) where b′′(θ) is the variance function, Var[E(Y)]. From the probability function we can show that θ = ln μ1, E(Y1) = μ1, and Var(Y1) = μ1/ϕ1. Similarly, using the conditional probability function for Y2 given Y1 = y1, we find θ = ln μ2y1, E(Y2|y1) = μ2y1 and Var(Y2|y1) = μ2y1/ϕ2.

The Voung test is a t or standard normal test for large sample where where f(y1i, y2i;θ) is the probability function for model with parameter vector θ and g(y1i, y2i;θ′) the probability function for model with parameter vector θ′ and

Adjusted Vuong test is where p = number of parameters in numerator in model f(y1i, y2i;θ) and q = number of parameters in denominator in model g(y1i, y2i;θ′). If V > 1.96 then model in the numerator is favored and if V < −1.96 then model in the denominator is favored.

Application

The models proposed in this paper are applied to the tenth wave of the Health and Retirement Study [17]. The outcome variables are number of conditions ever had (Y1) as mentioned by doctors and utilization of healthcare services (Y2) where utilization of healthcare services include services from hospital, nursing home, doctor, and home care. The explanatory variables are: gender (1 male, 0 female), age (in years), race (1 Hispanic, 0 others) and veteran status (1 yes, 0 no). The sample size is 5568.

Table 1 displays the average number of conditions (Y1) and utilization of health care services (Y2) and sample variances. The average number of conditions is 2.63 and the variance (2.05) appears to be lower than the mean. The average number of utilization of health care services is 0.77 and the corresponding variance is slightly higher (0.79). The measure of correlation between number of conditions and utilization of health care services can be estimated by where (see Leiter and Hamdan, [1]). The estimated correlation is 0.48 and the corresponding standard error is very small indicating significant association between the bivariate counts. The estimated standard error of r is

The dispersion parameters for number of conditions (Y1) and utilization of health care services (Y2) are 0.80 and 1.05, respectively. A test for over(under)dispersion using T2 indicates that there might be statistically significant over(under) dispersion in the bivariate count data. This indicates slight overdispersion for utilization of health care services and underdispersion for number of conditions. Using the mean, variance and correction factor [15] for right truncated Poisson marginal and conditional models, we can estimate the dispersion parameters.

Table 2 displays estimates of the parameters of the marginal and conditional models for the untruncated model as well as tests for significance of the parameters with and without considering adjustment for over or under dispersion. The selected explanatory variables in the models are gender, age, race and veteran status. All the variables except race show statistically significant association with number of conditions. From the marginal model for number of conditions it appears that the number of conditions is significantly lower for males compared to females but age and veteran status are positively associated with number of conditions. This shows that the number of conditions in health increases steadily with age, as expected, and the veterans have higher number of conditions compared to non-veterans. There is no significant difference in number of conditions by race. As there is evidence of under dispersion in the number of conditions, an adjustment is made using the dispersion parameter, After adjustment, the standard errors are smaller which are displayed in Table 2 and p-values appear to be smaller. We observe that the conditional model for utilization of healthcare services for given number of conditions indicate significant positive association with gender (male = 1, female = 0) and veteran status (yes = 1, no = 0) while age and race (race = 1 for Hispanic, race = 0, otherwise) show negative association. Although number of conditions is lower for males, utilization of healthcare services appears to be higher among males compared to females. Another important finding reveals that the utilization of healthcare services reduces significantly with age for given number of indications which implies that the with ncreased age elderly people do not take healthcare services from different sources. Race is not associated significantly with number of conditions but utilization of healthcare services show a negative association with race indicating lower utilization for Hispanics as compared to other races. There is positive association between utilization of healthcare services and veteran status, similar to the association found with number of conditions observed from the marginal model.

thumbnail
Table 2. Parameter estimates for the bivariate poisson models using the data on number of conditions and utilization of healthcare services (HRS, 2010).

https://doi.org/10.1371/journal.pone.0178153.t002

The estimates for the right truncated bivariate Poisson model are displayed in Table 3. The results are generally similar to the fit of the untruncated model displayed in Table 2 but estimates of the parameters for the right truncated model show substantial difference from that of the untruncated model and it is more evident for the conditional part of the joint model. Let us denote β1∗ = (β11, …, β1p)′, β2∗ = (β21, …, β2p)′, then for the overall test for the null hypothesis H0: β∗ = 0, the following likelihood ratio test is used where . Here the likelihood ratio test statistic is asymptotically chi square with 2p degrees of freedom. Both the models for untruncated and truncated appear statistically significant. The estimated regression coefficients for gender for the untruncated model is 0.345 compared to 0.453 in the truncated model. Similarly, the estimated regression coefficient for veteran status is 0.093 and 0.219 in the untruncated and truncated models respectively. However, standard errors have not shown much change. The likelihood ratio test statistics show p-values less than 0.001 for both the untruncated and truncated models. It may be noted here that both the untruncated and truncated models are fitted to examine the changes in the estimation of parameters and tests attributable to truncation. Test for over (under) dispersion for both the models show that the fitted models are not significantly over or under dispersed.

thumbnail
Table 3. Parameter estimates for the truncated bivariate poisson models using the data on number of conditions and utilization of healthcare services (HRS, 2010).

https://doi.org/10.1371/journal.pone.0178153.t003

The estimated dispersion parameters for untruncated models, and , are 0.80 and 1.05 respectively and from the truncated models and are 0.90 and 0.83 respectively (see Table 4). Adjusting for both untruncted models and truncated models, we observe that the test results for significance of the parameters remain similar.

thumbnail
Table 4. Model statistics for untruncated and truncated bivariate poisson models for Health and Retirement Study data (HRS) Data.

https://doi.org/10.1371/journal.pone.0178153.t004

Table 4 summarizes the results for both untruncated and truncated bivariate Poisson regression models. We observe from goodness of fit test statistic, T1, that both untruncated and truncated bivariate Poisson regression models fit the data very well. In other words, the null hypotheses may be accepted in favor of the proposed models. To make a more precise decision about the choice of the better model, the Voung test is employed. We have used the Voung test for comparing non-nested models because non-nested models can not be compared using the classical likelihood ratio based tests. Table 4 shows comparison between untruncated and right truncated models and it is found that the bivariate right truncated Poisson model appears to be significantly better than the untruncated model (p<0.001).

Conclusion

The use of bivariate count regression models can provide very useful insights for analyzing repeated measures data emerging from various fields including health. In this paper, a comprehensive methodology has been displayed to deal with different types of problems associated with bivariate count data. In this paper, a new model is proposed using the Poisson-Poisson marginal-conditional approach for both untruncated and truncated data. In reality, there may be influence of right truncation for analyzing such data and the application displayed in this paper clearly demonstrates substantial change in the estimates of regression parameters. A generalized linear model for the bivariate Poisson-Poisson regression model is developed for the right truncated data. The proposed model is applied to the Health and Retirement Study data on number of conditions and utilization of healthcare services. The problem of under or overdispersion in the count data is also examined and appropriate procedures for adjusting the standard errors are also shown. Tests for both goodness of fit and overdispersion have been used and it is observed that the models fit well to the bivariate count data and the extent of under or overdispersion are not found to be significant in the application to the data on number of conditions and utilization of healthcare services from the Health and Retirement Study. In order to select a better model, an extended test for non-nested models for bivariate count data is shown in this paper and it is found that the right truncated bivariate Poisson model is a better choice for the health data used in this study.

Appendix

Similarly,

Acknowledgments

We acknowledge gratefully that the study is supported by the HEQEP sub-project 3293, University Grants Commission of Bangladesh and the World Bank. The authors also acknowledge gratefully to the HRS (Health and Retirement Study) conducted by the University of Michigan for making the data publicly available.

Author Contributions

  1. Conceptualization: MAI RIC.
  2. Formal analysis: RIC.
  3. Funding acquisition: MAI.
  4. Investigation: MAI RIC.
  5. Methodology: MAI RIC.
  6. Project administration: MAI.
  7. Resources: MAI RIC.
  8. Software: RIC.
  9. Supervision: MAI RIC.
  10. Validation: MAI RIC.
  11. Visualization: MAI RIC.
  12. Writing – original draft: MAI RIC.
  13. Writing – review & editing: MAI RIC.

References

  1. 1. Leiter RE, Hamdan MA. Some Bivariate Probability Models Applicable to Traffic Accidents and Fatalities. International Statistical Review. 1973 41: 87–100.
  2. 2. Cacoullos T, Papageorgiou H. On Some Bivariate Probability Models Applicable to Traffic Accidents and Fatalities. International Statistical Review. 1980; 48: 345–356.
  3. 3. Consul PC, Jain GC. A Generalization of the Poisson Distribution. Technometrics. 1973 15: 791–799.
  4. 4. Consul PC. Some Bivariate Families of Lagrangian Probability Distributions. Communications in Statistics—Theory and Methods. 1994 23: 2895–2906.
  5. 5. Consul PC. Generalized Poisson Distributions: Properties and Applications. Marcel Dekker. 1989.
  6. 6. Consul PC, Shoukri MM. The generalized Poisson distribution when the sample mean is larger than the sample variance. Communications in Statistics—Simulation and Computation. 1985 14: 1533–1547.
  7. 7. Holgate P. Estimation for the Bivariate Poisson Distribution. Biometrika. 1964 51: 241–245.
  8. 8. Jung R, Winkelmann R. Two Aspects of Labor Mobility: A Bivariate Poisson Regression Approach. Empirical economics. 1993 18: 543–556.
  9. 9. Karlis D Ntzoufras I. Bivariate Poisson and Diagonal Inflated Bivariate Poisson Regression Models in R. Journal of Statistical Software. 2005 14: 1–36.
  10. 10. Hofer V, Letiner JA. bivariate Sarmanov regression model for count data with generalised Poisson marginals. Journal of Applied Statistics. 2012 39, 2599–2617.
  11. 11. Kocherlakota S, Kocherlakota K. Bivariate Discrete Distributions. Marcel Dekker. 1992.
  12. 12. Vernic R. On The Bivariate Generalized Poisson Distribution. ASTIN Bulletin: The Journal of the International Actuarial Association. 1997 27: 23–32.
  13. 13. Karlis D, Ntzoufras I. Analysis of Sports Data by Using Bivariate Poisson Models. Journal of the Royal Statistical Society Series D (The Statistician). 2003 52: 381–393.
  14. 14. Islam MA, Chowdhury RI. A Bivariate Poisson Models with Covariate Dependence. Bulletin of Calcutta Mathematical Society. 2015 107: 11–20.
  15. 15. Gurmu S, Trivedi PK. Overdispersion Tests for Truncated Poisson Regression Models. Journal of Econometrics. 1992 54: 347–370.
  16. 16. Cameron A, Trivedi P. Regression Analysis of Count Data. Oxford University Press. 1998.
  17. 17. Health and Retirement Study (HRS). Wave 10- Public Use Dataset. Produced and Distributed by the University of Michigan with Funding from the National Institute on Aging (Grant Number NIA U01AG09740). Ann Arbor, MI. 2010.
  18. 18. Abramowitz M, Stegun I.A. Handbook of Mathematical Functions With Formulas, Graphs, and Mathematical Tables. National Bureau of Standards, United States Department of Commerce. 1964.
  19. 19. McCullagh P. The Conditional Distribution of Goodness-of-Fit Statistics for Discrete Data, Journal of the American Statistical Association. 1986 81: 104–107.
  20. 20. McCullagh P, Nelder JA. Generalized Linear Models, Second Edition. Chapman and Hall/CRC. 1989.
  21. 21. Long JS. Regression Models for Categorical and Limited Dependent Variables. SAGE Publications, Thousand Oaks. 1997.
  22. 22. Vuong QH. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica. 1989 57: 307–333.