A semi-nonparametric Poisson regression model for analyzing motor vehicle crash data

This paper develops a semi-nonparametric Poisson regression model to analyze motor vehicle crash frequency data collected from rural multilane highway segments in California, US. Motor vehicle crash frequency on rural highway is a topic of interest in the area of transportation safety due to higher driving speeds and the resultant severity level. Unlike the traditional Negative Binomial (NB) model, the semi-nonparametric Poisson regression model can accommodate an unobserved heterogeneity following a highly flexible semi-nonparametric (SNP) distribution. Simulation experiments are conducted to demonstrate that the SNP distribution can well mimic a large family of distributions, including normal distributions, log-gamma distributions, bimodal and trimodal distributions. Empirical estimation results show that such flexibility offered by the SNP distribution can greatly improve model precision and the overall goodness-of-fit. The semi-nonparametric distribution can provide a better understanding of crash data structure through its ability to capture potential multimodality in the distribution of unobserved heterogeneity. When estimated coefficients in empirical models are compared, SNP and NB models are found to have a substantially different coefficient for the dummy variable indicating the lane width. The SNP model with better statistical performance suggests that the NB model overestimates the effect of lane width on crash frequency reduction by 83.1%.


Introduction
Statistical regression models are typically used in analyzing the likelihood and severity of vehicle crashes. Recent review studies [1][2][3][4] have summarized the innovative models for examining the impact of factors (e.g., traffic, roadway and vehicle characteristics, etc.) on the likelihood of a crash and its resulting injury severity. Regarding observed crash counts, previous studies often found that some crash data are likely to demonstrate heterogeneity. This heterogeneity in the crash data can be explained as the unknown variation of the impact of explanatory variables on crash. As discussed by Mannering, et al. [3], when the crash-related information is collected, some factors affecting the likelihood and severity of the vehicle crash may not be available to the transportation safety analysts (e.g., driver's weight, height, roadway lighting a1111111111 a1111111111 a1111111111 a1111111111 a1111111111

Modeling methodology
This section presents the modeling methodology adopted in this paper. First, the Poisson regression model is presented using the log-gamma heterogeneity (i.e., the Negative Binomial regression model). Although the focus of this paper is to develop a Poisson regression model for crash frequency with unobserved heterogeneity following a semi-nonparametric (SNP) distribution, it will be insightful to present this state-of-practice model for comparison.

Negative binomial (NB) regression model: Poisson regression model with log-gamma heterogeneity
Count data models are most suited to modeling dependent variable y i that constitutes a frequency or "count." The dependent variable can only take non-negative integer values. In this paper, y i represents crash frequency for road section i. The expectation of y i is assumed to be λ i and the count data model formulation is as follows: where x i is a vector of explanatory variables indicating characteristics for road section i; β is a vector of coefficients associated with x i . ε i is a random variable representing heterogeneity that accounts for unobserved factors and other random disturbances. Since y i constitutes count data, the probability of y i conditional on ε i is given as: The Negative Binomial (NB) regression model is formulated based on the assumption that exp (ε i ) = t i follows a gamma distribution, denoted as Γ(1/α 2 ,α 2 ). The corresponding probability density function is: where GðzÞ ¼ The expectation and standard deviation of t are equal to 1 and α, respectively. By integrating t i over its distributional domain, one may obtain the unconditional probability of y i as: Cameron and Trivedi [37] proposed this unconditional probability function with a closedform solution. This formulation has allowed the NB model to be widely applied for modeling count data in many different areas, including transportation. It is to be noted that the true heterogeneity in the model is not t i , but ε i , which accounts for the presence of unobserved variables or factors excluded from the vector x i . Since ε i is equal to ln(t i ), the underlying distributional assumption on ε i is the log-gamma distribution and the probability density function can be derived as: It is not a symmetric function with respect to the variable ε i , indicating that the distribution of the random variable ε i is asymmetric in nature [38].

Poisson regression with SNP heterogeneity
To improve the flexibility of the distribution for unobserved heterogeneity, one may choose to use the SNP distribution for representing heterogeneity ε i . The probability density function for the SNP distribution is usually specified as: In Eq (8), "K" is the length of the polynomial, "m" is an index increasing from 0 to "K", a m is a constant coefficient, and φ(ε) represents the probability density function (PDF) of the standard normal distribution. The denominator ensures that R þ1 À 1 fðεÞdε ¼ 1. The denominator in Eq (8) can be extended and written in the following form, where "n" is another index increasing from 0 to "K": P K n¼0 a m a n R þ1 À 1 ε mþn φðεÞdε ð9Þ R þ1 À 1 ε mþn φðεÞdε in Eq (9) is actually the expectation of ε m+n , which can be calculated based on the moment-generating function and derived recursion formulae.
Define IðnÞ ¼ R þ1 À 1 ε n φðεÞdε; then; Ið0Þ ¼ 1; Ið1Þ ¼ 0 and IðnÞ ¼ ðn À 1Þ Iðn À 2Þ; when n ! 2: Thus; the denominator R þ1 À 1 ð P K m¼0 a m ε m Þ 2 φðεÞdε ¼ P K m¼0 P K n¼0 a m a n Iðm þ nÞ: Under the assumption of the SNP distribution, one can integrate ε i over its distributional domain and obtain the unconditional probability of y i as: n¼0 a m a n Iðm þ nÞ ( ) The key difference in comparison to the NB regression model is that the unconditional probability function presented in Eq (12) does not have a closed-form solution. The numerical method of Gauss-Hermite quadrature is applied to approximate the unconditional probability as follows: n¼0 a m a n Iðm þ nÞ Gaussian quadrature [39] is a sophisticated procedure that can accurately evaluate the integrals in the likelihood function with a small number (usually 10-20) of supporting points. In this study, 30 supporting points are applied to ensure a high level of accuracy for integral evaluations. The values of nodes and weights of Gaussian-Hermit quadrature are listed in Table 1 for interested readers.
The log-likelihood function over the sample consisting of "N" observations can be formulated as: n¼0 a m a n Iðm þ nÞ The standard Maximum Likelihood Estimation (MLE) method can be applied to estimate unknown parameters in the vectors "β" and "a" by maximizing the log-likelihood function in Eq (14). The model estimation is an exploratory procedure, in which the polynomial length "K" needs to start from "1" and then gradually increases to involve more coefficients into the vector "a". The likelihood ratio test can be applied to examine whether adding more coefficients can significantly improve the goodness-of-fit of model. The model estimation results will be finalized when adding more coefficients fails to significantly improve the goodness-offit (GOF) measure.

Simulation experiments
In this section, a number of simulation experiments are conducted to demonstrate the capability of the SNP distribution to approximate different types of distributions for unobserved heterogeneities in Poisson regression models. Those distributions include log-gamma distributions, normal distributions, a bimodal distribution and a trimodal distribution.

SNP model approximating NB model (Poisson model with log-gamma heterogeneity)
The NB model is the most practical modeling approach for crash frequency. It will be insightful to examine whether an SNP distribution can well approximate the log-gamma heterogeneity in NB models. The simulation experiments are designed as below: where "x 1 " and "x 2 " independently follow a uniform distribution between 0 and 5; "ε" represents the unobserved heterogeneity following the log-gamma distribution, whose PDF is given in Eq (7) and the parameter α 2 = 0.8. Then, the count variable "y" is drawn from a Poisson distribution associated with the parameter λ.
The sample size is setup at 1000. Based on the random sample consisting of the dependent variable "y" and explanatory variables "x 1 " and "x 2 ", an NB model can be estimated and shown in the left part of Table 2. As expected, the model coefficients are highly consistent with their true values. With the same sample, an SNP model can be estimated as well and the estimation results are presented in the right part of Table 2 for comparison. It should be noted that the coefficient a 0 needs to be fixed at 1 for identification. The high flexibility of the SNP distribution causes that the intercept in a regression model and the expectation of "ε" may not be simultaneously identifiable. To solve this issue and facilitate comparisons, the intercept of the SNP model is fixed at the value of intercept in the NB model.
As shown, when the length of polynomial (i.e. the "K" value) reaches 4, the SNP model almost perfectly replicates the NB model results, including the log-likelihood value at convergence, model coefficients, and the plot of heterogeneity distribution (as in Fig 1). Table 3 provides similar comparisons between NB and SNP models when α 2 takes the value of 1.2. In this case, the approximation is not as good as before. However, the relative difference between model coefficients is still less than 3% while the log-likelihood values at convergence and the plots of heterogeneity distributions are close to each other (as in Fig 2).

SNP model approximating Poisson model with normal heterogeneities
In this subsection, the SNP distribution is applied to approximate normal heterogeneities in Poisson regression models. The simulation experiments are designed as below: where "x 1 " and "x 2 " still follow independently uniform distribution between 0 and 5; "ε" follows a normal distribution and PDF ε ð Þ ¼ 1 s ffiffiffi ffi  The sample size is also setup at 1000. Based on the random sample consisting of the dependent variable "y" and explanatory variables "x 1 " and "x 2 ", SNP models can be estimated to approximate the normal heterogeneities in the Poisson regression model. The model estimation results are presented in Table 4. With the polynomial length of 2, SNP models can almost perfectly approximate the normal distributions when "σ" takes the value of 0.8 or 1.2. The model coefficients are highly consistent with their true values and differences between the exact and simulated heterogeneity distributions are almost invisible (as in Fig 3).

SNP model approximating bimodal and trimodal distributions
This subsection further exhibits the great flexibility of the SNP distribution to approximate a bimodal distribution and a trimodal distribution. The simulation experiments are designed as below: where "x 1 " and "x 2 " still follow independently uniform distribution between 0 and 5. The unobserved heterogeneity ε = 3 Á D(u 1 > 0.4) + 1.5 Á u 2 + 0.5 Á η − 2.5, where D ( ) is an indicator function, "η" follows the standard normal distribution while "u 1 " and "u 2 " independently follow the standard uniform distribution. The sample size is setup at 500. Since it is challenging to derive the analytical PDF of the mixture distribution for the random variable "ε", Kernel Density Estimation (KDE) approach is employed to estimate density for each e i in the arithmetic sequence (e i = -6.0, -5.9, . . . 5.9, 6.0) based on the random sample and following equation:  In the formula, K h (u) = ϕ(u/h)/h. Namely, the PDF of standard normal distribution is chosen as the smooth function K h (u) and the bandwidth "h" is setup at 0.3. The estimated probability density is plotted as a solid curve in Fig 4. As shown, it is a typical bimodal distribution with two explicit modal points.
After the vector "λ" is generated, the count variable "y" is drawn from a Poisson distribution based on this vector. Then, an SNP model is estimated to approximate the bimodal distribution and the estimation results are provided in the left part of Table 5. When "K" reaches 5, the model coefficients are close to their true values and the SNP distribution can mimic the bimodal distribution reasonably well, which is plotted as a dashed curve in Fig 4. A last experiment is conducted to mimic a trimodal distribution. The unobserved heterogeneity ε = 3 D(u 1 > 0.8)-3 D(u 2 > 0.7) + 2 u 3 + 0.5 η -1.0, where "η" follows the standard  normal distribution while "u 1 ", "u 2 " and "u 3 " independently follow the standard uniform distribution. The KDE approach is applied to estimate kernel densities using a bandwidth of 0.4 for better smoothness. The estimated distribution is then plotted as a solid curve in Fig 5. The model coefficients, which are close to their true values, are presented in the right part of Table 5. The dashed line in Fig 5 represents the SNP distribution mimicking the trimodal distribution. As shown, the SNP distribution correctly exhibits the feature of the trimodal distribution with three modal points and mimics the overall distribution reasonably well. In summary, the simulation experiments demonstrate the strong capability of the SNP distribution to approximate different types of distributions (e.g. unimodal, bimodal and trimodal distributions) for unobserved heterogeneity in Poisson regression models. In terms of the performance, the SNP distribution can almost perfectly approximate a symmetric unimodal distribution like normal distribution, well approximate a skewed unimodal distribution like loggamma distribution and reasonably approximate bimodal and trimodal distributions. With consideration of heterogeneity following the SNP distribution, all the model coefficients are highly consistent or fairly close to their true values. Consequently, it should be appropriate to apply the flexible SNP distribution to explore potential problems, such as non-symmetricity, skewness or multimodality, etc., in the distribution of the unobserved heterogeneities within a Poisson regression model.

Data description
An empirical crash dataset is used to demonstrate the capability of SNP distribution in modeling unobserved heterogeneities. The crash observations were collected on 1443 rural highway sections in California State of USA from 1993 to 2002. This dataset contains sufficient explanatory variables, which can be used to develop a well-defined mean functional form for NB and SNP models. Table 6 provides the summary statistics of variables for the California data. The

Empirical estimation results
This section presents the comparison results between the NB and SNP models. Table 7 presents all the estimate results and overall performance measurements of both NB model and SNP model of crash frequency for comparisons. In the SNP model, the log-likelihood value at convergence can be gradually improved until the polynomial length "K" reaches 3. The performance measurements are listed at the bottom of the table, including the log-likelihood value at convergence [i.e. LL(β)], Deviance, Akaike information criterion (AIC) and Bayesian information criterion (BIC). The following formulae are used to compute those performance

measurements:
Deviance ¼ À 2 Á LLðbÞ; AIC ¼ 2½k À LLðbÞ; BIC ¼ lnðnÞ Á k À 2 Á LLðbÞ; where "n" represents the sample size and "k" represents the number of parameters estimated in the model. A greater value in LL(β) or a less value in Deviance indicates a better goodnessof-fit (GOF) for the data. As shown in Table 7, the SNP model greatly improves the GOF for the data relative to the NB model after 3 additional parameters for the SNP distribution are specified into the model. AIC and BIC are two alternative criteria for model selection by penalizing the number of parameters in models and avoiding overfitting issues. A smaller value of AIC or BIC indicates a better performance of the SNP model than that of the NB model. It implies that it is worth specifying additional coefficients to better describe the distribution of the unobserved heterogeneity and further improve the model performance. In addition, the Chi-squared test is applied to examine whether adding more coefficients can significantly improve the goodnessof-fit of the SNP model. When the polynomial length "K" reaches 3, the Chi-squared test value is 164.36 relative to the log-likelihood value with "K" at 1 and the critical value is 5.99 for 2 degrees of freedom. Since the increase of the polynomial length fails to further significantly improve the goodness-of-fit, the model is finalized at the polynomial length of 3. Fig 6 visualizes the SNP distribution and compares it with the estimated log-gamma distribution in the NB model. It is interesting to see that the estimated SNP distribution exhibits three visible modal points, although the left and right ones are fairly minor. They are presumed to correspond to three groups of observations in the sample. The major one occurs near -0.3 on the coordinate of "ε" and takes a density value of 0.56, corresponding to the largest group in the middle of the distributional domain. This group consists of almost all the (about 99.5%) observations in the sample. The left mode occurs near -3.3 on the coordinate and the relevant small group consists of about 0.4% of all the observations. The right modal point occurs near +2.6 on the coordinate and the relevant group only consists of 0.1% of observations. Since the heterogeneity "ε" represents unobserved or unspecified factors affecting crash frequency, those results indicate the existence of three groups of highway segments exposed to different levels of crash risk, which may be denoted as "low risk", "medium risk" and "high risk" groups. About 6 (= 1443Á0.4%) highway segments from the sample fall into the "low risk" group. If the expected crash frequencies are compared between the "low risk" and "medium risk" groups, the expectation of "low risk" can be only 5% of that of "medium risk" [i.e. exp(−3.3 + 0.3)] even if all the observed and specified factors are the same. Similarly, there are only about 1~2 (%1443Á0.1%) highway segments falling into the "high risk" group, where the expected crash frequency can be more than 18 times [i.e. exp(2.6 + 0. 3)] as much as that of "medium risk" group when all the observed and specified factors are the same.
However, the details revealed by the SNP distribution are ignored by the log-gamma distribution assumed in the NB model. If comparing two distributions, one may envision that the log-gamma distribution has already been extended to represent both left and middle groups. Unfortunately, the log-gamma distribution is a unimodal distribution and therefore cannot exhibit more than one mode to well represent a multimodal distribution. On the other hand, the log-gamma distribution is a skewed distribution in nature, which cannot well reflect a more symmetric error distribution of observations in the middle group. As a result, the GOF of the SNP model is much better than that of the NB model thanks to its advantages to have multiple modes and represent a more symmetric distribution.
In addition to the overall model performance, the SNP model brings great benefit to improve the precision of model estimators. If comparing the standard errors of coefficient estimators between SNP and NB models, one may see that some of them are reduced by a few times. As we know, MLE estimators are consistent and efficient only if the distributional assumption is valid. When the heterogeneity is well mimicked by the SNP distribution, the model coefficient estimators have much less standard errors and are more precise than those in the NB model based on the inappropriate unimodal and skewed distribution for unobserved heterogeneity. When comparing magnitude of estimated coefficients, one may find that SNP and NB models have similar coefficient estimators except that of the dummy variable indicating the lane width. The SNP model with better statistical performance suggests that the NB model substantially overestimates the effect of lane width on crash frequency reduction by 83.1% ({1 − exp(−0.1266)} v.s.{1 − exp(−0.0677)}). The striking difference is probably caused by a better representation of the error distribution in the SNP model.

Conclusions and discussions
In this paper, the authors specify a semi-nonparametric (SNP) distribution to represent the unobserved heterogeneity in a Poisson regression model for crash frequency analysis. Relative to the unimodal log-gamma distribution in the conventional negative binomial model, the SNP distribution is highly flexible to mimic different types of distributions. When the length of polynomial increases, the SNP distribution can approximate a large family of distributions, including symmetric or asymmetric unimodal distribution and different types of multimodal distributions. Traffic crash analysts can take advantage of its flexibility to release distributional restrictions imposed by the conventional modeling method and explore the most appropriate distributional form for the unobserved heterogeneity.
In the empirical study based on the crash dataset collected from the California State of USA, the SNP distribution classifies the observations from the sample into three groups, which are exposed to different levels of risk. The SNP model fits data substantially better than the conventional NB model and provides more precise model coefficient estimators. The NB model is found to substantially overestimate the effect of lane width on crash frequency reduction relative to the SNP model based on more robust estimation of unobserved heterogeneity.
Future research may be carried out in the following three directions. At first, an approach may be required to classify observations into the groups identified by the SNP model. With this approach, there may be great potential to identify "high-risk" and "low-risk" locations associated with unobserved risk factors for further considerations. In addition, the crash model may be re-estimated based on the observations belonging to the "medium-risk" group, where the unobserved heterogeneity is more narrowly distributed. If it can be realized, the goodness-of-fit of the model may be further improved, while all the model coefficients will reflect the situation with the most "medium-risk" locations since "outliers" in "high-risk" and "low-risk" locations are omitted from the sample. Second, the SNP model needs to be applied to some other crash frequency datasets to further examine its applicability in different occasions. Third, there are different methods to capture the heterogeneity. Instead of modifying the distribution of the random component ε i , a random parameter model can also be explored to capture the heterogeneity and improve the goodness-of-fit of model. In future research, a random parameter model may be developed and compared with the SNP model and the traditional NB model.

Author Contributions
Data curation: Dominique Lord.