Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Geographically weighted regression analysis for nonnegative continuous outcomes: An application to Taiwan dengue data

Abstract

Geographically Weighted Regression (GWR) has gained widespread popularity across various disciplines for investigating spatial heterogeneity with respect to data relationships in georeferenced datasets. However, GWR is typically limited to the analysis of continuous dependent variables, which are assumed to follow a symmetric normal distribution. In many fields, nonnegative continuous data are often observed and may contain substantial amounts of zeros followed by a right-skewed distribution of positive values. When dealing with such type of outcomes, GWR may not provide adequate insights into spatially varying regression relationships. This study intends to extend the GWR based on a compound Poisson distribution. Such an extension not only allows for exploration of relationship heterogeneity but also accommodates nonnegative continuous response variables. We provide a detailed specification of the proposed model and discuss related modeling issues. Through simulation experiments, we assess the performance of this novel approach. Finally, we present an empirical case study using a dataset on dengue fever in Tainan, Taiwan, to demonstrate the practical applicability and utility of our proposed methodology.

Introduction

Spatial heterogeneity, which refers to the instability of data relationships between dependent and independent variables, is recognized as a crucial factor in spatial data analysis [13]. This recognition has sparked interest in spatially varying coefficient models. Geographically weighted regression (GWR) has emerged as a popular tool, pioneered by [4]. With a nonparametric modeling framework, GWR employs kernel-based density functions centered on each observation to estimate local parameters, utilizing geographical weights determined by distance. Due to its simplicity and interpretability, numerous extensions of GWR have been developed to address complex data structures and analytical challenges. These extensions include methods for spatiotemporal data (e.g., [5, 6]), count outcomes (e.g., [7, 8]), and approaches dealing with spatial autocorrelation and relationship heterogeneity (e.g., [9, 10]). Significant advancements have also been made in those for quantile versions [3, 11], multivariate responses [12], machine learning techniques (e.g., [13, 14]), and models capable of handling zero-inflated data [15].

This paper aims to enhance the development of Geographically Weighted Regression (GWR) by introducing a flexible approach for modeling nonnegative continuous data that may exhibit severe skewness and a substantial proportion of zeros. Such data, also called semi-continuous variables, arise frequently in various fields, including health services (e.g., medical costs), environmental sciences (e.g., daily precipitation levels), ecology (e.g., total weight of a particular fish species), social studies (e.g., alcohol consumption), and actuarial research (e.g., insurance claim losses), among others. In spatial applications, it is not unusual for researchers to be interested in analyzing semi-continuous outcomes. For example, Reich et al. [16] investigated the spatial impact of neighborhood environmental characteristics on the physical activity levels of pregnant women across different regions. Their study focused on activity intensity, a semi-continuous variable characterized by a large number of zero values (indicating no activity during pregnancy) and highly right-skewed positive, continuous values (indicating varying levels of activity among exercising pregnant women). Modeling such spatial semi-continuous outcomes is challenging due to the point mass at zero and the presence of positive heavy-tailed continuous values. Traditional spatial models based on normal, gamma, and log-normal distributions can yield biased estimates and incorrect inferences when applied to such data [1719].

Given the mixture of zero and non-zero values, semi-continuous data are often analyzed using two-part models, which involve a two-stage process: one governing the occurrence of zeros and another determining the observed value given a positive non-zero continuous response [1719]. Several developments have extended two-part models for application to spatial semi-continuous data. Dreassi et al. [20] developed a hierarchical Bayesian approach that constructs a two-part random effect model for small area estimations and handles semi-continuous, skewed, spatially structured data. Neelon et al. [21] proposed a broad class of Bayesian two-part models in which various parametric and nonparametric modeling specifications can be considered for the spatial density estimation of semi-continuous data. More recently, Paradinas et al. [22] focused on spatiotemporal extensions for two-part models to infer the spatiotemporal behavior of the process under study. While quite appealing, these spatial two-part modeling approaches share some disadvantages in practical data analysis.

On one hand, spatial two-part modeling approaches explicitly separate zeros and positive values using two submodels: a binary model (typically logistic regression) that estimates the probability of the outcome being positive in Part I, and a (generalized) linear regression model that estimates the amount of the (transformed) positive value in Part II [21, 23]. One important issue in such model specification is determining the distributional form in Part II. Various choices are suggested in the literature for modeling the positive component, including the lognormal model, log-skewed normal model, and generalized gamma model. An orthodox approach can fit all of these models for a particular dataset and then select the best one based on measures of goodness of fit. However, this approach requires specific fitting algorithms for each candidate model, which can pose computational challenges in empirical applications. Additionally, analysts may struggle to confidently select the most suitable model because different criteria for model comparison can lead to different selected models. On the other hand, spatial two-part models allow for two sets of covariates in the two parts of the model. However, such a feature raises concerns about how to define the predictors to be considered. The number of predictor variables also plays a role. Although one could assume identical predictors for both parts of the model, adding an additional linear function in Part I increases model complexity and makes interpretation of the results more difficult. Another option is to use different independent variables in the two equations, but doing so requires conceptual reasons or appropriate justifications.

Drawing from previous discussions, we advocate a unified approach that enables simultaneous modeling of both zero and positive observations without splitting them into separate parts. In statistics, the compound Poisson model (termed CPM for simplicity) presents an interesting alternative for the analysis of semi-continuous data. This modeling technique assumes that the response variable follows a single distribution known as compound Poisson distribution, which belongs to the Tweedie exponential dispersion family [24, 25]. The CPM has a unique parameter characterizing the power relation between distribution mean and variance. As such, it offers flexibility in modeling continuous response data with large variances and/or excessive zeros [26]. Moreover, it has shown that CPM is comparable in model fit with the two-part models but only includes a one-stage modeling process, thereby enhancing interpretability [27]. Owing to these advantages, several scientists have adopted the modeling for spatial data analysis. For instance, Arcuti et al. [28] introduced a spatiotemporal generalized additive model that draws upon the compound Poisson distributional assumption to deal with spatial zero-inflated abundance data. Swallow et al. [19] developed a Bayesian hierarchical model utilizing the compound Poisson distribution to directly handle the excess number of zeros in nonnegative continuous data while also addressing spatial and temporal correlation. Despite these advancements, little effort has been made to extend the CPM with a local spatial modeling perspective for properly exploring spatially varying relationships with semi-continuous responses, particularly on the GWR framework.

Our motivating example herein is the analysis of dengue data in Taiwan during the 2015 calendar year. The variable of interest is the dengue intensity index, which measures the severity of dengue transmission. This index is a nonnegative continuous variable that can exhibit a point mass at zero and a heavily right-skewed distribution of positive values. Since dengue data are typically georeferenced, spatial models that account for these data characteristics are of interest when estimating the dengue intensity indices in relation to regional covariates such as socioeconomic, demographic, or environmental factors. As it is essential to consider local impacts in dengue epidemiology [29], traditional GWR assumes a normal distribution for the outcome variable at each local calibration and can not effectively address the issues of zero-inflation and heavy skewness in the data. Consequently, it fails to provide appropriate information about spatial heterogeneity. We intend to tackle the analytical challenges by proposing a more suitable approach for dengue data analysis.

This study fills the methodology gap and demonstrates an extension of CPM to GWR by proposing a novel approach named geographically weighted compound Poisson regression model (GWCPM). We begin with a brief review of non-spatial CPM and then present the framework of GWCPM, covering aspects such as bandwidth selection, statistical inferences, and tests of spatial nonstationarity. Simulation experiments are conducted to validate the proposed methodology. Finally, we apply GWCPM to the empirical data on dengue fever in Taiwan, demonstrating its practical utility. The paper concludes with a discussion of the findings and implications.

Methodology

A non-spatial CPM

Let Yi be a response for ith observation, i = 1, …, n, and Xi = [Xi1, …, Xic] denote a row vector of covariates with dimension (1 × c) for ith observation, which includes the intercept. According to Jøregenson [24], the non-spatial CPM is a statistical model where the response variable is assumed to follow the compound Poisson (CP) distribution, denoted as Yi ∼ CP(μi, ϕ, p). This three-parameter distribution can be expressed as (1) for yi ≥ 0 where (2) with Γ being the gamma function and α = (2 − p)/(1 − p). The distribution mean μi is typically modeled based on the covariate vector Xi with the associated regression parameters β = [β1, …, βc]t (dimension c × 1): (3) for a known monotonic link function g that relates μi to the linear predictor ηi = g(μi) = Xiβ. The variance of Yi has the form (4) which characterizes a so-called power mean-variance relationship [24, 30]. Here, ϕ > 0 is the dispersion parameter, and p ∈ (1, 2) is a power index parameter that controls the shape of the response distribution. In Eq (2), a(yi; ϕ, p) represents a normalizing function and needs to be evaluated using an infinite series expansion. Further technical information can be found in [30, 31].

The variance function of (4) plays an important role in the CP distribution. Depending on the power index parameter p, the CP density is a mixed distribution characterized by a positive probability mass at zero and continuous positive support. This feature offers flexibility in modeling various characteristics of nonnegative continuous data. As displayed in Fig 1, the CP densities with mean (μ) and dispersion (ϕ) parameters fixed at 0.5 and 1, respectively, have different shapes, skewness, variabilities, and discrete probability of Y = 0 (solid circles) for various p values. In other words, the key advantage of the CP distribution lies in its ability to simultaneously capture both a mass at zero and a unimodal or multimodal distribution with a long right tail. This capability is attributed to the unique specification of the power index parameter p, which defines the variance function and characterizes the distribution on varying shapes [24].

thumbnail
Fig 1. Compound Poisson distribution densities with μ = 0.5 and ϕ = 1 for different values of p.

https://doi.org/10.1371/journal.pone.0315327.g001

GWCPM specification

We develop GWCPM as an extension of the described non-spatial CPM to GWR. Suppose that we now have data collected from n locations in space. Adopting the notations used in previous subsection, let Yi be a semi-continuous random variable collected from ith location (ui, vi), i = 1, 2, …, n; Xi denote the row vector of c georeferenced covariates at location i. The GWCPM is then specified by (5) where β(ui, vi) indicates the regression coefficient vector at location (ui, vi), such that the mean and variance of Yi can be expressed as (6) for the parameters ϕ(ui, vi) and p(ui, vi) defined similarly as non-spatial CPM. Expanding upon non-spatial CPM, GWCPM remains the power relationship between the mean and variance. In contrast to the non-spatial CPM, the GWCPM considers spatially varying parameters through ϕ(ui, vi), p(ui, vi) and β(ui, vi). The vector β(ui, vi) = [β1(ui, vi), …, βc(ui, vi)]t consists of local regression coefficients for the explanatory variables Xi evaluated at each location (ui, vi). Thus, the GWCPM can be viewed as a spatial local variant of the non-spatial CPM. It enables users not only to examine the spatial heterogeneity in the data relationships across geographical locations but also to handle spatial semi-continuous outcomes with excessive zeros and severe skewness through the dispersion and index parameters.

It is worth noting that the proposed GWCPM (Eqs 5 and 6) specifies the local index parameter within the range of 1 < p(ui, vi) < 2. To enhance its analytical capabilities, we also explore a variant of GWCPM where p(ui, vi) can take on arbitrary values beyond this range, but not between 0 and 1. Such relaxation with the unbounded power parameter can automatically adapt to a wider range of variance structures [32], making it more useful than the original GWCPM in handling diverse characteristics of nonnegative continuous data at each local calibration.

Estimation and inference for GWCPM

By the concepts of GWR, the GWCPM is calibrated by a kernel regression methodology in which the model estimators are obtained in a pointwise way. The model parameters of GWCPM can be estimated using the maximum likelihood estimation method. Let λ(ui, vi) = {β(ui, vi), ϕ(ui, vi), p(ui, vi)}. The local log-likelihood function at location i is expressed by (7) where , for μj = g−1(Xjβ(ui, vi)) as the mean evaluating at location j with parameters at regression point i, and fY(yj; μj, ϕ(ui, vi), p(ui, vi)) represents the CP density distribution of Eq (1) with the parameterizations of Eq (6). The term wij = K(dij, h) is the geographical weight assigned locally to observation (Xj, Yj), calculated based on a kernel function K that places more weights on observations closer to (ui,vi) than those farther away. It depends on the distance dij between the given location (ui,vi) and the jth designed location (uj,vj), as well as a bandwidth h. Multiple choices are available for kernel weight schemes in local calibrations. Two prevalent options include fixed gaussian and adaptive bisquare kernels. The fixed gaussian kernel employs a constant bandwidth, ensuring uniform weights between observation points regardless of the distance. In contrast, the adaptive bisquare kernel dynamically adjusts its bandwidth by considering the density of neighboring points surrounding each estimation point with a nearest-neighbor approach. For further details, please refer to Fotheringham et al. [1].

Maximizing the local log-likelihood in (7) at each location involves solving the derivative equations , , and . Since the CP distribution does not have a closed form [30, 31], an iterative maximization algorithm is required to solve the optimization problem numerically. In this paper, we utilize the Quasi-Newton algorithm, which avoids the need for explicitly computing the second derivative (i.e., the Hessian matrix). Instead, it iteratively approximates the Hessian matrix using information from the first derivatives, providing an efficient approach for optimization. By employing the Quasi-Newton approach for parameter estimation, the (numerically) Hessian matrix, denoted as , can be computed at the optimum point. Consequently, the asymptotic distribution of follows a normal distribution with mean λ(ui, vi) and variance . For testing hypotheses such as H0: βk(ui, vi) = 0 for k = 1, …, c, H0: ϕ(ui, vi) = 0, or H0: p(ui, vi) = 0, corresponding local pseudo t-statistics can be computed. These statistics serve as tools to evaluate the statistical significance of the local parameters of the GWCPM.

Bandwidth selection

As mentioned above, local calibrations for GWCPM entail a bandwidth h that regulates both the model complexity and the parameter estimates. To determine the optimal bandwidth, we adopt the leave-one-out cross-validation technique suggested by Fotheringham et al. [1]. We attempt to minimize the prediction error, which is calculated as follows: where is the fitted value of Yi with calibration location i left out of the estimation dataset using bandwidth h, and CV is termed the cross-validation score. The cross-validation minimization given above is an iterative process wherein the GWCPM is refitted multiple times using the data excluding location i (i = 1, 2, …, n) over a range of bandwidth values. Therefore, numerical search routines for function minimization can be applied to improve efficiency.

Test of spatial nonstationarity and semiparametric model

For GWCPM, it is essential to identify whether the model parameters (regression coefficients, dispersion parameter, power index parameter) really vary across space. We carry out such assessment using a Monte Carlo randomization testing approach, which has been widely used in GWR literature because it does not require the determination of sampling distribution for the variance of the local parameter estimates. The testing procedure involves the following steps: (i) Calculate the variance of the estimated local parameter estimates for GWCPM using the observed spatial data; (ii) Randomly permute the geographic coordinates of the observations and perform GWCPM for this new configuration; (iii) Calculate the variance of local parameters according to the new estimates; (iv) Repeat the previous two steps for a total of R times, say 999; (v) Compute the number of times, say r, that the simulated variances obtained from Step (iii) exceed the original value in Step (1). The Monte Carlo p-value can be then approximated by r/(R + 1).

If the Monte Carlo tests yield non-significant results, it suggests that certain parameters of the GWCPM are spatially stationary, while others vary geographically [1, 33]. In response to such scenarios, we introduce a semiparametric model for the analysis of nonnegative continuous data. For cases where only some regression coefficients of independent variables remain constant, we suggest the following semiparametric model: (8) where Xiz represents a vector of cz independent variables with constant regression coefficients βz and Xis is defined similarly for a vector of cs spatially varying parameters; c = cz + cs. A similar model specification can be constructed by further considering global dispersion and/or power parameters. For example, the semiparametric GWCPM with a constant power parameter can be specified as (9)

Regarding the semiparametric model estimations, we implement a two-stage calibration procedure inspired by the approach of Li and Mei [34]. Using the model in Eq (8) as an example, we first perform a standard GWCPM in which fixed parameters βz are treated to vary spatially. We then obtain the final estimators by averaging the corresponding local estimates across all locations. In the second stage, another GWCPM is considered to compute refined estimators for the spatially varying coefficients βs(ui, vi), while keeping the coefficients unchanged in the log-likelihood function. This two-stage method can be more effective than the iterative back-fitting approach typically used for mixed GWR models [12, 34].

Simulation study

We conduct two simulation experiments to assess the estimation accuracy and model performance for both GWCPM and semiparametric GWCPM. The first experiment generates data for GWCPM using the equation where μi = exp(β0(ui, vi) + β1(ui, vi)Xi); Xi are obtained from random draws of the uniform distribution on the interval [0, 1], denoted XiU(0, 1). Following Chen et al. [3], three sets of real coordinates are used for the geographical system (ui, vi) in the experiment: 159 counties in Georgia, 577 minor civil divisions in Georgia, and 1054 minor civil divisions in North Carolina. The true varying coefficients for each configuration are generated based on the eigenvectors of the transformed binary spatial weight matrix derived from the coordinate system, following the approach proposed by Páez et al. [35]. We select four of them, namely the first (e1), fourth (e4), eighth (e4), and ninth (e9) eigenvectors, and set

For the semiparametric GWCPM, we extend the data-generating process to include an additional independent variable X2 with a global constant coefficient β2. Also, we assume fixed dispersion and power parameters across locations. Specifically, the data is simulated as follows: Here, β0(ui, vi), β1(ui, vi), and Xi1 are specified in the same manner as the first simulation, with β2 = 1.5, Xi2U(0, 1), p = 1.5, and ϕ = 0.5.

Throughout the experiments, we employ the fixed gaussian kernel for the model calibrations. A total of 100 simulations are performed for each sample size (n = 159, 577, 1054), and the optimal bandwidth is computed using the CV method for each replication. The simulation results are examined with respect to the absolute bias (ABS) and root mean square error (RMSE).

Table 1 presents the results of the mean ABS and RMSE for these two simulation cases. As can be seen, both the mean ABS and RMSE of the β regression coefficients for GWCPM decrease with the increase in the sample size, which follows the theoretical expectation. There seems to be a trade-off between the accuracy of estimating the dispersion and power index parameters, where decreasing one tends to increase the other as the sample size grows. This may be consistent with the known fact that the power index parameter can significantly influence the estimation of the dispersion parameter, and they exhibit a reciprocal relationship [26, 36]. Nevertheless, the results are reasonable, and overall, the GWCPM performs well in estimating the true parameters. Similar results are also found for the case of semiparametric GWCPM. This suggests that our proposed two-stage estimation approach has favorable finite sample properties.

Empirical application

Data and variables

The empirical application involves the analysis of dengue data from Taiwan, an island that has regularly experienced dengue epidemics in recent decades. In 2015, southern Taiwan, particularly Tainan City, experienced the deadliest dengue fever outbreak to date, with greater severity than in other regions. Such geographical disparities highlight the spatial variation in dengue epidemics across different regions, emphasizing the impact of locality. Investigating the spatial nonstationarity of potential risk factors for dengue epidemics is crucial for formulating effective, location-specific dengue control strategies.

To this end, we collect and analyze a dataset of 752 ‘Li’ units in Tainan City for the year 2015. The primary outcome is the dengue index of transmission intensity, a metric developed to assess the severity of epidemic transmission based on the frequency of cases occurring in consecutive weeks. We first obtain dengue case data at the ‘Li’ level from the database maintained by Taiwan’s Centers for Disease Control (Taiwan CDC). We then convert the aggregated dengue counts into an intensity index following the definition of Wen et al. [37, 38]. According to their research, this intensity index is a nonnegative continuous measure that can take on both zero and positive values, rather than being count-based. High values of the index indicate a time-concentrated dengue transmission in the area, while ‘Li’ units with no dengue cases result in zero values for the index.

Building upon previous studies analyzing Taiwan dengue data [39], we consider eight explanatory variables consisting of socio-demographic characteristics, environmental attributes, and meteorological conditions. The socio-demographic factors include population density (X1), percent of the susceptible population (X2), and median income (X3), which are collected from the social-economic platform of the National Geographic Information System maintained by the Department of Statistics, Ministry of Interior. Because both the population density and median income vary quite greatly and are highly skewed, we normalize them with natural logarithm. The Breteau index (BI), extracted through the active vector surveillance system of Taiwan CDC, quantifies the density of immature Aedes mosquitoes by measuring the number of positive containers per 100 houses. We calculate the maximum BI at the Li level and use it as an environmental indicator in the analysis. Our last four independent variables are meteorological measures obtained from the Central Weather Bureau in Taiwan: the mean days of most suitable temperature (MST; X5) for dengue vector mosquitoes, the total number of rainy days from June to November (X6), the longest duration of dryness days from January to May (X7), and the annual maximum temperature (X8). These climatic summaries are spatially extrapolated using the inverse distance weighted (IDW) method to obtain representative results for each ‘Li’ unit. Full descriptive statistics for all variables are given in Table 2.

The spatial distribution of dengue transmission intensity is given in Fig 2. It finds that ‘Li’ units around the inner downtown area of Tainan (e.g. West Central, North, South, Yongkang, Anping, East districts) with denser populations tend to have higher transmission. Fig 3 depicts the partial histogram for the observed values of the dengue transmission intensity up to 60. There are 10 additional observations above 60 with a maximum value of 132.8. We observe that approximately 26% of the 752 ‘Li’ units in Tainan city reported zero intensity values. The distribution of dengue transmission intensity is highly right-skewed with a cluster at zero and a long tail to the right. The continuous positive intensity values range from 0.13 to 121.08, averaging around 7.66 with a large variance of 226.68. These data characteristics suggest that dengue transmission intensities are potentially zero-inflated with a skewed heavy-tailed distribution of the positive outcomes, which poses challenges for statistical analysis. In particular, conventional data transformations may prove ineffective in normalizing the distribution due to the substantial proportion of zero values, rendering models based on normal distributions inappropriate. Other common models, such as those considered lognormal, gamma, or inverse gaussian distributions, may also be unsuitable as they do not account for zero observations in their modeling process. Consequently, the compound Poisson model emerges as a potential candidate for fitting the dengue transmission intensity data. While recent studies have discussed the relevance of the compound Poisson distribution for infectious disease modeling [40, 41], we highlight the interest in exploring its potential within the framework of GWR for dengue analysis. We use the proposed GWCPM approach to effectively handle the distinctive features of dengue transmission intensity outcomes, namely a considerable proportion of zeros, severe right skewness, and large variance. This approach also allows us not only to identify potential factors of the dengue intensity index but also to examine whether these impacts are stable over space.

thumbnail
Fig 2. The spatial distribution of dengue transmission intensity at 752 ‘Li’ units in Tainan city for year 2015.

Source: The figure is created by the authors in ArcGIS 10.3. The shapefile used to create the map is publicly available from the Taiwan Ministry of the Interior (https://www.tgos.tw); no copyrighted material was used.

https://doi.org/10.1371/journal.pone.0315327.g002

thumbnail
Fig 3. The partial histogram of dengue transmission intensity (up to 60) in Tainan city for year 2015.

https://doi.org/10.1371/journal.pone.0315327.g003

Model results

We first analyze the data from a global perspective by fitting a non-spatial global CPM as the base model and using an OLS model for comparison. Table 3 shows that the parameter estimates obtained from the two models are quite different, which is in line with our expectations given the different distributional assumptions (CP and normal distributions) of the two models. In both models, the logarithm of median income is negatively associated with dengue intensity; the higher the median income in the area, the lower the intensity of dengue transmission. Conversely, all other variables (excluding intercept) demonstrate positive and statistically significant effects on dengue intensity for CPM. The OLS model reveals a similar pattern, except for the significance on the effects of Maximum BI and total rainy days from June to November. The proportion of the susceptible population may be the most influential risk factor, as indicated by the highest estimated coefficient. We utilize various goodness of fit measures, including R2 (pseudo-R2 for CPM), deviance residuals, mean square prediction error (MSPE), and mean absolute prediction error (MAPE), to compare the models. The results indicate that the CPM outperforms the OLS model. To confirm this finding, we produce the sample quantile-quantile (QQ) plot of the quantile residuals, a tool often used to assess how well the underlying distribution in regression models fits the original data [42]. The results for OLS and CPM are shown in Fig 4. Points that deviate from the reference line indicate poor model performance and an inappropriate distributional assumption for fitting the considered data. It is clear from the graph that the CPM is appropriate and the OLS model seriously violates the normality assumption. The dispersion and power parameters ( and ) for the CPM are estimated at 1.4724 and 1.6947 respectively. These observations suggest that a compound Poisson distribution is suitable for global modeling of dengue transmission intensity. In addition, spatial autocorrelation analysis using Moran’s I statistics shows a significant positive correlation in the residuals of the non-spatial CPM model, indicating the need for spatial modeling in the analysis of the data.

thumbnail
Fig 4. Sample quantile-quantile (QQ) plot of quantile residuals for non-spatial CPM (left panel) and OLS (right panel) models.

https://doi.org/10.1371/journal.pone.0315327.g004

thumbnail
Table 3. Parameter estimates for non-spatial CPM and OLS model.

https://doi.org/10.1371/journal.pone.0315327.t003

For spatial data analysis, we implement GWCPM to investigate the spatial heterogeneity in data relationships. To assess the effectiveness of GWCPM, we also perform a comparison with GWR. Additionally, we conduct an analysis using GWCPM without boundary constraint on p ∉ (1, 2) to allow for analytical flexibility. Both fixed gaussian and adaptive bisquare kernel weighting functions are used for local calibrations. The model results are evaluated by the same criteria considered in Table 3. The metrics in Table 4 show that all GWCPM models (GWCPMa, GWCPMb, GWCPMc, GWCPMd) provide better fits than the non-spatial global CPM (see Table 3) and have significantly lower Moran’s I values in the model residuals. This finding is consistent with the point made by Fotheringham et al. [1] regarding the explanatory power of local models and the potential reduction in error autocorrelation by allowing for spatial heterogeneity in the regression parameters. Among the GWCPM analyses, models with adaptive bisquare kernel (GWCPMa, GWCPMb) fit the dengue data better than those with fixed kernels (GWCPMc, GWCPMd). The GWCPM with no constraint on the power parameter (GWCPMb) is the best model. Such conclusion is further supported by two-way comparisons using the Gini index [43] in Table 5, where the model is chosen based on a “mini-max” argument following Zhang [36]. The GWCPMb has the smallest maximal Gini indices across all rows, indicating better predictive accuracy for dengue transmission intensity.

Compared to the GWR analysis, the GWR model with fixed gaussian (GWRa) appears to have slightly worse fitting accuracy than the best GWCPM model (GWCPMb) in terms of pseudo-R2, MSPE, and MAPE. However, when examining the sample QQ plot of the quantile residuals (Fig 5), the GWRa model fits poorly at both the lower and upper quantiles. This suggests that GWR modeling cannot effectively deal with the substantial proportion of zeros and the skewed heavy-tailed distribution of the positive dengue intensities. It should be emphasized that GWR typically assumes a normal distribution of residuals. This assumption makes it unsuitable for fitting the dengue transmission intensity data with zero-inflation or heavy tails and hence leads to potentially incorrect inferences. In contrast, the sample QQ plot of the quantile residuals for GWCPMb shows that the points mostly lie along the diagonal line, reinforcing the superiority of GWCPM over GWR.

thumbnail
Fig 5. Sample quantile-quantile (QQ) plot of quantile residuals for GWCPMb (left panel) and GWRa (right panel) models.

https://doi.org/10.1371/journal.pone.0315327.g005

Table 6 gives the five-number summary of the GWCPMb estimation results based on the specifications of unbounded power parameter and an adaptive bisquare kernel weighting scheme. The last column of Table 6 reports the results of the Monte Carlo test for spatial nonstationarity in terms of a total of 999 replications. All the p-values in the model are less than the significant level of 5%, which validates the application of the GWCPM model and rules out the possibility of a semiparametric counterpart.

To gain a deeper understanding of the estimated varying relationships across space, we employ the mapping technique proposed by Matthews and Yang [44] to visualize the results of the GWCPMb. Due to space constraints, we opt not to include all maps but only limit two covariates, the logarithm of population density and the mean days of MST. Both of them are known to play important roles on the dengue transmission intensity [39]. Figs 6 and 7 show the spatial distributions of the local parameter estimates for these two variables and only the areas with significant effects at the 5% level are colored. Several findings can be drawn from the figure. First, as shown in Fig 6, larger local estimates of the logarithm of population density are observed in some parts along the coast (e.g., Cigu, Jiangjun, and Beimen districts). This means that ‘Li’ units in the areas with greater population density are more favorable for dengue transmission. Certain ‘Li’ units in northern Tainan (Yanshui, Xinying districts) and on the east side of central districts (e.g., Rende, Guiren, Guanmiao, Xinhua, New Downtown districts) also exhibit a positive but weaker ecological association than coastal regions. Second, in Fig 7, the locations surrounding inner downtown Tainan (e.g., Rende, Guiren, Guanmiao, Yongkang, Xinhua, Xijang, Anan, New Downtown districts) as well as some northwestern areas (Haoying, Xuejia districts), where strong significant positive relationships occur, have a high dengue transmission with increasing mean days of MST. Comparably, the general pattern of negative local estimates in the same graph demonstrates that the intensity of dengue fever transmission is less susceptible to the increased mean days of MST in parts of northern Tainan (Baihe and Dongshan districts). Finally, both maps in Figs 6 and 7 clearly depict the substantial spatial heterogeneity in the relationships between the two covariates and dengue transmission intensity. Significant effects are concentrated on specific parts of Tainan city, rather than being observed uniformly across all ‘Li’ units. Such locality cannot be revealed from the conventional global CPM models.

thumbnail
Fig 6. Local estimates of the logarithm of population density from the GWCPMb model.

Source: The figure is created by the authors in ArcGIS 10.3. The shapefile used to create the map is publicly available from the Taiwan Ministry of the Interior (https://www.tgos.tw); no copyrighted material was used.

https://doi.org/10.1371/journal.pone.0315327.g006

thumbnail
Fig 7. Local estimates of the mean MST days from the GWCPMb model.

Source: The figure is created by the authors in ArcGIS 10.3. The shapefile used to create the map is publicly available from the Taiwan Ministry of the Interior (https://www.tgos.tw); no copyrighted material was used.

https://doi.org/10.1371/journal.pone.0315327.g007

As described in the Methodology section, GWCPM modeling includes the varying power and dispersion parameters to characterize the relationship between the mean and variance at each location, allowing for the capture of various data characteristics of nonnegative continuous outcomes. Thus it is of interest to examine the spatial pattern of such power mean-variance relationships in this case study. The corresponding maps of the estimated power and dispersion parameters computed from the GWCPMb are reported in Figs 8 and 9. We see from Fig 8 that the power index estimates are highest with values close to 3 in the inner downtown area of Tainan city and decreasing outward. The estimates of varying dispersion parameters exhibit a reverse pattern (Fig 9). This phenomenon draws a reciprocal relationship between the power and dispersion parameter, consistent with the CPM literature [26, 36]. We note that the ‘Li’ units, particularly in inner downtown Tainan (West Central, North, East, South, Yongkang, Anping districts) suffered the dengue outbreak in the year 2015, resulting in high dengue transmission intensities with significantly large variations. Therefore, the varying estimates for the power and dispersion parameters correspond to this reality and successfully capture the data characteristics of dengue intensities.

thumbnail
Fig 8. Local estimates of the power parameter from the GWCPMb model.

Source: The figure is created by the authors in ArcGIS 10.3. The shapefile used to create the map is publicly available from the Taiwan Ministry of the Interior (https://www.tgos.tw); no copyrighted material was used.

https://doi.org/10.1371/journal.pone.0315327.g008

thumbnail
Fig 9. Local estimates of the dispersion parameter from the GWCPMb model.

Source: The figure is created by the authors in ArcGIS 10.3. The shapefile used to create the map is publicly available from the Taiwan Ministry of the Interior (https://www.tgos.tw); no copyrighted material was used.

https://doi.org/10.1371/journal.pone.0315327.g009

Conclusion and discussion

GWR has become a widely used approach to explore spatial heterogeneity in data relationships. Despite its popularity, GWR encounters difficulties when modeling nonnegative continuous or so-called semi-continuous outcomes often characterized by significant numbers of zeros and a highly right-skewed distribution of positive values. This paper introduces the GWCPM to resolve such limitations in local spatial modeling approaches. The GWCPM builds upon the traditional non-spatial CPM within the GWR framework. Its primary advantage is the capability to handle the intricate structure and distinctive characteristics of semi-continuous data in local modeling through a single compound Poisson distribution, wherein the distribution variance is expressed as a function of the mean using both dispersion and power index parameters. This power mean-variance relationship, unique to the compound Poisson distribution, offers the flexibility to capture varying levels of data dispersion and tackle issues like zero-inflation and heavy tails. As a result, the GWCPM serves as an effective tool for analyzing spatial data with semi-continuous response variables while allowing for the examination of spatially varying relationships. We also present a semiparametric version of the GWCPM to accommodate spatially invariant parameters, thus enhancing analytical flexibility by controlling for constant factors across different locations. In simulation experiments, we demonstrate the reliable performance of both the GWCPM and its semiparametric counterpart, validating their effectiveness in modeling spatial heterogeneity in semi-continuous data.

Our GWCPM has demonstrated its practical applicability through an empirical analysis of dengue transmission data from Tainan City in southern Taiwan. The results reveal several key findings. On one hand, we find significant spatially varying relationships between dengue transmission and its risk factors, which cannot be captured by global models. For instance, the global CPM indicates a broad positive relationship between mean MST days and dengue transmission, consistent with existing literature [39]. In contrast, GWCPM identifies more localized variations, showing negative associations in parts of northern Tainan and positive associations in southern regions. Such observation echoes that space plays a crucial role in influencing the magnitude and direction of effects among variables of interest (e.g., [29, 39, 45, 46]). On the other hand, the GWCPM analysis underscores the necessity for a place-based approach in developing intervention strategies against dengue transmission at the local level. A one-size-fits-all policy may overlook crucial spatial nuances, as evidenced by the significant nonstationarity retrieved from GWCPM. For example, reducing urbanization in coastal areas of Tainan, where population density is positively associated with transmission intensity, may help control outbreaks. Public health interventions in these regions could benefit from monitoring land cover use and understanding urban dynamics and environmental changes [4548]. Similarly, climate-related vector control efforts, such as reducing mosquito breeding habitats and deploying mosquito traps, should be prioritized in southern areas with high MST days. This suggestion aligns with previous studies emphasizing the importance of localized modeling in epidemiological or public health research [9, 11, 29]. Moreover, compared to the GWR results, GWCPM effectively handles the dengue transmission data with excessive zeros and severe right skewness. Our proposed method thus offers a robust framework and blueprint for dengue data analysis, enabling more precise assessments of spatially varying relationships.

Some issues warrant further discussion. First, the GWCPM method assumes that the response variable follows a compound Poisson distribution, which belongs to the Tweedie exponential dispersion family and specifies variance proportional to a power of the mean [24]. Should this assumption be violated, the validity of GWCPM may be compromised. While the QQ plot of the quantile residuals offers a straightforward empirical assessment of this assumption, additional sensitivity analyses and simulations are necessary to investigate the effects of dispersion and power index parameters on the model’s performance. In addition, given that GWCPM is developed in the context of GWR, it may be limited in capturing complex nonlinear relationships. This limitation might potentially be addressed by considering the concept of geographically neural network weighted regression (GNNWR) model recently introduced by Du et al. [15], which utilizes a spatially weighted neural network in place of the kernel function. Such a possibility deserves future investigation. It is also worth noting that in spatial data analysis, researchers typically seek to identify locations where the estimated regression coefficients are statistically significant, in addition to parameter estimations and model predictions. Although GNNWR has demonstrated superior fitting accuracy and predictive performance relative to GWR, the main strength of the proposed GWCPM lies in the statistical inference of local model parameters. Specifically, GWCPM considers variance structures and different shapes of the response variable distribution through the power mean-variance specification, ensuring reliable standard error estimation and robust inferences.

Furthermore, current methods for constructing statistical inference in GWCPM rely on asymptotic theory, which sometimes presents problems when the Hessian matrix is not positive definite. In the GWR literature, bootstrapping [49] has been suggested as an inferential framework for GWR-type modeling techniques, including the mixed GWR [33, 50], semiparametric geographically weighted generalized linear models [34], and geographically weighted quantile regression [3]. Therefore, further research in this regard could establish bootstrap inference for GWCPM. Another limitation pertains to the development of the semiparametric GWCPM, as inference on local parameter estimates has not yet been established for the model. Future work could also employ the bootstrap method to address this concern. Additionally, recent studies have adopted bootstrap tests to determine constant coefficients. Similar efforts can be made not only to complement the semiparametric GWCPM framework but also to provide a model selection tool for data analysis. Finally, we should consider an empirical application of the semiparametric GWCPM to illustrate its utility in allowing some regression parameters to remain spatially invariant under appropriate circumstances.

In summary, this study introduces the GWCPM and demonstrates its strength in examining spatially varying coefficients for semi-continuous dependent variables. From both theoretical and empirical standpoints, this approach holds significant potential for advancing geographical analysis and fostering new discussions in the field.

References

  1. 1. Fotheringham AS, Brunsdon C, Charlton M. Geographically Weighted Regression: The Analysis of Spatially Varying Relationships. Chichester: John Wiley & Sons; 2002.
  2. 2. Harris P. A simulation study on specifying a regression model for spatial data: choosing between autocorrelation and heterogeneity effects. Geographical Analysis. 2019; 51(2):151–181.
  3. 3. Chen VYJ, Yang TC, Matthews SA. Exploring heterogeneities with geographically weighted quantile regression: An enhancement based on the bootstrap approach. Geographical Analysis. 2020; 52:642–661. pmid:33888913
  4. 4. Brunsdon C, Fotheringham AS, Charlton M. Geographically weighted regression-modelling spatial non-stationarity. Journal of the Royal Statistical Society Series D (The Statistician). 1998; 47(3):431–443.
  5. 5. Huang B, Wu B, Barry M. Geographically and temporally weighted regression for modeling spatio-temporal variation in house prices. International Journal of Geographical Information Science. 2010; 24(3):383–401.
  6. 6. Fotheringham AS, Crespo R, Yao J. Geographical and temporal weighted regression (GTWR). Geographical Analysis. 2015;47(4):431–452.
  7. 7. Nakaya T, Fotheringham AS, Brunsdon C, Charlton M. Geographically weighted Poisson regression for disease association mapping. Statistics in Medicine. 2005; 24:2625–2717. pmid:16118814
  8. 8. Silva AR, Rodrigues TCV. Geographically Weighted Negative Binomial Regression—incorporating overdispersion. Statistics and Computing. 2014; 24:769–783.
  9. 9. Shoff C, Chen VYJ, Yang TC. When homogeneity meets heterogeneity: the geographically weighted regression with spatial lag approach to prenatal care utilization. Geospatial health. 2014; 8(2):557–568. pmid:24893033
  10. 10. Geniaux G, Martinetti D. A new method for dealing simultaneously with spatial autocorrelation and spatial heterogeneity in regression models. Regional Science and Urban Economics. 2018; 72:74–85.
  11. 11. Chen VYJ, Deng WS, Yang TC, Matthews SA. Geographically weighted quantile regression (GWQR): An application to US mortality data. Geographical Analysis. 2012; 44(2):134–150. pmid:25342860
  12. 12. Chen VYJ, Yang TC, Jian HL. Geographically weighted regression modeling for multiple outcomes. Annals of the Association of American Geographers. 2022; 112(5):1278–1295.
  13. 13. Li L. Geographically weighted machine learning and downscaling for high-resolution spatiotemporal estimations of wind speed. Remote Sensing. (2019); 11(11):1378.
  14. 14. Kalogirou S. Destination choice of athenians: An application of geographically weighted versions of standard and zero inflated poisson spatial interaction models. Geographical Analysis. 2016; 48(2):191–230.
  15. 15. Du Z, Wang Z, Wu S, Zhang F, Liu R. Geographically neural network weighted regression for the accurate estimation of spatial non-stationarity. International Journal of Geographical Information Science. (2020); 34(7):1353–1377.
  16. 16. Reich BJ, Fuentes M, Herring AH, Evenson KR. Bayesian variable selection for multivariate spatially varying coefficient regression. Biometrics. 2010; 66(3):772–782. pmid:19817742
  17. 17. Neelon B, O’Malley AJ, Smith VA. Modeling zero-modified count and semicontinuous data in health services research, part 1: Background and overview. Statistics in Medicine. 2016a; 35:5070–5093. pmid:27500945
  18. 18. Neelon B, O’Malley AJ, Smith VA. Modeling zero-modified count and semicontinuous data in health services research, part 2: Case studies. Statistics in Medicine. 2016b; 35: 5094–5111. pmid:27500973
  19. 19. Swallow B, Buckland ST, King R, Toms MP. Bayesian hierarchical modelling of continuous non‐negative longitudinal data with a spike at zero: An application to a study of birds visiting gardens in winter. Biometrical Journal. 2016; 58(2):357–371. pmid:25737026
  20. 20. Dreassi E, Petrucci A, and Rocco E. Small area estimation for semicontinuous skewed spatial data: an application to the grape wine production in tuscany. Biometrical Journal. 2014; 56(1):141–156. pmid:24214421
  21. 21. Neelon B, Zhu L, Neelon SEB. Bayesian two-part spatial models for semicontinuous data with application to emergency department expenditures. Biostatistics. 2015; 16:465–479. pmid:25649743
  22. 22. Paradinas I, Conesa D, López-Quílez A, Bellido JM. Spatio-temporal model structures with shared components for semi-continuous species distribution modelling. Spatial Statistics. 2017; 22:434–450.
  23. 23. Liu L, Strawderman RL, Johnson BA, O’Quigley JM. Analyzing repeated measures semi-continuous data, with application to an alcohol dependence study. Statistical methods in medical research. 2012; 25(1):133–152. pmid:22474003
  24. 24. Jørgensen B. The Theory of Dispersion Models. London: Chapman & Hall; 1997.
  25. 25. Bonat WH, Jørgensen B. Multivariate covariance generalized linear models. Journal of the Royal Statistical Society: Series C (Applied Statistics). 2016; 65(5):649–675.
  26. 26. Bonat WH, Kokonendji CC. Flexible Tweedie regression models for continuous data. Journal of Statistical Computation and Simulation. 2017; 87(11):2138–2152.
  27. 27. Kurz CF. Tweedie distributions for fitting semicontinuous health care utilization cost data. BMC Medical Research Methodology. 2017; 17(1):171. pmid:29258428
  28. 28. Arcuti S, Pollice A, Ribecco N, D’Onghia G. Bayesian spatiotemporal analysis of zero-inflated biological population density data by a delta-normal spatiotemporal additive model. Biometrical Journal. 2016; 58(2):372–386. pmid:26418888
  29. 29. Lin CH, Wen TH. Using geographically weighted regression (GWR) to explore spatial varying relationships of immature mosquitoes and human densities with the incidence of dengue. International journal of environmental research and public health 2011; 8(7): 2798–2815. pmid:21845159
  30. 30. Dunn PK, Smyth G. Series evaluation of Tweedie exponential dispersion model densities. Statistics and Computing. 2005; 15(4):267–280.
  31. 31. Dunn PK, Smyth G. Evaluation of Tweedie exponential dispersion model densities by Fourier inversion. Statistics and Computing. 2008; 18(1):73–86.
  32. 32. Dunn PK. tweedie: Tweedie exponential family models. R package version 2.1.7; 2013.
  33. 33. Mei CL, Xu M, Wang N. A bootstrap test for constant coefficients in geographically weighted regression models. International Journal of Geographical Information Science. 2016; 30(8):1622–1643.
  34. 34. Li D, Mei C. A two-stage estimation method with bootstrap inference for semi-parametric geographically weighted generalized linear models. International Journal of Geographical Information Science. 2018; 32(9):1860–1883.
  35. 35. Páez A, Farber S, Wheeler D. A simulation-based study of geographically weighted regression as a method for investigating spatially varying relationships. Environment and Planning A. 2011; 43(12):2992–3010.
  36. 36. Zhang Y. Likelihood-based and Bayesian methods for Tweedie compound Poisson linear mixed models. Statistics and Computing. 2013; 23:743–757.
  37. 37. Wen TH, Lin NH, Lin CH, King CC, Su MD. Spatial mapping of temporal risk characteristics to improve environmental health risk identification: a case study of a dengue epidemic in Taiwan. Sci Total Environ. 2006; 367:631–640. pmid:16584757
  38. 38. Wen TH, Lin NH, Chao DY, Hwang KP, Kan CC, Lin KCM, et al. Spatial-temporal patterns of dengue in areas at risk of dengue hemorrhagic fever in Kaohsiung, Taiwan, 2002. International Journal of Infectious Diseases. 2010; 14(4):334–343. pmid:19716331
  39. 39. Chen THK, Chen VYJ, Wen TH. Revisiting the role of rainfall variability and its interactive effects with the built environment in urban dengue outbreaks. Applied Geography. 2018; 101:14–22.
  40. 40. Enki DG, Noufaily A, Farrington P, Garthwaite P, Andrews N, Charlett A. Taylor’s power law and the statistical modelling of infectious disease surveillance data. Journal of the Royal Statistical Society Series A: Statistics in Society. 2017; 180(1):45–72.
  41. 41. Watts M, Kotsilla P, Mortyn PG, Monteys VSi, Brancati CU. Influence of socio-economic, demographic and climate factors on the regional distribution of dengue in the United States and Mexico. International Journal of Health Geographics. 2020; 19:44. pmid:33138827
  42. 42. Dunn KP, Smyth GK. Randomized Quantile Residuals. Journal of Computational and Graphical Statistics. 1996; 5:1–10.
  43. 43. Frees EW, Meyers G, Cummings DA. Summarizing insurance scores using a Gini index. Journal of the American Statistical Association. 2011; 495:1085–1098.
  44. 44. Matthews SA, Yang T-C. Mapping the results of local statistics: Using geographically weighted regression. Demographic research. 2012; 26:151. pmid:25578024
  45. 45. Jiang L, Lai Y, Guo R, Li X, Hong W, Tang X. Measuring the impact of government intervention on the spatial variation of market-oriented urban redevelopment activities in Shenzhen, China. Cities. 2024; 147:104834.
  46. 46. Yu B, Zhou X. Land finance and urban Sprawl: Evidence from prefecture-level cities in China. Habitat International. 2024; 148:103074.
  47. 47. He Q, Xia P, Hu C, Li B. Public information, actual intervention and inflation expectations. Transformations in Business & Economics. 2022; 21(57C):644–666.
  48. 48. Gong X, Hou Z, Wan Y, Zhong Y, Zhan M, Lv K. Multispectral and SAR image fusion for multi-scale decomposition based on least squares optimization rolling guidance filtering. IEEE Transactions on Geoscience and Remote Sensing. 2024; 62:5401920.
  49. 49. Efron B. Bootstrap methods: another look at the jackknife. The Annals of Statistics. 1997; 7(1):1–26.
  50. 50. Harris P, Brunsdon C, Lu B, Nakaya T, Charlton M. Introducing bootstrap methods to investigate coefficient non-stationarity in spatial regression models. Spatial Statistics. 2017; 21:241–261.