Figures
Abstract
Spatial Autoregressive (SAR) models are widely used to analyze interactions among regions. However, the traditional model assumes a constant spatial autocorrelation coefficient, which fails to effectively capture spatial heterogeneity. To address this issue, we propose proposes a novel Spatial Single-Index Varying Coefficient Autoregressive (SSIVCAR) model. By introducing a single-index varying coefficient function, this model allows the spatial correlation strength to dynamically change with the characteristics of spatial units, thereby more accurately capturing spatial dependence relationships. To estimate the model parameters, we combine spline methods with two-stage least squares, and we assess the model’s performance under finite sample conditions under Monte Carlo simulations. The simulation results show that the proposed model performs significantly better in capturing spatial heterogeneity and improving estimation accuracy. Finally, the model is applied to analyze the impact of digital economy development on environmental quality, and find that it has significant heterogeneous effects across different regions. This study provides a new framework for analyzing complex spatial dependence structures and offers valuable insights for regional governance policies.
Citation: Zhao J, Pu Y (2025) Characterization and estimation of heterogeneous spatial autocorrelation in spatial autoregressive models. PLoS One 20(7): e0327316. https://doi.org/10.1371/journal.pone.0327316
Editor: Mohamed R. Abonazel, Cairo University, EGYPT
Received: February 16, 2025; Accepted: June 12, 2025; Published: July 1, 2025
Copyright: © 2025 Zhao, Pu. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: https://www.stats.gov.cn/sj/ndsj/ https://cnki.nbsti.net/CSYDMirror/trade/yearbook/Single/N2023070131?z=Z012 https://geodata.ucdavis.edu/gadm/gadm4.1/shp/gadm41_CHN_shp.zip
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Spatial statistics is a distinct and well-established discipline focused on analyzing spatially referenced data. In recent years, it has played an increasingly important role in the study of environmental economics, regional development, and related fields. A central feature of spatial data is spatial dependence, which captures the interrelatedness of observations across geographic space. For example, Anselin [1] pointed out that the diffusion of air pollutants often crosses administrative boundaries, leading to a high correlation in environmental quality between regions; LeSage [2] further emphasized that ignoring spatial dependence may lead to significant bias in estimating policy effects. However, in addition to spatial dependence, spatial heterogeneity between regions, such as differences in economic structure, industrial layout, and population distribution, also significantly affects model identification and policy formulation.
The traditional Spatial Autoregressive (SAR) model [3], which captures spatial dependence through a fixed parameter , has become the benchmark tool in empirical research due to its simplicity. However, this homogeneity assumption often fails in real-world scenarios. For example, Feng [4] found that in the Yangtze River Delta region, the spatial correlation of industrial pollution emissions is as high as 0.8 due to the close collaboration of industrial chains, whereas in ecologically fragile western regions, the spatial correlation
is below 0.3 due to geographic barriers. Ignoring this heterogeneity may lead to an underestimation of the collaborative governance needs in economically developed areas, while overestimating the independent emission reduction potential in less developed areas.
To address this issue, researchers have attempted improvements from multiple perspectives. One approach introduces spatial heterogeneity structures, such as multiscale models [5,6,20] and spatial varying-coefficient models [7–9], by assigning different parameters across regions to capture regional differences [10]. These methods have alleviated the limitations of the homogeneity assumption to some extent, but there are still difficulties in handling high-dimensional feature spaces and capturing dynamic patterns. Another line of research adopted nonparametric or semiparametric methods, attempting to flexibly capture spatial heterogeneity without pre-setting specific functional forms [11]. Although these methods improve model flexibility, they still focus on explaining the relationships and tend to overlook the dynamic variation in . In these methods, spatial effects are often simplified or omitted, making it difficult to fully reveal the differences in spatial dependence between spatial units. Therefore, these methods remain limited in addressing the combined effects of spatial heterogeneity and spatial dependence [12].
To address these limitations, we propose a novel spatial single-index varying coefficient autoregressive (SSIVCAR) model. Unlike the SAR model, SSIVCAR links regional characteristics ui to the spatial autoregressive coefficient via a nonlinear single-index function . This design captures the heterogeneity in spatial correlation and reveals how regional characteristics influence spatial dependence, enabling more accurate modeling of heterogeneous spatial structures. It relaxes the constant correlation assumption in the SAR model, allowing spatial effects to vary across regions. While maintaining the basic structure of the SAR model, the second part still adopts a linear regression model form, preserving the simplicity and interpretability of the model. The model also ensures computational efficiency when dealing with large-scale spatial data by combining spatial correlation with linear covariate relationships. Furthermore, considering the endogeneity issue, we adopt a joint estimation method of spline functions and two-stage least squares (2SLS) to eliminate potential endogeneity bias and improve the accuracy of parameter estimates.
Monte Carlo simulations were conducted to evaluate the performance of the proposed model under finite sample conditions. The results show that the SSIVCAR model outperforms the SAR model in terms of estimation accuracy and robustness. Under conditions of strong spatial heterogeneity, the proposed model significantly improves the reliability of the estimation results. Finally, we apply the SSIVCAR model to examine the impact of digital economy development on environmental quality. The empirical findings indicate that digital economy development has a significant and heterogeneous impact on various types of environmental pollution.
The structure of the following sections is as follows. Sect 2 discusses related works; Sect 3 introduces and derives the SSIVCAR model, including its basic framework and estimation methods; Sect 4 presents Monte Carlo simulations to evaluate the model’s performance under finite samples; Sect 5 provides an empirical analysis, applying the SSIVCAR model to investigate the impact of digital economy development on environmental pollution in China; Sect 6 concludes the paper by summarizing the main findings and suggesting directions for future research.
Related works
The SAR serves as a foundational framework for analyzing spatial dependence. Its classic form was systematically developed by Anselin [1], characterizing interactions between neighboring units through a spatial lag term. LeSage and Pace [2] further expanded the Bayesian estimation method and developed a spatial econometrics toolbox that is widely used in policy evaluation. To address endogeneity, Kelejian [15] proposed the generalized spatial two-stage least squares (GS2SLS), providing a theoretical basis for instrument variable selection. Elhorst et al. [16] developed a dynamic spatial panel model that allows for both temporal and spatial dependence; Lee et al. [12] derived the asymptotic properties of the quasi-maximum likelihood estimator (QMLE). In recent years, high-dimensional data analysis has driven SAR models toward sparse modeling. Lam et al. [17] proposed a penalized maximum likelihood method suitable for variable selection in high-dimensional settings. Additionally, Ejigu [38] introduced a covariate-dependent weighting matrix that integrates spatial proximity and observed characteristics. This approach challenges the traditional assumption that geographic closeness alone defines neighborhood structure and demonstrates improved model fit in environmental applications. These developments extend the flexibility of SAR-type models by allowing spatial dependence to vary with contextual covariates.
To address the limitations of global spatial models in capturing local heterogeneity, researchers have developed the geographically weighted regression (GWR) framework. GWR performs localized parameter estimation through a spatial kernel function, as systematically outlined by Brunsdon and Fotheringham [18]. Brunsdon et al. [19] proposed an adaptive bandwidth selection algorithm that optimizes the kernel range via cross-validation. To enhance model flexibility, Comber et al. [20] developed the multi-scale GWR (MGWR), which allows different variables to operate at different spatial scales. With the development of machine learning, Hagenauer [21] integrated GWR with artificial neural networks, while Lu et al. [22] proposed a covariate-weighted GWR (CS-GWR) to address dimensionality issues via variable selection. However, GWR methods face two major limitations: First, they rely heavily on geographic distance weights, limiting their ability to model heterogeneity arising from socio-economic factors [14]; Second, their computational complexity grows quadratically with sample size, limiting scalability in big data contexts [23]. To partially address these issues, Ejigu [41] applied Bayesian geostatistical models to analyze modern contraceptive use in Ethiopia. By incorporating individual- and service-level covariates, the study demonstrated how flexible spatial models can better capture social and demographic heterogeneity beyond geographic proximity.
Spatially varying coefficient models (SVCM) have also garnered considerable attention for modeling local structural variation. The original varying coefficient model framework was proposed by Hastie and Tibshirani [24], enhancing model flexibility through localized parameterization. Aquaro et al. [10] extended this framework to spatial settings by developing heterogeneous coefficient spatial regression models. Yang [25] proposed instrumental variable quantile estimation with nonparametric spatial weights, and Sun [26] introduced local linear methods to improve estimation efficiency. More recently, Chen et al. [27] adopted spline basis functions to approximate varying coefficient surfaces and established corresponding asymptotic theory; Liu [28] developed an adaptive LASSO penalized SVCM for simultaneous variable selection and estimation. However, SVCM methods still face two key challenges: the curse of dimensionality when handling high-dimensional covariates [29], and limited attention to spatial lag endogeneity [30]. Furthermore, most existing approaches assume stationarity in spatial dependence. Addressing this, Ejigu [39] proposed a geostatistical model incorporating covariate-driven nonstationarity in the covariance structure. Simulation results and case studies in disease mapping show that ignoring nonstationarity can lead to biased inference and misestimated uncertainty. Moreover, Ejigu [40] applied spatial models to map malaria risk in Mozambique, identifying social and environmental determinants of childhood infection. These studies underscore the practical relevance of flexible spatial modeling in real-world epidemiological applications.
In the realm of semiparametric modeling, the single-index model (SIM) strikes a balance between flexibility and interpretability by reducing dimensionality. Its theoretical foundation was established by Ichimura [31], who proposed a semi-parametric least squares estimator. Yu and Ruppert [32] introduced penalized spline estimation to control model complexity and mitigate overfitting. In spatial econometrics, Klein et al. [33] developed a single-index spatial error model (SI-SEM) to capture nonlinear housing price diffusion mechanisms; Li [34] embedded the single index structure into a spatial lag framework and proposed a semiparametric SAR model; Guan [35] extended this framework to construct a single-index spatially varying coefficient model (SI-SVCM), using splines to approximate locally varying effects. However, existing SIM-based models seldom address heterogeneity in spatial lag parameters or account for endogeneity within a unified framework. These gaps motivate the development of more general spatial semiparametric frameworks capable of capturing complex spatial interaction patterns.
Model
Preliminary experiment
The SAR model is widely used in spatial econometrics to capture spatial dependence and autocorrelation. The standard form of the SAR model is given by [1]:
where Y is the dependent variable, W is the spatial weight matrix, is the spatial autoregressive coefficient, X is the matrix of explanatory variables,
is the regression coefficient, and
is the error term. The spatial correlation coefficient
in the model is assumed to be a constant, representing the overall degree of spatial autocorrelation in the sample.
However, the assumption that is a constant implies that spatial correlation is the same across all samples, which may not hold in practice. In many real-world applications, spatial effects may vary across regions or units. For example, in regional economic studies, the economic behaviors of neighboring regions may be influenced by different spatial effects. Therefore, assuming that all samples share the same spatial correlation coefficient
may be overly simplistic and neglect the issue of spatial heterogeneity.
This raises a fundamental question: Should the spatial correlation coefficient be assumed identical across all observations? In other words, can the model be extended to allow observation-specific spatial correlation coefficients to better capture spatial dependence and heterogeneity? The answer to this question has important implications for whether the standard SAR model should be extended.
To test this hypothesis, we conducted a preliminary experiment to examine the variation in spatial correlation coefficients across different spatial units. Specifically, we defined a square region and divided it into a grid, yielding 225 spatial units. In the experiment, the spatial correlation coefficient
for each unit was assigned a different value, following a sine function to simulate the heterogeneous spatial dependence that may exist across regions. The regression coefficients were fixed as
, and the explanatory variables X were generated using four different distributions: uniform distribution, normal distribution, chi-square distribution, and beta distribution. Specifically, for the uniform distribution, X was uniformly distributed within the interval [0,1]; for the normal distribution, X was drawn from a normal distribution with a mean of 0 and variance of 1; for the chi-square distribution, X was sampled from a chi-square distribution with 2 degrees of freedom; and for the beta distribution, X was generated using shape parameters
and
. The error term
was drawn from a normal distribution with a mean of 0 and a variance of 1. To ensure the robustness of the results, the experiment was repeated 300 times using a Monte Carlo simulation. Table 1 summarizes the results from applying the SAR model across 300 simulations for each distribution.
The estimation results in Table 1 clearly demonstrate that the SAR model is inadequate for capturing heterogeneous spatial dependence. Under the uniform distribution, the estimated spatial correlation coefficient significantly deviates from the true value, showing that the SAR model tends to overestimate
when spatial correlation varies across regions. Additionally, the regression coefficients
and
are heavily biased, deviating by 191.5% and 123.7% from their true values, respectively. The model also overestimates the variance parameter
by more than 37 times. These biases are not unique to the uniform distribution. Under the normal distribution,
is far from the true value, indicating a failure of the model to correctly capture spatial dependence. The estimates of
and
also show significant bias, with
and
deviating from the true values by 24.8% and 22.8%, respectively. The model also overestimates
. Under the chi-square distribution, the model performs slightly better, with
, but the estimates for
and
remain biased, and
are grossly overestimated at 25.6878. The best performance is seen with the beta distribution, where
is closer to the true value, but still exhibits bias. Similarly,
and
deviate significantly from their true values, and
is still overestimated. In all cases, the SAR model fails to accurately estimate the spatial correlation, regression coefficients, and variance, confirming its limitations when dealing with spatially heterogeneous data.
Spatial single-index varying coefficient autoregressive model
In the SAR model, spatial autocorrelation is primarily governed by the spatial weight matrix W and the spatial autoregressive coefficient . The matrix W reflects the adjacency relationships among geographic locations or spatial units and is typically constructed using an adjacency matrix, a distance matrix, or a standardized matrix based on specific rules. The spatial correlation coefficient
quantifies the strength of association among spatial units, with its value usually ranging between –1 and 1, corresponding to negative and positive spatial effects, respectively.
However, as demonstrated in the previous section, the assumption of a constant spatial autoregressive coefficient imposes inherent limitations on the model. Specifically, it fails to capture spatial heterogeneity, where the degree of spatial dependence may vary across different spatial units. In practical applications, spatial interactions are often influenced by region-specific factors such as geographic features, socioeconomic conditions, and policy environments. As a result, imposing a globally constant
may oversimplify and misrepresent the true spatial structure. To better capture underlying spatial heterogeneity, the model should be extended to allow unit-specific spatial autoregressive coefficients.
Consider the following SSIVCAR model:
where yi denotes the response variable, and
are parameters to be estimated. The element wij denotes the (i,j)th entry of the given
spatial adjacency matrix W, and
represents the observed variables. The error term
is assumed to be independent and normally distributed with mean 0 and variance
. The function
is an unknown single-index function that captures spatial heterogeneity, where
denotes the spatial location parameters for the ith observation. By combining the spatial adjacency matrix with the single-index function
, the model flexibly reflects the influence of spatial heterogeneity while capturing spatial autocorrelation.
Compared to the SAR model, the SSIVCAR model replaces the global spatial autoregressive coefficient with a nonlinear function
. Since
follows the structure of a single-index model, this formulation not only captures spatial heterogeneity but also mitigates the curse of dimensionality.
Model estimation
For ease of estimation, the model can be rewritten in matrix form as follows:
where ,
, and
with . Here,
denotes the diagonal matrix whose diagonal elements are the entries inside the braces.
Assuming that the matrix I–GW is nonsingular, equation (3) can be rewritten as:
which further implies:
Assuming that X is exogenous, we analyze the relationship between WY and . Specifically, we have:
where denotes the trace of a matrix. Since
, it follows that there is a correlation between WY and
in the SSIVCAR model, leading to an endogeneity problem.
Therefore, the SSIVCAR model cannot be directly estimated using conventional methods; and requires the use of appropriate instrumental variables or alternative estimation techniques to address endogeneity.
The detailed estimation procedure is as follows:
Step 1: To estimate the single-index component , an initial value
is first specified. This initial estimate can be obtained using the de-nonlinearization method proposed by Xue et al. [36] to estimate
in the model
, which then serves as the initial value for the SSIVCAR model. Alternatively, when the dimension of U is low, a grid search method can be employed to obtain a more accurate initial estimate.
Step 2: Let be l knots on the interval [a,b]. Given
, define
, which is used to construct the basis functions. The pth-order truncated power spline basis function is given by:
where the notation is defined as:
By constructing the spline basis function B(ti), we can flexibly approximate the unknown function g(t). In subsequent iterations, ti is updated based on the current value of , thereby optimizing the estimation of the model parameters. Let the coefficients for the truncated power spline basis functions be
. Then,
Substituting this approximation into equation (3) yields:
where and
As mentioned earlier, since the model suffers from an endogeneity problem and X is exogenous, one can choose linearly independent variables including X, WX, , WU, and
as instrumental variables for WY.
Step 3: Let Q be the matrix of instrumental variables constructed from the linearly independent variables X, WX, , WU, and
. Then, the two-stage least squares (2SLS) method can be used for estimation.
In the first stage, to address the endogeneity problem, the fitted values of WY are estimated as:
Substituting into equation (10) transforms the model into:
At this point, the endogeneity problem has been addressed through the instrumental variable matrix Q.
In the second stage, the parameters and
are estimated based on the modified model. Specifically, we have:
where .
This two-stage estimation procedure addresses the endogeneity problem while ensuring the efficiency and consistency of the parameter estimates.
Step 4: To ensure model identifiability, we impose a parameter constraint using the “omit-one-component" method. Specifically, let . Then, we reparameterize
as:
Substituting the parameter estimates and
obtained in Step 3 into equation (3), we construct the loss function:
where . We then minimize the loss function L using the Trust Region Method (TRM) [37] to obtain an estimate
, from which a new estimate of
, denoted by
, is computed.
Step 5: Substitute the estimate obtained in Step 4 back into Step 2, and then repeat Steps 2 through 4 until
converges. The final converged value is the optimal estimate
.
Step 6: Substitute the optimal estimate from Step 5 into Steps 2 through 4 to obtain the optimal estimates of the model parameters:
,
, and
. The predicted values of the model,
, are given by:
and the optimal estimate of the residual variance is:
In summary, through the iterative optimization process described above, we obtain the optimal estimates of the model parameters ,
,
, and
.
Monte Carlo simulation
This chapter conducts a Monte Carlo simulation study to evaluate the performance of the spatial lag single-index varying coefficient model under finite sample conditions. Simulations were performed on a computer running 64-bit Windows 11 with a 2.2GHz AMD processor, using MATLAB 2024b as the programming environment. Optimization was carried out using MATLAB’s Trust-Region Method (TRM).
Evaluation metrics
To assess the accuracy of the parameter estimates, we employ the following three statistical measures:
- Standard Error (S.E.):
(19)
- where
denotes the estimated value of the parameter
in the ith simulation and
is the average estimate.
- Bias:
(20)
- where
is the true value of the parameter.
- Mean Squared Error (MSE):
(21)
- where
represents the total number of simulations.
Lower values of these measures indicate greater estimation accuracy and a better approximation of the unknown function.
Construction of spatial location information and model specification
In the simulation experiments, two types of spatial region structures are considered to reflect different forms of spatial organization: (1) a regular square grid with evenly spaced locations, and (2) an irregular circular region with randomly distributed points. Schematic illustrations of these two spatial settings are shown in Fig 1.
(1) Regular square grid setting: Assume that m = 2, so that geographic locations are represented by two-dimensional coordinates. The simulation region is a square, with its lower-left corner designated as the origin. A Cartesian coordinate system is imposed along the horizontal and vertical directions. Each side of the square is divided into h–1 equal segments. By connecting the division points along both axes, a total of intersection points are generated, including those on the boundary. These intersection points represent the simulated geographic locations.
Assuming a sample size of n = h2, the coordinates of each geographic location can be defined as:
where denotes the remainder when i–1 is divided by h, and
denotes the integer part of the quotient; where,
denotes the floor function.
(2) Irregular circular region setting: To better reflect the fact that real-world spatial structures are often irregular and do not conform to regular grids, we introduce an additional simulation setting based on a circular region with randomly distributed spatial units and irregular neighborhood structures. Specifically,n spatial units are generated by randomly placing n points within a unit circle. Each spatial location is drawn from a uniform distribution over the disk
. To construct the spatial weight matrix W, we adopt a k-nearest neighbor strategy: for each point Ui, we identify its k closest neighbors (measured by Euclidean distance) and assign a weight of 1 to each of these links, setting all other weights to 0. The resulting adjacency matrix is then row-normalized.
Model specification: The SSIVCAR model is specified as follows:
where the sample size is set to n = 225, with parameters and
. The covariates xij are independently drawn from the uniform distribution U(–1,1),with h = 15, and ui is as defined in (22). The elements wij are drawn from the Rook contiguity matrix W (in the square grid case) or from the k-nearest neighbor matrix (in the circular region). The coefficient function is specified as
, and the error term
is independently drawn from the normal distribution
.
Comparative experiments
Parameter estimation is conducted under the following settings. The number of simulations is set to , following standard practice in Monte Carlo studies. A key principle of Monte Carlo simulations is that increasing the number of simulations generally leads to more stable and accurate estimates, particularly for complex models. The choice of 300 simulations ensures sufficient convergence of the results without excessive computational cost. The node step is set to 10, motivated by the use of spline functions to approximate the unknown function g. The choice of step size in spline-based methods directly affects the resolution of the approximation. Specifically, it determines how frequently the index variable is sampled, thereby affecting the spline’s ability to capture the underlying behavior of the function. A smaller step size yields a finer discretization and more accurate approximation of g, but increases computational burden. Conversely, a larger step would decrease the model’s ability to accurately capture variations in the function. In this study, to select the nodes, we sort the independent variable of the unknown function and choose every 10th value as a node. This method ensures that the nodes are spaced evenly, providing a well-distributed sample of the independent variable. The order of the spline basis functions is set to
. The choice of p is driven by the need for a balance between model flexibility and smoothness. In spline approximation, the order p controls the degree of smoothness and the model’s ability to fit the data. A higher order increases the flexibility of the splineand may lead to overfitting, whereas a lower order may fail to adequately capture the complexity of the underlying function. Setting p = 3 is a common choice in spline-based approximation, as it provides sufficient flexibility to approximate complex functions while maintaining smoothness and avoiding overfitting.
Table 2 presents the average estimated values, S.E., biases, and mean MSE for the various parameters.
Table 2 shows the parameter estimation results for a sample size of n = 225 and nmc = 300 simulations. The results indicate that the model exhibits good parameter estimation precision and stability under small sample conditions. The estimates for the single-index parameters and
show small biases and low values of both MSE and S.E., indicating a high accuracy in capturing spatial location effects. In addition, the regression parameters
and
are close to their true values, with biases within an acceptable range, although the MSE is slightly higher, as expected under a finite sample. It is worth noting that the standard error for
is relatively large, suggesting some uncertainty in its estimation under small-sample conditions. The estimate of the error variance
is close to the true value, though its standard error is relatively high, possibly due to the influence of random variation. Overall, the estimation results under small-sample conditions demonstrate the model’s robustness and adaptability. The estimation precision for the single-index parameters is particularly notable, while the regression parameters and error variance remain within reasonable bounds.
Furthermore, the fitting performance of the function g is illustrated in Fig 2, where Fig 2(a) shows the true surface for the square region and Fig 2(b) shows the fitted surface for the square region. Fig 2(c) shows the true curve and Fig 2(d) presents the fitted curve for the circular region. These plots enable a direct comparison of the fitting accuracy across regular (square) and irregular (circular) spatial domains.
(a) True surface for square region; (b) fitted surface for square region; (c) true curve for circular region; (d) fitted curve for circular region.
Fig 2 compares the true and fitted representations of the unknown function g across two distinct spatial regions. The true surface for the square region (Fig 2 (a)) exhibits smooth variation with respect to the spatial location parameter , showing periodic fluctuations. The fitted surface for the square region (Fig 2 (b)) successfully reproduces these global features, particularly the amplitude and trends near the extrema, confirming the model’s ability to capture the main characteristics of the underlying function. For the circular region, the true curve (Fig 2 (c)) presents a non-periodic pattern reflecting the more irregular distribution of the spatial locations. The fitted curve (Fig 2 (d)) closely aligns with the true curve, indicating that the model can effectively accommodate irregular spatial structures. Although some minor discrepancies are observed in finer details, the overall fit remains strong, suggesting that the spatial lag single-index varying coefficient model is robust for both regular and irregular spatial structures.
To further evaluate the distributional characteristics and normality of the parameter estimates, we present histograms and Q-Q plots for the parameters ,
,
, and
. The histograms provide an intuitive depiction of the distributional concentration of the estimates, while the Q-Q plots verify whether the estimates conform to the normality assumption.
Fig 3 and 4 demonstrate that the estimates for ,
,
, and
are well concentrated and approximately normally distributed. The histograms reveal that the estimates of
(Fig 3 (a)) and
(Fig 3 (b)) exhibit symmetric, bell-shaped distributions centered around the true values, indicating high precision and stability in estimating the single-index parameters. For
(Fig 3 (c)) and
(Fig 3 (d)), although the tails are slightly heavier, the overall distribution is consistent with normality. The corresponding Q-Q plots (Fig 4) further confirm that most points align closely with the theoretical diagonal, with only minor deviations in the tails, thereby validating the normality assumption of the parameter estimates.
(a) Distribution of ; (b) distribution of
; (c) distribution of
; (d) distribution of
.
(a) Q-Q plot for ; (b) Q-Q plot for
; (c) Q-Q plot for
; (d) Q-Q plot for
.
In the following analysis, the regular square spatial structure is used as the baseline to examine how different values of and alternative specifications of spatial adjacency matrices influence the model’s estimation performance.
To further evaluate the model’s performance, we compared the parameter estimation results under different error variances (i.e.,
, 0.25, 0.01). Table 3 presents the estimated parameter values, S.E., Biases, and MSE under these three conditions. This comparison facilitates analysis of how the estimation accuracy and stability vary with different levels of error variance.
As shown in Table 3, as decreases, the estimation accuracy of the model parameters improves notably as
decreases: the estimates approach the true values, and both S.E. and MSE decrease substantially. For the single-index parameters
and
, both bias and MSE are relatively large under
, but they decrease rapidly as
is reduced to 0.25 and 0.01, with the MSE nearly approaching zero. Similarly, the regression parameters
and
exhibit a comparable pattern: both standard errors and MSE are large under
due to high noise, but decrease substantially as
becomes smaller. For the error variance
itself, the estimates converge to the true value and exhibit lower S.E. and MSE as the true variance decreases. These results further validate the model’s adaptability and robustness under varying levels of error variance.
Thus, the comparative experiments indicate that lower error variance leads to improved estimation accuracy, with estimates being closer to the true values and exhibiting significantly lower standard errors and MSE.
Fig 5 illustrates the fitted results for the unknown function g under different error variances (
, 0.25, 0.01).
(a) ; (b)
; (c)
.
Fig 5 shows that as decreases, the fitted performance of the unknown function g improves substantially. When
(Fig 5 (a)), the fitted surface captures the overall trend but displays noticeable local discrepancies; As
decreases to 0.25 (Fig 5 (b)), the differences between the fitted and true surfaces diminish, enabling the model to more accurately capture the global structure of the function; and when
(Fig 5 (c)), the fit is nearly perfect, with the fitted surface almost exactly matching the true surface.
Fig 5 presents the fitted results for the unknown function g under three distinct noise variance settings. As the variance decreases, the fitting performance improves progressively. When
, the model captures the global shape of the function but exhibits noticeable deviations in regions of high curvature. For
, the estimation becomes more stable, with reduced deviations in both peak and valley regions. At a low noise level of
, the fitted surface shows excellent agreement with the true function, confirming the model’s robustness under low-noise conditions. These results demonstrate that lower noise variance leads to improved estimation accuracy, particularly in regions where the function exhibits rapid variation.
To further analyze the impact of spatial adjacency structures on parameter estimation, we consider four types of spatial weight matrices: the Bishop matrix, the Case matrix, the Distance matrix, and a random matrix. For each matrix, we recorded the estimated values, S.E., Biases, and MSE of the parameters. The Bishop and Case matrices define weights based on neighborhood structures and are suited to capturing local and complex spatial dependencies, respectively; the Distance matrix characterizes global spatial dependence based on pairwise distances; while the random matrix, independent of geographic location, allocates weights based on other associated features, offering broader applicability. Table 4 presents the parameter estimation results under each of the four spatial weight matrix settings.
As shown in Table 4, the parameter estimation results vary under different spatial adjacency matrices. The Bishop matrix performs exceptionally well in capturing local spatial relationships, with very small biases and low MSE for and
, indicating that it effectively characterizes local dependency. The Case matrix demonstrates high flexibility in scenarios with strong spatial dependence, producing estimates for
and
that are close to their true values and stable estimation of
. The Distance matrix shows remarkable advantages in modeling global spatial relationships, achieving the highest estimation accuracy for
and
, along with highly stable estimates for
and
. The random matrix exhibits a balanced performance, with biases and MSE for all parameters within a reasonable range, indicating its adaptability in capturing both global and local relationships. These findings confirm that the choice of spatial weight matrix significantly affects model performance. The Bishop and Distance matrices offer distinct advantages for local and global modeling, respectively, while the Case and random matrices provide flexible alternatives for capturing complex spatial dependencies, offering diverse options for practical applications.
Fig 6 presents the fitted results for the unknown function g under four spatial weight matrices: Bishop, Case, Distance, and Random. The figure clearly illustrates how different spatial weight matrices influence the fitting performance.
(a) Bishop matrix; (b) case matrix; (c) distance matrix; (d) random matrix.
Fig 6 presents the estimated surfaces of the coefficient-modulating function under four spatial weight matrices: Bishop, Case, Distance, and Random. While the overall shape of the estimated surfaces remains consistent—reflecting the smooth sinusoidal structure of the true function—subtle differences among them highlight the influence of the spatial weight matrix on estimation precision. Specifically, the Bishop and Case matrices, which emphasize local connectivity, result in slightly sharper surface transitions, particularly along the diagonal regions where spatial dependence is strongest. The Distance matrix yields a smoother and more stable surface, suggesting its strength in capturing global spatial patterns. In contrast, the Random matrix, although not based on geographic proximity, still produces a reasonably smooth fit, indicating its ability to balance local and global spatial information in a data-driven manner.
Empirical analysis: The impact of digital economy development on the environment
In recent years, the digital economy has emerged as a key driver of global economic transformation. It not only promotes technological progress, optimizes resource allocation, and enhances production efficiency, but also offers new pathways for addressing environmental challenges. Through low-carbon technological innovation, increased informatization, and improved energy efficiency, the digital economy is expected to enhance environmental quality alongside economic growth. However, the actual impact of the digital economy on environmental governance remains uncertain. On one hand, it may reduce resource waste and optimize energy structures, thereby benefiting the environment; on the other hand, its expansion may result in higher energy consumption and carbon emissions, with its environmental effects exhibiting significant spatial heterogeneity across urban areas.
Against this background, this section uses Chinese cities as the study sample to examine the impact of digital economy development on environmental quality. The analysis addresses the following key questions: (1) Does digital economy development significantly improve environmental quality? (2) Does this impact vary spatially across regions with different geographic and economic characteristics? (3) When accounting for spatial effects, what mechanisms underlie the influence of digital economy development on various environmental indicators?In addition, the empirical analysis integrates urban environmental data with digital economy indicators to explore their spatial interrelationship, thereby offering new insights into the coordinated development of the digital economy and environmental governance.
All variables used in this study are derived from data for the year 2022, which is the most recent year with comprehensive and consistent records across all Chinese prefecture-level cities. Although the original data sources cover multiple years, this study adopts a cross-sectional design to focus on spatial heterogeneity rather than temporal variation. Potential dynamic effects of digital economy development on environmental quality will be examined in future research using panel data models.
To comprehensively examine the impact of digital economy development on urban environmental quality in China, this study defines environmental quality as the condition of the urban natural environment, as reflected by levels of air and industrial pollution. This definition is consistent with prior research in environmental economics and urban studies, where environmental quality is typically measured by pollution indicators that reflect environmental pressure and degradation caused by anthropogenic activities.
Based on this definition, four indicators are selected to characterize environmental quality, covering two primary dimensions: air pollution and industrial pollution. The selected indicators include average PM2.5 concentration, industrial wastewater discharge, industrial smoke and dust emissions, and industrial sulfur dioxide emissions. The average PM2.5 concentration (unit: ) reflects the level of fine particulate matter in ambient air and serves as a key indicator of urban air pollution. Industrial wastewater discharge (unit: ten thousand tons) captures the intensity of water pollution resulting from industrial processes. Industrial smoke and dust emissions (unit: ten thousand tons) and industrial sulfur dioxide emissions (unit: ten thousand tons) quantify the release of solid particles and acidic gases from industrial operations, respectively. All data are sourced from the China Urban Statistical Yearbook. These indicators offer a comprehensive and objective representation of urban environmental quality.
In addition to environmental quality outcomes, the empirical model includes explanatory variables capturing the level of digital economy development and other socioeconomic characteristics.
The digital economy indicator serves as the primary explanatory variable in the model. To capture the multidimensional characteristics of urban digitalization, a composite index is constructed using the entropy method based on five components: the Digital Inclusive Finance Index (sourced from the Peking University Digital Finance Research Center), the number of international internet users per 100 people, the proportion of employees in the information transmission, computer service, and software industries, per capita telecommunication service volume, and the number of mobile phone subscribers per 100 people. The latter four indicators are sourced from the China Urban Statistical Yearbook [44].
To account for other factors potentially affecting environmental quality, five control variables are included in the model. These include: (1) population density, capturing the intensity of population concentration and its associated environmental burden; (2) per capita GDP, indicating the level of regional economic development; (3) foreign direct investment, measured as the ratio of actual utilized foreign capital to GDP, reflecting the degree of economic openness; (4) scientific expenditure, defined as the proportion of fiscal expenditure allocated to science and technology, reflecting government support for innovation; and (5) green innovation, measured by the number of authorized green patents, indicating the technological capacity for environmentally sustainable development. Among these variables, scientific expenditure and green innovation are also interpreted as partial indicators of environmental governance capacity, understood as institutional and policy mechanisms for managing environmental outcomes.
Taken together, the selected variables span multiple dimensions—including environmental quality, digital economy, and governance capacity—and thus provide a comprehensive foundation for analyzing the mechanisms through which digital economy development affects urban environmental outcomes. All variables used in this study are derived from data for the year 2022, which is the most recent year with comprehensive and consistent records across all Chinese prefecture-level cities. Although the original data sources span multiple years, this study adopts a cross-sectional design to focus on spatial heterogeneity rather than temporal variation. The potential dynamic effects of digital economy development on environmental quality will be examined in future research using panel data models. Table 5 summarizes the definitions and data sources of all variables used in the empirical analysis.
During data preprocessing, all variables were transformed using the natural logarithm to reduce scale differences and improve distributional smoothness. Table 6 presents the descriptive statistics of the main variables in logarithmic form, including the mean, standard deviation, median, minimum, maximum, skewness, and kurtosis.
Table 6 shows that the variables differ significantly in terms of their means, ranges, and skewness. For example, has a mean of -5.6424, a standard deviation of 0.3856, and a skewness of 1.2690, indicating a moderately right-skewed distribution with slight kurtosis;
has a skewness of -2.7312 and kurtosis of 30.8390, reflecting a highly peaked and left-skewed distribution—suggesting that green innovation in a few cities is exceptionally high, while most cities have relatively low levels.
has a skewness of -2.8132, indicating that while most cities exhibit relatively good air quality, a small number face severe pollution levels.
The empirical data are collected at the prefecture-level city scale, encompassing a broad range of urban areas across China. Each observation corresponds to a distinct city, inherently characterized by spatial attributes. To explicitly model the spatial structure among these cities, a spatial weight matrix W is constructed based on first-order Rook contiguity. In this approach, two cities are defined as neighbors if they share a common administrative boundary. The contiguity relationships are derived from official geographic boundary shapefiles using GIS software, and the resulting binary adjacency matrix is row-standardized to facilitate interpretation. This spatial weight matrix is applied consistently throughout the spatial dependence diagnostics and econometric modeling.
In spatial econometrics, Moran’s I is a classical test used to detect whether a variable exhibits significant spatial autocorrelation. Spatial autocorrelation refers to the extent to which observations at nearby locations are statistically correlated [42]. The global Moran’s I statistic is computed to assess the spatial autocorrelation of each response variable: ,
,
, and
.
Moran’s I is defined as [43]:
where n is the number of observations; yi and yj are the observed values for spatial units i and j; and is the mean of variable y; wij denotes the (i,j)th element of the spatial weight matrix W, representing the spatial relationship between units i and j; and
is the total sum of all spatial weights.
The value of Moran’s I generally falls within the range of –1 to 1. A value of I>0 indicates positive spatial autocorrelation, meaning that neighboring regions tend to have similar values;I<0 implies negative spatial autocorrelation, where neighboring regions tend to have dissimilar or contrasting values; and suggests the absence of significant spatial autocorrelation.
To evaluate the significance of spatial autocorrelation, a Monte Carlo simulation with 999 permutations was conducted to test Moran’s I statistic. Table 7 presents Moran’s I indices along with their corresponding significance test results for the response variables.
As shown in Table 7, the Moran’s I statistics for all response variables are substantially greater than zero, with corresponding z-scores exceeding conventional critical values and p-values well below 0.05. These results indicate significant positive spatial autocorrelation across all variables. In particular, shows a Moran’s I value of 0.8199, indicating an extremely high degree of spatial clustering. This suggests that PM2.5 concentrations in Chinese cities are highly spatially clustered, implying that neighboring cities tend to exhibit similar air quality levels. This spatial autocorrelation may stem from cross-regional pollutant diffusion, similarities in industrial structure, and insufficient environmental policy coordination among adjacent cities. For other variables,
exhibits a Moran’s I of 0.4065, significant at a high level, suggesting strong spatial autocorrelation in industrial wastewater discharge—likely linked to hydrological connectivity within river basins. Although the Moran’s I values for
and
(0.2190 and 0.2477, respectively) are relatively lower, they still reveal statistically significant positive spatial autocorrelation. This suggests that industrial particulate emissions and sulfur dioxide emissions also exhibit moderate regional clustering.
In addition, Moran’s I scatterplots were constructed to visualize spatial relationships. These scatterplots divide the observations into four quadrants: high-high, low-low, high-low, and low-high. The first and third quadrants represent positive spatial autocorrelation, while the second and fourth indicate negative spatial autocorrelation.
Fig 7 shows the Moran’s I scatterplots for ,
,
, and
, offering a visual interpretation of the spatial autocorrelation patterns for each variable.
(a) ; (b)
; (c)
; (d)
.
As shown in Fig 7, the scatterplots of all response variables exhibit significant positive spatial autocorrelation, with most observations concentrated in the first and third quadrants. In particular, the scatterplot of displays the highest degree of concentration, further confirming its strong spatial clustering—suggesting that PM2.5 concentrations tend to be similar across neighboring cities. Similarly, the large number of observations in the first and third quadrants for
indicates strong spatial autocorrelation in industrial wastewater discharge, possibly linked to regional water systems and cross-boundary watershed pollution. Although the scatter distributions for
and
are more dispersed, the predominance of points in the first and third quadrants indicate that industrial smoke and dust emissions and sulfur dioxide emissions still exhibit certain positive spatial correlations. This pattern may be attributed to the inter-city spillover effects of industrial activities, such as shared energy structures or industrial configurations. Overall, Moran’s I scatterplots visually confirm the spatial autocorrelation patterns of the response variables, consistent with the statistical results reported in Table 7.
To further assess whether the response variables exhibit significant spatial dependence, both LM-Lag and LM-Error tests are conducted. The LM-Lag test evaluates the suitability of a spatial lag model, while the LM-Error test detects spatial dependence in the error terms. These diagnostic tests provide critical evidence for selecting an appropriate spatial econometric specification. The null hypothesis for the LM-Lag test is the absence of a spatial lag effect (i.e., ), whereas the LM-Error test assumes no spatial autocorrelation in the residuals (i.e.,
). The test results are shown in Table 8.
The LM-Lag test results indicate that most response variables exhibit significant spatial lag effects, whereas the LM-Error test results are generally not statistically significant. This suggests that spatial dependence among the environmental quality indicators is primarily reflected through lag effects, while spatial autocorrelation in the residuals is relatively weak. Specifically, the LM-Lag test results for PM2.5 concentration, industrial wastewater discharge, and industrial emissions indicate significant spatial lag dependence—implying that environmental quality among neighboring cities is highly interdependent—likely due to cross-regional pollutant diffusion, interlinked economic activities, and inadequate environmental policy coordination. Although the LM-Lag test for industrial smoke and dust emissions is only marginally significant, it suggests that spatial dependence still exists and should not be entirely overlooked.
The empirical analysis begins with the application of the SAR model to examine the impact of digital economy development on urban environmental quality. The SAR model captures spatial dependence via a fixed spatial lag coefficient, assuming a homogeneous influence of neighboring observations across all regions. While this model offers a baseline framework for evaluating spatial interactions, it may fail to account for spatial heterogeneity that often characterizes real-world environmental processes. Therefore, the SAR model is first estimated and its performance evaluated, prior to introducing a more flexible specification. The SAR model employed in this study is specified as follows:
Table 9 presents the estimation results of the SAR model applied to four environmental indicators. The results indicate that the estimated coefficients for the digital economy and green innovation are significantly negative in the models for and
, aligning with theoretical expectations that technological progress reduces pollution. However, in the models for
and
, most explanatory variables display weak or insignificant effects. Specifically, in the
, the coefficients for the digital economy, population density, and foreign direct investment are not statistically significant, suggesting that the model fails to capture meaningful relationships. In the
, although several variables are statistically significant, the estimated signs and magnitudes vary considerably compared to those in other pollutant models. Furthermore, although the spatial lag parameter
is statistically significant across all models, its estimated magnitude varies considerably, indicating differing degrees of spatial dependence among the pollutants. These findings suggest that the SAR model, due to its constant spatial effect assumption, lacks the flexibility needed to capture spatial heterogeneity in environmental processes.
Based on the framework of the SSIVCAR model and the empirical data, the model is specified as follows:
where yi denotes the response variable for city i, specifically one of ,
,
, or
.
The single-index varying coefficient captures the spatially varying intensity of the lag effect. In this study, the spatial location ui is defined using each city’s longitude and latitude, and
represents the combined influence of spatial position on the lag intensity. By integrating yi with the weighted average of its neighbors,
, the model can effectively capture the spatial correlation characteristics of urban environmental quality.
We then estimate the parameters separately for the four response variables and denote significance levels (with * indicating significance, indicating high significance, and
indicating extreme significance). Table 10 presents the estimation results for
,
,
, and
.
Table 10 shows that the single-index parameters and
are highly significant (with p < 0.001) for
and
, indicating pronounced spatial lag effects for these pollutants. In addition,
is also significant in the model for
(p < 0.05). These results reflect a significant spatial association in environmental pollution across regions. Among the explanatory variables, the digital economy indicator (
) has a significantly negative effect on all response variables—most notably on
(coefficient = –0.5123, p < 0.001)—suggesting that digital development plays an effective role in reducing industrial emissions and improving environmental quality. The effect of population density (
) varies across different pollutants: it is significantly positive for
(p < 0.01), indicating that population concentration may exacerbate industrial wastewater discharge, whereas its significantly negative effect on
(p < 0.01) may be related to industrial restructuring and enhanced environmental governance. Per capita GDP (
) is significantly positively correlated in all models, suggesting that economic development is associated with greater environmental pressure. Finally, green innovation (
) exhibits a significantly negative effect across all models, underscoring the important role of technological advancement in pollution mitigation.
In addition to the parameter estimates, we provide a statistical summary of spatial autoregressive coefficients, represented by the estimated values of for the response variables
,
,
, and
. The corresponding results are summarized in Table 11.
Table 11 reveals that spatial autoregressive coefficients of the response variables exhibit heterogeneity. For example, has a mean of 0.1259 and a median of 0.1261, ranging from –0.2834 to 0.7665, indicating that while the overall spatial lag effect for PM2.5 is weak, certain regions exhibit significant positive correlation—likely reflecting the diffusion characteristics of air pollution among urban agglomerations. Similarly,
has a mean of 0.1334 and ranges from –0.7700 to 0.5199, suggesting that while most areas exhibit positive lag effects, negative spatial dependence in some regions may be linked to disparities in local environmental governance. For
, the mean spatial autoregressive coefficient is 0.1484, with a maximum of 0.9210, indicating that industrial smoke and dust emissions are subject to strong spatial spillover effects in certain areas. Finally,
has a mean of 0.0651 and a median of 0.0593, but reaches a maximum of 0.9400, suggesting that sulfur dioxide emissions exhibit very strong spatial association in specific urban regions.
Furthermore, Fig 8 presents the spatial distribution of the estimated autoregressive coefficients for ,
,
, and
.
(a) ; (b)
; (c)
; (d)
. Map data source: GADM, available at https://geodata.ucdavis.edu/gadm/gadm4.1/shp/gadm41_CHN_shp.zip. Data is licensed under the CC BY License [45].
Fig 8 illustrates the distribution of spatial autoregressive coefficient for ,
,
, and
. The analysis shows that the spatial autoregressive coefficient of different environmental pollution indicators exhibits significant heterogeneity across regions, which reflects the combined influence of regional economic structures, industrial layouts, and pollutant dispersion mechanisms. For example, in Fig 8 (a),
exhibits generally positive spatial correlation across most regions; northern regions (e.g., the Beijing-Tianjin-Hebei and Northeast regions) exhibit particularly high correlation (up to 0.7665), while southern regions show weaker or even negative correlations. This indicates that air pollution in northern regions exhibits marked cross-regional diffusion, likely driven by high heating demand, industrial emissions, and regional climatic conditions. In particular, the Beijing-Tianjin-Hebei region, characterized by high industrial concentration and pollutant emissions, faces severe cross-regional air pollution transmission. Therefore, establishing coordinated regional governance mechanisms in the Beijing-Tianjin-Hebei and surrounding areas is essential, along with promoting technological upgrades for key pollution sources, encouraging new energy substitution, and implementing coal-to-electricity conversion policies. Additionally, optimizing transportation systems and accelerating the phase-out of high-emission diesel vehicles will help reduce the cross-regional transmission of pollutants.
For (Fig 8 (b)), the spatial autoregressive coefficient is more pronounced in coastal areas and certain economically developed inland cities, with a maximum value of 0.5199. By contrast, some central and western regions exhibit negative or insignificant spatial correlation. This suggests that industrial wastewater discharge in coastal areas has strong regional linkage, closely related to dense industrial clustering and well-developed water networks. For example, in the middle and lower reaches of the Yangtze River, industrial wastewater discharge affects not only local water quality but also has cascading impacts on upstream and downstream regions. These findings highlight the need for integrated basin-wide governance of water pollution, the promotion of advanced wastewater treatment technologies, and the implementation of recycled water utilization policies to mitigate cross-regional contamination.
The spatial autoregressive coefficient of (Fig 8 (c)) is particularly high in energy-intensive cities in central and western regions (e.g., Shanxi, Shaanxi, Inner Mongolia), with a maximum value reaching 0.9210. The concentration of coal, steel, and other energy-intensive industries in these regions results in substantial emissions of industrial smoke and dust, which tend to follow fixed diffusion paths and generate cumulative regional impacts. In addition, some industrial cities in eastern coastal areas also exhibit high correlation intensity, indicating that industrial smoke and dust emissions adversely affect the air quality in adjacent cities. In response, central and western regions should prioritize industrial restructuring in high-energy-consuming sectors, accelerate supply-side reforms in the coal industry, and promote the adoption of low- and zero-emission technologies. Strengthening regional air quality monitoring networks and improving the precision of pollution source identification can further enhance the control of particulate matter pollution in key areas.
For (Fig 8 (d)), the spatial autoregressive coefficient is highest in the Northeast, Bohai Rim, and East China regions, with a maximum value of 0.9400. These regions are characterized by concentrations of high-polluting industries such as chemicals and steel, leading to strong spatial linkage in sulfur dioxide emissions. In contrast, some industrial cities in central and western regions show lower sulfur dioxide emission intensities with weak spatial correlation due to limited atmospheric dispersion. To address these issues, regions with high pollution levels should accelerate the implementation of total
emission control policies. In particular, ultra-low emission retrofit technologies should be prioritized in the Northeast and Bohai Rim regions to reduce SO2 emissions from coal-fired power plants and the steel industry. Moreover, optimizing the regional distribution of heavy and chemical industries can gradually reduce the concentration of high-polluting sectors in a single area, thereby promoting cleaner production and improving regional air quality.
In summary, the spatial autoregressive coefficient distribution of different pollutants indicates that environmental governance must fully account for regional differences and pollutant diffusion patterns. In northern regions, PM2.5 control requires the establishment of cross-regional collaborative mechanisms and the promotion of clean energy substitution; in coastal economic zones, industrial wastewater management should focus on basin-wide coordination; and in central, western, and northeastern industrial areas, accelerating industrial upgrading and the promotion of clean technologies is essential to mitigate the impact of heavy-polluting industries. Targeted policies that avoid resource wastage while effectively reducing cross-regional pollutant diffusion provide a scientific basis for the comprehensive improvement of regional environmental quality.
To further investigate the impact of digital economy development on environmental quality, we examine a set of key variables that capture both environmental outcomes and socioeconomic characteristics. The response variables include ,
,
, and
, representing different types of pollution indicators. The explanatory variables include the digital economy indicator, along with control variables including population density, per capita GDP, foreign direct investment, scientific expenditure, and green innovation. Table 5 presented earlier summarizes the descriptions and sources of these variables.
During the data preprocessing stage, all variables are log-transformed to reduce scale disparities and achieve more normalized distributions. Table 6 presents the descriptive statistics of the log-transformed variables.
Furthermore, we calculate the global Moran’s I for the response variables to assess their spatial autocorrelation, with the results presented in Table 7 and visualized via Moran’s I scatterplots (Fig 7). LM-Lag and LM-Error tests (Table 8) further confirm the presence of significant spatial lag effects, thereby supporting the application of a SAR model.
Based on the SSIVCAR model framework, we estimate the model specified in equation (26) for each response variable. Table 10 shows that the digital economy indicator () has a significantly negative effect on environmental pollutants, whereas population density, per capita GDP, and green innovation exhibit heterogeneous effects across different pollutant indicators.
In addition, we analyze the spatial autoregressive coefficients, as summarized in Table 11 and illustrated in Fig 8. The results reveal significant heterogeneity in spatial correlation across different pollutants. For example, northern cities such as those in the Beijing-Tianjin-Hebei and Northeast regions exhibit high spatial correlation in PM2.5 concentrations, while coastal areas show strong positive spatial correlation in industrial wastewater discharge. Energy-intensive cities in central and western regions display very high spatial correlation in industrial smoke and dust emissions, and regions in the Northeast, Bohai Rim, and East China demonstrate pronounced spatial clustering in sulfur dioxide emissions.
These empirical findings indicate that digital economy development plays a significant role in improving environmental quality, although the magnitude and direction of its effects vary across pollutants and regions. The results underscore the need for targeted, region-specific environmental policies——for instance, establishing cross-regional coordination mechanisms in northern China to control PM2.5, promoting basin-wide coordination for water pollution control in coastal areas, and advancing industrial restructuring and the adoption of clean technologies in energy-intensive regions.
Conclusion
This paper proposes a novel SSIVCAR model to address the limitations inherent in the SAR model. The SAR model assumes a uniform spatial correlation across all spatial units, which fails to capture the spatial heterogeneity commonly observed in applications. To overcome this shortcoming, we incorporate a non-linear single-index varying coefficient function, , into the classical framework. This formulation allows the spatial autoregressive coefficients to vary dynamically with the characteristics of individual spatial units, thereby providing a more accurate representation of complex spatial dependence structures.
Methodologically, we propose a joint estimation procedure that combines spline approximation with 2SLS to effectively estimate the model parameters. Through theoretical derivations, we derive explicit expressions for the parameter estimates. In addition, a series of Monte Carlo simulation experiments is designed to comprehensively evaluate the finite-sample performance of the model. The experimental results demonstrate that, compared to the SAR model, the new offers substantial improvements in capturing spatial heterogeneity and more accurately characterizes variations across spatial units. In comparative experiments, we further investigate the model’s performance under various spatial adjacency structures and different levels of random noise, with results indicating that the proposed model remains robust and broadly applicable across a range of conditions.
Finally, to validate the practical applicability of the proposed model, we apply it to examine the impact of digital economy development on environmental pollution in China. By selecting ,
,
, and
as representative indicators of environmental pollution, and incorporate digital economy indicators and relevant control variables into a comprehensive empirical framework. The study finds that digital economy development in China has a significant and spatially heterogeneous effect on various environmental pollution indicators. Specifically, for
and
, digital economy development significantly reduces air pollution and industrial wastewater discharge, with more pronounced effects in economically developed coastal regions; for
and
, the impact on industrial smoke and dust and sulfur dioxide emissions is more complex—exhibiting negative correlations in some regions, while in others it is shaped by industrial composition and the degree of policy coordination. Digital economy development contributes positively to pollution control by promoting green technological innovation, optimizing industrial structures, and improving resource use efficiency; however, its overall effectiveness remains constrained by regional economic development levels, industrial concentration, and the effectiveness of environmental governance policies.
Moreover, the empirical results of the SSIVCAR model further reveal the cross-regional characteristics and spillover effects of environmental pollution. The findings indicate that the diffusion of PM2.5 pollution is particularly pronounced in the Beijing-Tianjin-Hebei region, while industrial wastewater discharge exhibits strong regional interdependence in the Yangtze River Delta. These results suggest that environmental governance policies should be tailored to the spatial diffusion characteristics of pollutants by enhancing inter-regional collaboration and optimizing pollution control mechanisms. In addition, the findings highlight the importance of accounting for spatial heterogeneity in digital economy development to achieve the dual objectives of environmental improvement and sustained economic growth.
Future research may extend the application of this model to other domains or incorporate panel or time-series data to investigate time-varying spatial correlation effects, thereby providing deeper theoretical insights for spatial economics and evidence-based support for environmental governance policies.
References
- 1.
Anselin L. Spatial Econometrics: Methods and Models. Springer Netherlands. 1988. https://doi.org/10.1007/978-94-015-7799-1
- 2.
LeSage J, Pace RK. Introduction to Spatial Econometrics. Chapman and Hall/CRC. 2009. https://doi.org/10.1201/9781420064254
- 3.
Elhorst JP. Spatial Econometrics. Springer Berlin Heidelberg. 2014. https://doi.org/10.1007/978-3-642-40340-8
- 4. Feng T, Du H, Lin Z, Zuo J. Spatial spillover effects of environmental regulations on air pollution: Evidence from urban agglomerations in China. J Environ Manage. 2020;272:110998. pmid:32854900
- 5. Overmars KP, de Koning GHJ, Veldkamp A. Spatial autocorrelation in multi-scale land use models. Ecological Modelling. 2003;164(2–3):257–70.
- 6. Chen X, Dai E. Comparison of spatial autoregressive models on multi-scale land use. Trans Chin Soc Agric Eng. 2011;27:324–31.
- 7. Wheeler DC, Calder CA. An assessment of coefficient accuracy in linear regression models with spatially varying coefficients. J Geograph Syst. 2007;9(2):145–66.
- 8. Kim M, Wang L. Generalized spatially varying coefficient models. J Comput Graph Stat. 2020;30(1):1–10.
- 9. Murakami D, Griffith DA. Spatially varying coefficient modeling for large datasets: Eliminating N from spatial regressions. Spatial Statistics. 2019;30:39–64.
- 10. Aquaro M, Bailey N, Pesaran MH. Estimation and inference for spatial models with heterogeneous coefficients: an application to US house prices. J Appl Econ. 2020;36(1):18–44.
- 11.
Li Q, Racine JS. Nonparametric econometrics: theory and practice. Princeton, NJ, USA: Princeton University Press; 2007. https://doi.org/10.1093/erae/jbn027
- 12. Lee L-F. Asymptotic distributions of quasi-maximum likelihood estimators for spatial autoregressive models. Econometrica. 2004;72(6):1899–925.
- 13. Su L, Yang Z. QML estimation of dynamic panel data models with spatial errors. J Econ. 2015;185(1):230–58.
- 14. Geniaux G, Martinetti D. A new method for dealing simultaneously with spatial autocorrelation and spatial heterogeneity in regression models. Reg Sci Urban Econ. 2018;72:74–85.
- 15. Kelejian HH, Prucha IR. The Journal of Real Estate Finance and Economics. 1998;17(1):99–121.
- 16. Elhorst JP. Dynamic spatial panels: models, methods, and inferences. J Geogr Syst. 2011;14(1):5–28.
- 17. Lam C, Souza PCL. Estimation and selection of spatial weight matrix in a spatial lag model. J Bus Econ Stat. 2019;38(3):693–710.
- 18. Brunsdon C, Fotheringham S, Charlton M. Geographically weighted regression. J R Stat Soc D. 1998;47(3):431–43.
- 19. Brunsdon C, Fotheringham AS, Charlton ME. Geographically weighted regression: a method for exploring spatial nonstationarity. Geogr Anal. 1996;28(4):281–98.
- 20. Comber A, Brunsdon C, Charlton M, Harris P. The GWR route map: a guide to the informed application of geographically weighted regression. arXiv Prepr. 2020.
- 21. Hagenauer J, Helbich M. A geographically weighted artificial neural network. International Journal of Geographical Information Science. 2021;36(2):215–35.
- 22. Lu B, Brunsdon C, Charlton M, Harris P. Geographically weighted regression with parameter-specific distance metrics. Int J Geogr Inform Sci. 2016;31(5):982–98.
- 23. Li , Fotheringham , Li , Oshan . Fast geographically weighted regression (FastGWR): a scalable algorithm to investigate spatial process heterogeneity in millions of observations. int j geogr inf sci. 2019;33(1):155–75.
- 24. Hastie T, Tibshirani R. Varying-coefficient models. J R Stat Soc Ser B: Stat Methodol. 1993;55(4):757–79.
- 25.
Yang Z. Instrumental variable quantile estimation of spatial autoregressive models. Proceedings of the 1st world conference of the spatial econometrics association, Cambridge, UK; 2007. https://ink.library.smu.edu.sg/soe_research/1074/
- 26. Sun Y. Functional-coefficient spatial autoregressive models with nonparametric spatial weights. J Econ. 2016;195(1):134–53.
- 27. Chen Z, Chen M, Xing G. Bayesian estimation of partially linear additive spatial autoregressive models with P-aplines. Math Probl Eng. 2021;2021:1–14.
- 28. Liu H, Cui X. Adaptive estimation for spatially varying coefficient models. MATH. 2023;8(6):13923–42.
- 29. Fan J, Zhang W. Statistical methods with varying coefficient models. Stat Interface. 2008;1(1):179–95. pmid:18978950
- 30. Lee L. GMM and 2SLS estimation of mixed regressive, spatial autoregressive models. J Econ. 2007;137(2):489–514.
- 31. Ichimura H. Semiparametric least squares (SLS) and weighted SLS estimation of single-index models. J Econ. 1993;58(1–2):71–120.
- 32. Yu Y, Ruppert D. Penalized spline estimation for partially linear single-index models. J Am Stat Assoc. 2002;97(460):1042–54.
- 33. Klein N, Kneib T, Klasen S, Lang S. Bayesian structured additive distributional regression for multivariate responses. J R Stat Soc Ser C: Appl Stat. 2014;64(4):569–91.
- 34. Li Y, Ying C. Semi-functional partial linear spatial autoregressive model. Commun Stat – Theory Methods. 2020;50(24):5941–54.
- 35.
Guan X, Liu H, You J, Song X. Estimation and inference for dynamic single-index varying-coefficient models. Stat Sin. 2023;33:85–105. https://doi.org/10.5705/ss.202019.0467
- 36. Xue L, Wang Q. Empirical likelihood for single-index varying-coefficient models. Bernoulli. 2012;18(3).
- 37.
Conn AR, Gould NIM, Toint PL. Trust region methods. Society for Industrial and Applied Mathematics: Philadelphia, PA, USA; 2000. https://doi.org/10.1137/1.9780898719857
- 38. Ejigu BA, Wencheko E. Introducing covariate dependent weighting matrices in fitting autoregressive models and measuring spatio-environmental autocorrelation. Spatial Stat. 2020;38:100454.
- 39. Ejigu BA, Wencheko E, Moraga P, Giorgi E. Geostatistical methods for modelling non-stationary patterns in disease risk. Spatial Stat. 2020;35:100397.
- 40. Ejigu BA. Geostatistical analysis and mapping of malaria risk in children of Mozambique. PLoS One. 2020;15(11):e0241680. pmid:33166322
- 41. Ejigu BA, Shiferaw S, Moraga P, Seme A, Yihdego M, Zebene A, et al. Spatial analysis of modern contraceptive use among women who need it in Ethiopia: using geo-referenced data from performance monitoring for action. PLoS One. 2024;19(4):e0297818. pmid:38573989
- 42.
Getis A. Spatial autocorrelation. Handbook of applied spatial analysis. Springer Berlin Heidelberg; 2009. p. 255–78. https://doi.org/10.1007/978-3-642-03647-7_14
- 43. Moran PAP. Notes on continuous stochastic phenomena. Biometrika. 1950;37(1/2):17.
- 44.
National Bureau of Statistics of China. China Urban Statistical Yearbook. China Statistics Press; 2023.
- 45.
GADM. GADM version 4.1: Database of global administrative areas. Available at: https://geodata.ucdavis.edu/gadm/gadm4.1/shp/gadm41_CHN_shp.zip. Data licensed under the CC BY License. 2020.