Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Multi-scale dynamic graph neural network for PM2.5 concentration prediction in regional station cluster

  • Xin Lu,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Writing – original draft

    Affiliations College of Computer Science and Mathematics, Central South University of Forestry and Technology, Changsha, China, Hunan Changsha-Zhuzhou-Xiangtan City Cluster National Research Station of Ecosystem, Hunan Botanical Garden, Changsha, China

  • Juyang Liao,

    Roles Project administration, Validation

    Affiliation Hunan Changsha-Zhuzhou-Xiangtan City Cluster National Research Station of Ecosystem, Hunan Botanical Garden, Changsha, China

  • Huihua Huang,

    Roles Software

    Affiliation College of Computer Science and Mathematics, Central South University of Forestry and Technology, Changsha, China

  • Jiazhen Li,

    Roles Validation

    Affiliation College of Computer Science and Mathematics, Central South University of Forestry and Technology, Changsha, China

  • Hao Huang,

    Roles Validation

    Affiliation College of Computer Science and Mathematics, Central South University of Forestry and Technology, Changsha, China

  • Yingfang Zhu

    Roles Conceptualization, Project administration, Resources, Validation, Writing – review & editing

    zhuyingfang@csuft.edu.cn

    Affiliation College of Computer Science and Mathematics, Central South University of Forestry and Technology, Changsha, China

Abstract

Accurate prediction of PM2.5 concentrations is crucial for public health and environmental management. However, effectively capturing complex spatiotemporal dependencies across multiple time scales remains a persistent challenge for existing methods, particularly in regions with sparse monitoring stations. This study proposes a Multi-Scale Dynamic Graph Neural Network (MSDGNN) for PM2.5 forecasting in station clusters. The model incorporates multi-scale temporal modeling (hourly, daily, weekly) to capture both short- and long-term dependencies. A learnable mapping matrix dynamically groups stations to strengthen spatial correlation learning. Furthermore, MSDGNN employs multi-head attention and spatiotemporal graph attention mechanisms to construct dynamic graphs, utilizing adaptive adjacency matrices and Chebyshev graph convolutions for effective feature propagation. We evaluated MSDGNN on 22 air quality monitoring stations in the Chang-Zhu-Tan region. Results demonstrated that our model reduces MAE by 6.77% and RMSE by 8.67% compared to the best baseline, validating its capability to learn complex dependencies and deliver accurate predictions under diverse spatiotemporal conditions.

Introduction

Amid rapid economic growth and extensive urban expansion, air pollution has emerged as a critical global environmental and public health challenge [1]. Particulate matter (PM), especially fine particulate matter PM2.5, is among the most harmful pollutants due to its ability to penetrate deeply into the human respiratory system [2]. Epidemiological studies indicate that for every 10 μg/m³ increase in PM2.5 concentration, cardiovascular mortality rises by approximately 11% [3]. Consequently, reliable prediction of PM2.5 levels is essential for safeguarding public health and mitigating adverse health impacts [4].

Current methodologies for PM2.5 concentration prediction primarily fall into two categories: numerical simulation-based approaches and data-driven models. Numerical simulation models [57], grounded in atmospheric dynamics, utilize emission inventories, meteorological data, and boundary conditions to simulate pollutant diffusion and chemical reactions [8]. However, these models often face limitations in capturing localized variations, particularly in regions with complex topography or heterogeneous pollution sources [9]. In contrast, Data-driven approaches learn directly from high-frequency, multi-dimensional sensor data, offering a powerful means to address complex regional characteristics [10].

Traditional machine learning techniques, such as Linear Regression (LR) [11], Random Forests (RF) [12], and Support Vector Regression (SVR) [13], have demonstrated improvements over classical statistical methods (e.g., Historical Average, ARMA [14], ARIMA [15]) in handling large datasets. Nevertheless, these methods often fail to adequately capture the temporal dependencies critical for accurate PM2.5 forecasting [16]. Deep learning models, including Convolutional Neural Networks (CNN) [17,18], Long Short-Term Memory networks (LSTM) [19,20], and Gated Recurrent Units (GRU) [21,22] effectively address temporal modeling challenges [23]. Despite these improvements, a common limitation of these approaches is their predominant focus on either spatial or temporal patterns, often at the expense of simultaneous multi-scale integration.

Unlike traditional CNNs and RNNs designed for Euclidean data, graph neural networks (GNNs) perform convolutions on irregular, non-Euclidean structures, offering flexible modeling of complex relationships between nodes [24]. This capability enables effective capture of spatial correlations in pollutant dispersion and multi-dimensional associative features [25]. For instance, Bai et al.[26] proposed an Adaptive Graph Convolutional Recurrent Network (AGCRN), which incorporates Node Adaptive Parameter Learning (NAPL) to generate node-specific parameters, thereby capturing spatial dependencies and site-specific characteristics. Li et al.[27] developed a Dynamic Graph Convolutional Recurrent Network (DGCRN) that employs hypernetworks to dynamically generate adjacency matrices at each time step, capturing dynamic spatiotemporal correlations. Choudhury et al.[28] proposed AGCTCN, combining spatial attention with graph and temporal convolutions to model spatiotemporal dependencies for short-term PM2.5 forecasting.

In parallel, Transformer-based architectures have gained prominence for sequence modeling, leveraging multi-head self-attention to process multiple feature subspaces in parallel and handle long-range dependencies. Liang et al.[29] proposed AirFormer, which incorporates Dartboard Spatial MSA and Causal Temporal MSA to model spatiotemporal dependencies for nationwide PM2.5 predictions. Zhang et al.[30] introduced Crossformer, which decomposes and aggregates multi-scale information through multi-head attention for comprehensive feature representations.

Despite these advancements, current GNN and Transformer-based models exhibit limitations in simultaneously capturing multi-scale spatiotemporal dependencies. Spatiotemporal data in PM2.5 prediction inherently exhibit multi-scale characteristics: spatially, from individual sites to regional clusters; temporally, from hourly fluctuations to weekly periodic patterns. However, existing GNN and Transformer-based models are inadequate for the effective capture of these cross-scale spatiotemporal dependencies. Furthermore, in practice, monitoring stations are often sparsely distributed due to economic and logistical constraints, resulting in weak data associations between distant stations. This sparsity leads to insufficient modeling of long-range spatiotemporal dependencies, hindering accurate capture of pollutant transport patterns from distant sources [31].

To address these challenges, we proposed a Multi-Scale Dynamic Graph Neural Network (MSDGNN) for PM2.5 forecasting. The model employs parallel multi-scale temporal modeling to capture both short-term fluctuations and long-term periodic patterns. Spatially, it utilizes a learnable grouping module that dynamically clusters stations based on functional similarities, enhancing long-range dependency modeling in sparse networks. The architecture integrates Multi-Head Attention for global spatial correlations and Spatiotemporal Graph Attention for dynamic relationship weighting. These features are processed through an Adaptive Graph Convolution module using Chebyshev approximation on a hybrid graph structure. Extensive evaluations on real-world data from 22 monitoring stations in the Chang-Zhu-Tan region demonstrate that our model significantly outperforms existing state-of-the-art methods.

Data and problem statement

Study area and data

The Chang-Zhu-Tan (CZT) urban agglomeration, situated in the middle reaches of the Yangtze River in China, encompasses the cities of Changsha, Zhuzhou, and Xiangtan. As a pivotal region within the Yangtze River Economic Belt, the CZT has experienced rapid economic growth and urbanization, leading to increasing challenges associated with PM2.5 pollution.

Hourly air quality data were collected from 22 monitoring stations across the three cities via the National Urban Air Quality Real-Time Release Platform (https://air.cnemc.cn:18007/). The dataset spans from January 2020 to December 2023 and includes measurements of six major pollutants (PM2.5, PM10, NO₂, CO, O₃, SO₂) as well as the Air Quality Index (AQI).

Problem statement

PM2.5 prediction faces three interconnected challenges: capturing multi-scale temporal patterns, modeling spatial correlations between stations, and handling sparse monitoring networks. We formulate this as a spatiotemporal graph learning problem with dynamic station grouping.

Spatiotemporal graph definition.

A set of stations is defined as and a set of station groups is defined as , where indicates the total number of stations and reflects the total number of station groups. A station graph is associated with a specific number of station groups. A station graph is utilized to represent the relationships among stations, where represents the collection of station nodes, refers to the set of edges, D is the list of node attributes, and E is the edge adjacency matrix. Similarly, a station group graph is employed to represent the relationships among station groups, where includes the group nodes, represents the list of edges among these groups, H stands for the matrix of group node attributes, and T depicts the matrix of edge attributes.

Multi-scale temporal modeling.

PM2.5 concentrations exhibit patterns across multiple temporal scales. Given the current time t₀ and prediction horizon Tₚ, we extract three complementary time segments: For recent observations, Xₕ∈ Captures short-term fluctuations from the past Tₕ hours. For Daily patterns, Xd Extracts same-hour data from past D days to model diurnal cycles. For Weekly patterns, Xw Samples same day-hour from past W weeks for weekly periodicity.

Here, N represents the number of monitoring stations and F denotes the feature dimension, comprising six pollutant measurements (PM2.5, PM10, NO₂, CO, O₃, SO₂) collected at each station. This multi-scale temporal decomposition allows the model to capture both immediate fluctuations and recurring patterns that characterize PM2.5 dynamics.

Prediction task.

Given historical observations at multiple scales {Xₕ, Xd, Xw} and station set S, we learn function fθ to predict future PM2.5 concentrations:

(1)

Here, represents the model parameters to be learned, and the predicted PM2.5 concentrations over the next Tₚ time steps are represented by .

Research methodology

Overview of method

The MSDGNN architecture addresses three key challenges through an integrated approach. Multi-scale temporal modeling (hourly, daily, weekly components) captures dependencies across different time horizons. The station grouping module discovers spatial correlations beyond geographic proximity, particularly valuable for sparse monitoring networks. Adaptive graph convolution dynamically modulates these spatial relationships based on learned attention weights.

As illustrated in Fig 1, MSDGNN consists of three parallel components processing historical data at different timescales. Each component contains multiple sequentially connected spatiotemporal blocks (ST-Blocks), with residual connections between blocks to optimize training efficiency. A fully connected layer with feature fusion combines outputs from the three components to generate final predictions. Each ST-Block integrates three modules: a station grouping module, an attention enhancement module, and a dynamic adaptive graph convolution network.

thumbnail
Fig 1. Schematic architecture of the proposed the MSDGNN (MSDGNN).

The model processes input data at hourly (H), daily (D), and weekly (W) temporal scales through parallel components. Each component consists of stacked Spatiotemporal Blocks (ST-Blocks) that incorporate Multi-Head Attention (MHA), a Spatiotemporal Graph Attention Network (ST-GAT), a Station Grouping Module (SGM), and a Dynamic Adaptive Graph Convolution Network (DAGCN). Outputs from each scale are fused to generate the final prediction. Residual connections are employed to facilitate training.

https://doi.org/10.1371/journal.pone.0338392.g001

Station grouping learning module

The model captures dependencies at multiple spatial scales by considering stations and station groups across different temporal scales. According to atmospheric dispersion theory, PM2.5 concentrations exhibit spatial heterogeneity driven by emission sources, meteorological transport, and chemical transformations. These factors naturally partition monitoring stations into functional groups.

Station group dependency modeling.

We employ a learnable mapping matrix to dynamically discover these latent patterns, where each element Si,j represents the probability of station i belonging to group j. The matrix is initialized using Xavier uniform initialization, with row-wise softmax normalization maintaining probabilistic constraints during training:

(2)

The soft assignment is constrained so that the probabilities for each station sum to unity (= 1). The weights in the matrix S thus reflect the relevance of each station to the different groups. A concrete example of this probability distribution for 6 stations across 3 groups is visualized in the inset table of Fig 2, which illustrates the hierarchical processing and bidirectional information flow of the Station Grouping Learning Module.

thumbnail
Fig 2. Station Grouping Learning Module (SGM) Architecture.

The diagram depicts the hierarchical processing pipeline: input features X are processed by MHA and ST-GAT to compute dynamic attention, followed by the SGM which maps stations (yellow) to groups (green) via a learnable matrix S (see inset). Bidirectional information flow (dashed lines) and feature fusion via DAGCN enable the final prediction .

https://doi.org/10.1371/journal.pone.0338392.g002

The station group representation for group , is computed as:

(3)

Here, represents the air quality data of station Si. This formulation enables stations with similar pollution patterns to be assigned to the same group, creating virtual monitoring regions that transcend geographical boundaries. After obtaining Hj for each station group, the dependencies among station groups are further computed, and temporal information is introduced to form more dynamic spatial dependencies. The edge attributes Ti,j, between station group Hi and Hj are computed as:

(4)

The temporal encoding time ∈ R6 captures temporal features through sinusoidal functions that preserve natural periodicity, incorporating sine and cosine transformations of the hour of day (h ∈ [0,23], encoded as sin(2πh/24) and cos(2πh/24)), day of week (d ∈ [0,6], encoded as sin(2πd/7) and cos(2πd/7)), and month of year (m ∈ [1,12], encoded as sin(2πm/12) and cos(2πm/12)). The MLP leverages these encoded temporal patterns to dynamically modulate inter-group connections, strengthening connections from industrial zones during weekday working hours when PM2.5 transport is typically high, while weakening them during weekend nights when industrial emissions reduce. This temporal modulation effectively captures time-varying inter-group influences that reflect changing pollution transport patterns driven by diurnal boundary layer dynamics and seasonal variations.

Based on these dependencies, message aggregation and representation updates are performed. The message-passing mechanism comprises two main steps:

(5)

Here, represents the revised station group structure, ri denotes the aggregated from neighboring groups, and and ϕg represent the message aggregation and representation update functions, respectively, typically implemented by MLPs.

Station dependency modeling.

The updated station group representation , which encodes regional pollution patterns, are then propagated back to individual stations to enrich local measurements with broader contextual information. Specifically, the representation of each station is refined by integrating information from all groups to which it belongs, weighted by the soft-assignment probabilities from matrix S:

(6)

This step, represents an intermediate station representation fused with group-level insights. The final station representation is obtained through an additional message-passing step that incorporates information from neighboring stations, capturing both direct influences and complex correlations:

(7)

Here, represents the edge attributes between station group n and i, represents the collection of all messages sent to station from its neighbors, and is the final updated representation. This hierarchical, bidirectional information flow—upward aggregation from stations to groups and downward refinement from groups to stations—enables the model to effectively integrate multi-scale spatial dependencies for accurate prediction.

Attention enhancement module

The attention enhancement module captures complex spatiotemporal dependencies through two complementary mechanisms: global spatial attention via Multi-Head Self-Attention (MSA) and dynamic spatiotemporal attention that adapts to temporal variations. These mechanisms work together to identify both long-range dependencies and time-varying patterns in PM2.5 propagation.

Spatial multi-head self-attention.

MSA captures global-scale spatial dependencies beyond local graph neighborhoods. Specifically, given the air quality data , MSA generates multiple subspaces via linear transformations, producing and . They are computed as: . where , and are learnable parameters. Then, the spatial attention weights are calculated via a scaled dot-product operation:

(8)

Here, are learnable parameters. This mechanism identifies dependencies between distant monitoring stations that may share similar emission sources or meteorological patterns despite geographical separation.

Dynamic spatiotemporal graph attention mechanism.

After processing by MSA, the dynamic spatiotemporal attention mechanism can further refine the broad spatial relationships captured, particularly adjusting and optimizing time-sensitive changes. Specifically, spatiotemporal attention matrices are generated from dynamic temporal and spatial data and are employed to adjust dependencies among stations across various temporal scales and spatial ranges.

The temporal attention matrix AT is calculated as:

(9)

Here, are learnable parameters, and is the MSA output, representing monitoring station features at iteration r-1. Similarly, the spatial attention matrix Satt is computed as:

(10)

Here, are learnable parameters, denotes the weight matrix, represents the bias term, and represents the information integrated with temporal attention. By integrating AT and AS, the model can comprehensively consider dynamic changes in both time and space.

Dynamic adaptive graph convolution network

The DAGCN propagates features across the station network. DAGCN combines an adaptive adjacency matrix--capturing persistent spatial relationships between stations--with dynamic attention weighting that adjusts information flow based on current conditions. The following section first describes the graph construction method, followed by an explanation of the attention-modulated spectral graph convolution operation.

Adaptive adjacency matrix.

We construct two complementary adjacency matrices. The geographical adjacency matrix Ageo derives from monitoring station distances, where Ageo[i,j] = 1 if the Euclidean distance between stations i and j is within 18 km, and 0 otherwise, reflecting direct spatial proximity and local pollutant transport between neighboring stations. To capture latent dependencies beyond physical proximity, we learn an adaptive adjacency matrix Ãadp through node embeddings E1, E2 ∈ RN×d:

(11)

This learned matrix identifies hidden correlations such as stations sharing similar emission sources or meteorological patterns, optimized through end-to-end training. The node embeddings E1 and E2 are learned during training and remain fixed during inference, capturing persistent spatial correlations between stations. The hybrid graph combines both matrices:

(12)

Here, ∈ [0,1] is a learnable parameter balancing geographical priors with data-driven patterns. This approach proves particularly valuable for sparse monitoring networks.

Chebyshev polynomial graph convolution.

To avoid direct eigenvector decomposition of the Laplacian matrix and reduce computational complexity, we use Chebyshev polynomials to approximate the convolution kernel:

(13)

In this context, are learnable parameters, and represents the k-th order Chebyshev polynomial matrix. The Chebyshev polynomials matrices are computed through an efficient recursive formulation:

(14)

Here, . IN is the identity matrix and is the hybrid adjacency matrix.

PM2.5 propagation patterns vary with meteorological conditions and temporal dynamics. We modulate the Chebyshev polynomial matrices with spatial attention computed from current input features:

(15)

Here, denotes element-wise (Hadamard) product. This modulation allows the model to dynamically reweight the information flow between stations at each time step, enhancing the importance of connections critical under current conditions (e.g., wind-driven transport).

Finally, convolution is employed to capture the evolving trends of PM2.5 concentrations over varying time periods. The complete formula is formulated as:

(16)

Here, represents learnable weight parameters, denotes the cross-city structure-enhanced features obtained from the station grouping module at iteration r-1, and aggregates information from k-hop neighborhoods with attention-modulated weights.

Multi-Component fusion

To comprehensively process features at various spatiotemporal scales, a multi-component fusion strategy is employed in this study. By weighting and combining near-hour, daily, and weekly features, multi-scale information integration is achieved. The fusion formula is given by:

(17)

Here, , , are the weights of each component, and , , are the predicted values for the short-term, daily cycle, and weekly cycle components, respectively.

Model training and evaluation metrics

The model was trained with a fixed Chebyshev polynomial order and temporal convolution kernel size of 3. Each graph convolutional layer and temporal convolutional layer employed 64 filters. The GNN architecture consisted of two layers with a hidden dimension of 32. The Mean Squared Error (MSE) between predictions and ground truth was used as the loss function, optimized via backpropagation with a batch size of 32 and a learning rate of 0.001.

Hyperparameters were determined through grid search on the validation set, with search ranges and selected values presented in Table 1.

thumbnail
Table 1. Hyperparameter Search Ranges and Selected Values.

https://doi.org/10.1371/journal.pone.0338392.t001

The selected configuration balances model capacity and computational efficiency, achieving optimal validation performance while mitigating underfitting and overfitting risks.

The dataset was split chronologically into training, validation, and testing sets with a ratio of 60:20:20. This partitioning strategy preserves temporal dependencies, prevents data leakage, and simulates real-world forecasting scenarios. The 60% training set facilitates effective learning of multi-scale temporal patterns, while the 20% validation and test sets enable robust evaluation across seasonal variations. This approach aligns with established practices in spatiotemporal forecasting literature [26] and ensures comparability with state-of-the-art methods.

All implementations were based on PyTorch [32], with experiments conducted on a server equipped with an NVIDIA RTX 3090 GPU.

To evaluate the performance of various models from multiple angles, four metrics are selected: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Weighted Absolute Percentage Error (WAPE), and Correlation Coefficient (CORR). These measures provide comprehensive insights into the predictive accuracy and effectiveness of the models. The corresponding formulas are as follows:

(18)(19)(20)(21)

Statistical significance testing

All experimental results are reported as the average over five independent runs with different random seeds (42, 123, 256, 512, 1024) to ensure reliability. The statistical significance of performance differences—both for MSDGNN versus baseline models and for the complete model versus its ablation variants—was evaluated with paired t-tests (p < 0.001, *p < 0.01, *p < 0.05).

Results and analysis

Performance comparison with baselines

The proposed MSDGNN model was compared with seven established models—DGCRN [27], Crossformer [30], Autoformer [33], AGCRN [26], Informer [34], LSTM [35], and Historical Average(HA)— to evaluate their multi-step prediction capabilities on the PM2.5 test dataset. Predictive analyses were conducted for 16-hour, 24-hour, and 32-hour horizons. For the 32-hour prediction, models were required to forecast concentrations for the subsequent 32 time steps, with evaluation metrics averaged across all points from hour 1 to hour 32. Table 2 presents the comparative performance, where the best result for each metric is highlighted in bold.

As shown in Table 2, MSDGNN achieved superior performance across all evaluation metrics and prediction horizons. Specifically, for the 32-hour prediction, MSDGNN reduced MAE and RMSE by 6.77% and 8.67%, respectively, compared to the second-best model, Crossformer. Corresponding improvements of 10.17% in WAPE and 8.45% in CORR further underscore MSDGNN’s strong adaptability to sparse-site data. This consistent advantage stems from the model’s enhanced capability to efficiently capture multi-scale spatiotemporal features, thereby improving the representation of intricate dependencies.

To evaluate performance stability across forecasting periods, Fig 3 illustrates the RMSE, MAE, and WAPE metrics for MSDGNN and baseline models. Throughout the prediction horizon, MSDGNN exhibited more stable and accurate prediction capabilities. In contrast, HA maintained consistently high errors, as expected from a simple baseline that ignores temporal and spatial dependencies. DGCRN showed relatively strong performance for horizons shorter than 12 hours, but its error increased sharply beyond this point, revealing limitations in modeling long-term dependencies. Similarly, LSTM and Autoformer exhibited marked performance degradation at longer horizons, indicating a failure to fully capture long-range spatiotemporal correlations. MSDGNN, by comparison, maintained low error rates across both short and long horizons, reflecting its robust ability to model complex relationships. Its superior error control, particularly when handling multi-scale features and sparse-site data, demonstrates greater robustness and adaptability, making it a more reliable solution for complex PM2.5 concentration forecasting.

thumbnail
Fig 3. Trends of Evaluation Metrics Over Multiple Time Steps.

Fig 3 illustrates the performance of MSDGNN compared to other baseline models across different prediction horizons using RMSE, MAE, and WAPE metrics. The three vertically arranged subplots depict the error rate trends over 30 time steps for each model. The results demonstrate that MSDGNN (green line) maintains consistently lower and more stable error rates throughout the entire prediction period.

https://doi.org/10.1371/journal.pone.0338392.g003

We further evaluated model performance at the site level using boxplots of the error distributions for each station, as shown in Fig 4. For MAE and RMSE, MSDGNN exhibited lower medians, first and third quartiles, and fewer outliers compared to other models, indicating smaller overall prediction errors and higher stability across the monitoring network. The narrower interquartile range further confirms the consistency of its predictions. For scale-invariant metrics, MSDGNN achieved a higher median CORR and a lower median WAPE, reflecting a stronger correlation between predictions and observations, as well as higher overall accuracy. In contrast, models like Crossformer and Informer showed higher error medians and greater dispersion, suggesting limitations in handling the spatiotemporal complexities of PM2.5 data. While AGCRN and DCRNN exhibited more concentrated error distributions, their accuracy remained lower than that of MSDGNN. LSTM and HA produced the widest boxplots and most dispersed error distributions, resulting in larger, more unstable prediction errors. These results can be attributed to MSDGNN’s effective capture of multi-scale spatiotemporal characteristics and its use of an adaptive graph convolution module, which enhances the model’s adaptability to data from sparsely distributed stations.

thumbnail
Fig 4. Comparative Prediction Performance Across Multiple Stations.

Fig 4 presents the error distributions of different models at the site level through boxplots for four evaluation metrics (MAE, RMSE, WAPE, and CORR). The results demonstrate that MSDGNN (yellow) exhibits lower median values and interquartile ranges with fewer outliers for MAE, RMSE, and WAPE metrics, while achieving higher values for the CORR metric.

https://doi.org/10.1371/journal.pone.0338392.g004

To visually summarize the comprehensive performance, a Taylor diagram was employed in Fig 5, incorporating standard deviation, correlation coefficient, and centered root-mean-square error (RMSE). MSDGNN showed the closest alignment with observed values, with a correlation coefficient near 0.7, a standard deviation most consistent with observations, and the smallest centered RMSE. These findings highlight MSDGNN’s superior ability to capture both the trends and variability of PM2.5 concentrations with minimal error.

thumbnail
Fig 5. Model Performance Evaluation Using Taylor Diagram.

Fig 5 compares the standard deviation, correlation coefficient, and centered root-mean-square error of predictions from different models against observations. The proximity of MSDGNN point (yellow) to the observed reference point (purple) indicates its superior performance.

https://doi.org/10.1371/journal.pone.0338392.g005

Ablation studies

Ablation studies were conducted to assess the contribution of each key module within the MSDGNN framework. The results are summarized in Table 3, where the best performance for each evaluation metric is highlighted in bold.

thumbnail
Table 3. Impact of Key Modules on Overall Model Performance.

https://doi.org/10.1371/journal.pone.0338392.t003

For comparison, the performance of each ablation model is also plotted in Fig 6. The results demonstrate that the complete MSDGNN model outperforms all ablation variants across all evaluation metrics and prediction horizons.

thumbnail
Fig 6. Comparison of Ablation Experiment Results.

Fig 6 illustrates the performance differences between the complete MSDGNN model and its ablation variants (without multi-head attention: w/o MHA; without adaptive module: w/o Aadp; without station grouping module: w/o SGM; without time component: w/o Time) across 16-hour, 24-hour, and 32-hour prediction horizons.

https://doi.org/10.1371/journal.pone.0338392.g006

Performance declined markedly upon the removal of any major component. Eliminating the Multi-Head Attention (MHA) module led to an increase in MAE by approximately 15.94%, 18.19%, and 18.11% for the 16-hour, 24-hour, and 32-hour predictions, respectively, confirming its critical role in modeling long-range spatial relationships, which is crucial for long-term prediction accuracy under sparse-site data conditions. The most significant degradation occurred when the adaptive module (Aadp) was removed, which resulted in an approximately 23.82% increase in RMSE for the 24-hour forecast, demonstrating its importance in capturing dynamic spatiotemporal variations. Without the Station Grouping Module (SGM), CORR values decreased by about 1.62%, 2.00%, and 2.16% for the respective horizons, indicating its essential role in capturing spatial dependencies and enabling effective interactions among sites. While the removal of the time component only slightly affected short-term WAPE (a 0.29% reduction at 16 hours), it caused substantial performance drops in long-term predictions, with MAE increasing by 19.19% and CORR decreasing by 2.00% at the 32-hour horizon, underscoring its necessity for modeling long-term temporal dependencies. In conclusion, the complete model provides more accurate predictions across all time horizons, validating the synergistic contribution of each integrated module.

Analysis of the station grouping mechanism

The station grouping learning module significantly enhanced MSDGNN’s prediction performance. This module operates by learning soft assignments derived from station feature representations, rather than relying solely on geographic coordinates. This enables the identification of latent spatial structures that conventional distance-based approaches tend to overlook. Although the optimal number of groups may vary across datasets, the data-driven strategy established here offers a principled and geography-agnostic framework for modeling spatial dependencies in sparse monitoring networks.

We analyzed two critical hyperparameters: the number of GNN layers and the number of station groups. As illustrated in Fig 7, performance was evaluated under different hyperparameter settings. Fig 7(a) indicates that a two-layer GNN structure achieved the best performance (MAE ≈ 9.48), balancing spatial feature extraction and model complexity. A single-layer model lacked sufficient expressive power, while a three-layer network introduced over-smoothing. Fig 7(b) shows that the model performed best when the number of station groups was set to three (MAE ≈ 9.5). This optimal grouping reflects the underlying functional regions within the monitoring network. Using fewer than three groups failed to distinguish key spatial patterns, whereas more groups led to diminishing returns—consistent with spatial complexity theory, where granularity must balance heterogeneity capture and overfitting risk.

thumbnail
Fig 7. Evaluation Results for Different GNN Layers and Station Group Numbers.

Subplot (a) demonstrates the impact of GNN layer count on model performance, indicating that a two-layer structure (MAE ≈ 9.48) outperforms both single-layer and three-layer configurations. Subplot (b) depicts the influence of station group quantity, revealing that the model achieves the lowest MAE (approximately 9.5) when the number of groups is set to three.

https://doi.org/10.1371/journal.pone.0338392.g007

Fig 8 visualizes the grouping result for the Chang-Zhu-Tan region, where each station is assigned to its most probable group based on the learned matrix S, distinguished by red (Group No.1), blue (Group No.2), and green (Group No.3). Stations are distributed across the three cities, with some clustered near the central river and others in peripheral areas.

thumbnail
Fig 8. Results of Station Grouping.

Fig 8 presents the visualization of station groupings in the Chang-Zhu-Tan region when the number of groups is set to three. The stations are plotted using their geographical coordinates (longitude and latitude), with each station differentiated by color: (Group No.1), blue (Group No.2), and green (Group No.3) colors, demonstrating a distribution based on functional similarity rather than geographical proximity. The spatial distribution of approximately 22 stations is displayed in coordinate space. For visualization clarity, each station is assigned to the group with the highest probability from the learned matrix S. However, during model training and inference, the full soft assignment probabilities are used in computations (as shown in Equations 3 and 6), allowing stations to partially belong to multiple groups with different membership weights.

https://doi.org/10.1371/journal.pone.0338392.g008

As summarized in Table 4, the three groups exhibit distinct statistical characteristics. Group No.2 shows the highest variance (136.2) and prediction errors, yet maintains the strongest correlation (0.67), reflecting high variability but well-captured dynamics. Group No.3 is the most stable (Var = 119.1) and has the lowest errors, indicating higher prediction accuracy. Group No.1 falls between the two. These differences confirm that MSDGNN adaptively learns inherent data characteristics, enabling performance-optimized spatial grouping without predefined partitions.

Case study: analysis of prediction results at individual stations

To gain deeper insights into the model’s performance, we selected Station 1339A, which exhibited the best MAE performance, and Station 1559A, which exhibited the worst. As shown in Figs 9 and 10, the model effectively captures periodicity and trends in the test set for both stations, demonstrating robust spatiotemporal modeling capabilities. This ability stems from the effective extraction of both local fluctuations and global trends in multi-scale spatiotemporal features.

thumbnail
Fig 9. Prediction Results for Station 1339A (Best Performance).

The left panel depicts long-term trends from May to November 2023, and the right panel shows a magnified view of short-term details from May 13 to June 7, 2023. The high degree of overlap between predictions (green solid lines) and actual observations (yellow dashed lines) demonstrates the model’s success in capturing periodic fluctuations and trend variations at both scales.

https://doi.org/10.1371/journal.pone.0338392.g009

thumbnail
Fig 10. Prediction Results for Station 1559A (Worst Performance).

Despite relatively lower overall prediction performance at this station, the model’s predictions (green solid lines) still effectively track the trend variations of actual observations (yellow dashed lines), particularly during periods of sharp concentration increases in November 2023. This demonstrates the model’s robust capability in modeling multi-scale spatiotemporal features at more challenging sites.

https://doi.org/10.1371/journal.pone.0338392.g010

Conclusion

In this study, we proposed the Multi-Scale Dynamic Graph Neural Network (MSDGNN), a novel framework for predicting PM2.5 concentrations. By leveraging a multi-scale spatiotemporal modeling architecture, MSDGNN effectively captures complex dependencies at both global and local scales. Comprehensive evaluation on a real-world air quality pollutant dataset from the Chang-Zhu-Tan region demonstrates that our model maintains high predictive accuracy even with a sparse network of monitoring stations. This highlights MSDGNN’s significant advantages in handling the challenges posed by sparse station distribution and complex spatiotemporal correlations.

The MSDGNN framework serves as a robust and generalizable tool for spatiotemporal pattern extraction and forecasting. Its modular design ensures that it is not solely limited to PM2.5 prediction but also provides a foundational architecture for modeling a wide range of other spatiotemporal phenomena.

Future research will focus on extending the application of MSDGNN to other urban areas and further optimizing its architecture to enhance computational efficiency. We plan to employ adaptive methods, such as Fast Fourier Transform (FFT) analysis, for the automatic identification of dominant periods within time series data. This will allow for dynamic temporal scale selection, moving beyond the fixed hourly, daily, and weekly scales used in the current study. Such advancements are expected to improve the model’s adaptability and performance across diverse geographical and temporal contexts.

Acknowledgments

We thank Jiahao Guo for his valuable assistance in data collection and management, and Zexin Guo and Shuang Wang for their technical support.

References

  1. 1. Han D, Shi L, Wang M, Zhang T, Zhang X, Li B, et al. Variation pattern, influential factors, and prediction models of PM2.5 concentrations in typical urban functional zones of northeast China. Sci Total Environ. 2024;954:176299. pmid:39284444
  2. 2. Song C, Pei T, Yao L. Analysis of the characteristics and evolution modes of PM2.5 pollution episodes in Beijing, China during 2013. Int J Environ Res Public Health. 2015;12(2):1099–111. pmid:25648172
  3. 3. Mannucci PM. Air pollution levels and cardiovascular health: Low is not enough. Eur J Prev Cardiol. 2017;24(17):1851–3. pmid:28696141
  4. 4. Faraji M, Nadi S, Ghaffarpasand O, Homayoni S, Downey K. An integrated 3D CNN-GRU deep learning method for short-term prediction of PM2.5 concentration in urban environment. Sci Total Environ. 2022;834:155324.
  5. 5. Thongsame W, Henze DK, Kumar R, Barth M, Pfister G. Evaluation of WRF-Chem PM2.5 simulations in Thailand with different anthropogenic and biomass-burning emissions. Atmospheric Environment: X. 2024;23:100282.
  6. 6. Lee JA, Alessandrini S, Kim J-H, Meech S, Kumar R, Djalalova IV, et al. Comparison of CAMS and CMAQ analyses of surface-level PM2.5 and O3 over the conterminous United States (CONUS). Atmospheric Environment. 2024;338:120833.
  7. 7. Liu H, Yan G, Duan Z, Chen C. Intelligent modeling strategies for forecasting air quality time series: A review. Applied Soft Computing. 2021;102:106957.
  8. 8. Su J, Zhao P, Ge S, Ding J. Aerosol liquid water content of PM2.5 and its influencing factors in Beijing, China. Sci Total Environ. 2022;839:156342. pmid:35640746
  9. 9. Gao Y, Zhang M, Liu Z, Wang L, Wang P, Xia X, et al. Modeling the feedback between aerosol and meteorological variables in the atmospheric boundary layer during a severe fog–haze event over the North China Plain. Atmos Chem Phys. 2015;15(8):4279–95.
  10. 10. Xu Y, Wang F, An Z, Wang Q, Zhang Z. Artificial intelligence for science-bridging data to wisdom. Innovation (Camb). 2023;4(6):100525. pmid:38028137
  11. 11. Zheng Y, Yi X, Li M, Li R, Shan Z, Chang E, et al. Forecasting Fine-Grained Air Quality Based on Big Data. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015. 2267–76.
  12. 12. Yu R, Yang Y, Yang L, Han G, Move OA. RAQ-A Random Forest Approach for Predicting Air Quality in Urban Sensing Systems. Sensors (Basel). 2016;16(1):86. pmid:26761008
  13. 13. Leong WC, Kelani RO, Ahmad Z. Prediction of air pollution index (API) using support vector machine (SVM). Journal of Environmental Chemical Engineering. 2020;8(3):103208.
  14. 14. Yang H, Wang W, Li G. Multi-factor PM2.5 concentration optimization prediction model based on decomposition and integration. Urban Climate. 2024;55:101916.
  15. 15. Lai Y, Dzombak DA. Use of the Autoregressive Integrated Moving Average (ARIMA) Model to Forecast Near-Term Regional Temperature and Precipitation. Weather and Forecasting. 2020;35(3):959–76.
  16. 16. Ma X, Yu H, Wang Y, Wang Y. Large-scale transportation network congestion evolution prediction using deep learning theory. PLoS One. 2015;10(3):e0119044. pmid:25780910
  17. 17. Sayeed A, Choi Y, Pouyaei A, Lops Y, Jung J, Salman AK. CNN-based model for the spatial imputation (CMSI version 1.0) of in-situ ozone and PM2.5 measurements. Atmospheric Environment. 2022;289:119348.
  18. 18. Dehbozorgi P, Duponchel L, Motto-Ros V, Bocklitz TW. Enhancing prediction stability and performance in LIBS analysis using custom CNN architectures. Talanta. 2025;284:127192. pmid:39561618
  19. 19. Zhang C, Li X, Sheng H, Shen Y, Xie W, Zhu X. Long-term prediction method for PM2.5 concentration using edge channel graph attention network and gating closed-form continuous-time neural networks. Process Safety and Environmental Protection. 2024;189:356–73.
  20. 20. Wang Y, Zha Y. Comparison of transformer, LSTM and coupled algorithms for soil moisture prediction in shallow-groundwater-level areas with interpretability analysis. Agricultural Water Management. 2024;305:109120.
  21. 21. Becerra-Rico J, Aceves-Fernández MA, Esquivel-Escalante K, Pedraza-Ortega JC. Airborne particle pollution predictive model using Gated Recurrent Unit (GRU) deep neural networks. Earth Sci Inform. 2020;13(3):821–34.
  22. 22. Cheng Y, Li Y, Zhuang Q, Liu X, Li K, Liu C, et al. Mechanism-informed friction-dynamics coupling GRU neural network for real-time cutting force prediction. Mechanical Systems and Signal Processing. 2024;221:111749.
  23. 23. Jin G, Liang Y, Fang Y, Shao Z, Huang J, Zhang J, et al. Spatio-Temporal Graph Neural Networks for Predictive Learning in Urban Computing: A Survey. IEEE Trans Knowl Data Eng. 2024;36(10):5388–408.
  24. 24. Zhou J, Cui G, Hu S, Zhang Z, Yang C, Liu Z, et al. Graph neural networks: A review of methods and applications. AI Open. 2020;1:57–81.
  25. 25. Mandal S, Thakur M. A city-based PM2.5 forecasting framework using Spatially Attentive Cluster-based Graph Neural Network model. Journal of Cleaner Production. 2023;405:137036.
  26. 26. Zhang C, Wu Y, Shen Y, Wang S, Zhu X, Shen W. Adaptive Graph Convolutional Recurrent Network with Transformer and Whale Optimization Algorithm for Traffic Flow Prediction. Mathematics. 2024;12(10):1493.
  27. 27. Li F, Feng J, Yan H, Jin G, Yang F, Sun F, et al. Dynamic Graph Convolutional Recurrent Network for Traffic Prediction: Benchmark and Solution. ACM Trans Knowl Discov Data. 2023;17(1):1–21.
  28. 28. Choudhury A, Middya AI, Roy S. Attention enhanced hybrid model for spatiotemporal short-term forecasting of particulate matter concentrations. Sustainable Cities and Society. 2022;86:104112.
  29. 29. Liang Y, Xia Y, Ke S, Wang Y, Wen Q, Zhang J, et al. AirFormer: Predicting Nationwide Air Quality in China with Transformers. AAAI. 2023;37(12):14329–37.
  30. 30. Zhang Y, Yan J. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. In: 2023. https://openreview.net/forum?id=vSVLM2j9eie
  31. 31. Liu J, Li T, Xie P, Du S, Teng F, Yang X. Urban big data fusion based on deep learning: An overview. Information Fusion. 2020;53:123–33.
  32. 32. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems. 2019;32.
  33. 33. Wu H, Xu J, Wang J, Long M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems. 2021;34:22419–30.
  34. 34. Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, et al. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. AAAI. 2021;35(12):11106–15.
  35. 35. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. pmid:9377276