Figures
Abstract
Meteorological sensors deployed on ocean buoys frequently suffer from data loss or outliers due to electromagnetic interference and component failures caused by harsh weather and environmental conditions. Accurate reconstruction of corrupted buoy data remains a significant challenge, as conventional interpolation and imputation methods often fail to capture the inherent spatio-temporal dependencies in marine meteorological variables. To address this issue, this paper proposes a novel deep learning model that integrates Transformer and Graph Attention Network (GAT) architectures—termed the Spatio-Temporal Dual-Attention Network (ST-DAN). The model uses parallel computing to capture two aspects of the data: on one hand, it captures temporal dependencies through a Transformer enhanced by position encoding; on the other, it models inter-variable spatial correlations with a Graph Attention Network (GAT) based on a physically informed adjacency matrix, which dynamically adjusts the influence weights between variables to significantly enhance reconstruction accuracy. To evaluate the ST-DAN model, extensive experiments were conducted leveraging the ERA5 reanalysis dataset and in-situ observations from a Qingdao buoy, focusing on the reconstruction of temperature and wind speed data. The experiment result shows that ST-DAN outperformed baseline models (e.g., ARIMA, RNN, Bi-LSTM, and Transformer) across metrics including MAE, MSE, RMSE, and R². It indicates that the proposed model (ST-DAN) is off high robustness and achieves high-precision interpolation and anomaly correction for meteorological data.
Citation: Song M, Huang J, Chen S, Fu X, Liu S, Li W, et al. (2026) An intelligent method for Buoy meteorological data restoration using a Spatio-Temporal Dual-Attention Network with transformer and GAT. PLoS One 21(2): e0343310. https://doi.org/10.1371/journal.pone.0343310
Editor: Babak Mohammadi, Swedish Meteorological and Hydrological Institute, SWEDEN
Received: November 12, 2025; Accepted: February 4, 2026; Published: February 18, 2026
Copyright: © 2026 Song et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The complete code of this study has been publicly released on GitHub, including all datasets used in the experiment. The repository link is https://github.com/Khalil-gua/ST-DAN.git. This repository contains all relevant scripts and documents for viewing and copying.
Funding: This work is supported by Laoshan Laboratory Independent Innovation Science and Technology Project (Grant No. LSKJ202502405), the Taishan Industrial Program “Marine Observation and Detection Buoy Equipment R&D and Industrialization Team”, Major Scientific Research Project for the Construction of State Key Laboratory at Qilu University of Technology (Shandong Academy of Sciences) (Grant No. 2025ZDGZ01) and Qilu University of Technology (Shandong Academy of Sciences) Major innovation project of science, education and production integration pilot project (2025ZDZX05). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Meteorological research relies on long-term observational data to analyze regional climate characteristic. As crucial platform for acquiring meteorological and hydrological data, ocean buoys offer the advantages of long-term, continuous, and stable monitoring. They can consistently observe various parameters such as sea surface temperature, wind speed, wind direction, and atmospheric pressure. Real-time buoy data are widely used to test and refine Numerical Weather Prediction (NWP) models [1,2], which is of great significance for studying air sea interactions [3]. Multiple meteorological agencies, including the China Meteorological Administration and the European Centre for Medium-Range Weather Forecasts, utilize buoy data to enhance forecast accuracy; For instance, the TRITON buoy network monitors variation in the Pacific and Indian Oceans to study El Niño-Southern Oscillation (ENSO), monsoon, and interdecadal climate variability [4]. The Hong Kong Observatory has been experimenting with integrating buoy data into the machine learning models (UWIN-CM) to achieve real-time forecasting of tropical cyclones since 2024, thereby strengthening capabilities for predicting and preventing extreme weather events [5]. Furthermore, buoy data constitute a key component of climate change research, by providing reliable sea surface temperature observations [6], they contribute to improve the credibility of climate projections [7,8]. Consequently, the multi parameter data provided by buoys hold substantial value for enhancing weather prediction accuracy, understanding climate change, and monitoring the marine environment, the quality of these data directly impacts the reliability of scientific research results [9].
The marine meteorological environment is highly dynamic and variable. Buoy-based meteorological sensors are frequently susceptible to data gaps and anomalies due to a combination of external and internal factors. External challenges include long-term exposure to sunlight leading to component aging, sensor damage caused by strong winds or extreme cold, while internal issues encompass accuracy degradation from a lack of regular calibration and power insufficiency due to battery performance decay. Missing values and outliers can distort data characteristics, alter statistical properties of datasets—such as expected values and higher-order moments—and consequently increase the difficulty of data analysis. Therefore, it is essential to accurately identify and reconstruct anomalous data to provide reliable observational data for marine scientific research.
Conventional approaches to handling anomalous meteorological data often rely on statistical imputation methods [10]. These range from simple univariate techniques, such as mean, median, or mode filling, to more sophisticated models like K-Nearest Neighbors (KNN) and ARIMA. While straightforward to implement, simple statistical methods often fail to capture the variability and correlations in meteorological data, and they generally adapt poorly to the characteristics of time series. Yozgatligil et al. [11] compared various time series imputation methods, noting that although simpler algorithms (e.g., SAA, NR, NRWC) and computationally intensive approaches (e.g., EM-MCMC) are effective to some extent, they often come with high computational costs. The ARIMA model is more suitable for non-stationary series with trends or seasonal components and has been applied to tasks such as missing data estimation in water quality monitoring and multi-scale change point detection [12]. However, its reliance on specific distributional assumptions limits its performance on data that lacks significant trends.
With advancements in artificial intelligence, deep learning has demonstrated considerable effectiveness in data imputation tasks [13]. Classic sequential models such as RNN, GRU, and LSTM have demonstrated promising performance across various applications [14–18]. Nonetheless, they are often plagued by issues including vanishing or exploding gradients, difficulty in capturing long-range dependencies, high computational complexity, and a tendency to overfit [19–21]. Studies indicate that LSTM, while mitigating the problem to some extent, may still experience gradient vanishing [22], and its gating mechanisms can induce gradient anomalies when saturated [23]. In contrast, the Transformer architecture, by leveraging a global self-attention mechanism, demonstrates a enhanced capability in capturing long-term dependencies and complex temporal patterns, thereby offering a more effective solution for imputation in long sequence data [24,25].
With the widespread application of Transformer in temporal prediction, various optimization variants such as Informer [26] and Auto-former [27] have been proposed to improve the computational efficiency and expressive power of the original model. These models have been proven effective in multiple experiments through sparse attention or sequence decomposition mechanisms [27–32]. However, meteorological data has both spatiotemporal dependencies, and modeling only temporal features can easily ignore spatial correlations, and vice versa [13].
Unlike structured data such as images, spatial dependencies in meteorological data often exists as irregular graph structure (non-Euclidean data), which necessitate processing with Graph Nural Network (GNN). Representative GNN architectures include Graph Convolutional Networks (GCN) [33,34], Graph Attention Network (GAT) [35] and Graph Autoencoders encoder (GAE) [36]. These models effectively capture complex spatial dependencies through graph convolution operations, attention weighted aggregation and autoencoding mechanisms, respectively, making them suitable for tasks such as node classification and link prediction.
To enhance the physical consistency and accuracy of meteorological data repair, this paper constructs an advanced spatio-temporal joint modeling framework, termed the Spatio-Temporal Dual Attention Network(ST-DAN), The ST-DAN model effectively integrates Transformer encoder, physical constrained matrix and a Graph Attention Network(GAT) to form a dual link reasoning architecture. This design enables the separate extraction and subsequent fusion of temporal and spatial features for predictive repair. In the temporal encoding link, the Transformer encoder structure is used to capture the long-distance features of time series; In the spatial encoding link, the GAT model extracts inter-variable spatial dependence. Crucially, the native attention mechanism in GAT is replaced by a physically constrained adjacency matrix, which effectively suppresses connections between irrelevant features while strengthening correlations between related ones. This matrix dynamically adjusts influence weight during training, thereby enhancing prediction accuracy while reducing computational overhead. In general, the ST-DAN model successfully leverages both temporal and spatial constraints to achieve effective restoration of meteorological data.
The remainder of this paper is organized into five sections. The first section presents the research background and the motivation behind the proposed model; The second section focuses on the algorithm, structure, working principle and function of each part of the model; The third section includes the processing of the experimental data set and the arrangement of the experimental environment; The fourth section is consisted of experiments and analysis of the experimental results; The fifth section contains the conclusion and future prospects.
2. Methodologies
2.1. Modeling of buoy meteorological data restoration
In the modeling process of buoy meteorological data restoration, the input data is first divided into meteorological features
and temporal features
including target features. These features are then fed into the spatial encoding link and the temporal encoding link respectively. In the spatial encoding link, we introduce a unique physical constraint matrix
to replace the GAT neural network with the attention mechanism, which can avoid irrelevant features connections and output the spatial feature
after training and learning; in the temporal encoding link, the data is first encoded by position, then entered into the multi-layer transformer encoder, and output the temporal feature
, and PE is the positional encoding. The features output from the two links are fused through the feature fusion layer to obtain the fusion feature
.
In the prediction output stage, the sliding window technique is used to convert the original sequence into input–output sample pairs to effectively model temporal dependencies and feature interactions.. Given the temporal series with length N, where T is the length of the temporal series and d is the feature dimension, training sample
are generated via the sliding window. Here the input sequence
contains L time step features, and the target value
corresponds to the target variable at the next time step. This approach enables the model to learn the mapping from the past L time steps to the next P time steps. For example, n the dataset used in this experiment, where the temporal resolution of the data is 1 hour, we set L = 24 and P = 1. That is, when a data point is identified as an anomaly or a missing value, the 24 hours of data preceding that point are used to predict and replace its value, thereby achieving data repair. Additionally, to handle consecutive missing values and avoid error accumulation, a bidirectional prediction strategy is adopted at the prediction stage. This means that the input sequence is first predicted in the forward temporal direction and then in the reverse temporal direction. The final predicted output is obtained as follows:
Where represents the final predicted value,
denotes the forward fusion feature, and
denotes the backward fusion feature. To ensure the reasonable prediction output, dynamic weights is applied based on the relative position of the predicted point within the gap of missing data. A higher weight is assigned to positions closer to the prediction direction, with the weight constrained to the range [0.1, 0.9]. The weighting function is defined as follows:
.
2.2. Constructing spatio-temporal dual attention network model
The dual-path architecture, owing to its capability to collaboratively process features of different natures within data, has demonstrated significant technical advantages across numerous fields. Addressing the characteristics of meteorological data, which inherently encompass both their own temporal patterns and spatial correlations among elements, this study adopts a dual-path structure and proposes a Spatio-temporal Dual Attention Network (ST-DAN) that integrates Transformer and Graph Attention Networks (GAT), aiming to achieve high-precision time-series data imputation. The temporal path of this model employs Transformer encoder layers, specifically designed to capture the dynamic temporal features of individual meteorological elements [37]. The spatial path utilizes GAT to model the complex interrelationships among different meteorological elements [38]. Furthermore, this study introduces a key optimization to the attention mechanism by incorporating a novel physical adjacency matrix to replace the traditional attention weight generation mechanism. This enables the dynamic and appropriate adjustment of influence weights among different elements, allowing the model to capture dependency relationships that more closely align with real physical processes, thereby yielding more accurate prediction results.
The overall architecture of ST-DAN is shown in Fig 1, which is composed of a temporal encoding link, a spatial encoding link, and a features fusion layer. The temporal encoding link and the spatial encoding link process the original input data is in parallel, and the features fusion layer is performed to finally output the prediction data. Prior to data input, the Z-score method is used to standardize the data, transforming the input features into a distribution with a mean of 0 and a standard deviation of 1, so as to improve the stability of model training. The input data grid contains m dimensional temporal elements, such as year, month, day, hour, minute, etc.; It also includes n dimensional meteorological elements, such as temperature, pressure, wind speed, wave period, etc. The data are then split into m-dimensional temporal features plus 1-dimensional target meteorological elements, which enter the temporal encoding link as ; All n-dimensional meteorological elements enter the spatial encoding link as
. The temporal encoder first encodes the observation time using a position encoding function, and then employs a multi-layer Transformer encoder, which uses its attention mechanism to capture the temporal dependence in the data and output temporal features; The spatial encoding link uses a graph attention network to learn the relationship among multiple variables and output spatial features. The temporal and spatial features are then fed into the feature fusion module. A decoupling fusion strategy is applied, in which a gating mechanism—generated via linear transformation and a sigmoid function—dynamically adjusts the weights of the temporal and spatial features. The module finally produces a fused representation, based on which the final prediction result is generated. The detailed algorithms and working principles of each module in the ST-DAN model are described as follows.
2.2.1. Temporal encoding link.
In the temporal encoding link, a temporal encoder is constructed to fuse time information and target features. The temporal encoder adopts a multi-layer Transformer encoder structure, as illustrated in Fig 2. A four-layer Transformer encoder is adopted in this architecture. The self-attention mechanism contained in the Transformer encoder enables the model to capture long-range temporal dependencies in the data, thereby effectively fitting the evolving trends over time and obtaining representative temporal features.
After the input sequence enters the temporal encoding link, it will be projected the original L* (m + 1) dimension to
,
is the Transformer encoder model dimension, which is set to 256 here. Then the position encoder is used to encode the temporal information and combined it with the target feature. Because the Transformer model only focuses on the content similarity between elements and lacks the ability to perceive the order of elements, positional encoding is essential. The positional encoder uses the sine/cosine function (as shown in formulas (2) and (3)) to generate a fixed position encoding vector [39].
Where pos is the time step position, i is the dimension index, is the positional encoding of the even dimension index, and
is the positional coding of the odd dimension index. The position encoder generates a position matrix
based on the input data
The input sequence after positional encoding fusion will be input into the attention mechanism layer, and its calculation logic is shown in Fig 3. First,
will be multiplied by the attention weight matrices
respectively to obtain Q, K, and V, which correspond to the query, key, and value matrices respectively. Then, the attention score will be calculated. The calculation method of the attention score is shown in formula (4).
Where Attention(Q,K,V) represents the attention score, denotes the dimension of the input data
, and
represents the dimensions of
.
The multi-head attention mechanism performs multiple attention operations on the same input. Each attention module, or “head,” has the same structure but uses independently learned weight matrices, allowing the model to jointly attend to information from different representation subspaces at different positions. This enables the model to capture more extensive and diverse contextual features and enhances its expressive power [39].
After the attention score is computed, it is combined with the original input sequence
. The calculation process is as follows:
x represents the layer normalization result, and Layer Norm is the layer normalization function.
Subsequently, the representation x enters the Feed-Forward Network (FFN). The FFN acts independently on the representation vectors at each position in the sequence output by the self-attention layer. Its main function is to perform nonlinear transformation and deep feature abstraction on each position vector that already contains global context information. Its structure follows an expansion-nonlinearity-contraction design. First, the input vector x is linearly projected to a higher dimension through a weight matrix
and a bias
, where
and we set
to 256. Then, a nonlinear activation function, specifically the ReLU function, is applied to the projected result. Afterwards, the activated high-dimensional vector is linearly projected back to the original
dimension through a weight matrix
and a bias
, according to the mathematical formula:
Among them, The function represents the abstraction of features after nonlinear transformation and computation through the feedforward fully connected layer. The parameters (W₁, W₂, b₁, b₂) of FFN are shared across all sequence positions within the same layer, but the computation is position-independent.
FFN complements the self-attention mechanism, which focuses on learning inter-position dependencies, and together they form the powerful feature learning capability of the Transformer encoder layer. By introducing nonlinear transformations and high-dimensional mappings, the FFN layer significantly enhances the model’s representational capacity and serves as a key component for the model to learn complex features while contributing a substantial number of parameters.
Subsequently, performs another residual connection and layer normalization with x, ultimately obtaining and outputting the temporal feature matrix
of the target element. Here, B represents the batch size, and
denotes the size of the hidden layer. The calculation process is as follows:
2.2.2. Spatial encoding link.
In the application of graph neural networks, modeling based on graph structures can effectively capture the complex relationships between variables. The introduction of attention mechanisms further enhances the model’s ability to capture dynamic spatial dependencies [40]. Additionally, architectures that integrate graph convolution with temporal convolution have been shown to significantly improve the modeling performance of spatiotemporal sequence data, achieving superior results in prediction tasks involving multi-source heterogeneous data [41].
Inspired by this, the present study treats the sequences of various meteorological elements as nodes in a graph structure and designs a Physics-aware Spatial Encoder (PhysicsAwareGAT), whose core is an improved Graph Attention Network (Fig 4). This encoder uses the correlation coefficient graph between meteorological elements (Fig 5) as its underlying structure. Through the graph attention mechanism, it adaptively learns the strength of associations between elements, thereby achieving effective modeling of spatial correlations.
After the input sequence enters the spatial encoding link, an adaptive physical constraint matrix is generated in the Spatial Encoding Link. The attention mechanism in traditional graph attention networks relies on data-driven implicit learning relationships, which can automatically capture the relationships between features and dynamically adjust weights. This is extremely convenient and effective for some data where the correlation between features cannot be directly known. However, during the learning process, there may be cases where false correlations are captured and lack physical interpretability. For meteorological data, we can directly know the relationships between features, such as temperature being related to wind speed and pressure, but not to precipitation. The adaptive physical mask matrix mechanism proposed in this invention replaces the traditional attention mechanism. Through a predefined physical constraint matrix, it can forcefully prohibit connections that violate domain knowledge, ensuring that elements with existing correlations can be correctly connected. This enhances the interpretability of the model while avoiding unnecessary calculations and improving model accuracy and training efficiency.
First, a physical prior matrix Aij (an n*n Boolean matrix, as shown in Fig 6) is preset,. Its rows and columns correspond to the set of meteorological features , where n is determined by the number of meteorological features in the collected data. In this matrix, the row index represents the target feature (i.e., the affected feature), while the column index represents the source feature (i.e., the influencing feature). For each row corresponding to a target feature, the entries in the columns associated with the source features are set to either 0 or 1: 0 indicates that the source feature is irrelevant to the target feature, and 1 indicates that the source feature is relevant. The values of this matrix must be manually defined based on expert knowledge.
Using the example of in this experiment, the five meteorological features
correspond to temperature, wind speed, pressure, wave period, and humidity, respectively. If the goal is to predict temperature, the first row can be set to
; if predicting wind speed, the second row can be set to
, This approach effectively eliminates correlations between irrelevant features.
However, the Boolean matrix Aij can only express the irrelevant or relevant relationships between meteorological features (i.e., values of 0 or 1), but it cannot quantify the influence weights between the elements. To address this, we introduce an adaptive physical constraint matrix module. This module remaps Aij into a physical learning matrix Mij, whose initial values are consistent with Aij, but the type is converted from a Boolean matrix to a floating-point matrix, serving as a learnable parameter matrix. To broaden the value range, increase the optimization gradient, and make weight changes smoother, the weight range of Mij is constrained to [0, 2]. This setting uses 1 as the center point: values greater than 1 enhance correlation, while values less than 1 weaken correlation. This interval amplification helps avoid the issue of universally small weights and enhances distinguishability.
In each training round, Mij functions similarly to a convolution kernel, sliding along the temporal dimension with a specified stride over the data, and performs feature extraction via matrix multiplication. After the training round, the parameters of Mij are updated using gradient descent. During this process, Aij remains fixed. Through the Hadamard product of Mij and Aij, the weights corresponding to originally irrelevant elements are guaranteed to remain zero, thereby forcibly masking irrelevant connections.
The updated weight matrix is normalized to the [0, 1] interval using the Sigmoid function, then multiplied by 2 to restore it to the [0, 2] range, generating a new matrix for the next training round. This mechanism effectively prevents weights from exceeding bounds due to cumulative effects during iterations. Finally,
is output as the adaptive physical constraint matrix for subsequent prediction tasks. Its calculation formula is as follows:
Where denotes the Sigmoid function and
denotes the Hadamard product. This process achieves an effective integration of physical priors and data-driven learning, retaining the physical constraints of feature relationships while allowing the model to adaptively adjust the influence weights. In the actual computation process, the weight range is constrained to [0,2], where 0 indicates irrelevance and 2 indicates strong correlation. After obtaining the adaptive physical constraint matrix
,
is used as a mask to perform graph convolution operations with the input sequence
. The graph convolution calculation is defined as follows:
Where represents the meteorological features obtained after calculation,
represents the original value of node j, and N(i) represents the set of all values that are 1 in the i-th row of
. At this stage, the meteorological features are not yet the final spatial features and require be further transformation through the feature processing layer.
In the feature processing layer, the obtained meteorological features undergo a nonlinear transformation via the ReLU activation function and are projected into a high-dimensional space, ultimately yielding the spatial features
of the target element as output. The calculation process is as follows:
Where and
are the weight matrices of spatial encoding link, and
and
are the bias vectors of spatial encoding link.
2.2.3. Features fusion.
The application of gating mechanisms in deep learning models for feature fusion has become a key technology for enhancing the efficiency of multi-modal and multi-level information integration. This mechanism dynamically regulates the contribution of features from different sources or levels, enabling the selective retention of critical information and the suppression of noise, thereby significantly improving the model’s representational capacity and generalization performance. It has been widely applied and proven effective in time series forecasting [42–44].
In this experiment, after the spatial and temporal pathways output spatial and temporal features respectively, a gating mechanism is similarly adopted for feature integration. The feature fusion module in this study achieves the organic integration of multi-modal information through a decoupling-fusion strategy. Its design adheres to the following principles: Physical Consistency, ensuring spatial features carry explicit meteorological physical relationships; Temporal Dynamics, capturing periodic and trend patterns within temporal features; and Adaptive Interaction, dynamically adjusting the contribution of spatio-temporal features via the gating mechanism. The structure diagram is shown below (Fig 7).
Among them, and
are the learnable weight matrices of the linear fusion layer. Spatial Features represents the feature matrix
output by the improved graph attention network in the spatial link; Temporal features denotes the temporal feature matrix
generated by the Transformer encoder after concatenating the target meteorological features and temporal features in the temporal link. Weight_T and Weight_S represent the weights assigned to temporal features and spatial features, respectively. After concatenating them in the hidden layer dimension, adaptive weight allocation is achieved through a gated fusion network: first, a gated vector is generated via linear transformation and Sigmoid activation
Then through:
To achieve feature fusion, this design ensures the physical consistency of spatial features through a physical mask matrix, while utilizing a dynamic weight matrix to adjust the contribution of spatio-temporal features.
After the model training is completed, during the application phase, the data is fed into the trained model, and the last time step of the output is taken as the predicted value to replace the outliers, thereby accomplishing data repair.
2.2.4. Loss and evaluation metric.
The loss function adopts Mean Absolute Error (MAE) as the training objective, which is calculated as shown in formula (13). Here, M represents the number of samples, denotes the true value, and
represents the predicted value. MAE exhibits robustness against physical outliers. When faced with extreme values in meteorological data, MAE is less affected, meeting the fault tolerance requirements of meteorological prediction regarding outliers. Additionally, MAE has the advantage of interpretability, with its physical units consistent with the prediction target, facilitating direct evaluation of model performance. During the training process, a gradient clipping mechanism is introduced to prevent gradient explosion. The calculation of gradient clipping is shown in formula (14). Here, θ is the gradient threshold, preset to 1, g is the current gradient, and
is the L2 norm of g.
In addition to using MAE as the loss function, we also employed MSE, RMSE, R², and AR² as evaluation metrics. The calculation formulas are as shown in Formulas 15–17. MSE stands for Mean Squared Error, RMSE for Root Mean Squared Error, R² for the Coefficient of Determination, and AR² for the Adjusted Coefficient of Determination. All five metrics can reflect the discrepancy between the predicted data and the true values. The four metrics generally range between 0 and 1, with MAE, MSE, and RMSE being more effective when closer to 0, while R² and AR² are more effective when closer to 1. Here, n represents the sample size, and k denotes the number of explanatory variables.
3. Experimental data processing and hyperparameter setting
3.1. Experimental data
In the experiment, the ERA5 dataset was first used for experimental verification. The selected area was Qingdao, China, with latitude 36.00−36.25N and longitude 120.25−120.50E. The time span was three years from January 1, 2022, to December 31, 2024, with a temporal resolution of one hour and a total of 26,304 records. The observed elements included temperature, pressure, wave period, wind speed, and rainfall. Training and prediction were conducted with temperature as the target feature. Then, buoy data measured in the Xiaomai Island sea area of Qingdao was used for testing. A total of 42,646 data records with a temporal resolution of ten minutes from June 1, 2024, to June 1, 2025, were selected. The measured data includes five elements: temperature, wind speed, pressure, wave period, and humidity. Training and prediction were conducted for wind speed. During the experiment, the data was divided into a training set and a test set – true value
at a ratio of 8:2. Then, 20% of the data in the test set – true value was modified as outliers for model prediction. Missing values and outliers were randomly added to the test set. First, 15% of the temperature data in the test set – true value was randomly modified as missing values. Considering that missing values may occur consecutively in actual situations, the maximum consecutive number of missing values was limited to 4 to mimic missing values in actual collected data. These data were then marked to obtain
. Subsequently, 5% of the data was modified as Gaussian noise values.
3.2. Data noise injection
We have designed an adaptive Gaussian noise injection method based on stratified sampling. The noise injection process comprises the following three steps.
The first step involves randomly sorting the candidate set V, where V represents the non-null values in the data . Subsequently, V is divided into D layers, denoted as
. The number of outliers m to be allocated to each layer is calculated using the following formula.
Where Q represents the total number of noise points to be inserted. For the remainder , allocate one additional sample to each of the first R layers to ensure as balanced a sample distribution as possible. Once the number of outliers for each layer is determined, independently draw m original data points
from each layer.
The second step involves generating adaptive Gaussian noise. The intensity of noise is crucial to the model’s prediction performance. In this study, Gaussian noise is adaptively generated based on the standard deviation of the original data. The noise generation function is , where
represents the standard deviation of the original data. By scaling the standard deviation by 0.5, the noise intensity is controlled within a reasonable range, ensuring that the noise can introduce data variations without excessively distorting the original data characteristics.
The third step injects the generated noise into
to obtain the data point
after noise injection. The calculation formula is:
Through index matching, this process precisely adds noise to selected data points, injecting a specific proportion of noise uniformly across the time dimension. This approach enables better simulation of real-world scenarios while preserving the distribution characteristics of the original data. After following the aforementioned steps, a test set Y` containing 15% missing values and 5% outliers is obtained.
3.3. Experimental environment hyperparameter settings
The hardware configuration of the experimental environment used in this study includes an NVIDIA GTX 1050ti GPU, an Intel Core i5-11400F CPU, and 32GB DDR4 memory. The software configuration includes the deep learning framework Pytorch 2.7, the deep learning computing component CUDA 12.4, the data processing libraries pandas 2.2.3 and scikit-learn 1.6.1, and the visualization component Matplotlib 3.9.2.
For the ERA5 dataset, the adopted training hyperparameters are shown in Table 1, where the input length and prediction length are represented as 24 data points, i.e., predicting one data point for 24 consecutive hours. To ensure that the model can correctly capture the dependencies in the data while also considering computational efficiency, after multiple experiments, it was found that using 256 hidden layers yielded the best results. Based on regularization theory [45] and experimental tuning, it was discovered that selecting 4 layers for the Transformer encoder can effectively prevent overfitting or underfitting, and achieve good convergence when the learning rate is set to 1e-4.
For the buoy’s actual measurement dataset, due to the increase in time resolution from 1 hour to 10 minutes, the training strategy was adjusted accordingly: to ensure consistency in time span, the input sequence length was extended from 24 to 144; at the same time, as the time dimension included minute information, minute features were added to the input dimension; other parameters were also adjusted synchronously. The corresponding training strategy is detailed in Table 2.
4. Experiment and result analysis
This section focuses on the experiments and result analysis. Firstly, ablation experiments are designed to replace or remove various components of the model in order to verify the effectiveness of the temporal and spatial links in the proposed model. Secondly, comparative experiments are conducted by comparing the target elements (air temperature and wind speed) of two different datasets with mainstream temporal interpolation models, in order to verify the effectiveness of the ST-DAN model proposed in this paper. Finally, the ST-DAN model is trained using experimental datasets with good data quality and real-world datasets containing outliers, respectively, and the robustness of the model is verified through prediction results.
4.1. Ablation experiments
First, ablation studies were conducted based on the ERA5 dataset to validate the effectiveness of the proposed model’s components from three dimensions: the temporal pathway, the spatial pathway, and the spatio-temporal dual pathway. For the temporal pathway, a comparison was made with a standalone Transformer model; for the spatial pathway, a comparison was made with the traditional Graph Attention Network (GAT). The experimental results are shown in Table 3, presenting evaluations for both the training and testing phases. The MAE, MSE, and RMSE values for the single temporal pathway and single spatial pathway network models were consistently worse than those of the spatio-temporal dual-pathway network model (ST-DAN).
Specifically, ST-DAN achieved an MAE of 0.0386, which is 37.6% and 18.2% lower than ST-GAT and the standalone Transformer model, respectively. Its MSE was 0.0262, representing reductions of 23.8% and 1.3% compared to ST-GAT and Transformer, respectively. The RMSE was 0.1617, showing reductions of 12% and 0.6% compared to ST-GAT and Transformer, respectively. These results validate the effectiveness of the individual model components and fully demonstrate the necessity of spatio-temporal dual-pathway fusion and its performance advantages.
4.2. Comparative experiments
To validate the effectiveness and superiority of the proposed ST-DAN model in outlier correction tasks, comparative experiments were conducted with the statistical model ARIMA, sequential models from the RNN family, models from the Transformer family, and the spatio-temporal dual-branch model ST-LSTM.
As a classical time series analysis tool, ARIMA achieves data stationarity through differencing and combines autoregressive and moving average processes to capture long-term trends and short-term fluctuations in the data. The RNN family includes RNN, Bi-GRU, and Bi-LSTM. These models leverage the inherent structure of hidden layer neurons to endow the model with memory capacity for sequential data, enabling them to learn temporal dependencies within the data. Bi-GRU and Bi-LSTM are bidirectional variants of GRU and LSTM, respectively, capable of learning contextual information from both past and future, effectively handling long sequence problems. The Transformer family includes Transformer, Informer, and Autoformer. These models utilize the attention mechanism to capture long-range temporal dependencies for higher prediction reliability. Additionally, the ST-LSTM model was constructed by replacing the Transformer encoder in ST-DAN with an LSTM model, serving as a spatio-temporal dual-branch comparison model.
To ensure the fairness of the comparative experiments, the RNN family models (RNN, Bi-GRU, Bi-LSTM) used identical training parameter settings during training. Similarly, the Transformer family models (Transformer, Informer, Autoformer) maintained consistent internal parameters. To address the architectural differences between the two families, hyperparameter adjustments were made based on the proposed ST-DAN model’s hyperparameters for key parameters such as learning rate, hidden layer dimension, and number of network layers. This aimed to make models from different families comparable in terms of architectural depth and overall capacity. Meanwhile, the number of training epochs and batch size remained consistent for all models to ensure uniform training conditions. Furthermore, the same early stopping threshold mechanism was applied across all experiments, effectively suppressing overfitting while fully exploring the performance potential of each model, thereby enabling a comparative evaluation of optimal performance under fair training conditions.
In the experiments, the eight models—ARIMA, RNN, Bi-GRU, Bi-LSTM, ST-LSTM, Informer, Autoformer, and ST-DAN—were used to perform missing value imputation and outlier correction on the temperature data from the ERA5 dataset. The performance metrics of the eight models are shown in Table 4, including both training and test set evaluations. The table shows that, except for the ARIMA model which does not involve a training/prediction split, all models experienced a performance drop in the testing phase compared to the training phase. Since the actual number of input and output variables for all models was 1, the values for AR² and R² are identical.
The proposed ST-DAN model demonstrated significant advantages across all four metrics: MAE, MSE, RMSE, and R². Its MAE was 0.0386, representing a 29.3% reduction compared to the relatively well-performing ST-LSTM model; MSE was 0.0262, a reduction of 54.9%; RMSE was 0.1619, a reduction of 32.9%; and R² was 0.9996, an improvement of 0.4%. These results indicate that the ST-DAN model better captures the underlying trends of the real data, and the corrected data possesses higher reference value.
Figs 8 and 9 provide a visual comparison using bar charts of the training evaluation, from which it is evident that the proposed ST-DAN model demonstrated the best performance across all metrics during the training phase.
Figs 10 and 11 present intuitive comparative bar charts for the testing evaluation across various metrics. Compared to the training phase, all models exhibited a decline in their metrics to varying degrees; however, the ST-DAN model proposed in this paper still demonstrated the best performance during the testing process.
In the wind speed prediction experiment using buoy-measured data, a comparative analysis was conducted involving Bi-GRU, Bi-LSTM, Transformer, Informer, Autoformer, and the ST-DAN model. The experimental results for both the training and testing phases are presented in Table 5.
Compared to the experimental results on the ERA5 dataset (Table 4), in the context of buoy-measured data, the actual data exhibit stronger nonlinear fluctuation characteristics due to the complex and variable marine observation environment. The inclusion of such fluctuating data in the training set led to a certain degree of decline in the overall prediction accuracy of all models. Nevertheless, the proposed ST-DAN model still maintained a high level of performance.
Among the compared models in the testing phase, the Transformer model demonstrated relatively good performance, with MAE, MSE, RMSE, R², and AR² values of 0.1543, 0.1697, 0.4119, 0.9028, and 0.9028, respectively. However, compared to Transformer, the ST-DAN model showed significant advantages across all metrics: an MAE of 0.0833, a reduction of 46.0%; an MSE of 0.1036, a reduction of 38.9%; an RMSE of 0.3218, a reduction of 21.9%; and an R² of 0.9732, an improvement of 7.2%.
It is noteworthy that, since the number of input and output variables was one for all models, the values of the adjusted coefficient of determination (AR²) and the coefficient of determination (R²) are identical. The experimental results validate the significant advantages of the ST-DAN model in wind speed prediction tasks under complex real-world marine environments.
Figs 12 and 13 present intuitive comparative bar charts of the evaluation data for each model during the training phase. Due to the presence of some latent outliers in the training set, the performance of all models declined to some extent; however, the proposed ST-DAN still remained optimal.
As shown in Figs 14 and 15, during the testing phase of wind speed prediction, the ST-DAN model exhibited only a minor performance difference compared to its training phase, indicating excellent generalization capability. Meanwhile, the model maintained its superior performance over all other compared models. These results collectively demonstrate that ST-DAN possesses reliable data restoration capability and strong robustness when handling complex meteorological data.
4.3. Visual analysis of data repair results
This study selects the relatively well-performing Transformer model and the ST-DAN model for comparative analysis of their effectiveness in temperature data restoration. The results before and after temperature data restoration are shown in Figs 16 and 17, where the solid blue line represents the true values, the red dashed line represents the predicted values, red circles indicate missing values, and black dots mark anomalies.
In Figs 16 and 17, the red and green boxes mark two regions with significant differences, whose enlarged local details are shown in Figs 18 and 19, respectively, covering a total of seven data points requiring restoration. Based on the discrepancy between the restored data (represented by the red dashed line) and the true data (represented by the solid blue line), it can be concluded that the data restored by ST-DAN almost perfectly align with the true values, with no predicted values deviating from the underlying trend.
In Figs 16 and 17, the red and green boxes mark two regions with significant differences, whose enlarged local details are shown in Figs 18 and 19, respectively, covering a total of seven data points requiring restoration. Based on the discrepancy between the restored data (represented by the red dashed line) and the true data (represented by the solid blue line), it can be concluded that the data restored by ST-DAN almost perfectly align with the true values, with no predicted values deviating from the underlying trend (Table 6).
The results of the wind speed data before and after restoration are shown in Figs 20 and 21. Similarly, the solid blue line represents the ground truth values, the red dashed line denotes the predicted values, and the red hollow circles indicate missing values. Due to the presence of genuine anomalies in the measured data, the training set contained some anomalous samples, leading to a certain decline in overall prediction quality.
The results of the wind speed data before and after restoration are shown in Figs 20 and 21. Similarly, the solid blue line represents the ground truth values, the red dashed line denotes the predicted values, and the red hollow circles indicate missing values. Due to the presence of genuine anomalies in the measured data, the training set contained some anomalous samples, leading to a certain decline in overall prediction quality (Figs 22 and 23).
The ground truth values at these locations, along with the restoration results from the Transformer model and the ST-DAN model, are detailed in Table 7. The results demonstrate that despite interference from anomalous data, ST-DAN effectively captures the underlying variation trends of the data, exhibiting a favorable goodness of fit. This further validates the strong robustness of the proposed model.
4.4. Analysis of computational complexity
The variables in the computational complexity analysis are defined as follows: B represents the batch size, L denotes the length of the input sequence, H signifies the dimension of the hidden layer, n stands for the number of Transformer layers, and C indicates the number of meteorological features. For time complexity, in the data preprocessing stage, time features are added to n pieces of data and standardized, with a time complexity of . In the model computation part, the spatial encoding link involves processing the physical relationships and mask matrix calculations of 5-dimensional meteorological symbols, with a time complexity of
; in the temporal encoding link, the time complexity of the multi-layer Transformer encoder and its included multi-head attention mechanism is
, while the feature fusion layer involves two linear transformation operations, with a time complexity of
. Therefore, the total time complexity is
.
Regarding spatial complexity, in the temporal encoding link, it is primarily related to the number of Transformer layers, denoted as n, and is expressed as . In the spatial encoding link, the spatial complexity is mainly determined by the physical relationship matrix and is expressed as
. The remaining part involves the feature fusion layer, with spatial complexities of
respectively. Therefore, the total spatial complexity is expressed as
.
4.5. Robustness analysis
The ST-DAN model described in this paper leverages advanced concepts of spatio-temporal feature fusion and multi-head self-attention algorithms. It capitalizes on the technical advantages of its dual link architecture to establish a physical constraint mechanism suitable for complex anomaly scenarios in meteorological data repair, demonstrating strong robustness. The spatial link models the physical relationships between meteorological variables, while the temporal link captures the evolutionary patterns of target variables. This decoupled design prevents the cross-contamination of spatio-temporal features, ensuring that when one feature link is compromised by anomalies, the other can maintain baseline predictive capability. Furthermore, by utilizing a physical constraint matrix based on learnable relationships derived from meteorological priors, the model is compelled to adhere to fundamental physical laws, thereby avoiding repaired results that contradict physical principles.
On the other hand, this paper adopts an unsupervised learning training scheme, using the training set of the EAR5 numerical model for training and prediction. ST-DAN achieved high performance metrics, with MAE of 0.038, MSE of 0.0262, and RMSE of 0.1617. When subsequently trained using buoy observation data containing outliers, the ST-DAN model maintains high performance indicators, with MAE increasing by 30% to 0.0622, MSE increasing by 7.1% to 0.0487, and RMSE increasing by 3.7% to 0.2280. Although R2 slightly decreases, it remains above 99% (see Table 7 for details). Additionally, the data repair results for both EAR5 numerical model data and buoy observation data do not violate the data variation trend. these experimental results collectively demonstrate the model’s strong robustness.
5. Conclusion
The Spatio-Temporal Dual Attention Network (ST-DAN) proposed in this study integrates a Transformer encoder with an improved Graph Attention Network (GAT) to construct a dual-link reasoning architecture, effectively addressing the task of missing value imputation and outlier correction in buoy meteorological data. The temporal link employs the global self-attention mechanism of the Transformer encoder to accurately capture long-term temporal dependencies of individual elements, while the spatial link enhances the physical correlation between meteorological elements and dynamically optimizes weights by introducing a GAT model with a physically constrained adjacency matrix. The synergistic combination of the two significantly improves the accuracy and physical consistency of data repair. Experimental results show that ST-DAN performs excellently on both the ERA5 dataset and actual buoy data from Xiaomai Island in Qingdao, and demonstrates good robustness in continuous missing scenarios and noisy data, validating its effectiveness in meteorological data repair tasks.
Nevertheless, the proposed model has certain limitations. The construction of the physical adjacency matrix relies on domain expertise, which limit its scalability; and the parallel computation of dual links results in slower training speed. Future research can be advanced in three aspects: first, exploring automated for generating constraint matrices by integrating data-driven approaches with physical rules to reduce dependence on expert knowledge; second, introducing sparse attention mechanisms and lightweight model design to optimize computational efficiency; third, extending the model to application such as multi-buoy collaborative data repair and high-frequency data imputation during extreme weather events. Incorporating additional meteorological physical equations could further enhance the model’s interpretability and generalization capability, thereby providing more comprehensive technical support for the high-quality application of marine meteorological observation data.
Acknowledgments
The authors are grateful to the Ocean Buoy Technology Team from the Institute of Oceanographic Instrumentation, Shandong Academy of Sciences, for their invaluable assistance with the acquisition and processing of ocean buoy data.
References
- 1. Ma Z, Han W, Zhao C, Zhang X, Yang Y, Wang H, et al. A case study of evaluating the GRAPES_Meso V5.0 forecasting performance utilizing observations from South China Sea Experiment 2020 of the “Petrel Project”. Atmospheric Research. 2022;280:106437.
- 2. Zheng S, Chen Y, Huang X-Y, Chen M, Chen X, Huang J. Ensemble-based diurnally varying background error covariances and their impact on short-term weather forecasting. Atmospheric and Oceanic Science Letters. 2022;15(6):100225.
- 3. Zheng L, Li T, Liu D. Evaluation of sub-seasonal prediction skill for an extreme precipitation event in Henan province, China. Front Earth Sci. 2023;11.
- 4. Lai SK, Chan PW, He Y, Chen SS, Kerns BW, Su H, et al. Real-Time Operational Trial of Atmosphere–Ocean–Wave Coupled Model for Selected Tropical Cyclones in 2024. Atmosphere. 2024;15(12):1509.
- 5. Yu Y-H, Gang Y-S, Lee W-B. Development of a Floating Buoy for Monitoring Ocean Environments. Journal of the Korean Society of Marine Engineering. 2009;33(5):705–12.
- 6. Kent EC, Kennedy JJ, Smith TM, Hirahara S, Huang B, Kaplan A, et al. A Call for New Approaches to Quantifying Biases in Observations of Sea Surface Temperature. Bulletin of the American Meteorological Society. 2017;98(8):1601–16.
- 7. Okafor N, Ingle R, Saunders M, Delaney D. Assessing and Improving the Quality of IoT Sensor Data in Environmental Monitoring Networks: A Focus on Peatlands. In: Authorea Preprints, 2024.
- 8. Huang J-X, Li Q-S, Han X-L, He J-Y. Recovery of Long-Term Missing Wind Velocity Data for Structural Health Monitoring Based on Deep Learning and Meteorological Databases. J Struct Eng. 2025;151(6).
- 9. Topaloğlu F. Machine Learning-Based Approaches and Comparisons for Estimating Missing Meteorological Data and Determining the Optimum Data Set in Nuclear Energy Applications. IEEE Access. 2025;13:37019–34.
- 10. Ramos‐Calzado P, Gómez‐Camacho J, Pérez‐Bernal F, Pita‐López MF. A novel approach to precipitation series completion in climatological datasets: application to Andalusia. Intl Journal of Climatology. 2008;28(11):1525–34.
- 11. Yozgatligil C, Aslan S, Iyigun C, Batmaz I. Comparison of missing value imputation methods in time series: the case of Turkish meteorological data. Theor Appl Climatol. 2012;112(1–2):143–67.
- 12. Haile TT, Tian F, AlNemer G, Tian B. Multiscale Change Point Detection for Univariate Time Series Data with Missing Value. Mathematics. 2024;12(20):3189.
- 13. Lalic B, Stapleton A, Vergauwen T, Caluwaerts S, Eichelmann E, Roantree M. A comparative analysis of machine learning approaches to gap filling meteorological datasets. Environ Earth Sci. 2024;83(24).
- 14. Shaikh ZM, Ramadass S. Unveiling deep learning powers: LSTM, BiLSTM, GRU, BiGRU, RNN comparison. IJEECS. 2024;35(1):263.
- 15. Subha J, Saudia S. Precipitation forecast using RNN variants by analyzing Optimizers and Hyperparameters for Time-series based Climatological Data. Int j electr comput eng syst (Online). 2024;15(3):261–74.
- 16. Ng YN, Lim HY, Cham YC, Abu Bakar MA, Mohd Ariff N. Comparison Between LSTM, GRU and VARIMA in Forecasting of Air Quality Time Series Data. Mal J Fund Appl Sci. 2024;20(6):1248–60.
- 17. Gao S, Huang Y, Zhang S, Han J, Wang G, Zhang M, et al. Short-term runoff prediction with GRU and LSTM networks without requiring time step optimization during sample generation. Journal of Hydrology. 2020;589:125188.
- 18. Wang J, Li X, Li J, Sun Q, Wang H. NGCU: A New RNN Model for Time-Series Data Prediction. Big Data Research. 2022;27:100296.
- 19. Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw. 1994;5(2):157–66. pmid:18267787
- 20. Niu Z, Zhong G, Yue G, Wang L-N, Yu H, Ling X, et al. Recurrent attention unit: A new gated recurrent unit for long-term memory of important parts in sequential data. Neurocomputing. 2023;517:1–9.
- 21. Safwan Mahmood Al-Selwi, Mohd Fadzil Hassan, Said Jadid Abdulkadir, Amgad Muneer. LSTM Inefficiency in Long-Term Dependencies Regression Problems. ARASET. 2023;30(3):16–31.
- 22. Pei W, Feng X, Fu C, Cao Q, Lu G, Tai Y-W. Learning Sequence Representations by Non-local Recurrent Neural Memory (Version 1). In: arXiv, 2022.
- 23. Chandar S, Sankar C, Vorontsov E, Kahou SE, Bengio Y. Towards non-saturating recurrent units for modelling long-term dependencies. In: 2019.
- 24. Wu N, Green B, Ben X, O’Banion S. Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case. In: arXiv, 2020.
- 25. Li S, Jin X, Xuan Y, Zhou X, Chen W, Wang YX, et al. Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting. arXiv. 2019.
- 26. Zhou H, Li J, Zhang S, Zhang S, Yan M, Xiong H. Expanding the prediction capacity in long sequence time-series forecasting. Artificial Intelligence. 2023;318:103886.
- 27. Wu H, Xu J, Wang J, Long M. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. arXiv. 2021.
- 28. Cui Y, Li Z, Wang Y, Dong D, Gu C, Lou X, et al. Informer model with season-aware block for efficient long-term power time series forecasting. Computers and Electrical Engineering. 2024;119:109492.
- 29. Yang C, Shu Z. Long-term rolling prediction of transformer power load capacity based on the informer model. J Phys: Conf Ser. 2024;2782(1):012023.
- 30. Yu Y, Zhao S, Han L, Peng L, Xu Y, Tian Q, et al. Cycleift: A deep transfer learning model based on informer with cycle fine-tuning for water quality prediction. Stoch Environ Res Risk Assess. 2025;39(7):2873–85.
- 31. Cao D, Zhang S. AD-autoformer: decomposition transformers with attention distilling for long sequence time-series forecasting. J Supercomput. 2024;80(14):21128–48.
- 32. Ahn JY, Kim Y, Park H, Park SH, Suh HK. Evaluating Time-Series Prediction of Temperature, Relative Humidity, and CO2 in the Greenhouse with Transformer-Based and RNN-Based Models. Agronomy. 2024;14(3):417.
- 33. Yu D, Yang B, Liu D, Wang H, Pan S. A survey on neural-symbolic learning systems. Neural Networks. 2023;166:105–26.
- 34. Li K, Lu G, Luo G, Cai Z. Seed-free Graph De-anonymiztiation with Adversarial Learning. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020. 745–54.
- 35. Lu B, Gan X, Jin H, Fu L, Wang X, Zhang H. Make More Connections: Urban Traffic Flow Forecasting with Spatiotemporal Adaptive Gated Graph Convolution Network. ACM Trans Intell Syst Technol. 2022;13(2):1–25.
- 36. Ward IR, Joyner J, Lickfold C, Guo Y, Bennamoun M. A Practical Tutorial on Graph Neural Networks. ACM Comput Surv. 2022;54(10s):1–35.
- 37. Ahmed A, Sun G, Bilal A, Li Y, Ebad SA. Precision and efficiency in skin cancer segmentation through a dual encoder deep learning model. Sci Rep. 2025;15(1):4815. pmid:39924555
- 38. Ji A, Xie Y, Fan H, Xue X, Zhang M. Dual-attention deep learning model for real-time advance rate prediction in TBM operations. Automation in Construction. 2025;175:106210.
- 39. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. arXiv. 2017.
- 40. Peng D, Zhang Y. MA-GCN: A Memory Augmented Graph Convolutional Network for traffic prediction. Engineering Applications of Artificial Intelligence. 2023;121:106046.
- 41. Xia H, Chen X, Wang Z, Chen X, Dong F. A Multi-Modal Deep-Learning Air Quality Prediction Method Based on Multi-Station Time-Series Data and Remote-Sensing Images: Case Study of Beijing and Tianjin. Entropy (Basel). 2024;26(1):0. pmid:38275499
- 42. Feng Z, Shen J, Zhou Q, Hu X, Yong B. Hierarchical gated pooling and progressive feature fusion for short-term PV power forecasting. Renewable Energy. 2025;247:122929.
- 43. Shen N, Wang J, Wang X, Ma J, Wu S. PLSTM-MTGF: A deep learning fusion model enabling real-time multi-target monitoring of penicillin fermentation via near-infrared spectroscopy. Spectrochim Acta A Mol Biomol Spectrosc. 2026;349:127358. pmid:41406794
- 44. Chen H. Deep Transfer Recommendation Model with Spatio-Temporal Enhancement and Adaptive Gating Fusion. In: 2025 8th International Conference on Advanced Algorithms and Control Engineering (ICAACE), 2025. 2014–7.
- 45. Moradi R, Berangi R, Minaei B. A survey of regularization strategies for deep models. Artif Intell Rev. 2019;53(6):3947–86.