KALFormer: Knowledge-augmented attention learning for long-term time series forecasting with transformer

Xing Dong; Qianwei Yang; Wenbo Cheng; Yun Zhang

doi:10.1371/journal.pone.0338052

Abstract

Time series forecasting remains a fundamental yet challenging task due to its inherent non-linear dynamics, inter-variable dependencies, and long-term temporal correlations. Existing approaches often struggle to jointly capture local temporal continuity and global contextual relationships, particularly under complex external influences. To overcome these limitations, we propose KALFormer, a knowledge-augmented attention learning transformer framework that integrates sequential modeling with external information fusion. KALFormer enhances spatiotemporal representation and contextual reasoning by integrating Long Short-Term Memory (LSTM) encoders, Transformer-based self-attention mechanisms, and knowledge-aware modules. Extensive experiments on six public benchmark datasets demonstrate that KALFormer achieves an average improvement of 8.4% in MSE and MAE compared with representative baseline models, highlighting its robustness, interpretability, and reliability for long-term time series forecasting. The source code is available at https://github.com/dxpython/KALFormer.

Citation: Dong X, Yang Q, Cheng W, Zhang Y (2026) KALFormer: Knowledge-augmented attention learning for long-term time series forecasting with transformer. PLoS One 21(1): e0338052. https://doi.org/10.1371/journal.pone.0338052

Editor: Rafael Duarte Coelho dos Santos, Instituto Nacional de Pesquisas Espaciais, BRAZIL

Received: July 24, 2025; Accepted: November 17, 2025; Published: January 5, 2026

Copyright: © 2026 Dong et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data download: https://doi.org/10.5281/zenodo.17068599.

Funding: This work was financially supported by the Undergraduate Teaching Reform Project (2024-26) from Guizhou University of Traditional Chinese Medicine, the Doctoral Start-up Fund (2019-76) from Guizhou University of Traditional Chinese Medicine, and the National Natural Science Foundation of China (Grant No. 32360152).

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Time series forecasting is a critical task in a wide range of fields [1], where accurate prediction of future trends is essential. The complexity of time series data [2] and the challenge of capturing long-range dependencies [3] are fundamental difficulties that have persisted in this domain. As the length of sequences grows, the contextual relationships between distant time steps become increasingly intricate. Traditional models, such as Long Short-Term Memory (LSTM) [22–24] and Recurrent Neural Networks (RNN) [4], often exhibit limitations in capturing these distant dependencies, leading to reduced forecasting accuracy, particularly when dealing with long-range patterns.

Additionally, in many real-world applications, forecasting tasks involve multiple correlated variables that evolve under the influence of external and contextual factors [5]. For example, policy changes, economic trends, and environmental conditions such as weather jointly affect multiple variables in domains like finance [7] and energy management [6]. Therefore, it is crucial to model these inter-variable dependencies and nonlinear interactions rather than treating them as isolated exogenous influences. This motivates a multivariate forecasting formulation, where all variables are predicted jointly to capture both temporal dynamics and cross-variable relationships [32].

To tackle these issues, we propose KALFormer, Knowledge-Augmented Attention Learning for Long-Term Time Series Forecasting with Transformer. This model integrates LSTM, Knowledge Augmented Network (KAN) [29], Multi-Head Attention (MHA) [12], and Transformer architecture [26], addressing both long-range dependencies and disruptive factors. Specifically, the KAN component enhances the model’s ability to integrate domain-informed external factors, while the Transformer architecture—with its multi-head attention mechanism and feedforward networks—facilitates the efficient fusion of features across different time steps.

In summary, our contributions are as follows:

A unified framework, KALFormer, is developed by integrating LSTM, self-attention, KAN, and Transformer modules to address contextual dependencies and external interferences in long-term time series forecasting.
The incorporation of knowledge-augmented representations through KAN enhances the model’s interpretability and environmental awareness, enabling more reliable prediction in complex domains such as finance and energy.
Comprehensive experiments on multiple benchmark datasets demonstrate that KALFormer achieves consistently improved predictive accuracy and robustness compared with existing forecasting approaches.

2 Related work

2.1 Long-term dependency models for series prediction

2.1.1 RNN-based and CNN-based methods.

RNNs [4] and their variants, such as LSTM [22–24], more comprehensive reviews of RNN architectures are provided in [25,30] and Gated Recurrent Units (GRU) [8], are among the earliest deep learning architectures applied to time series forecasting. Due to their recursive structure, RNNs process sequential data step by step, updating hidden states at each time step to capture temporal dependencies [9–11]. This characteristic makes them well-suited for modeling short- to medium-term dependencies in time series data.

Several improvements have been proposed to enhance the performance of RNN-based models. For instance, Sagheer et al. introduced a deep Long Short-Term Memory (DLSTM) model [9], where a genetic algorithm was employed to optimize the model architecture. When evaluated on oil industry data, the DLSTM model outperformed various statistical and computational intelligence techniques, demonstrating its effectiveness in capturing both past and future dependencies [31]. Similarly, Chang et al. developed the Memory Time-series Network (MTNet) [10], which integrates a large memory module, three independent encoders, and an autoregressive component, thereby improving the model’s ability to learn from long-term historical data.

Despite these advancements, RNN-based methods face inherent challenges, particularly when dealing with very long sequences. One major limitation is their difficulty in effectively capturing long-range dependencies due to information dissipation during backpropagation. Although LSTM and GRU partially address this issue through gating mechanisms, they still struggle when processing extremely long sequences [16]. Additionally, the sequential nature of RNNs restricts their ability to leverage parallel computation, resulting in inefficient training and slower inference times for large-scale time series data.

To mitigate these issues, alternative architectures, such as convolutional neural networks (CNNs), have been explored for time series forecasting. CNNs can efficiently extract local temporal patterns using convolutional filters, enabling faster computations and improved scalability compared to RNNs. However, CNNs alone often lack the ability to capture long-term dependencies without additional modifications, such as dilated convolutions or hybrid models that integrate recurrent structures.

More recently, TimesNet [36] has emerged as an alternative approach that extends CNN-based methods by focusing on temporal 2D-variation modeling. Unlike conventional sequence-based models, TimesNet represents time series as 2D tensors, allowing it to effectively analyze temporal data with non-uniform frequencies and resolutions. This makes it particularly advantageous in scenarios where traditional RNNs and CNNs struggle with irregular time series patterns. By leveraging its novel representation, TimesNet outperforms traditional models in handling dynamic temporal patterns and provides a scalable solution for forecasting applications.

In summary, while RNNs and their variants have played a crucial role in time series forecasting, their inherent limitations in handling long-term dependencies and computational efficiency have motivated the development of alternative approaches, including CNN-based models, TimesNet, attention-based models, and hybrid architectures.

2.1.2 Attention-based methods.

To overcome the inherent limitations of RNN architectures, the Attention mechanism was introduced, emerging as a powerful tool for time series modeling, particularly in solving long-range dependency issues. By calculating the relevance between different time steps within an input sequence, Attention enables models to identify which time points are crucial for the current prediction. This allows models to focus on the most relevant information without being constrained by the step-by-step processing inherent in RNNs, thereby mitigating the vanishing gradient problem and inefficiencies associated with sequential data processing.

One of the prominent Attention-based models is the Transformer [26], which fully leverages the Attention mechanism and completely eliminates the need for recurrent structures. By employing parallel computations, the Transformer architecture [27] dramatically enhances training efficiency. Its Multi-Head Attention mechanism [12] allows the model to attend to different subspaces of the input, thus capturing a wide array of global dependencies and providing a more comprehensive understanding of long-range temporal relationships.

Although the Transformer architecture has shown strong capability in modeling long-range dependencies, it still faces limitations in capturing complex spatiotemporal interactions and non-stationary dynamics that often occur in real-world time series. To address these challenges, several Transformer-based variants have been proposed to enhance representation of temporal and cross-variable relationships. For example, the iTransformer [33] adopts an inverted architecture to improve modeling of long-range dependencies through self-attention mechanisms, while the Crossformer [34] introduces a cross-dimension dependency module that jointly captures temporal and spatial correlations in multivariate series. Likewise, the Deep Time-Index model [35] learns time-indexed representations to balance short- and long-term dependencies, thereby improving forecasting performance across diverse domains.

Informer [14] further refines Transformer-based time series forecasting by addressing the high complexity and memory usage issues associated with long-sequence processing. It introduces the ProbSparse self-attention mechanism, self-attention distillation, and a generative decoder, which significantly improve computational efficiency while maintaining strong predictive capabilities. Non-stationary Transformers [15] further advance this paradigm by incorporating both stationary and non-stationary attention mechanisms, allowing the model to adaptively balance predictability and flexibility, which is crucial for forecasting non-stationary data.

The VMD-Crossformer [13] is proposed for short-term power load forecasting. By combining Variational Mode Decomposition (VMD) with the Crossformer architecture, the model effectively captures both temporal and dimensional dependencies. The Two-Stage Attention layer in Crossformer is designed to simultaneously capture dependencies across time and feature dimensions, improving forecasting performance.

Collectively, these Attention-based methods represent significant advancements in time series forecasting, each employing unique strategies to handle complex, high-dimensional temporal data. By leveraging self-attention mechanisms, they offer improved efficiency, scalability, and forecasting accuracy over traditional RNN-based approaches, making them highly effective for a wide range of real-world applications.

2.2 Handling external factors in time series prediction

In practical applications, the dynamic behavior of time series data is often shaped not only by historical trends but also by various other factors. For example, fluctuations in electricity load may be affected by weather conditions, social activities, and holidays. Consequently, relying solely on historical data for time series modeling can be insufficient for capturing the complex interplay between these influencing elements.

To address these complexities more effectively, KAN has been applied to time series forecasting [29]. KAN enhances the model’s interpretive ability by integrating external knowledge bases or graph data into the forecasting process. These additional sources may include domain-specific expertise, industry standards, or other relevant contextual information. By leveraging Graph Convolutional Networks (GCN) [19–21], this information is encoded into low-dimensional embeddings that are subsequently fused with historical time series data.

In this context, GCNs model the relationships between influencing factors, generating contextually meaningful embedding vectors. These embeddings, when combined with time series features, enable the model to leverage both historical data and a deeper understanding of the broader influences shaping the predictions. For example, Han et al. [17] introduced the Graph Hawkes Neural Network, designed to capture complex dependencies within dynamic sequences and predict future events. Experimental results on large-scale temporal multi-relational datasets validated its effectiveness. Similarly, Zhong et al. [18] proposed the Knowledge Graph-Augmented Network for fine-grained sentiment analysis, showcasing how contextual knowledge can be efficiently combined with linguistic data. Their method integrated multiple perspectives, including syntax, context, and domain information, to extract sentiment features with greater accuracy.

While GCNs excel at modeling the underlying relationships between influencing factors, they aggregate all neighboring nodes with uniform weights, which limits their capacity to model heterogeneous or asymmetric dependencies among variables [28]. To overcome this limitation, our framework replaces static GCN aggregation with a graph attention mechanism, allowing the model to dynamically learn the relative importance of neighboring nodes. This adaptive weighting enables more flexible and context-aware information propagation, thereby improving the expressiveness and robustness of the knowledge integration process.

3 Method

3.1 Overall architecture

In time series forecasting, models typically face three fundamental challenges. First, sequence data are inherently complex, with multidimensional features and nonlinear relationships that hinder traditional approaches from capturing intricate dynamic patterns. Second, time series often exhibit long-term dependencies, and models frequently struggle to preserve distant historical information, which reduces prediction accuracy for long sequences. Third, external influences such as economic trends and environmental changes further complicate forecasting, increasing task difficulty.

To address these issues, we propose the KALFormer model, Knowledge-Augmented Attention Learning for Long-Term Time Series Forecasting with Transformer. KALFormer integrates LSTM [22], Transformer [26], MHA, and the KAN [29] to capture long-range dependencies and complex temporal patterns more effectively. The overall architecture is illustrated in Fig 1.

Download:

Fig 1. Overall architecture of KALFormer.

The framework integrates Long Short-Term Memory (LSTM) units for temporal encoding, a self-attention mechanism for capturing global dependencies, a Graph Neural Network (GNN)-based Knowledge-Augmented Network (KAN) for nonlinear relational interactions, and a multi-layer Transformer encoder–decoder for feature fusion and sequence prediction. The model outputs are normalized and passed through a linear layer and Softmax for forecasting probabilities.

https://doi.org/10.1371/journal.pone.0338052.g001

3.2 Sequential backbone with LSTM and attention

Recurrent architectures are widely adopted for sequential modeling due to their ability to preserve contextual information across time. Among them, LSTM mitigate the vanishing gradient problem by introducing gating mechanisms that regulate information flow. Nevertheless, standard LSTMs often struggle to capture very long-range dependencies and complex temporal dynamics, which limits their capacity for modeling global context. To address this issue, we employ a sequential backbone that combines stacked LSTM layers with a self-attention mechanism, thereby enabling the extraction of both local continuity and long-distance dependencies.

Formally, let denote the input sequence, where N is the number of time steps and D is the feature dimension. Temporal features are obtained by passing X through two stacked LSTM layers:

(1)

The memory update within each LSTM cell is expressed compactly as

(2)

where denote the input, forget, and output gates, respectively, and is the candidate memory cell. The vector represents the internal cell state that carries long-term information, while denotes the hidden state at time step t, which serves as both the output of the current cell and the input to subsequent layers or future time steps. This formulation highlights the gated mechanism that allows the LSTM to preserve long-term dependencies while adaptively integrating new temporal information.

To complement sequential modeling, a self-attention layer is applied to the hidden representations . In this mechanism, each time step interacts with all others through a set of learned projections that generate a query–key–value (QKV) triplet. Specifically, the hidden sequence is linearly mapped to query, key, and value matrices as , , and , where W_Q, W_K, and are trainable weight matrices. The attention-enhanced representation is then computed as

(3)

where d_k denotes the key dimension. This mechanism assigns adaptive weights to temporal positions, enabling the model to highlight informative time steps while suppressing redundant or less relevant ones.

By integrating LSTM and self-attention, the backbone leverages the complementary strengths of both modules: the LSTM ensures sequential continuity and memory preservation, whereas the attention mechanism dynamically highlights salient dependencies irrespective of temporal distance. The resulting representation serves as a robust foundation for subsequent knowledge integration and fusion.

3.3 Knowledge integration through knowledge-augmented network

Although recurrent and attention-based encoders effectively capture temporal dependencies, they remain limited in explicitly modeling the cross-variable interactions that characterize multivariate time series. To address this limitation, a KAN is introduced to integrate structured relational information into the representation learning process.

Specifically, a variable graph is defined as , where denotes the set of variables (nodes), represents the set of edges encoding statistical or semantic relations among variables, and A is the weighted adjacency matrix that quantifies connection strengths. Each node corresponds to one feature dimension in the multivariate input sequence , where N is the sequence length (time steps) and D is the feature dimension, i.e., the number of observed variables or sensors in the dataset. The value of D is determined directly by the dataset configuration (e.g., the number of sensors in traffic data or meteorological indicators in weather data) and remains fixed during both training and inference, ensuring a consistent graph structure for the KAN across all temporal windows.

The adjacency matrix combines statistical correlation with domain priors in a convex form,

(4)

where is the Pearson correlation matrix estimated from training data, ensures nonnegativity, P encodes domain-specific structural priors (e.g., sensor connectivity in traffic or meteorological coupling patterns), denotes row-wise normalization, and λ is a learnable mixing coefficient. To encourage sparsity, only the top-k strongest connections per node are retained,

(5)

followed by the addition of self-loops and symmetric normalization,

(6)

which yields a normalized adjacency matrix for stable propagation.

Each node is initialized with an embedding , collected in . Information is propagated through a two-layer graph neural network (GNN) based on neighborhood aggregation,

(7)

where are trainable weights and denotes a nonlinearity. The output serves as knowledge embeddings that encode cross-variable dependencies and domain-informed relationships.

To condition temporal features on these knowledge embeddings, cross-attention is applied with queries from the temporal representation and keys/values from K,

(8)

producing a knowledge-informed context . The final integration employs a learnable gating mechanism,

(9)

where and denote global summaries. The gate is initialized to 0.5 and optimized jointly with the rest of the model, balancing contributions from sequential and knowledge pathways.

3.4 Representation fusion with multi-head transformer

Although knowledge-conditioned features enhance cross-variable modeling, a single attention mechanism may still be insufficient to capture the diverse dependency patterns present in long and heterogeneous sequences. To enrich representational capacity, we incorporate a MHA mechanism, which projects the input into multiple subspaces and performs attention operations in parallel. This design allows the model to simultaneously attend to distinct aspects of temporal and relational structure, thereby improving robustness and generalization. The overall architecture is illustrated in Fig 2.

Download:

Fig 2. Structure of the multi-head attention mechanism.

Each attention head performs a scaled dot-product attention using individual query (Q), key (K), and value (V) projections. The outputs from all heads are concatenated and linearly transformed to form the final attention representation, enabling the model to jointly attend to information from different subspaces.

https://doi.org/10.1371/journal.pone.0338052.g002

Formally, given the hidden representation after knowledge integration, the query, key, and value matrices for the m-th head are defined as

(10)

where are head-specific projection matrices. The output of each head is computed using scaled dot-product attention:

(11)

Outputs from all M heads are concatenated and linearly transformed:

(12)

with denoting the output projection. This formulation enables the network to capture complementary patterns from multiple representation subspaces in parallel.

By integrating multi-head attention, the model leverages diverse relational cues and contextual information across temporal positions, providing a more expressive fusion of sequential and knowledge-informed representations. This enriched feature space serves as the foundation for subsequent prediction layers.

4 Experiment

4.1 Data foundations and evaluation methodology

We evaluated the proposed model on six real-world benchmark datasets—Traffic, Weather, Electricity, ETTh1, ETTh2, ETTm1, and ETTm2 as summarized in Table 1. Their temporal resolutions are hourly for Traffic, Electricity, ETTh1, and ETTh2, 10 minutes for Weather, and 5 minutes for ETTm1 and ETTm2. Each dataset was reformulated as a graph, where nodes represent variables and edges encode statistical associations or domain-specific priors, enabling expressive spatiotemporal learning. All datasets are publicly available at Zenodo (https://doi.org/10.5281/zenodo.17068599).

Download:

Table 1. Main characteristics of each dataset used in the experiments.

https://doi.org/10.1371/journal.pone.0338052.t001

For preprocessing, data were chronologically divided into training, validation, and test sets in an 8:1:1 ratio to prevent temporal leakage. The input window was fixed at 96, and forecasting horizons of 96 and 192 steps were used. All variables were normalized using z-score statistics computed from the training set, and missing values were imputed via linear interpolation to ensure temporal continuity. Experimental configurations are summarized in Table 2.

Download:

Table 2. Experimental environment, model configuration, and training hyperparameters.

https://doi.org/10.1371/journal.pone.0338052.t002

Model performance was evaluated using MSE and MAE. Additionally, a direction-based accuracy metric was used to assess the proportion of correctly predicted trend directions between consecutive time steps, defined as:

(13)

where denotes the indicator function, and y_t and are the true and predicted values at time t, respectively. This metric evaluates the correctness of trend prediction rather than magnitude, complementing error-based measures.

To ensure statistical robustness, each experiment was repeated three times using a different random seed, and the results are presented as mean ± standard deviation (Std).

4.2 SOTA comparison and mechanistic interpretation

The comparative analysis summarized in Table 3 and illustrated in Fig 3 indicates that KALFormer yields consistently lower error values compared with several representative forecasting architectures, including iTransformer [33], Crossformer [34], DeepTime [35], and TimesNet [36]. Across all benchmark datasets, the model maintains reduced MSE and MAE, reflecting stable predictive behavior under diverse temporal conditions. For example, on the ETTm2 dataset, KALFormer records an MSE of 0.022 and an MAE of 0.115, while Crossformer and DeepTime reach 0.309 and 0.287, respectively. This corresponds to an average improvement of 8.4% in MSE and MAE across the six benchmark datasets. Similar trends across the Traffic, Weather, and Electricity datasets suggest that the framework preserves robustness across distinct forecasting horizons and data characteristics.

Download:

Table 3. Performance comparison between KALFormer and state-of-the-art models.

https://doi.org/10.1371/journal.pone.0338052.t003

Download:

Fig 3. Performance comparison across benchmark datasets.

Heatmap visualization of MSE and MAE values for different models on six public datasets. Darker blue shades indicate lower error values. KALFormer consistently achieves the lowest MSE and MAE across all datasets, demonstrating superior generalization and robustness.

https://doi.org/10.1371/journal.pone.0338052.g003

This performance pattern can be attributed to the complementary functions of its principal modules. The LSTM backbone captures local temporal continuity, while the attention mechanism enhances feature selectivity by emphasizing informative patterns and suppressing redundant signals. The KAN integrates contextual information and refines representation consistency, thereby reducing noise sensitivity and improving robustness. Finally, the Transformer layer aggregates these enriched representations across time and variables, enabling the model to capture long-range dependencies and complex global interactions. Through this coordinated interaction, the framework maintains a balance between localized detail and global temporal context, facilitating more stable and interpretable sequence learning.

A structural interpretation of Fig 3 provides additional evidence of the model’s robustness and consistency. The darkest blue regions, corresponding to the lowest error magnitudes, are uniformly distributed along KALFormer’s columns, indicating not only numerical superiority but also statistical stability across datasets. This distribution reflects a balanced bias–variance configuration enabled by residual normalization and adaptive attention scaling. Consequently, KALFormer achieves a harmonized forecasting mechanism that captures both localized transitions and global temporal patterns, resulting in interpretable, reproducible, and domain-generalizable predictive performance.

4.3 Ablation study and mechanistic validation

The empirical results in Table 4 demonstrate a clear hierarchical improvement as architectural components are progressively integrated. As illustrated in Fig 4, the variations in Mean Squared Error (MSE) and Mean Absolute Error (MAE) across different model configurations further confirm this trend. Starting from the standalone LSTM and Transformer baselines, it can be observed that while the Transformer alone achieves better long-range dependency modeling compared with the LSTM, it still lacks fine-grained temporal continuity. The addition of the attention mechanism enhances temporal selectivity, enabling the network to emphasize informative subsequences and suppress redundant noise. The inclusion of the KAN introduces contextual priors that enrich representation learning and provide structural consistency across variables; however, without temporal calibration, it tends to generate inconsistent feature embeddings. When attention, KAN, and Transformer modules are jointly applied, the model attains both lower error values and reduced output variance, indicating that their coordinated interaction effectively balances local continuity and global dependency modeling, leading to improved generalization and convergence stability.

Download:

Table 4. Ablation study results for different module combinations.

https://doi.org/10.1371/journal.pone.0338052.t004

Download:

Fig 4. Ablation study on MAE and MSE performance.

Each configuration represents a variant of the KALFormer architecture, isolating the contribution of individual modules. Results are reported as mean ± Std over three independent runs. KALFormer achieves the lowest error, confirming the effectiveness of multi-level fusion.

https://doi.org/10.1371/journal.pone.0338052.g004

A complementary perspective emerges when observing the convergence dynamics illustrated in Fig 5. The learning curves reveal that the Transformer baseline exhibits more stable loss decay than the recurrent-only LSTM, confirming its advantage in capturing long-range temporal dependencies. Nevertheless, it converges to a suboptimal plateau due to the lack of short-term temporal refinement. When integrated with LSTM and KAN, the optimization trajectory becomes both faster and smoother, suggesting a more favorable optimization landscape. The full architecture achieves higher accuracy and smaller variance across multiple runs, reflecting that structured temporal weighting, knowledge-guided embedding, and cross-layer communication jointly mitigate gradient degradation and stabilize training during extended forecasting horizons.

Download:

Fig 5. Accuracy and loss comparison in ablation experiments.

The figure reports trend-based accuracy (%) and loss (mean ± Std) for each model variant. KALFormer exhibits the highest accuracy (95.19%) and the lowest loss (0.1631), highlighting its predictive precision and training stability compared with other configurations.

https://doi.org/10.1371/journal.pone.0338052.g005

Deeper insight into the internal mechanism can be gained from the attention distributions visualized in Fig 6. The short-horizon attention maps on the Electricity Transformer Temperature monthly (ETTm1) dataset display compact diagonal concentration, corresponding to local temporal dependencies, while the long-horizon scenario reveals extended diagonal patterns associated with periodic or global correlations. In the Electricity dataset, attention spreads across multiple variables, uncovering latent inter-series relationships that enhance the model’s capacity to learn cross-dimensional dynamics. These observations indicate that the model gradually reallocates its attention from localized memory to broader temporal abstraction as the forecasting horizon increases.

Download:

Fig 6. Attention mechanism visualization on representative datasets.

Normalized attention maps of KALFormer on ETTm1 and Electricity datasets for 96- and 192-step forecasts, showing a shift from local diagonal focus to broader periodic patterns across variables.

https://doi.org/10.1371/journal.pone.0338052.g006

The comparative visualization in Fig 7 provides additional evidence of how attention allocation evolves across different temporal scales. The diffusion of attention energy from narrow peaks toward more distributed regions implies an adaptive rebalancing between short-term precision and long-term contextual awareness. Such behavior reflects a coherent integration of relational knowledge and multi-scale attention, mediated through Transformer fusion. Collectively, these analyses suggest that KALFormer’s performance arises from a systematic coordination among its modules, where temporal alignment, contextual embedding, and attention modulation jointly enhance both interpretability and predictive consistency.

Download:

Fig 7. Evolution of attention energy across temporal scales.

As the forecasting horizon extends, attention becomes more diffuse, reflecting adaptive balance between short-term precision and global contextual awareness.

https://doi.org/10.1371/journal.pone.0338052.g007

4.4 Complexity and efficiency comparison

To comprehensively demonstrate the practical feasibility and computational efficiency of KALFormer, we conduct a detailed comparison with the baseline LSTM and BiLSTM models across four key aspects: forecasting accuracy, model complexity, training time, and inference latency. Table 5 reports the MSE, MAE, parameter count, total training duration, and average inference latency on seven representative benchmark datasets.

Download:

Table 5. Complexity and efficiency comparison among LSTM, BiLSTM, and KALFormer.

https://doi.org/10.1371/journal.pone.0338052.t005

The results indicate that KALFormer consistently achieves lower MSE and MAE than both LSTM and BiLSTM, underscoring its superior predictive capability for long-term time series forecasting. While the parameter count of KALFormer is approximately five to six times larger than that of BiLSTM due to the incorporation of knowledge-augmented and attention modules, this increase in model size is well justified by the significant improvements in accuracy. Moreover, although the training time naturally grows with the enhanced architecture, the inference latency remains within a sub-millisecond range across all datasets. This balance between predictive performance and computational efficiency suggests that the additional complexity introduced by KALFormer does not impose a prohibitive cost in real-time applications. It thus confirms its practical suitability for robust time series forecasting in dynamic environments.

5 Conclusion

This study presents KALFormer, a Knowledge-Augmented Attention Learning framework(KALFormer) for long-term time series forecasting that integrates sequential modeling, attention mechanisms, and knowledge-driven feature enhancement. By combining LSTM-based temporal encoding, self-attention for global dependency modeling, and a knowledge-augmented Transformer fusion strategy, the model effectively reconciles local precision with global contextual reasoning. Experiments on multiple benchmark datasets verify the consistent superiority of KALFormer in terms of accuracy, robustness, and adaptability, emphasizing the complementary roles of attention and Transformer modules in mitigating the limitations of recurrent architectures and enhancing nonlinear dependency learning. Although the framework demonstrates strong generalization, future extensions may focus on adaptive knowledge fusion, scalability to cross-domain or irregular data, and deployment in dynamic streaming environments. Overall, the proposed approach contributes a flexible and interpretable paradigm for developing resilient forecasting models applicable to diverse real-world temporal systems.

References

1. Lim B, Zohren S. Time-series forecasting with deep learning: a survey. Philos Trans A Math Phys Eng Sci. 2021;379(2194):20200209. pmid:33583273
- View Article
- PubMed/NCBI
- Google Scholar
2. Ponce-Flores M, Frausto-Solís J, Santamaría-Bonfil G, Pérez-Ortega J, González-Barbosa JJ. Time series complexities and their relationship to forecasting performance. Entropy (Basel). 2020;22(1):89. pmid:33285864
- View Article
- PubMed/NCBI
- Google Scholar
3. Song W, Fujimura S. Capturing combination patterns of long- and short-term dependencies in multivariate time series forecasting. Neurocomputing. 2021;464:72–82.
- View Article
- Google Scholar
4. Salehinejad H, Sankar S, Barfett J, Colak E, Valaee S. Recent advances in recurrent neural networks. arXiv preprint 2017. https://arxiv.org/abs/1801.01078
5. De Baets S, Harvey N. Incorporating external factors into time series forecasts. Judgment in Predictive Analytics. Springer; 2023. p. 265–87.
6. Lehna M, Scheller F, Herwartz H. Forecasting day-ahead electricity prices: a comparison of time series and neural network models taking external regressors into account. Energy Economics. 2022;106:105742.
- View Article
- Google Scholar
7. Sezer OB, Gudelek MU, Ozbayoglu AM. Financial time series forecasting with deep learning: a systematic literature review 2005 –2019). Applied Soft Computing. 2020;90:106181.
- View Article
- Google Scholar
8. Shiri FM, Perumal T, Mustapha N, Mohamed R. A comprehensive overview and comparative analysis on deep learning models: CNN, RNN, LSTM, GRU. arXiv preprint 2023. https://arxiv.org/abs/2305.17473
9. Sagheer A, Kotb M. Time series forecasting of petroleum production using deep LSTM recurrent networks. Neurocomputing. 2019;323:203–13.
- View Article
- Google Scholar
10. Chang YY, Sun FY, Wu YH, Lin SD. A memory-network based solution for multivariate time-series forecasting. arXiv preprint 2018. https://arxiv.org/abs/1809.02105
11. Abbasimehr H, Paki R. Improving time series forecasting using LSTM and attention models. J Ambient Intell Human Comput. 2021;13(1):673–91.
- View Article
- Google Scholar
12. Cordonnier JB, Loukas A, Jaggi M. Multi-head attention: collaborate instead of concatenate. arXiv preprint 2020. https://arxiv.org/abs/2006.16362
13. Li S, Cai H. Short-term power load forecasting using a VMD-crossformer model. Energies. 2024;17(11):2773.
- View Article
- Google Scholar
14. Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, et al. Informer: beyond efficient transformer for long sequence time-series forecasting. AAAI. 2021;35(12):11106–15.
- View Article
- Google Scholar
15. Liu Y, Wu H, Wang J, Long M. Non-stationary transformers: exploring the stationarity in time series forecasting. Advances in Neural Information Processing Systems. 2022;35:9881–93.
- View Article
- Google Scholar
16. Wen Q, Zhou T, Zhang C. Transformers in time series: a survey. arXiv preprint 2022. https://arxiv.org/abs/2202.07125
17. Han Z, Ma Y, Wang Y, Günnemann S, Tresp V. Graph Hawkes neural network for forecasting on temporal knowledge graphs. arXiv preprint 2020. https://arxiv.org/abs/2003.13432
18. Zhong Q, Ding L, Liu J, Du B, Jin H, Tao D. Knowledge graph augmented network towards multiview representation learning for aspect-based sentiment analysis. IEEE Trans Knowl Data Eng. 2023;35(10):10098–111.
- View Article
- Google Scholar
19. Chen Y, Segovia I, Gel YR. Z-GCNETs: Time zigzags at graph convolutional networks for time series forecasting. In: International Conference on Machine Learning. 2021. p. 1684–94.
20. Bai L, Yao L, Li C, Wang X, Wang C. Adaptive graph convolutional recurrent network for traffic forecasting. Advances in Neural Information Processing Systems. 2020;33:17804–15.
- View Article
- Google Scholar
21. Bai J, Zhu J, Song Y, Zhao L, Hou Z, Du R, et al. A3T-GCN: attention temporal graph convolutional network for traffic forecasting. IJGI. 2021;10(7):485.
- View Article
- Google Scholar
22. Hochreiter S. Long short-term memory. Neural Computation. 1997.
- View Article
- Google Scholar
23. Gers FA, Schmidhuber J, Cummins F. Learning to forget: continual prediction with LSTM. Neural Comput. 2000;12(10):2451–71. pmid:11032042
- View Article
- PubMed/NCBI
- Google Scholar
24. Greff K, Srivastava RK, Koutnik J, Steunebrink BR, Schmidhuber J. LSTM: a search space odyssey. IEEE Trans Neural Netw Learn Syst. 2017;28(10):2222–32. pmid:27411231
- View Article
- PubMed/NCBI
- Google Scholar
25. Yu Y, Si X, Hu C, Zhang J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019;31(7):1235–70. pmid:31113301
- View Article
- PubMed/NCBI
- Google Scholar
26. Vaswani A, et al. Attention is all you need. Advances in Neural Information Processing Systems. 2017.
- View Article
- Google Scholar
27. Li S, Jin X, Xuan Y. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in Neural Information Processing Systems. 2019;32.
- View Article
- Google Scholar
28. Lin Y, Liu Z, Sun M, Liu Y, Zhu X. Learning entity and relation embeddings for knowledge graph completion. AAAI. 2015;29(1).
- View Article
- Google Scholar
29. Wang Q, Mao Z, Wang B, Guo L. Knowledge graph embedding: a survey of approaches and applications. IEEE Trans Knowl Data Eng. 2017;29(12):2724–43.
- View Article
- Google Scholar
30. Qin Y, Song D, Chen H, et al. A dual-stage attention-based recurrent neural network for time series prediction. arXiv preprint 2017. arXiv:1704.02971
31. Liu Y, Liu Q, Zhang JW, et al. Multivariate time-series forecasting with temporal polynomial graph neural networks. Advances in Neural Information Processing Systems. 2022;35:19414–26.
- View Article
- Google Scholar
32. Alharthi M, Mahmood A. xLSTMTime: long-term time series forecasting with xLSTM. AI. 2024;5(3):1482–95.
- View Article
- Google Scholar
33. Liu Y, Hu T, Zhang H. iTransformer: inverted transformers are effective for time series forecasting. arXiv preprint 2023. https://doi.org/10.48550/arXiv.2310.06625
34. Zhang Y, Yan J. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. The Eleventh International Conference on Learning Representations, 2023.
35. Woo G, Liu C, Sahoo D, et al. Learning deep time-index models for time series forecasting. In International Conference on Machine Learning. PMLR; 2023. p 37217–37.
36. Wu H, Hu T, Liu Y. TimesNet: temporal 2D-variation modeling for general time series analysis. arXiv preprint. 2022. https://arxiv.org/abs/2210.02186

[ref1] 1. Lim B, Zohren S. Time-series forecasting with deep learning: a survey. Philos Trans A Math Phys Eng Sci. 2021;379(2194):20200209. pmid:33583273
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Ponce-Flores M, Frausto-Solís J, Santamaría-Bonfil G, Pérez-Ortega J, González-Barbosa JJ. Time series complexities and their relationship to forecasting performance. Entropy (Basel). 2020;22(1):89. pmid:33285864
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Song W, Fujimura S. Capturing combination patterns of long- and short-term dependencies in multivariate time series forecasting. Neurocomputing. 2021;464:72–82.
View Article
Google Scholar

[10] View Article

[11] Google Scholar

[ref4] 4. Salehinejad H, Sankar S, Barfett J, Colak E, Valaee S. Recent advances in recurrent neural networks. arXiv preprint 2017. https://arxiv.org/abs/1801.01078

[ref5] 5. De Baets S, Harvey N. Incorporating external factors into time series forecasts. Judgment in Predictive Analytics. Springer; 2023. p. 265–87.

[ref6] 6. Lehna M, Scheller F, Herwartz H. Forecasting day-ahead electricity prices: a comparison of time series and neural network models taking external regressors into account. Energy Economics. 2022;106:105742.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref7] 7. Sezer OB, Gudelek MU, Ozbayoglu AM. Financial time series forecasting with deep learning: a systematic literature review 2005 –2019). Applied Soft Computing. 2020;90:106181.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref8] 8. Shiri FM, Perumal T, Mustapha N, Mohamed R. A comprehensive overview and comparative analysis on deep learning models: CNN, RNN, LSTM, GRU. arXiv preprint 2023. https://arxiv.org/abs/2305.17473

[ref9] 9. Sagheer A, Kotb M. Time series forecasting of petroleum production using deep LSTM recurrent networks. Neurocomputing. 2019;323:203–13.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref10] 10. Chang YY, Sun FY, Wu YH, Lin SD. A memory-network based solution for multivariate time-series forecasting. arXiv preprint 2018. https://arxiv.org/abs/1809.02105

[ref11] 11. Abbasimehr H, Paki R. Improving time series forecasting using LSTM and attention models. J Ambient Intell Human Comput. 2021;13(1):673–91.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref12] 12. Cordonnier JB, Loukas A, Jaggi M. Multi-head attention: collaborate instead of concatenate. arXiv preprint 2020. https://arxiv.org/abs/2006.16362

[ref13] 13. Li S, Cai H. Short-term power load forecasting using a VMD-crossformer model. Energies. 2024;17(11):2773.
View Article
Google Scholar

[30] View Article

[31] Google Scholar

[ref14] 14. Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, et al. Informer: beyond efficient transformer for long sequence time-series forecasting. AAAI. 2021;35(12):11106–15.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref15] 15. Liu Y, Wu H, Wang J, Long M. Non-stationary transformers: exploring the stationarity in time series forecasting. Advances in Neural Information Processing Systems. 2022;35:9881–93.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref16] 16. Wen Q, Zhou T, Zhang C. Transformers in time series: a survey. arXiv preprint 2022. https://arxiv.org/abs/2202.07125

[ref17] 17. Han Z, Ma Y, Wang Y, Günnemann S, Tresp V. Graph Hawkes neural network for forecasting on temporal knowledge graphs. arXiv preprint 2020. https://arxiv.org/abs/2003.13432

[ref18] 18. Zhong Q, Ding L, Liu J, Du B, Jin H, Tao D. Knowledge graph augmented network towards multiview representation learning for aspect-based sentiment analysis. IEEE Trans Knowl Data Eng. 2023;35(10):10098–111.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref19] 19. Chen Y, Segovia I, Gel YR. Z-GCNETs: Time zigzags at graph convolutional networks for time series forecasting. In: International Conference on Machine Learning. 2021. p. 1684–94.

[ref20] 20. Bai L, Yao L, Li C, Wang X, Wang C. Adaptive graph convolutional recurrent network for traffic forecasting. Advances in Neural Information Processing Systems. 2020;33:17804–15.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref21] 21. Bai J, Zhu J, Song Y, Zhao L, Hou Z, Du R, et al. A3T-GCN: attention temporal graph convolutional network for traffic forecasting. IJGI. 2021;10(7):485.
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref22] 22. Hochreiter S. Long short-term memory. Neural Computation. 1997.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref23] 23. Gers FA, Schmidhuber J, Cummins F. Learning to forget: continual prediction with LSTM. Neural Comput. 2000;12(10):2451–71. pmid:11032042
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref24] 24. Greff K, Srivastava RK, Koutnik J, Steunebrink BR, Schmidhuber J. LSTM: a search space odyssey. IEEE Trans Neural Netw Learn Syst. 2017;28(10):2222–32. pmid:27411231
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref25] 25. Yu Y, Si X, Hu C, Zhang J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019;31(7):1235–70. pmid:31113301
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref26] 26. Vaswani A, et al. Attention is all you need. Advances in Neural Information Processing Systems. 2017.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref27] 27. Li S, Jin X, Xuan Y. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in Neural Information Processing Systems. 2019;32.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref28] 28. Lin Y, Liu Z, Sun M, Liu Y, Zhu X. Learning entity and relation embeddings for knowledge graph completion. AAAI. 2015;29(1).
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref29] 29. Wang Q, Mao Z, Wang B, Guo L. Knowledge graph embedding: a survey of approaches and applications. IEEE Trans Knowl Data Eng. 2017;29(12):2724–43.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref30] 30. Qin Y, Song D, Chen H, et al. A dual-stage attention-based recurrent neural network for time series prediction. arXiv preprint 2017. arXiv:1704.02971

[ref31] 31. Liu Y, Liu Q, Zhang JW, et al. Multivariate time-series forecasting with temporal polynomial graph neural networks. Advances in Neural Information Processing Systems. 2022;35:19414–26.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref32] 32. Alharthi M, Mahmood A. xLSTMTime: long-term time series forecasting with xLSTM. AI. 2024;5(3):1482–95.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref33] 33. Liu Y, Hu T, Zhang H. iTransformer: inverted transformers are effective for time series forecasting. arXiv preprint 2023. https://doi.org/10.48550/arXiv.2310.06625

[ref34] 34. Zhang Y, Yan J. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. The Eleventh International Conference on Learning Representations, 2023.

[ref35] 35. Woo G, Liu C, Sahoo D, et al. Learning deep time-index models for time series forecasting. In International Conference on Machine Learning. PMLR; 2023. p 37217–37.

[ref36] 36. Wu H, Hu T, Liu Y. TimesNet: temporal 2D-variation modeling for general time series analysis. arXiv preprint. 2022. https://arxiv.org/abs/2210.02186

Figures

Abstract

1 Introduction

2 Related work

2.1 Long-term dependency models for series prediction

2.1.1 RNN-based and CNN-based methods.

2.1.2 Attention-based methods.

2.2 Handling external factors in time series prediction

3 Method

3.1 Overall architecture

3.2 Sequential backbone with LSTM and attention

3.3 Knowledge integration through knowledge-augmented network

3.4 Representation fusion with multi-head transformer

4 Experiment

4.1 Data foundations and evaluation methodology

4.2 SOTA comparison and mechanistic interpretation

4.3 Ablation study and mechanistic validation

4.4 Complexity and efficiency comparison

5 Conclusion

References