Figures
Abstract
Time series forecasting is essential in energy, finance, and meteorology. However, existing Transformer-based models face challenges with computational inefficiency and poor generalization for long-term sequences. To address these issues, this study proposes the KEDformer framework. It integrates knowledge extraction and seasonal-trend decomposition to optimize model performance. By leveraging sparse attention and autocorrelation, KEDformer reduces computational complexity from O(L2) to O(L log L), enhancing the model’s ability to capture both short-term fluctuations and long-term patterns. Experiments on five public datasets covering energy, transportation, and weather tasks demonstrate that KEDformer consistently outperforms traditional models, with an average improvement of 10.4% in MSE prediction accuracy and 2.9% in MAE prediction accuracy.
Citation: Qin Z, Wei B, Gao C, Ni J (2025) KEDformer: Knowledge extraction seasonal trend decomposition for long-term sequence prediction. PLoS One 20(10): e0335047. https://doi.org/10.1371/journal.pone.0335047
Editor: Ayesha Maqbool, National University of Sciences and Technology NUST, PAKISTAN
Received: February 3, 2025; Accepted: October 6, 2025; Published: October 24, 2025
Copyright: © 2025 Qin et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All five datasets (ETT, Electricity, Exchange, Traffic, Weather) are available from the Figshare database (accession number: https://figshare.com/s/bee931e38c4b9e9a935e).
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Long-term forecasting plays a critical role in decision-making domains such as transportation logistics [1], healthcare monitoring [2,3], utility management [4,5], and energy optimization [6,7]; however, as the forecasting horizon increases, so do computational demands and the complexity of modeling temporal dependencies. Traditional time series decomposition methods, while useful, often rely on linear assumptions, limiting their effectiveness in handling complex multivariate scenarios or unpredictable, non-stationary data patterns, and struggling to capture the interplay between components such as trends, seasonality, and irregularities [8]. Recent advancements have integrated deep learning approaches into the decomposition process to improve forecasting accuracy [9,10], leveraging representation learning and nonlinear transformations to better capture dynamic dependencies and multi-scale interactions within time series data [11,12]. More recently, transformers have excelled in various tasks, such as computer vision (CV) [13,14], natural language processing (NLP) [15], and time series forecasting, due to their powerful modeling capabilities and flexibility, but in long-term forecasting tasks, they face significant challenges. The computational complexity of the traditional self-attention mechanism is O(L2), where L represents the sequence length, leading to an increasing demand for memory and computational resources and limiting their applicability in resource-constrained or real-time analysis scenarios. To address this, many researchers have improved the self-attention mechanism to reduce the computational complexity of long-term forecasting and enhance the application of models in practical scenarios [16–19]; additionally, transformers often struggle to model long-term dependencies effectively due to noise interference, where irrelevant information weakens the attention distribution and degrades overall performance [19,20]. Consequently, capturing long-term dependencies in time series data while ensuring computational efficiency over extended prediction horizons remains a significant challenge.
To address the challenges in long-term time series forecasting, we propose an end-to-end Knowledge Extraction Decomposition (KEDformer) framework. The core of this framework is the Knowledge Extraction Attention module (KEDA, blue block), which reduces model parameters through autocorrelation and sparse attention mechanisms. The autocorrelation mechanism estimates the correlations of subsequences within a specific time period, while the sparse attention filters the weight matrix of these correlations, thereby reducing computational overhead and mitigating the interference from irrelevant features. This design reduces the computational complexity from O(L2) to O(LlogL), significantly decreasing memory usage and enhancing the model’s ability to process long sequences. Additionally, KEDformer integrates a mixed time series pooling decomposition method (MSTP, yellow block) that decomposes the input time series data into seasonal and trend components, further improving prediction accuracy. This approach captures both short-term fluctuations and long-term patterns, making the predictions more consistent with real-world temporal dynamics. Therefore, the KEDformer framework not only addresses the computational bottleneck of traditional Transformers in long-term forecasting but also enhances their performance and robustness in complex sequence tasks. In summary, the contributions of this study are as follows:
- We introduce a knowledge extraction mechanism that combines sparse attention and autocorrelation to reduce the computational cost of the self-attention layer. This mechanism reduces the computational overhead from quadratic to linear complexity.
- Furthermore, by employing seasonal-trend decomposition, KEDformer effectively captures both long-term trends and seasonal patterns, overcoming the limitations of the Transformer model in capturing long-term dependencies.
- Extensive experiments on five public datasets demonstrate the effectiveness and competitiveness of the proposed KEDformer, which outperforms all previous Transformer-based models across various forecasting applications.
2 Related work
2.1 Transformer-based long-term time series forecasting
Transformer-based models have demonstrated exceptional performance in time series forecasting due to their powerful self-attention mechanism and parallel processing capabilities, excelling at capturing long-term dependencies and handling long-sequence data [21,22]; however, traditional Transformer models still face several challenges in time series forecasting, such as high computational complexity and difficulty addressing noise issues in long-term dependencies [15,18], for instance, the core self-attention mechanism exhibits quadratic computational complexity with respect to sequence length, which limits its efficiency in long-sequence tasks [23].
To overcome these limitations, various advancements have been proposed in recent years. For example, Synthesizer [24] investigated the importance of dot-product interactions, introduced randomly initialized, learnable attention mechanisms, and demonstrated competitive performance in specific tasks. Furthermore, FNet [25] replaced self-attention with Fourier transforms, showcasing its effectiveness in mixing sequence features. Another approach utilized Gaussian distributions to construct attention weights, enabling a focus on local windows and improving the performance of models in capturing local dependencies [26]. Pyraformer employed a pyramid attention structure to address the complexity of handling long-range dependencies, while TFT integrated multivariate features and time-varying information to improve multi-step forecasting [27]. More recently, Informer introduced the ProbSparse attention mechanism and distillation techniques, reducing computational complexity to and significantly improving efficiency. LogTrans employed logarithmic sparse attention to further alleviate the computational burden of long-sequence predictions [28], while AST combined adversarial training and sparse attention to enhance robustness in complex scenarios [29];ACSAformer [30] enhances the modeling capability for complex high-dimensional data by integrating sparse attention and adaptive graph convolution.SFDformer [31] improves the modeling and prediction of complex dynamic features by integrating Fourier transform, time series decomposition, and sparse attention mechanisms.Additionally, Autoformer leveraged time series decomposition and autocorrelation mechanisms for long-term sequence forecasting [16,32], and FEDformer utilized frequency-domain enhancements to optimize performance on long sequences [29,33]. We summarize the main characteristics and advantages of some published studies in the literature in Table 1. Although these studies have made progress in optimizing computational efficiency and capturing long-term dependencies, they still show instability in modeling complex long-term and non-periodic dependencies.
Unlike previous studies, the proposed KEDformer integrates sparse attention mechanisms with autocorrelation strategies at the architectural level and introduces a dominant weight distribution selection mechanism. This mechanism dynamically selects representative weights from the sparse attention matrix, thereby enhancing the model’s ability to capture key dependencies in time series. It not only inherits the computational efficiency of sparse attention but also alleviates the potential information loss often seen in traditional sparse schemes when modeling global contexts. Compared with Informer, which implements sparse attention through probabilistic sampling, KEDformer combines dominant distribution selection with explicit autocorrelation enhancement to improve the stability of modeling long-range, non-periodic dependencies. Unlike Autoformer, which entirely replaces attention with autocorrelation, KEDformer retains the sparse attention backbone and embeds auxiliary autocorrelation modules, forming a complementary modeling framework. This collaborative fusion strategy achieves a balance between performance and efficiency, demonstrating structural innovation and practical adaptability beyond existing methods.
2.2 Decomposition of time series
Time series decomposition is a traditional approach that breaks down time series data into components such as trend, seasonality, and residuals, revealing the intrinsic patterns within the data [34,35]. Among traditional methods, ARIMA [36] uses differencing and parameterized modeling to decompose and forecast non-stationary time series effectively. In contrast, the Prophet model combines trend and seasonal components while accommodating external covariates [29], making it suitable for modeling complex time series patterns. Matrix decomposition-based methods, such as DeepGLO, extract global and local features through low-rank matrix decomposition, while N-BEATS employs a hierarchical structure to dissect trends and periodicity [35]. However, these approaches primarily focus on the static decomposition of historical sequences and often fall short in capturing dynamic interactions for future forecasting.
More recently, deep learning models have increasingly incorporated time series decomposition to enhance predictive power. For example, Autoformer introduces an embedded decomposition module, treating trend and seasonal components as core building blocks to achieve progressive decomposition and forecasting [37]. FEDformer combines Fourier and wavelet transforms to decompose time series into components of varying frequencies, capturing global characteristics and local structures while significantly reducing computational complexity and improving the accuracy of long-sequence predictions. Similarly, ETSformer [38] adopts a hierarchical decomposition framework inspired by exponential smoothing, segmenting time series into level, growth, and seasonality components [39]. By integrating exponential smoothing attention and frequency-domain attention mechanisms, ETSformer effectively extracts key features, demonstrating superior performance across multiple datasets [40]. Inspired by these studies, our proposed KEDformer approach integrates decomposition modules dynamically with a progressive decomposition strategy, not only significantly improving computational efficiency but also enabling the simultaneous modeling of both short-term and long-term patterns.
3 Methodology
3.1 Background
The Long Sequence Time Forecasting (LSTF) problem is defined within a rolling forecasting setup, where predictions over an extended future horizon are made based on past observations within a fixed-size window [23]. At each time point t, the input sequence consists of observations with multiple feature dimensions, and the output sequence
predicts multiple future values. The output length Ly is intentionally set to be relatively long to capture complex dependencies over time. This setup enables the model to predict multiple attributes, making it well-suited for time series applications.
3.2 Data decomposition
In this section, we will cover the following aspects of KEDformer: (1) the decomposition process designed to capture seasonal and trend components in time series data; and (2) the architecture of the KEDformer encoder and decoder.
Time series decomposition. To capture complex temporal patterns in long-term predictions, we utilize a decomposition approach that separates sequences into trend, cyclical, and seasonal components. These components correspond to the long-term progression and seasonal variations inherent in the data. However, directly decomposing future sequences is impractical due to the uncertainty of future data. To address this challenge, we introduce a novel internal operation within the sequence decomposition block, referred to as the autocoupling mechanism in KEDformer, as shown in Fig 1. This mechanism enables the progressive extraction of long-term stationary trends from predicted intermediate hidden states. Specifically, we adjust the moving average to smooth periodic fluctuations and emphasize the long-term trends. For the length-L input sequence , the procedure is as follows:
where represent the seasonal part and the extracted trend component, respectively. We use
for moving average and filling operations to maintain a constant sequence length. We summarize the above process as
, which is a within-model block.
Model input.
In Fig 2, the encoder’s input consists of the past I time steps, denoted as . In the decomposition architecture, the input to the decoder is composed of both a seasonal component,
, and a trend-cyclical component,
, both of which are subject to further refinement. Each initialization consists of two elements: (1) the decomposed component derived from the latter half of the encoder’s input,
, of length I/2, which provides recent information, and (2) placeholders of length O, filled with scalar values. The formulation is as follows:
where denote the seasonal and trend-cyclical components of
, respectively. The placeholders, labeled as
, are populated with zeros and the mean values of
, respectively.
Encoder.
In Fig 2, the encoder follows a multilayer architecture, defined as where
represents the output of the l-th encoder layer. The initial input,
, corresponds to the embedded historical time series. The Encoder function,
, is formally expressed as:
where ,
represents the seasonal component after the i-th decomposition block in the l-th layer.
Decoder.
In Fig 2, the decoder has two roles: the accumulation of the trend time series part and the knowledge extraction stacking of the seasonal time series part. For example, where
represents the output of the l-th decoder layer. The decoder is formalized as:
Eq (11) denotes the final decomposition stage within the decoder, where the output from the previous residual block, , is passed through a feed-forward layer and added back via a residual connection. The resulting sequence is then processed by the MSWTDecomp function, which applies multi-scale wavelet transform-based decomposition to extract seasonal (
) and trend (
) components. This design ensures that each decoder layer can refine the representation by isolating frequency-aware patterns, thereby improving long-horizon forecasting accuracy.
Eq (12), the final seasonal component —obtained from the MSWTDecomp operation—is directly used as the decoder’s output representation
. This substitution simplifies the decoder architecture by avoiding explicit trend aggregation and emphasizes the dominant short-term patterns captured via multi-scale decomposition, which are particularly useful for downstream forecasting tasks.
In this context, and
, where
, represent the seasonal and trend components, respectively, after the i-th decomposition block within the l-th layer. The matrix Wi,L, where
, serves as the projection matrix for the i-th extracted trend component,
.
To improve the flexibility of trend modeling, the decoder weight matrix in Eq (13) is designed as a learnable parameter and initialized using Xavier uniform initialization to ensure numerical stability at the early stage of training. During training, this matrix is updated via backpropagation to adaptively capture the evolving trend across multiple time steps. Additionally, to mitigate the risk of error accumulation during multi-step trend extrapolation, we adopt a residual-style skip connection in the decoder and apply L2 regularization and early stopping. These strategies help to stabilize trend prediction and enhance robustness against long-term forecast deviations.
3.3 Knowledge extraction process
3.3.1 Autocorrelation function.
To enhance the capability of modeling long-range dependencies in time series, we introduce the autocorrelation function (ACF) as an auxiliary mechanism. ACF measures the correlation between a time series and a lagged version of itself, allowing the model to explicitly encode repeating patterns or structural temporal regularities.
Given a time series , its autocorrelation at lag l is defined as:
where is the mean of the time series. This function quantifies how past values influence future values at different time scales.
In our framework, ACF values are computed for each time segment and used to weight candidate connections in the attention mechanism. Specifically, when constructing the query-key similarity map, we incorporate a weighting term based on ACF to prioritize temporally correlated elements. The combined similarity score becomes:
where controls the balance between dot-product similarity and autocorrelation guidance.
This integration ensures that attention weights are not only based on instantaneous token-level similarity but also reflect broader temporal structure, especially beneficial for periodic or structured sequences. In summary, the ACF module enriches the temporal modeling capacity by explicitly introducing interpretable lag-based correlations into the attention computation.
3.3.2 Self-attention mechanism.
The canonical self-attention mechanism is defined by the tuple inputs Q, K, and V, which correspond to the query, key, and value matrices, respectively. This mechanism performs scaled dot-product attention, computed as:
where ,
, and
, with d representing the input dimension. To further analyze the self-attention mechanism, we focus on the attention distribution of the i-th query, denoted as qi, which is based on an asymmetric kernel smoother. The attention for the i-th query is formulated in probabilistic terms:
where , and
represents the asymmetric exponential kernel
This self-attention mechanism combines the values and produces outputs by computing the probability
. However, this process involves quadratic dot-product computations, resulting in a complexity of
, which poses a significant limitation in memory usage, particularly for models designed to enhance predictive capacity.
3.3.3 Knowledge selection.
Knowledge Selection refers to selecting the most representative queries from multiple candidates for knowledge extraction. By measuring the distance between the query distribution and a reference distribution, the relevance of each query can be evaluated, enabling the identification and utilization of important queries. From Eq (15), the attention of the i-th query across all keys is represented as a probability distribution , where the output is computed by aggregating the values v weighted by this probability. High dot-product values between query-key pairs lead to a non-uniform attention distribution, as dominant query-key pairs shift the attention probability away from a uniform distribution. If
closely resembles a uniform distribution,
, then the self-attention essentially produces an averaged summation over the values v, diminishing the significance of individual values.
To mitigate this, we introduce a knowledge extraction mechanism that evaluates the similarity between the attention probability p and a baseline distribution q using the Kullback-Leibler (KL) divergence [41,42]. This measure effectively reduces the influence of less significant queries. The similarity between p and q for the i-th query is computed as:
From this, we define the distillation measure M(qi,K) for the i-th query as:
A larger M(qi,K) value indicates that the i-th query has a more diverse attention distribution, potentially focusing on dominant dot-product pairs in the tail of the self-attention output. This approach allows the model to prioritize influential query-key pairs, thereby improving the overall effectiveness of the knowledge extraction process.
3.3.4 Decoupled knowledge extraction.
Period-based dependencies. The period-based dependencies are quantified using the autocorrelation function, which measures the similarity between different time points in a time series, revealing its underlying periodic characteristics. For a discrete time series {Xt}, the autocorrelation function is defined as:
Where τ represents the time lag, L is the total length of the series, and {Xt} and are the values at the current and lagged time points, respectively. The autocorrelation function computes the cumulative similarity over lagged time intervals, reflecting the degree of self-similarity within the series for various time delays. Peaks in the autocorrelation values indicate potential periodicity and help identify the likely period lengths.
By identifying the peaks of the autocorrelation function, the most probable period lengths can be determined. These period lengths not only capture the dominant periodic patterns in the series but also serve as weighted features, enhancing interpretability and predictive capabilities.
Time-delay aggregation. The time-delay aggregation method for knowledge acquisition focuses on estimating the correlation of sub-sequences within a specific period. Therefore, we propose an innovative time-delay aggregation module that can perform hierarchical convolution operations on sub-sequences based on the selected time delays , thereby narrowing down the key knowledge weight matrix. This process captures sub-sequences from the same location and similar positions within the period, extracting the potential key-weight aggregation matrix. Finally, we apply the Softmax function to normalize the weights, enhancing the accuracy of sub-sequence aggregation.
For a time series x of length L, after projection and filtering of the weight matrix, we obtain the query , key K, and value v. The knowledge extraction attention mechanism is then as follows:
Where is used to obtain the top k parameters of self-attention, and let
, where c is a hyperparameter.
represents the self-attention matrix between sequences Q and K. Topu selects the most important u queries in the weight matrix.
represents the self-attention matrix after filtering between sequences Q and K.
denotes the operation of temporally shifting X by τ, where the elements shifted out from the front are reintroduced at the end. For the encoder-decoder self-attention, K and V come from the encoder
and are adjusted to length O, with Q originating from the previous block of the decoder.
In summary, as illustrated in Fig 3, the collaborative interaction between the Knowledge Extraction Attention module and the time-series pooling decomposition method markedly enhances the predictive efficiency of KEDformer within the overall architecture (Fig 4).
The input length is set to I = 96, and the prediction lengths are .
4 Experiment
4.1 Datasets
In order to evaluate the effectiveness of the proposed KEDformer model, five public datasets were employed. These datasets consisted of four periodic datasets and one non-periodic dataset. They encompassed a variety of tasks and are described in detail as follows:(1) ETT [43]: This dataset comprises four sub-datasets—ETTh1, ETTh2, ETTm1, and ETTm2. The data in ETTh1 and ETTh2 were sampled every hour. Meanwhile, the data in ETTm1 and ETTm2 were sampled every 15 minutes. These datasets include load and oil temperature measurements collected from power transformers between July 2016 and July 2018.(2) Electricity [44]: This dataset contains hourly electricity consumption data from 321 customers. The data span from 2012 to 2014.(3) Exchange Rates [45]: This non-periodic dataset records the daily exchange rates of eight countries. It spans the years 1990 to 2016.(4) Traffic [46]: This dataset consists of hourly traffic data from the California Department of Transportation. It captures road occupancy through various sensors on the Bay Area Highway.(5) Weather [47]: This dataset includes meteorological data recorded every 10 minutes throughout the year 2020. There are 21 indicators such as temperature and humidity.In accordance with standard protocols, all datasets were chronologically split into training, validation, and test sets. The ETT dataset was partitioned using a 6:2:2 ratio [43]. Meanwhile, the other datasets followed a 7:1:2 split [44–47].
4.2 Implementation details
Under the experimental configuration summarized in Table 2, this study embeds residual connections into the decomposition module of the Transformer-based model [32], thereby enhancing its capability to model time series, smoothing periodic fluctuations, and emphasizing long-term trends. The model is optimized using the L2 loss function and the ADAM optimizer [48], with a fixed random seed set to ensure reproducibility. Hyperparameters are systematically tuned on the validation set via a grid search strategy [49]. Specifically, we evaluated all combinations of key parameters and adopted MSE and MAE as performance metrics, with the final optimal configuration summarized in Table 3.
To mitigate overfitting and improve generalization, two complementary strategies were employed: (1) an early stopping mechanism, in which training was terminated if validation performance did not improve for 10 consecutive epochs; and (2) regularization, achieved by introducing dropout in critical layers and applying a weight decay term of 0.1. These measures effectively reduced the risk of over-parameterization and enhanced the robustness of the model in long-horizon forecasting tasks.
It is noteworthy that long-term sequence forecasting is inherently more prone to overfitting than short-term tasks, as the model must simultaneously capture local perturbations and global trends over extended horizons. Excessive reliance on training-specific patterns may therefore impair generalization. To address this issue, we further assessed model robustness through repeated experiments with different random seeds. Specifically, all experiments were conducted three times, and the mean and standard deviation were reported to ensure the reliability and reproducibility of the results.
4.3 Baselines
We evaluated eleven representative baseline models for time-series forecasting. Among them, Transformer-based models include Autoformer [32], Informer [17], Reformer [50], Pyraformer [51], FEDformer [28], and LogTrans [18]; recurrent neural network (RNN)–based models comprise LSTNet [52], LSTM [53], and DeepAR [54]; the convolutional neural network (CNN)–based model is TCN [55]; and statistical decomposition and linear models include Prophet [56] and ARIMA [57]. These bas [57]elines cover a broad spectrum of mainstream forecasting paradigms—from long-range dependency modeling and frequency-domain decomposition to seasonal-trend analysis—thereby providing a comprehensive and systematic benchmark for this study.
4.4 Performance comparison
The Mean Squared Error (MSE) emphasizes large deviations by penalizing them more heavily, while the Mean Absolute Error (MAE) provides a more intuitive measure of the average prediction bias. As both metrics are widely used and complementary in time series forecasting tasks, we adopt MSE and MAE as the two primary indicators to evaluate the predictive accuracy of our model. Their formulations are given in (25 and 26):
In the formulas of MSE and MAE n is the total number of samples. yx is the true value of the x-th sample. is the predicted value of the x-th sample.
In multi-step forecasting tasks, errors are first computed at each prediction horizon and then averaged across the entire forecast range. Let O denote the prediction length, N the number of test samples, and D the number of variables. The aggregated Mean Squared Error (MSE) and Mean Absolute Error (MAE) are formulated as follows:
Here, yi,t,d and represent the ground-truth and predicted values of the i-th sample at forecasting step t and variable dimension d, respectively. For each dataset, the final MSE and MAE are obtained by aggregating across all time steps and all variables in the test set.
4.4.1 Multivariate results.
In multivariate long-horizon forecasting, KEDformer demonstrates consistent stability and superior accuracy across all benchmark datasets Table 4. Under the input-96-predict-336 setting, KEDformer achieves substantial gains on five real-world datasets: it attains the best MSE in 11 out of 20 comparisons and the best MAE in 8 out of 20, indicating robust performance across data regimes and horizons.
In contrast, the baselines exhibit structural limitations under long horizons, multivariate coupling, and pronounced non-stationarity. Transformer incurs quadratic complexity with sequence length and lacks mechanisms tailored to temporal non-stationarity (i.e., evolving data statistics due to seasonality, policy shifts, or holidays), leading to slow adaptation and accumulated error. Informer, Reformer, Pyraformer, and FEDformer gain efficiency via attention sparsification or downsampling, which narrows effective context and increases the risk of missing weak yet critical long-range dependencies; frequency-domain pipelines further introduce phase and amplitude distortions when reconstructing to the time domain. LSTNet, LSTM, and TCN are constrained in modeling very long dependencies and cross-variable interactions: RNNs suffer gradient decay, TCNs remain locally biased and less robust to phase shifts, and LSTNet’s linear residual path under-represents nonlinear cross-variable relations, leading to underfitting.
KEDformer’s advantage stems from the integration of an explicit autocorrelation module and fidelity-oriented near-linear sparse attention. Autocorrelation explicitly retrieves repeated patterns and latent dependencies across variables and temporal lags, while sparse attention suppresses redundancy and preserves salient channels and dominant delays. Additionally, the MSTP decomposition jointly models local perturbations and global trends, strengthening representation of complex temporal structure. Together, these components—trained end-to-end—yield superior accuracy and robustness in multivariate forecasting.
4.4.2 Univariate results.
In the univariate forecasting setting, we selected the representative ETTm2 and Exchange datasets for evaluation, as they respectively exhibit strong periodic industrial characteristics and highly volatile financial behaviors. This complementary design enables a comprehensive assessment of the model’s robustness and generalization across diverse time series scenarios, with the results summarized in Table 5.
Compared with multiple baselines, KEDformer achieves overall state-of-the-art performance in long-term forecasting. Under the input-96-predict-336 configuration, it delivers the best results in 5 out of 8 cases for both Mean Squared Error (MSE) and Mean Absolute Error (MAE). To further enrich the baseline pool, LogTrans, DeepAR, Prophet, and ARIMA were additionally included in the univariate experiments. The results highlight their inherent limitations: LogTrans, while employing logarithmic sparse attention to reduce computational cost, fails to capture long-term dependencies and non-periodic patterns, leading to error accumulation over extended horizons; DeepAR, as an autoregressive RNN-based method, relies on step-by-step predictions, hindering parallelism and amplifying errors in long sequences; Prophet and ARIMA, constrained by linear decomposition and stationarity assumptions, lack adaptability to nonlinear dynamics and multi-scale temporal variations.
By contrast, KEDformer excels through its autocorrelation mechanism for enhanced global dependency modeling, sparse attention for effective redundancy filtering, and trend–seasonal decomposition module for precise temporal structure learning, thereby fully leveraging its representational advantages in low-dimensional sequences.
4.4.3 Ablation research.
To systematically evaluate the impact of the proposed Knowledge Extraction and Decomposition Attention (KEDA) module on model performance, we conducted a set of ablation experiments. Three model variants were designed for comparison:
- KEDformer:which completely replaces the original self-attention and cross-attention mechanisms with KEDA;
- KEDformer V1:which replaces only the self-attention mechanism with KEDA while retaining the original cross-attention;
- KEDformer V2: which uses conventional attention mechanisms in both positions.
These models were evaluated on the Exchange and Weather datasets, and the results are shown in Table 6. KEDformer outperformed its counterparts in 14 out of 16 test cases, while KEDformer V1 showed improvements in only 2 cases. The results demonstrate the significant advantage of KEDA as a unified attention mechanism.The KEDA module integrates an autocorrelation mechanism with a sparse attention strategy: the autocorrelation component precisely captures key dependencies across time segments, enhancing the model’s temporal modeling capability; meanwhile, the sparse attention mechanism sparsifies the attention weight matrix, effectively reducing interference.
5 Discussion
5.1 Time series decomposition effects on the mode
The time series decomposition applied to the ETTm1 dataset has clearly revealed distinct seasonal and trend-cyclical patterns, enabling the KEDformer model to more accurately capture periodic variations and long-term trends within the data. By effectively separating seasonal patterns from trend components, the integration of decomposition has significantly enhanced the model’s predictive accuracy. Focusing on short-term fluctuations while retaining an understanding of long-term trends, the model is better equipped for accurate forecasting. As illustrated in Fig 5, these improvements can be attributed to several key factors: First, the explicit modeling of seasonal variations allows the model to more effectively adapt to recurring patterns, thereby strengthening its ability to project future values based on historical data. Second, the decomposition process helps identify significant features within the data, enabling the model to prioritize relevant information during the forecasting process.
In the left subfigure (a), the raw time series data is shown without decomposition, displaying interwoven fluctuations and trends. In contrast, the right subfigure (b) presents the time series decomposed into three components: the original time series in purple, the trend-cyclical component in beige, and the seasonal component in teal.
5.2 Effect of KEDformer number on encode and decode
In this study, we conducted comparative experiments using the Exchange dataset. We varied the number of KEDformer mechanisms in these experiments. The results, illustrated in Fig 5, demonstrate that the model achieves superior performance when the number of KEDformer mechanisms in the decoding phase exceeds that in the encoding phase. This improvement can be attributed to the model’s enhanced ability to focus on the most informative features during the decoding process. As a result, it effectively captures dependencies between the predicted outputs and historical inputs. Conversely, performance declines when the number of KEDformer mechanisms in the decoding phase equals that in the encoding phase.
5.3 Effect of KEDformer on computational efficiency
We conducted experiments to evaluate the impact of increasing the number of KEDformer mechanisms on the computational efficiency of the model, as shown in Fig 6. The results demonstrate that the model achieves improved efficiency as the number of KEDformer mechanisms increases across various datasets, including ETTm1, ETTm2, and Weather. Notably, the time required for each epoch decreases significantly with the increase in the number of KEDformer mechanisms. The most pronounced improvement is observed in the ETTm1 dataset, where computation time drops from 794.0 seconds to 467.2 seconds. This enhancement can be attributed to the model’s improved ability to capture temporal dependencies and optimize resource utilization. This enables parallel processing and more effective distribution of the computational load.
In a comparative experiment that controls the number of KEDformer mechanisms during the encoding and decoding processes, we set the input length I = 96 and the prediction lengths .
5.4 Efficiency analysis and performance analysis
In this study, we conducted a comprehensive analysis of the computational efficiency and predictive performance of models employing different self-attention mechanisms, with the results presented in Fig 7. On the Exchange dataset,The KEDformer model ranks second in terms of memory usage (measured in GB), which is primarily attributed to its sparse attention mechanism. The sparse attention mechanism is one of the key technologies for reducing computational complexity and memory consumption. By selecting only the key positions in the input sequence for attention calculation instead of performing global calculations for all positions, this mechanism reduces the computational complexity from the traditional O(L2) to , significantly decreasing memory usage. the KEDformer model ranked third in terms of running time but achieved the highest prediction accuracy. This superior performance can be primarily attributed to its optimized knowledge extraction mechanism and seasonal trend decomposition approach, which significantly enhance the model’s ability to capture key patterns in time series data. However, when dealing with time series data lacking clear periodicity, the model’s performance may deteriorate, as the seasonal trend decomposition may fail to effectively extract relevant information. Additionally, an improper configuration of the number of KEDformer mechanisms can reduce computational efficiency and negatively impact the final prediction results. In this study, computational efficiency was assessed by measuring the time required for each model to complete a training epoch (in seconds), while predictive performance was evaluated using the Mean Squared Error (MSE) and Mean Absolute Error (MAE).
The input length is set to I = 96, and the prediction steps are . The time required for each epoch is used as an indicator of the model’s computational speed.
5.5 Computation efficiency
In the multivariate setting and with the current optimal implementation of all methods, KEDformer has achieved a significant enhancement in computational efficiency compared to conventional Transformer models. This improvement effectively addresses the challenges associated with the quadratic time complexity O(L2) and memory usage O(L2) inherent to standard self-attention mechanisms. By employing sparse attention and autocorrelation strategies, KEDformer reduces both the time complexity and memory usage to O(LlogL), thereby enhancing the model’s capability to process long sequence data. As illustrated in Fig 7, KEDformer maintains its time and memory complexity while significantly improving prediction accuracy, enabling the model to handle longer sequences more efficiently. During the testing phase, KEDformer completes predictions in a single step, in contrast to traditional models that require O(L) steps, thereby substantially increasing its efficiency. As demonstrated in Table 7, KEDformer strikes a superior balance between computational efficiency and predictive accuracy, rendering it a practical solution for long-term time series forecasting tasks in resource-constrained environments.
6 Conclusion
This study proposes KEDformer, a novel and efficient framework for long-term time series forecasting tasks. By introducing a sparse attention mechanism, the model reduces the quadratic complexity of standard self-attention to near-linear complexity, thereby significantly improving processing speed for long sequences. In addition, KEDformer integrates seasonal–trend decomposition with an autocorrelation mechanism to jointly model short-term perturbations and long-term structures, effectively mitigating information loss and producing forecasts that better align with real-world temporal dynamics.
Extensive experiments on multiple public benchmark datasets demonstrate that KEDformer consistently outperforms mainstream Transformer-based models in terms of long-term prediction accuracy, stability, and generalization ability, highlighting its strong adaptability across diverse forecasting tasks. Notably, for typical periodic data, the decomposition and autocorrelation modules provide substantial modeling advantages; however, in certain non-periodic or highly volatile scenarios (e.g., financial exchange rates), the effectiveness of these mechanisms is relatively constrained. This limitation is not inherent to the model itself but reflects the varying sensitivities of different data structures to specific pattern extraction methods.
To further enhance the model’s generality and flexibility, future work will explore dynamic structure selection mechanisms based on data characteristics, enabling adaptive adjustment of modeling strategies. In addition, we plan to investigate more interpretable sparse attention strategies to improve the identification of critical features and redundancy filtering in non-periodic scenarios. These directions are expected to broaden the applicability of KEDformer to a wider range of time series modeling tasks.
Overall, KEDformer marks an important step forward in long-term time series forecasting, offering a promising pathway to address the trade-off between efficiency and accuracy under high-dimensional input conditions. For example, in healthcare monitoring scenarios, KEDformer can capture both long-term trends and short-term fluctuations in physiological signals, thereby assisting in disease risk prediction. This illustrates that its applicability extends beyond standard benchmark tasks and can be adapted to more complex real-world domains. With continued structural optimization and mechanism refinement, its scope is expected to expand to increasingly diverse and challenging forecasting applications.
References
- 1.
Zhang C, Patras P. Long-term mobile traffic forecasting using deep spatio-temporal neural networks. In: Proceedings of the Eighteenth ACM International Symposium on Mobile Ad Hoc Networking and Computing. 2018. p. 231–40.
- 2. Maray N, Ngu AH, Ni J, Debnath M, Wang L. Transfer learning on small datasets for improved fall detection. Sensors (Basel). 2023;23(3):1105. pmid:36772148
- 3.
Alam M, Ashraf Z, Singh P, Pandey B, Rehman K, Aldasheva L. Deep learning techniques for intrusion detection systems in healthcare environments. In: 2025 IEEE 14th International Conference on Communication Systems and Network Technologies (CSNT). 2025. p. 105–11. https://doi.org/10.1109/csnt64827.2025.10968425
- 4. Feng Z, Zhang Y, Chen Z, Ni J, Feng Y, Xie Y. Machine learning to access and ensure safe drinking water supply: a systematic review. crossref. 2024.
- 5. Fu X, Zhang C, Xu Y, Zhang Y, Sun H. Statistical machine learning for power flow analysis considering the influence of weather factors on photovoltaic power generation. IEEE Trans Neural Netw Learn Syst. 2025;36(3):5348–62. pmid:38587954
- 6. Somu N, M R GR, Ramamritham K. A hybrid model for building energy consumption forecasting using long short term memory networks. Applied Energy. 2020;261:114131.
- 7. Fu X, Zhang C, Zhang X, Sun H. A novel GAN architecture reconstructed using bi-LSTM and style transfer for PV temporal dynamics simulation. IEEE Trans Sustain Energy. 2024;15(4):2826–9.
- 8. Gupta M, Wadhvani R, Rasool A. Comprehensive analysis of change-point dynamics detection in time series data: a review. Expert Systems with Applications. 2024;248:123342.
- 9.
Lin Y, Koprinska I, Rana M. SSDNet: state space decomposition neural network for time series forecasting. In: 2021 IEEE International Conference on Data Mining (ICDM). IEEE; 2021. p. 370–8. https://doi.org/10.1109/icdm51629.2021.00048
- 10.
Zhou L, Gao J. Temporal spatial decomposition and fusion network for time series forecasting*. In: 2023 International Joint Conference on Neural Networks (IJCNN). 2023. p. 1–10. https://doi.org/10.1109/ijcnn54540.2023.10191934
- 11.
Gao J, Song X, Wen Q, Wang P, Sun L, Xu H. Robusttad: robust time series anomaly detection via decomposition and convolutional neural networks. arXiv preprint 2020. https://arxiv.org/abs/2002.09545
- 12. Asadi R, Regan AC. A spatio-temporal decomposition based deep neural network for time series forecasting. Applied Soft Computing. 2020;87:105963.
- 13.
Dosovitskiy A. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint 2020. https://arxiv.org/abs/2010.11929
- 14.
Ni J, Tang H, Shang Y, Duan B, Yan Y. Adaptive cross-architecture mutual knowledge distillation. In: 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG). 2024. p. 1–5. https://doi.org/10.1109/fg59268.2024.10581969
- 15.
Devlin J. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint 2018. https://arxiv.org/abs/1810.04805
- 16. Wu H, Xu J, Wang J, Long M. Autoformer: decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems. 2021;34:22419–30.
- 17. Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, et al. Informer: beyond efficient transformer for long sequence time-series forecasting. AAAI. 2021;35(12):11106–15.
- 18.
Beltagy I, Peters ME, Cohan A. Longformer: the long-document transformer. arXiv preprint 2020. https://arxiv.org/abs/2004.05150
- 19.
Wang S, Li BZ, Khabsa M, Fang H, Ma H. Linformer: self-attention with linear complexity. arXiv preprint 2020. https://arxiv.org/abs/2006.04768
- 20. Jaszczur S, Chowdhery A, Mohiuddin A, Kaiser L, Gajewski W, Michalewski H. Sparse is enough in scaling transformers. Advances in Neural Information Processing Systems. 2021;34:9895–907.
- 21.
Zhuang B, Liu J, Pan Z, He H, Weng Y, Shen C. A survey on efficient training of transformers. arXiv preprint 2023. https://arxiv.org/abs/2302.01107
- 22.
Vaswani A. Attention is all you need. In: Advances in Neural Information Processing Systems. 2017.
- 23.
Wang W, Liu Y, Sun H. Tlnets: transformation learning networks for long-range time-series prediction. arXiv preprint. 2023. https://arxiv.org/abs/2305.15770
- 24. Abbasimehr H, Paki R. Improving time series forecasting using LSTM and attention models. J Ambient Intell Human Comput. 2021;13(1):673–91.
- 25.
Tay Y, Bahri D, Metzler D, Juan DC, Zhao Z, Zheng C. Synthesizer: rethinking self-attention for transformer models. In: International conference on machine learning. 2021. p. 10183–92.
- 26.
Lee-Thorp J, Ainslie J, Eckstein I, Ontanon S. Fnet: mixing tokens with fourier transforms. arXiv preprint. 2021. https://arxiv.org/abs/2105.03824
- 27. Lim B, Arık SÖ, Loeff N, Pfister T. Temporal fusion transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting. 2021;37(4):1748–64.
- 28.
Zhou T, Ma Z, Wen Q, Wang X, Sun L, Jin R. Fedformer: frequency enhanced decomposed transformer for long-term series forecasting. In: International conference on machine learning. 2022. p. 27268–86.
- 29. Wu S, Xiao X, Ding Q, Zhao P, Wei Y, Huang J. Adversarial sparse transformer for time series forecasting. Advances in Neural Information Processing Systems. 2020;33:17105–15.
- 30. Qin Z, Wei B, Gao C, Zhu F, Qin W, Zhang Q. ACSAformer: a crime forecasting model based on sparse attention and adaptive graph convolution. Front Phys. 2025;13.
- 31. Qin Z, Wei B, Gao C, Chen X, Zhang H, In Wong CU. SFDformer: a frequency-based sparse decomposition transformer for air pollution time series prediction. Front Environ Sci. 2025;13:1549209.
- 32. Yu S, Peng J, Ge Y, Yu X, Ding F, Li S, et al. A traffic state prediction method based on spatial–temporal data mining of floating car data by using autoformer architecture. Computer aided Civil Eng. 2024;39(18):2774–87.
- 33. Hong J-T, Bai Y-L, Huang Y-T, Chen Z-R. Hybrid carbon price forecasting using a deep augmented FEDformer model and multimodel optimization piecewise error correction. Expert Systems with Applications. 2024;247:123325.
- 34.
Vartholomaios A, Karlos S, Kouloumpris E, Tsoumakas G. Short-term renewable energy forecasting in Greece using prophet decomposition and tree-based ensembles. In: International Conference on Database and Expert Systems Applications. 2021. p. 227–38. https://doi.org/10.48550/arXiv.2107.03825
- 35.
Chen W, Yang Y, Liu J. A combination model based on sequential general variational mode decomposition method for time series prediction. arXiv preprint 2024. https://doi.org/arXiv:240603157
- 36. Kontopoulou VI, Panagopoulos AD, Kakkos I, Matsopoulos GK. A review of ARIMA vs. machine learning approaches for time series forecasting in data driven networks. Future Internet. 2023;15(8):255.
- 37. Wang X, Wang Y, Peng J, Zhang Z, Tang X. A hybrid framework for multivariate long-sequence time series forecasting. Appl Intell. 2022;53(11):13549–68.
- 38.
Woo G, Liu C, Sahoo D, Kumar A, Hoi S. Etsformer: exponential smoothing transformers for time-series forecasting. arXiv preprint 2022. https://arxiv.org/abs/2202.01381
- 39.
Niyogi D. A novel method combines moving fronts, data decomposition and deep learning to forecast intricate time series. arXiv preprint 2023. https://arxiv.org/abs/230306394
- 40.
Zhong S, Song S, Zhuo W, Li G, Liu Y, Chan SHG. A multi-scale decomposition mlp-mixer for time series analysis. arXiv preprint 2023. https://arxiv.org/abs/231011959
- 41.
Ni J, Sarbajna R, Liu Y, Ngu AHH, Yan Y. Cross-modal knowledge distillation for vision-to-sensor action recognition. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022. https://doi.org/10.1109/icassp43922.2022.9746752
- 42.
Ni J, Ngu AHH, Yan Y. Progressive cross-modal knowledge distillation for human action recognition. In: Proceedings of the 30th ACM International Conference on Multimedia. 2022. p. 5903–12. https://doi.org/10.1145/3503161.3548238
- 43.
Lee H, Lee C, Lim H, Ko S. TILDE-Q: a transformation invariant loss function for time-series forecasting. arXiv preprint 2022. https://doi.org/arXiv:221015050
- 44.
Fu Y, Virani N, Wang H. Masked multi-step probabilistic forecasting for short-to-mid-term electricity demand. CoRR. 2023. https://doi.org/abs/2302.06818
- 45.
Sanchis-Agudo M, Wang Y, Duraisamy K, Vinuesa R. Easy attention: a simple self-attention mechanism for transformers. arXiv preprint 2023. https://arxiv.org/abs/2308.12874
- 46. Hua Y, Zhao Z, Li R, Chen X, Liu Z, Zhang H. Deep learning with long short-term memory for time series prediction. IEEE Commun Mag. 2019;57(6):114–9.
- 47. Meenal R, Binu D, Ramya KC, Michael PA, Kumar KV, Rajasekaran E. Weather forecasting for renewable energy system: a review. Archives of Computational Methods in Engineering. 2022;29(5).
- 48.
Kingma DP, Ba JL. Adam: a method for stochastic optimization. In: International Conference on Learning Representations; 2014.
- 49. Mustaqeem M, Mustajab S, Alam M, Jeribi F, Alam S, Shuaib M. A trustworthy hybrid model for transparent software defect prediction: SPAM-XAI. PLoS One. 2024;19(7):e0307112. pmid:38990978
- 50.
Kitaev N, Kaiser L-, Levskaya A. Reformer: the efficient transformer. arXiv preprint 2020. https://doi.org/arXiv:200104451
- 51.
Liu S, Yu H, Liao C, Li J, Lin W, Liu AX, et al. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In: # Placeholder parent metadata value#. 2022.
- 52.
Li L, Wang K, Li S, Feng X, Zhang L. Lst-net: Learning a convolutional neural network with a learnable sparse transform. In: European conference on computer vision. Springer; 2020. p. 562–79.
- 53. Yu Y, Si X, Hu C, Zhang J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019;31(7):1235–70. pmid:31113301
- 54.
Nie X, Zhou X, Li Z, Wang L, Lin X, Tong T. Logtrans: Providing efficient local-global fusion with transformer and cnn parallel network for biomedical image segmentation. In: 2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys). IEEE; 2022. p. 769–76.
- 55. Hewage P, Behera A, Trovati M, Pereira E, Ghahremani M, Palmieri F, et al. Temporal convolutional neural (TCN) network for an effective weather forecasting using time-series data from the local weather station. Soft Comput. 2020;24(21):16453–82.
- 56. Taylor SJ, Letham B. Forecasting at scale. The American Statistician. 2018;72(1):37–45.
- 57. Salinas D, Flunkert V, Gasthaus J, Januschowski T. DeepAR: probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting. 2020;36(3):1181–91.
- 58.
Neo W, Bradley G, Xue B, Shawn O. Deep transformer models for time series forecasting: The influenza prevalence case. arXiv. Cornell University; 2020.