KEDformer: Knowledge extraction seasonal trend decomposition for long-term sequence prediction

Zhenkai Qin; Baozhong Wei; Caifeng Gao; Jianyuan Ni

doi:10.1371/journal.pone.0335047

Abstract

Time series forecasting is essential in energy, finance, and meteorology. However, existing Transformer-based models face challenges with computational inefficiency and poor generalization for long-term sequences. To address these issues, this study proposes the KEDformer framework. It integrates knowledge extraction and seasonal-trend decomposition to optimize model performance. By leveraging sparse attention and autocorrelation, KEDformer reduces computational complexity from O(L²) to O(L log L), enhancing the model’s ability to capture both short-term fluctuations and long-term patterns. Experiments on five public datasets covering energy, transportation, and weather tasks demonstrate that KEDformer consistently outperforms traditional models, with an average improvement of 10.4% in MSE prediction accuracy and 2.9% in MAE prediction accuracy.

Citation: Qin Z, Wei B, Gao C, Ni J (2025) KEDformer: Knowledge extraction seasonal trend decomposition for long-term sequence prediction. PLoS One 20(10): e0335047. https://doi.org/10.1371/journal.pone.0335047

Editor: Ayesha Maqbool, National University of Sciences and Technology NUST, PAKISTAN

Received: February 3, 2025; Accepted: October 6, 2025; Published: October 24, 2025

Copyright: © 2025 Qin et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All five datasets (ETT, Electricity, Exchange, Traffic, Weather) are available from the Figshare database (accession number: https://figshare.com/s/bee931e38c4b9e9a935e).

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Long-term forecasting plays a critical role in decision-making domains such as transportation logistics [1], healthcare monitoring [2,3], utility management [4,5], and energy optimization [6,7]; however, as the forecasting horizon increases, so do computational demands and the complexity of modeling temporal dependencies. Traditional time series decomposition methods, while useful, often rely on linear assumptions, limiting their effectiveness in handling complex multivariate scenarios or unpredictable, non-stationary data patterns, and struggling to capture the interplay between components such as trends, seasonality, and irregularities [8]. Recent advancements have integrated deep learning approaches into the decomposition process to improve forecasting accuracy [9,10], leveraging representation learning and nonlinear transformations to better capture dynamic dependencies and multi-scale interactions within time series data [11,12]. More recently, transformers have excelled in various tasks, such as computer vision (CV) [13,14], natural language processing (NLP) [15], and time series forecasting, due to their powerful modeling capabilities and flexibility, but in long-term forecasting tasks, they face significant challenges. The computational complexity of the traditional self-attention mechanism is O(L²), where L represents the sequence length, leading to an increasing demand for memory and computational resources and limiting their applicability in resource-constrained or real-time analysis scenarios. To address this, many researchers have improved the self-attention mechanism to reduce the computational complexity of long-term forecasting and enhance the application of models in practical scenarios [16–19]; additionally, transformers often struggle to model long-term dependencies effectively due to noise interference, where irrelevant information weakens the attention distribution and degrades overall performance [19,20]. Consequently, capturing long-term dependencies in time series data while ensuring computational efficiency over extended prediction horizons remains a significant challenge.

To address the challenges in long-term time series forecasting, we propose an end-to-end Knowledge Extraction Decomposition (KEDformer) framework. The core of this framework is the Knowledge Extraction Attention module (KEDA, blue block), which reduces model parameters through autocorrelation and sparse attention mechanisms. The autocorrelation mechanism estimates the correlations of subsequences within a specific time period, while the sparse attention filters the weight matrix of these correlations, thereby reducing computational overhead and mitigating the interference from irrelevant features. This design reduces the computational complexity from O(L²) to O(LlogL), significantly decreasing memory usage and enhancing the model’s ability to process long sequences. Additionally, KEDformer integrates a mixed time series pooling decomposition method (MSTP, yellow block) that decomposes the input time series data into seasonal and trend components, further improving prediction accuracy. This approach captures both short-term fluctuations and long-term patterns, making the predictions more consistent with real-world temporal dynamics. Therefore, the KEDformer framework not only addresses the computational bottleneck of traditional Transformers in long-term forecasting but also enhances their performance and robustness in complex sequence tasks. In summary, the contributions of this study are as follows:

We introduce a knowledge extraction mechanism that combines sparse attention and autocorrelation to reduce the computational cost of the self-attention layer. This mechanism reduces the computational overhead from quadratic to linear complexity.
Furthermore, by employing seasonal-trend decomposition, KEDformer effectively captures both long-term trends and seasonal patterns, overcoming the limitations of the Transformer model in capturing long-term dependencies.
Extensive experiments on five public datasets demonstrate the effectiveness and competitiveness of the proposed KEDformer, which outperforms all previous Transformer-based models across various forecasting applications.

2 Related work

2.1 Transformer-based long-term time series forecasting

Transformer-based models have demonstrated exceptional performance in time series forecasting due to their powerful self-attention mechanism and parallel processing capabilities, excelling at capturing long-term dependencies and handling long-sequence data [21,22]; however, traditional Transformer models still face several challenges in time series forecasting, such as high computational complexity and difficulty addressing noise issues in long-term dependencies [15,18], for instance, the core self-attention mechanism exhibits quadratic computational complexity with respect to sequence length, which limits its efficiency in long-sequence tasks [23].

To overcome these limitations, various advancements have been proposed in recent years. For example, Synthesizer [24] investigated the importance of dot-product interactions, introduced randomly initialized, learnable attention mechanisms, and demonstrated competitive performance in specific tasks. Furthermore, FNet [25] replaced self-attention with Fourier transforms, showcasing its effectiveness in mixing sequence features. Another approach utilized Gaussian distributions to construct attention weights, enabling a focus on local windows and improving the performance of models in capturing local dependencies [26]. Pyraformer employed a pyramid attention structure to address the complexity of handling long-range dependencies, while TFT integrated multivariate features and time-varying information to improve multi-step forecasting [27]. More recently, Informer introduced the ProbSparse attention mechanism and distillation techniques, reducing computational complexity to and significantly improving efficiency. LogTrans employed logarithmic sparse attention to further alleviate the computational burden of long-sequence predictions [28], while AST combined adversarial training and sparse attention to enhance robustness in complex scenarios [29];ACSAformer [30] enhances the modeling capability for complex high-dimensional data by integrating sparse attention and adaptive graph convolution.SFDformer [31] improves the modeling and prediction of complex dynamic features by integrating Fourier transform, time series decomposition, and sparse attention mechanisms.Additionally, Autoformer leveraged time series decomposition and autocorrelation mechanisms for long-term sequence forecasting [16,32], and FEDformer utilized frequency-domain enhancements to optimize performance on long sequences [29,33]. We summarize the main characteristics and advantages of some published studies in the literature in Table 1. Although these studies have made progress in optimizing computational efficiency and capturing long-term dependencies, they still show instability in modeling complex long-term and non-periodic dependencies.

Download:

Table 1. Comparison of transformer-based time series forecasting models.

https://doi.org/10.1371/journal.pone.0335047.t001

Unlike previous studies, the proposed KEDformer integrates sparse attention mechanisms with autocorrelation strategies at the architectural level and introduces a dominant weight distribution selection mechanism. This mechanism dynamically selects representative weights from the sparse attention matrix, thereby enhancing the model’s ability to capture key dependencies in time series. It not only inherits the computational efficiency of sparse attention but also alleviates the potential information loss often seen in traditional sparse schemes when modeling global contexts. Compared with Informer, which implements sparse attention through probabilistic sampling, KEDformer combines dominant distribution selection with explicit autocorrelation enhancement to improve the stability of modeling long-range, non-periodic dependencies. Unlike Autoformer, which entirely replaces attention with autocorrelation, KEDformer retains the sparse attention backbone and embeds auxiliary autocorrelation modules, forming a complementary modeling framework. This collaborative fusion strategy achieves a balance between performance and efficiency, demonstrating structural innovation and practical adaptability beyond existing methods.

2.2 Decomposition of time series

Time series decomposition is a traditional approach that breaks down time series data into components such as trend, seasonality, and residuals, revealing the intrinsic patterns within the data [34,35]. Among traditional methods, ARIMA [36] uses differencing and parameterized modeling to decompose and forecast non-stationary time series effectively. In contrast, the Prophet model combines trend and seasonal components while accommodating external covariates [29], making it suitable for modeling complex time series patterns. Matrix decomposition-based methods, such as DeepGLO, extract global and local features through low-rank matrix decomposition, while N-BEATS employs a hierarchical structure to dissect trends and periodicity [35]. However, these approaches primarily focus on the static decomposition of historical sequences and often fall short in capturing dynamic interactions for future forecasting.

More recently, deep learning models have increasingly incorporated time series decomposition to enhance predictive power. For example, Autoformer introduces an embedded decomposition module, treating trend and seasonal components as core building blocks to achieve progressive decomposition and forecasting [37]. FEDformer combines Fourier and wavelet transforms to decompose time series into components of varying frequencies, capturing global characteristics and local structures while significantly reducing computational complexity and improving the accuracy of long-sequence predictions. Similarly, ETSformer [38] adopts a hierarchical decomposition framework inspired by exponential smoothing, segmenting time series into level, growth, and seasonality components [39]. By integrating exponential smoothing attention and frequency-domain attention mechanisms, ETSformer effectively extracts key features, demonstrating superior performance across multiple datasets [40]. Inspired by these studies, our proposed KEDformer approach integrates decomposition modules dynamically with a progressive decomposition strategy, not only significantly improving computational efficiency but also enabling the simultaneous modeling of both short-term and long-term patterns.

3 Methodology

3.1 Background

The Long Sequence Time Forecasting (LSTF) problem is defined within a rolling forecasting setup, where predictions over an extended future horizon are made based on past observations within a fixed-size window [23]. At each time point t, the input sequence consists of observations with multiple feature dimensions, and the output sequence predicts multiple future values. The output length L_y is intentionally set to be relatively long to capture complex dependencies over time. This setup enables the model to predict multiple attributes, making it well-suited for time series applications.

3.2 Data decomposition

In this section, we will cover the following aspects of KEDformer: (1) the decomposition process designed to capture seasonal and trend components in time series data; and (2) the architecture of the KEDformer encoder and decoder.

Time series decomposition. To capture complex temporal patterns in long-term predictions, we utilize a decomposition approach that separates sequences into trend, cyclical, and seasonal components. These components correspond to the long-term progression and seasonal variations inherent in the data. However, directly decomposing future sequences is impractical due to the uncertainty of future data. To address this challenge, we introduce a novel internal operation within the sequence decomposition block, referred to as the autocoupling mechanism in KEDformer, as shown in Fig 1. This mechanism enables the progressive extraction of long-term stationary trends from predicted intermediate hidden states. Specifically, we adjust the moving average to smooth periodic fluctuations and emphasize the long-term trends. For the length-L input sequence , the procedure is as follows:

(1)

(2)

where represent the seasonal part and the extracted trend component, respectively. We use for moving average and filling operations to maintain a constant sequence length. We summarize the above process as , which is a within-model block.

Download:

Fig 1. Schematic diagram of mixed time series pooling decomposition block.

https://doi.org/10.1371/journal.pone.0335047.g001

Model input.

In Fig 2, the encoder’s input consists of the past I time steps, denoted as . In the decomposition architecture, the input to the decoder is composed of both a seasonal component, , and a trend-cyclical component, , both of which are subject to further refinement. Each initialization consists of two elements: (1) the decomposed component derived from the latter half of the encoder’s input, , of length I/2, which provides recent information, and (2) placeholders of length O, filled with scalar values. The formulation is as follows:

(3)

(4)

(5)

where denote the seasonal and trend-cyclical components of , respectively. The placeholders, labeled as , are populated with zeros and the mean values of , respectively.

Download:

Fig 2. Schematic overview of the proposed. KEDformer method. Knowledge Extraction Attention module (KEDA, blue block), Mixed time series pooling decomposition (MSTP, yellow block).

https://doi.org/10.1371/journal.pone.0335047.g002

Encoder.

In Fig 2, the encoder follows a multilayer architecture, defined as where represents the output of the l-th encoder layer. The initial input, , corresponds to the embedded historical time series. The Encoder function, , is formally expressed as:

(6)

(7)

(8)

where , represents the seasonal component after the i-th decomposition block in the l-th layer.

Decoder.

In Fig 2, the decoder has two roles: the accumulation of the trend time series part and the knowledge extraction stacking of the seasonal time series part. For example, where represents the output of the l-th decoder layer. The decoder is formalized as:

(9)

(10)

(11)

Eq (11) denotes the final decomposition stage within the decoder, where the output from the previous residual block, , is passed through a feed-forward layer and added back via a residual connection. The resulting sequence is then processed by the MSWTDecomp function, which applies multi-scale wavelet transform-based decomposition to extract seasonal () and trend () components. This design ensures that each decoder layer can refine the representation by isolating frequency-aware patterns, thereby improving long-horizon forecasting accuracy.

(12)

Eq (12), the final seasonal component —obtained from the MSWTDecomp operation—is directly used as the decoder’s output representation . This substitution simplifies the decoder architecture by avoiding explicit trend aggregation and emphasizes the dominant short-term patterns captured via multi-scale decomposition, which are particularly useful for downstream forecasting tasks.

(13)

In this context, and , where , represent the seasonal and trend components, respectively, after the i-th decomposition block within the l-th layer. The matrix W_i,L, where , serves as the projection matrix for the i-th extracted trend component, .

To improve the flexibility of trend modeling, the decoder weight matrix in Eq (13) is designed as a learnable parameter and initialized using Xavier uniform initialization to ensure numerical stability at the early stage of training. During training, this matrix is updated via backpropagation to adaptively capture the evolving trend across multiple time steps. Additionally, to mitigate the risk of error accumulation during multi-step trend extrapolation, we adopt a residual-style skip connection in the decoder and apply L2 regularization and early stopping. These strategies help to stabilize trend prediction and enhance robustness against long-term forecast deviations.

3.3 Knowledge extraction process

3.3.1 Autocorrelation function.

To enhance the capability of modeling long-range dependencies in time series, we introduce the autocorrelation function (ACF) as an auxiliary mechanism. ACF measures the correlation between a time series and a lagged version of itself, allowing the model to explicitly encode repeating patterns or structural temporal regularities.

Given a time series , its autocorrelation at lag l is defined as:

(14)

where is the mean of the time series. This function quantifies how past values influence future values at different time scales.

In our framework, ACF values are computed for each time segment and used to weight candidate connections in the attention mechanism. Specifically, when constructing the query-key similarity map, we incorporate a weighting term based on ACF to prioritize temporally correlated elements. The combined similarity score becomes:

(15)

where controls the balance between dot-product similarity and autocorrelation guidance.

This integration ensures that attention weights are not only based on instantaneous token-level similarity but also reflect broader temporal structure, especially beneficial for periodic or structured sequences. In summary, the ACF module enriches the temporal modeling capacity by explicitly introducing interpretable lag-based correlations into the attention computation.

3.3.2 Self-attention mechanism.

The canonical self-attention mechanism is defined by the tuple inputs Q, K, and V, which correspond to the query, key, and value matrices, respectively. This mechanism performs scaled dot-product attention, computed as:

(16)

where , , and , with d representing the input dimension. To further analyze the self-attention mechanism, we focus on the attention distribution of the i-th query, denoted as q_i, which is based on an asymmetric kernel smoother. The attention for the i-th query is formulated in probabilistic terms:

(17)

where , and represents the asymmetric exponential kernel This self-attention mechanism combines the values and produces outputs by computing the probability . However, this process involves quadratic dot-product computations, resulting in a complexity of , which poses a significant limitation in memory usage, particularly for models designed to enhance predictive capacity.

3.3.3 Knowledge selection.

Knowledge Selection refers to selecting the most representative queries from multiple candidates for knowledge extraction. By measuring the distance between the query distribution and a reference distribution, the relevance of each query can be evaluated, enabling the identification and utilization of important queries. From Eq (15), the attention of the i-th query across all keys is represented as a probability distribution , where the output is computed by aggregating the values v weighted by this probability. High dot-product values between query-key pairs lead to a non-uniform attention distribution, as dominant query-key pairs shift the attention probability away from a uniform distribution. If closely resembles a uniform distribution, , then the self-attention essentially produces an averaged summation over the values v, diminishing the significance of individual values.

To mitigate this, we introduce a knowledge extraction mechanism that evaluates the similarity between the attention probability p and a baseline distribution q using the Kullback-Leibler (KL) divergence [41,42]. This measure effectively reduces the influence of less significant queries. The similarity between p and q for the i-th query is computed as:

(18)

From this, we define the distillation measure M(q_i,K) for the i-th query as:

(19)

A larger M(q_i,K) value indicates that the i-th query has a more diverse attention distribution, potentially focusing on dominant dot-product pairs in the tail of the self-attention output. This approach allows the model to prioritize influential query-key pairs, thereby improving the overall effectiveness of the knowledge extraction process.

3.3.4 Decoupled knowledge extraction.

Period-based dependencies. The period-based dependencies are quantified using the autocorrelation function, which measures the similarity between different time points in a time series, revealing its underlying periodic characteristics. For a discrete time series {X_t}, the autocorrelation function is defined as:

(20)

Where τ represents the time lag, L is the total length of the series, and {X_t} and are the values at the current and lagged time points, respectively. The autocorrelation function computes the cumulative similarity over lagged time intervals, reflecting the degree of self-similarity within the series for various time delays. Peaks in the autocorrelation values indicate potential periodicity and help identify the likely period lengths.

By identifying the peaks of the autocorrelation function, the most probable period lengths can be determined. These period lengths not only capture the dominant periodic patterns in the series but also serve as weighted features, enhancing interpretability and predictive capabilities.

Time-delay aggregation. The time-delay aggregation method for knowledge acquisition focuses on estimating the correlation of sub-sequences within a specific period. Therefore, we propose an innovative time-delay aggregation module that can perform hierarchical convolution operations on sub-sequences based on the selected time delays , thereby narrowing down the key knowledge weight matrix. This process captures sub-sequences from the same location and similar positions within the period, extracting the potential key-weight aggregation matrix. Finally, we apply the Softmax function to normalize the weights, enhancing the accuracy of sub-sequence aggregation.

For a time series x of length L, after projection and filtering of the weight matrix, we obtain the query , key K, and value v. The knowledge extraction attention mechanism is then as follows:

(21)

(22)

(23)

(24)

Where is used to obtain the top k parameters of self-attention, and let , where c is a hyperparameter. represents the self-attention matrix between sequences Q and K. Top_u selects the most important u queries in the weight matrix. represents the self-attention matrix after filtering between sequences Q and K. denotes the operation of temporally shifting X by τ, where the elements shifted out from the front are reintroduced at the end. For the encoder-decoder self-attention, K and V come from the encoder and are adjusted to length O, with Q originating from the previous block of the decoder.

In summary, as illustrated in Fig 3, the collaborative interaction between the Knowledge Extraction Attention module and the time-series pooling decomposition method markedly enhances the predictive efficiency of KEDformer within the overall architecture (Fig 4).

Download:

Fig 3. In the experiment analyzing model computational efficiency and performance, four different models are used to perform long-term time series forecasting tasks on the Exchange dataset.

The input length is set to I = 96, and the prediction lengths are .

https://doi.org/10.1371/journal.pone.0335047.g003

Download:

Fig 4. The synergistic effect of the Knowledge Extraction Attention module and the time series pooling decomposition method.

https://doi.org/10.1371/journal.pone.0335047.g004

4 Experiment

4.1 Datasets

In order to evaluate the effectiveness of the proposed KEDformer model, five public datasets were employed. These datasets consisted of four periodic datasets and one non-periodic dataset. They encompassed a variety of tasks and are described in detail as follows:(1) ETT [43]: This dataset comprises four sub-datasets—ETTh1, ETTh2, ETTm1, and ETTm2. The data in ETTh1 and ETTh2 were sampled every hour. Meanwhile, the data in ETTm1 and ETTm2 were sampled every 15 minutes. These datasets include load and oil temperature measurements collected from power transformers between July 2016 and July 2018.(2) Electricity [44]: This dataset contains hourly electricity consumption data from 321 customers. The data span from 2012 to 2014.(3) Exchange Rates [45]: This non-periodic dataset records the daily exchange rates of eight countries. It spans the years 1990 to 2016.(4) Traffic [46]: This dataset consists of hourly traffic data from the California Department of Transportation. It captures road occupancy through various sensors on the Bay Area Highway.(5) Weather [47]: This dataset includes meteorological data recorded every 10 minutes throughout the year 2020. There are 21 indicators such as temperature and humidity.In accordance with standard protocols, all datasets were chronologically split into training, validation, and test sets. The ETT dataset was partitioned using a 6:2:2 ratio [43]. Meanwhile, the other datasets followed a 7:1:2 split [44–47].

4.2 Implementation details

Under the experimental configuration summarized in Table 2, this study embeds residual connections into the decomposition module of the Transformer-based model [32], thereby enhancing its capability to model time series, smoothing periodic fluctuations, and emphasizing long-term trends. The model is optimized using the L2 loss function and the ADAM optimizer [48], with a fixed random seed set to ensure reproducibility. Hyperparameters are systematically tuned on the validation set via a grid search strategy [49]. Specifically, we evaluated all combinations of key parameters and adopted MSE and MAE as performance metrics, with the final optimal configuration summarized in Table 3.

Download:

Table 2. Description of the experimental environment.

https://doi.org/10.1371/journal.pone.0335047.t002

Download:

Table 3. Table of optimal hyperparameter settings.

https://doi.org/10.1371/journal.pone.0335047.t003

To mitigate overfitting and improve generalization, two complementary strategies were employed: (1) an early stopping mechanism, in which training was terminated if validation performance did not improve for 10 consecutive epochs; and (2) regularization, achieved by introducing dropout in critical layers and applying a weight decay term of 0.1. These measures effectively reduced the risk of over-parameterization and enhanced the robustness of the model in long-horizon forecasting tasks.

It is noteworthy that long-term sequence forecasting is inherently more prone to overfitting than short-term tasks, as the model must simultaneously capture local perturbations and global trends over extended horizons. Excessive reliance on training-specific patterns may therefore impair generalization. To address this issue, we further assessed model robustness through repeated experiments with different random seeds. Specifically, all experiments were conducted three times, and the mean and standard deviation were reported to ensure the reliability and reproducibility of the results.

4.3 Baselines

We evaluated eleven representative baseline models for time-series forecasting. Among them, Transformer-based models include Autoformer [32], Informer [17], Reformer [50], Pyraformer [51], FEDformer [28], and LogTrans [18]; recurrent neural network (RNN)–based models comprise LSTNet [52], LSTM [53], and DeepAR [54]; the convolutional neural network (CNN)–based model is TCN [55]; and statistical decomposition and linear models include Prophet [56] and ARIMA [57]. These bas [57]elines cover a broad spectrum of mainstream forecasting paradigms—from long-range dependency modeling and frequency-domain decomposition to seasonal-trend analysis—thereby providing a comprehensive and systematic benchmark for this study.

4.4 Performance comparison

The Mean Squared Error (MSE) emphasizes large deviations by penalizing them more heavily, while the Mean Absolute Error (MAE) provides a more intuitive measure of the average prediction bias. As both metrics are widely used and complementary in time series forecasting tasks, we adopt MSE and MAE as the two primary indicators to evaluate the predictive accuracy of our model. Their formulations are given in (25 and 26):

(25)

(26)

In the formulas of MSE and MAE n is the total number of samples. y_x is the true value of the x-th sample. is the predicted value of the x-th sample.

In multi-step forecasting tasks, errors are first computed at each prediction horizon and then averaged across the entire forecast range. Let O denote the prediction length, N the number of test samples, and D the number of variables. The aggregated Mean Squared Error (MSE) and Mean Absolute Error (MAE) are formulated as follows:

(27)

(28)

Here, y_i,t,d and represent the ground-truth and predicted values of the i-th sample at forecasting step t and variable dimension d, respectively. For each dataset, the final MSE and MAE are obtained by aggregating across all time steps and all variables in the test set.

4.4.1 Multivariate results.

In multivariate long-horizon forecasting, KEDformer demonstrates consistent stability and superior accuracy across all benchmark datasets Table 4. Under the input-96-predict-336 setting, KEDformer achieves substantial gains on five real-world datasets: it attains the best MSE in 11 out of 20 comparisons and the best MAE in 8 out of 20, indicating robust performance across data regimes and horizons.

Download:

Table 4. Multivariate results.

https://doi.org/10.1371/journal.pone.0335047.t004

In contrast, the baselines exhibit structural limitations under long horizons, multivariate coupling, and pronounced non-stationarity. Transformer incurs quadratic complexity with sequence length and lacks mechanisms tailored to temporal non-stationarity (i.e., evolving data statistics due to seasonality, policy shifts, or holidays), leading to slow adaptation and accumulated error. Informer, Reformer, Pyraformer, and FEDformer gain efficiency via attention sparsification or downsampling, which narrows effective context and increases the risk of missing weak yet critical long-range dependencies; frequency-domain pipelines further introduce phase and amplitude distortions when reconstructing to the time domain. LSTNet, LSTM, and TCN are constrained in modeling very long dependencies and cross-variable interactions: RNNs suffer gradient decay, TCNs remain locally biased and less robust to phase shifts, and LSTNet’s linear residual path under-represents nonlinear cross-variable relations, leading to underfitting.

KEDformer’s advantage stems from the integration of an explicit autocorrelation module and fidelity-oriented near-linear sparse attention. Autocorrelation explicitly retrieves repeated patterns and latent dependencies across variables and temporal lags, while sparse attention suppresses redundancy and preserves salient channels and dominant delays. Additionally, the MSTP decomposition jointly models local perturbations and global trends, strengthening representation of complex temporal structure. Together, these components—trained end-to-end—yield superior accuracy and robustness in multivariate forecasting.

4.4.2 Univariate results.

In the univariate forecasting setting, we selected the representative ETTm2 and Exchange datasets for evaluation, as they respectively exhibit strong periodic industrial characteristics and highly volatile financial behaviors. This complementary design enables a comprehensive assessment of the model’s robustness and generalization across diverse time series scenarios, with the results summarized in Table 5.

Download:

Table 5. Univariate results.

https://doi.org/10.1371/journal.pone.0335047.t005

Compared with multiple baselines, KEDformer achieves overall state-of-the-art performance in long-term forecasting. Under the input-96-predict-336 configuration, it delivers the best results in 5 out of 8 cases for both Mean Squared Error (MSE) and Mean Absolute Error (MAE). To further enrich the baseline pool, LogTrans, DeepAR, Prophet, and ARIMA were additionally included in the univariate experiments. The results highlight their inherent limitations: LogTrans, while employing logarithmic sparse attention to reduce computational cost, fails to capture long-term dependencies and non-periodic patterns, leading to error accumulation over extended horizons; DeepAR, as an autoregressive RNN-based method, relies on step-by-step predictions, hindering parallelism and amplifying errors in long sequences; Prophet and ARIMA, constrained by linear decomposition and stationarity assumptions, lack adaptability to nonlinear dynamics and multi-scale temporal variations.

By contrast, KEDformer excels through its autocorrelation mechanism for enhanced global dependency modeling, sparse attention for effective redundancy filtering, and trend–seasonal decomposition module for precise temporal structure learning, thereby fully leveraging its representational advantages in low-dimensional sequences.

4.4.3 Ablation research.

To systematically evaluate the impact of the proposed Knowledge Extraction and Decomposition Attention (KEDA) module on model performance, we conducted a set of ablation experiments. Three model variants were designed for comparison:

KEDformer:which completely replaces the original self-attention and cross-attention mechanisms with KEDA;
KEDformer V1:which replaces only the self-attention mechanism with KEDA while retaining the original cross-attention;
KEDformer V2: which uses conventional attention mechanisms in both positions.

These models were evaluated on the Exchange and Weather datasets, and the results are shown in Table 6. KEDformer outperformed its counterparts in 14 out of 16 test cases, while KEDformer V1 showed improvements in only 2 cases. The results demonstrate the significant advantage of KEDA as a unified attention mechanism.The KEDA module integrates an autocorrelation mechanism with a sparse attention strategy: the autocorrelation component precisely captures key dependencies across time segments, enhancing the model’s temporal modeling capability; meanwhile, the sparse attention mechanism sparsifies the attention weight matrix, effectively reducing interference.

Download:

Table 6. Ablation results.

https://doi.org/10.1371/journal.pone.0335047.t006

5 Discussion

5.1 Time series decomposition effects on the mode

The time series decomposition applied to the ETTm1 dataset has clearly revealed distinct seasonal and trend-cyclical patterns, enabling the KEDformer model to more accurately capture periodic variations and long-term trends within the data. By effectively separating seasonal patterns from trend components, the integration of decomposition has significantly enhanced the model’s predictive accuracy. Focusing on short-term fluctuations while retaining an understanding of long-term trends, the model is better equipped for accurate forecasting. As illustrated in Fig 5, these improvements can be attributed to several key factors: First, the explicit modeling of seasonal variations allows the model to more effectively adapt to recurring patterns, thereby strengthening its ability to project future values based on historical data. Second, the decomposition process helps identify significant features within the data, enabling the model to prioritize relevant information during the forecasting process.

Download:

Fig 5. Visualization of time series decomposition.

In the left subfigure (a), the raw time series data is shown without decomposition, displaying interwoven fluctuations and trends. In contrast, the right subfigure (b) presents the time series decomposed into three components: the original time series in purple, the trend-cyclical component in beige, and the seasonal component in teal.

https://doi.org/10.1371/journal.pone.0335047.g005

5.2 Effect of KEDformer number on encode and decode

In this study, we conducted comparative experiments using the Exchange dataset. We varied the number of KEDformer mechanisms in these experiments. The results, illustrated in Fig 5, demonstrate that the model achieves superior performance when the number of KEDformer mechanisms in the decoding phase exceeds that in the encoding phase. This improvement can be attributed to the model’s enhanced ability to focus on the most informative features during the decoding process. As a result, it effectively captures dependencies between the predicted outputs and historical inputs. Conversely, performance declines when the number of KEDformer mechanisms in the decoding phase equals that in the encoding phase.

5.3 Effect of KEDformer on computational efficiency

We conducted experiments to evaluate the impact of increasing the number of KEDformer mechanisms on the computational efficiency of the model, as shown in Fig 6. The results demonstrate that the model achieves improved efficiency as the number of KEDformer mechanisms increases across various datasets, including ETTm1, ETTm2, and Weather. Notably, the time required for each epoch decreases significantly with the increase in the number of KEDformer mechanisms. The most pronounced improvement is observed in the ETTm1 dataset, where computation time drops from 794.0 seconds to 467.2 seconds. This enhancement can be attributed to the model’s improved ability to capture temporal dependencies and optimize resource utilization. This enables parallel processing and more effective distribution of the computational load.

Download:

Fig 6. Visualization of time series decomposition results.

In a comparative experiment that controls the number of KEDformer mechanisms during the encoding and decoding processes, we set the input length I = 96 and the prediction lengths .

https://doi.org/10.1371/journal.pone.0335047.g006

5.4 Efficiency analysis and performance analysis

In this study, we conducted a comprehensive analysis of the computational efficiency and predictive performance of models employing different self-attention mechanisms, with the results presented in Fig 7. On the Exchange dataset,The KEDformer model ranks second in terms of memory usage (measured in GB), which is primarily attributed to its sparse attention mechanism. The sparse attention mechanism is one of the key technologies for reducing computational complexity and memory consumption. By selecting only the key positions in the input sequence for attention calculation instead of performing global calculations for all positions, this mechanism reduces the computational complexity from the traditional O(L²) to , significantly decreasing memory usage. the KEDformer model ranked third in terms of running time but achieved the highest prediction accuracy. This superior performance can be primarily attributed to its optimized knowledge extraction mechanism and seasonal trend decomposition approach, which significantly enhance the model’s ability to capture key patterns in time series data. However, when dealing with time series data lacking clear periodicity, the model’s performance may deteriorate, as the seasonal trend decomposition may fail to effectively extract relevant information. Additionally, an improper configuration of the number of KEDformer mechanisms can reduce computational efficiency and negatively impact the final prediction results. In this study, computational efficiency was assessed by measuring the time required for each model to complete a training epoch (in seconds), while predictive performance was evaluated using the Mean Squared Error (MSE) and Mean Absolute Error (MAE).

Download:

Fig 7. Impact of KEDattention mechanisms on model computational efficiency.

The input length is set to I = 96, and the prediction steps are . The time required for each epoch is used as an indicator of the model’s computational speed.

https://doi.org/10.1371/journal.pone.0335047.g007

5.5 Computation efficiency

In the multivariate setting and with the current optimal implementation of all methods, KEDformer has achieved a significant enhancement in computational efficiency compared to conventional Transformer models. This improvement effectively addresses the challenges associated with the quadratic time complexity O(L2) and memory usage O(L2) inherent to standard self-attention mechanisms. By employing sparse attention and autocorrelation strategies, KEDformer reduces both the time complexity and memory usage to O(LlogL), thereby enhancing the model’s capability to process long sequence data. As illustrated in Fig 7, KEDformer maintains its time and memory complexity while significantly improving prediction accuracy, enabling the model to handle longer sequences more efficiently. During the testing phase, KEDformer completes predictions in a single step, in contrast to traditional models that require O(L) steps, thereby substantially increasing its efficiency. As demonstrated in Table 7, KEDformer strikes a superior balance between computational efficiency and predictive accuracy, rendering it a practical solution for long-term time series forecasting tasks in resource-constrained environments.

Download:

Table 7. Complexity analysis of space and time for different forecasting models.

https://doi.org/10.1371/journal.pone.0335047.t007

6 Conclusion

This study proposes KEDformer, a novel and efficient framework for long-term time series forecasting tasks. By introducing a sparse attention mechanism, the model reduces the quadratic complexity of standard self-attention to near-linear complexity, thereby significantly improving processing speed for long sequences. In addition, KEDformer integrates seasonal–trend decomposition with an autocorrelation mechanism to jointly model short-term perturbations and long-term structures, effectively mitigating information loss and producing forecasts that better align with real-world temporal dynamics.

Extensive experiments on multiple public benchmark datasets demonstrate that KEDformer consistently outperforms mainstream Transformer-based models in terms of long-term prediction accuracy, stability, and generalization ability, highlighting its strong adaptability across diverse forecasting tasks. Notably, for typical periodic data, the decomposition and autocorrelation modules provide substantial modeling advantages; however, in certain non-periodic or highly volatile scenarios (e.g., financial exchange rates), the effectiveness of these mechanisms is relatively constrained. This limitation is not inherent to the model itself but reflects the varying sensitivities of different data structures to specific pattern extraction methods.

To further enhance the model’s generality and flexibility, future work will explore dynamic structure selection mechanisms based on data characteristics, enabling adaptive adjustment of modeling strategies. In addition, we plan to investigate more interpretable sparse attention strategies to improve the identification of critical features and redundancy filtering in non-periodic scenarios. These directions are expected to broaden the applicability of KEDformer to a wider range of time series modeling tasks.

Overall, KEDformer marks an important step forward in long-term time series forecasting, offering a promising pathway to address the trade-off between efficiency and accuracy under high-dimensional input conditions. For example, in healthcare monitoring scenarios, KEDformer can capture both long-term trends and short-term fluctuations in physiological signals, thereby assisting in disease risk prediction. This illustrates that its applicability extends beyond standard benchmark tasks and can be adapted to more complex real-world domains. With continued structural optimization and mechanism refinement, its scope is expected to expand to increasingly diverse and challenging forecasting applications.

References

1. Zhang C, Patras P. Long-term mobile traffic forecasting using deep spatio-temporal neural networks. In: Proceedings of the Eighteenth ACM International Symposium on Mobile Ad Hoc Networking and Computing. 2018. p. 231–40.
2. Maray N, Ngu AH, Ni J, Debnath M, Wang L. Transfer learning on small datasets for improved fall detection. Sensors (Basel). 2023;23(3):1105. pmid:36772148
- View Article
- PubMed/NCBI
- Google Scholar
3. Alam M, Ashraf Z, Singh P, Pandey B, Rehman K, Aldasheva L. Deep learning techniques for intrusion detection systems in healthcare environments. In: 2025 IEEE 14th International Conference on Communication Systems and Network Technologies (CSNT). 2025. p. 105–11. https://doi.org/10.1109/csnt64827.2025.10968425
4. Feng Z, Zhang Y, Chen Z, Ni J, Feng Y, Xie Y. Machine learning to access and ensure safe drinking water supply: a systematic review. crossref. 2024.
- View Article
- Google Scholar
5. Fu X, Zhang C, Xu Y, Zhang Y, Sun H. Statistical machine learning for power flow analysis considering the influence of weather factors on photovoltaic power generation. IEEE Trans Neural Netw Learn Syst. 2025;36(3):5348–62. pmid:38587954
- View Article
- PubMed/NCBI
- Google Scholar
6. Somu N, M R GR, Ramamritham K. A hybrid model for building energy consumption forecasting using long short term memory networks. Applied Energy. 2020;261:114131.
- View Article
- Google Scholar
7. Fu X, Zhang C, Zhang X, Sun H. A novel GAN architecture reconstructed using bi-LSTM and style transfer for PV temporal dynamics simulation. IEEE Trans Sustain Energy. 2024;15(4):2826–9.
- View Article
- Google Scholar
8. Gupta M, Wadhvani R, Rasool A. Comprehensive analysis of change-point dynamics detection in time series data: a review. Expert Systems with Applications. 2024;248:123342.
- View Article
- Google Scholar
9. Lin Y, Koprinska I, Rana M. SSDNet: state space decomposition neural network for time series forecasting. In: 2021 IEEE International Conference on Data Mining (ICDM). IEEE; 2021. p. 370–8. https://doi.org/10.1109/icdm51629.2021.00048
10. Zhou L, Gao J. Temporal spatial decomposition and fusion network for time series forecasting*. In: 2023 International Joint Conference on Neural Networks (IJCNN). 2023. p. 1–10. https://doi.org/10.1109/ijcnn54540.2023.10191934
11. Gao J, Song X, Wen Q, Wang P, Sun L, Xu H. Robusttad: robust time series anomaly detection via decomposition and convolutional neural networks. arXiv preprint 2020. https://arxiv.org/abs/2002.09545
12. Asadi R, Regan AC. A spatio-temporal decomposition based deep neural network for time series forecasting. Applied Soft Computing. 2020;87:105963.
- View Article
- Google Scholar
13. Dosovitskiy A. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint 2020. https://arxiv.org/abs/2010.11929
14. Ni J, Tang H, Shang Y, Duan B, Yan Y. Adaptive cross-architecture mutual knowledge distillation. In: 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG). 2024. p. 1–5. https://doi.org/10.1109/fg59268.2024.10581969
15. Devlin J. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint 2018. https://arxiv.org/abs/1810.04805
16. Wu H, Xu J, Wang J, Long M. Autoformer: decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems. 2021;34:22419–30.
- View Article
- Google Scholar
17. Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, et al. Informer: beyond efficient transformer for long sequence time-series forecasting. AAAI. 2021;35(12):11106–15.
- View Article
- Google Scholar
18. Beltagy I, Peters ME, Cohan A. Longformer: the long-document transformer. arXiv preprint 2020. https://arxiv.org/abs/2004.05150
19. Wang S, Li BZ, Khabsa M, Fang H, Ma H. Linformer: self-attention with linear complexity. arXiv preprint 2020. https://arxiv.org/abs/2006.04768
20. Jaszczur S, Chowdhery A, Mohiuddin A, Kaiser L, Gajewski W, Michalewski H. Sparse is enough in scaling transformers. Advances in Neural Information Processing Systems. 2021;34:9895–907.
- View Article
- Google Scholar
21. Zhuang B, Liu J, Pan Z, He H, Weng Y, Shen C. A survey on efficient training of transformers. arXiv preprint 2023. https://arxiv.org/abs/2302.01107
22. Vaswani A. Attention is all you need. In: Advances in Neural Information Processing Systems. 2017.
23. Wang W, Liu Y, Sun H. Tlnets: transformation learning networks for long-range time-series prediction. arXiv preprint. 2023. https://arxiv.org/abs/2305.15770
24. Abbasimehr H, Paki R. Improving time series forecasting using LSTM and attention models. J Ambient Intell Human Comput. 2021;13(1):673–91.
- View Article
- Google Scholar
25. Tay Y, Bahri D, Metzler D, Juan DC, Zhao Z, Zheng C. Synthesizer: rethinking self-attention for transformer models. In: International conference on machine learning. 2021. p. 10183–92.
26. Lee-Thorp J, Ainslie J, Eckstein I, Ontanon S. Fnet: mixing tokens with fourier transforms. arXiv preprint. 2021. https://arxiv.org/abs/2105.03824
27. Lim B, Arık SÖ, Loeff N, Pfister T. Temporal fusion transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting. 2021;37(4):1748–64.
- View Article
- Google Scholar
28. Zhou T, Ma Z, Wen Q, Wang X, Sun L, Jin R. Fedformer: frequency enhanced decomposed transformer for long-term series forecasting. In: International conference on machine learning. 2022. p. 27268–86.
29. Wu S, Xiao X, Ding Q, Zhao P, Wei Y, Huang J. Adversarial sparse transformer for time series forecasting. Advances in Neural Information Processing Systems. 2020;33:17105–15.
- View Article
- Google Scholar
30. Qin Z, Wei B, Gao C, Zhu F, Qin W, Zhang Q. ACSAformer: a crime forecasting model based on sparse attention and adaptive graph convolution. Front Phys. 2025;13.
- View Article
- Google Scholar
31. Qin Z, Wei B, Gao C, Chen X, Zhang H, In Wong CU. SFDformer: a frequency-based sparse decomposition transformer for air pollution time series prediction. Front Environ Sci. 2025;13:1549209.
- View Article
- Google Scholar
32. Yu S, Peng J, Ge Y, Yu X, Ding F, Li S, et al. A traffic state prediction method based on spatial–temporal data mining of floating car data by using autoformer architecture. Computer aided Civil Eng. 2024;39(18):2774–87.
- View Article
- Google Scholar
33. Hong J-T, Bai Y-L, Huang Y-T, Chen Z-R. Hybrid carbon price forecasting using a deep augmented FEDformer model and multimodel optimization piecewise error correction. Expert Systems with Applications. 2024;247:123325.
- View Article
- Google Scholar
34. Vartholomaios A, Karlos S, Kouloumpris E, Tsoumakas G. Short-term renewable energy forecasting in Greece using prophet decomposition and tree-based ensembles. In: International Conference on Database and Expert Systems Applications. 2021. p. 227–38. https://doi.org/10.48550/arXiv.2107.03825
35. Chen W, Yang Y, Liu J. A combination model based on sequential general variational mode decomposition method for time series prediction. arXiv preprint 2024. https://doi.org/arXiv:240603157
36. Kontopoulou VI, Panagopoulos AD, Kakkos I, Matsopoulos GK. A review of ARIMA vs. machine learning approaches for time series forecasting in data driven networks. Future Internet. 2023;15(8):255.
- View Article
- Google Scholar
37. Wang X, Wang Y, Peng J, Zhang Z, Tang X. A hybrid framework for multivariate long-sequence time series forecasting. Appl Intell. 2022;53(11):13549–68.
- View Article
- Google Scholar
38. Woo G, Liu C, Sahoo D, Kumar A, Hoi S. Etsformer: exponential smoothing transformers for time-series forecasting. arXiv preprint 2022. https://arxiv.org/abs/2202.01381
39. Niyogi D. A novel method combines moving fronts, data decomposition and deep learning to forecast intricate time series. arXiv preprint 2023. https://arxiv.org/abs/230306394
40. Zhong S, Song S, Zhuo W, Li G, Liu Y, Chan SHG. A multi-scale decomposition mlp-mixer for time series analysis. arXiv preprint 2023. https://arxiv.org/abs/231011959
41. Ni J, Sarbajna R, Liu Y, Ngu AHH, Yan Y. Cross-modal knowledge distillation for vision-to-sensor action recognition. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022. https://doi.org/10.1109/icassp43922.2022.9746752
42. Ni J, Ngu AHH, Yan Y. Progressive cross-modal knowledge distillation for human action recognition. In: Proceedings of the 30th ACM International Conference on Multimedia. 2022. p. 5903–12. https://doi.org/10.1145/3503161.3548238
43. Lee H, Lee C, Lim H, Ko S. TILDE-Q: a transformation invariant loss function for time-series forecasting. arXiv preprint 2022. https://doi.org/arXiv:221015050
44. Fu Y, Virani N, Wang H. Masked multi-step probabilistic forecasting for short-to-mid-term electricity demand. CoRR. 2023. https://doi.org/abs/2302.06818
45. Sanchis-Agudo M, Wang Y, Duraisamy K, Vinuesa R. Easy attention: a simple self-attention mechanism for transformers. arXiv preprint 2023. https://arxiv.org/abs/2308.12874
46. Hua Y, Zhao Z, Li R, Chen X, Liu Z, Zhang H. Deep learning with long short-term memory for time series prediction. IEEE Commun Mag. 2019;57(6):114–9.
- View Article
- Google Scholar
47. Meenal R, Binu D, Ramya KC, Michael PA, Kumar KV, Rajasekaran E. Weather forecasting for renewable energy system: a review. Archives of Computational Methods in Engineering. 2022;29(5).
- View Article
- Google Scholar
48. Kingma DP, Ba JL. Adam: a method for stochastic optimization. In: International Conference on Learning Representations; 2014.
49. Mustaqeem M, Mustajab S, Alam M, Jeribi F, Alam S, Shuaib M. A trustworthy hybrid model for transparent software defect prediction: SPAM-XAI. PLoS One. 2024;19(7):e0307112. pmid:38990978
- View Article
- PubMed/NCBI
- Google Scholar
50. Kitaev N, Kaiser L-, Levskaya A. Reformer: the efficient transformer. arXiv preprint 2020. https://doi.org/arXiv:200104451
51. Liu S, Yu H, Liao C, Li J, Lin W, Liu AX, et al. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In: # Placeholder parent metadata value#. 2022.
52. Li L, Wang K, Li S, Feng X, Zhang L. Lst-net: Learning a convolutional neural network with a learnable sparse transform. In: European conference on computer vision. Springer; 2020. p. 562–79.
53. Yu Y, Si X, Hu C, Zhang J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019;31(7):1235–70. pmid:31113301
- View Article
- PubMed/NCBI
- Google Scholar
54. Nie X, Zhou X, Li Z, Wang L, Lin X, Tong T. Logtrans: Providing efficient local-global fusion with transformer and cnn parallel network for biomedical image segmentation. In: 2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys). IEEE; 2022. p. 769–76.
55. Hewage P, Behera A, Trovati M, Pereira E, Ghahremani M, Palmieri F, et al. Temporal convolutional neural (TCN) network for an effective weather forecasting using time-series data from the local weather station. Soft Comput. 2020;24(21):16453–82.
- View Article
- Google Scholar
56. Taylor SJ, Letham B. Forecasting at scale. The American Statistician. 2018;72(1):37–45.
- View Article
- Google Scholar
57. Salinas D, Flunkert V, Gasthaus J, Januschowski T. DeepAR: probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting. 2020;36(3):1181–91.
- View Article
- Google Scholar
58. Neo W, Bradley G, Xue B, Shawn O. Deep transformer models for time series forecasting: The influenza prevalence case. arXiv. Cornell University; 2020.

[ref1] 1. Zhang C, Patras P. Long-term mobile traffic forecasting using deep spatio-temporal neural networks. In: Proceedings of the Eighteenth ACM International Symposium on Mobile Ad Hoc Networking and Computing. 2018. p. 231–40.

[ref2] 2. Maray N, Ngu AH, Ni J, Debnath M, Wang L. Transfer learning on small datasets for improved fall detection. Sensors (Basel). 2023;23(3):1105. pmid:36772148
View Article
PubMed/NCBI
Google Scholar

[3] View Article

[4] PubMed/NCBI

[5] Google Scholar

[ref3] 3. Alam M, Ashraf Z, Singh P, Pandey B, Rehman K, Aldasheva L. Deep learning techniques for intrusion detection systems in healthcare environments. In: 2025 IEEE 14th International Conference on Communication Systems and Network Technologies (CSNT). 2025. p. 105–11. https://doi.org/10.1109/csnt64827.2025.10968425

[ref4] 4. Feng Z, Zhang Y, Chen Z, Ni J, Feng Y, Xie Y. Machine learning to access and ensure safe drinking water supply: a systematic review. crossref. 2024.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref5] 5. Fu X, Zhang C, Xu Y, Zhang Y, Sun H. Statistical machine learning for power flow analysis considering the influence of weather factors on photovoltaic power generation. IEEE Trans Neural Netw Learn Syst. 2025;36(3):5348–62. pmid:38587954
View Article
PubMed/NCBI
Google Scholar

[11] View Article

[12] PubMed/NCBI

[13] Google Scholar

[ref6] 6. Somu N, M R GR, Ramamritham K. A hybrid model for building energy consumption forecasting using long short term memory networks. Applied Energy. 2020;261:114131.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref7] 7. Fu X, Zhang C, Zhang X, Sun H. A novel GAN architecture reconstructed using bi-LSTM and style transfer for PV temporal dynamics simulation. IEEE Trans Sustain Energy. 2024;15(4):2826–9.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref8] 8. Gupta M, Wadhvani R, Rasool A. Comprehensive analysis of change-point dynamics detection in time series data: a review. Expert Systems with Applications. 2024;248:123342.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref9] 9. Lin Y, Koprinska I, Rana M. SSDNet: state space decomposition neural network for time series forecasting. In: 2021 IEEE International Conference on Data Mining (ICDM). IEEE; 2021. p. 370–8. https://doi.org/10.1109/icdm51629.2021.00048

[ref10] 10. Zhou L, Gao J. Temporal spatial decomposition and fusion network for time series forecasting*. In: 2023 International Joint Conference on Neural Networks (IJCNN). 2023. p. 1–10. https://doi.org/10.1109/ijcnn54540.2023.10191934

[ref11] 11. Gao J, Song X, Wen Q, Wang P, Sun L, Xu H. Robusttad: robust time series anomaly detection via decomposition and convolutional neural networks. arXiv preprint 2020. https://arxiv.org/abs/2002.09545

[ref12] 12. Asadi R, Regan AC. A spatio-temporal decomposition based deep neural network for time series forecasting. Applied Soft Computing. 2020;87:105963.
View Article
Google Scholar

[27] View Article

[28] Google Scholar

[ref13] 13. Dosovitskiy A. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint 2020. https://arxiv.org/abs/2010.11929

[ref14] 14. Ni J, Tang H, Shang Y, Duan B, Yan Y. Adaptive cross-architecture mutual knowledge distillation. In: 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG). 2024. p. 1–5. https://doi.org/10.1109/fg59268.2024.10581969

[ref15] 15. Devlin J. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint 2018. https://arxiv.org/abs/1810.04805

[ref16] 16. Wu H, Xu J, Wang J, Long M. Autoformer: decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems. 2021;34:22419–30.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref17] 17. Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, et al. Informer: beyond efficient transformer for long sequence time-series forecasting. AAAI. 2021;35(12):11106–15.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref18] 18. Beltagy I, Peters ME, Cohan A. Longformer: the long-document transformer. arXiv preprint 2020. https://arxiv.org/abs/2004.05150

[ref19] 19. Wang S, Li BZ, Khabsa M, Fang H, Ma H. Linformer: self-attention with linear complexity. arXiv preprint 2020. https://arxiv.org/abs/2006.04768

[ref20] 20. Jaszczur S, Chowdhery A, Mohiuddin A, Kaiser L, Gajewski W, Michalewski H. Sparse is enough in scaling transformers. Advances in Neural Information Processing Systems. 2021;34:9895–907.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref21] 21. Zhuang B, Liu J, Pan Z, He H, Weng Y, Shen C. A survey on efficient training of transformers. arXiv preprint 2023. https://arxiv.org/abs/2302.01107

[ref22] 22. Vaswani A. Attention is all you need. In: Advances in Neural Information Processing Systems. 2017.

[ref23] 23. Wang W, Liu Y, Sun H. Tlnets: transformation learning networks for long-range time-series prediction. arXiv preprint. 2023. https://arxiv.org/abs/2305.15770

[ref24] 24. Abbasimehr H, Paki R. Improving time series forecasting using LSTM and attention models. J Ambient Intell Human Comput. 2021;13(1):673–91.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref25] 25. Tay Y, Bahri D, Metzler D, Juan DC, Zhao Z, Zheng C. Synthesizer: rethinking self-attention for transformer models. In: International conference on machine learning. 2021. p. 10183–92.

[ref26] 26. Lee-Thorp J, Ainslie J, Eckstein I, Ontanon S. Fnet: mixing tokens with fourier transforms. arXiv preprint. 2021. https://arxiv.org/abs/2105.03824

[ref27] 27. Lim B, Arık SÖ, Loeff N, Pfister T. Temporal fusion transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting. 2021;37(4):1748–64.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref28] 28. Zhou T, Ma Z, Wen Q, Wang X, Sun L, Jin R. Fedformer: frequency enhanced decomposed transformer for long-term series forecasting. In: International conference on machine learning. 2022. p. 27268–86.

[ref29] 29. Wu S, Xiao X, Ding Q, Zhao P, Wei Y, Huang J. Adversarial sparse transformer for time series forecasting. Advances in Neural Information Processing Systems. 2020;33:17105–15.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref30] 30. Qin Z, Wei B, Gao C, Zhu F, Qin W, Zhang Q. ACSAformer: a crime forecasting model based on sparse attention and adaptive graph convolution. Front Phys. 2025;13.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref31] 31. Qin Z, Wei B, Gao C, Chen X, Zhang H, In Wong CU. SFDformer: a frequency-based sparse decomposition transformer for air pollution time series prediction. Front Environ Sci. 2025;13:1549209.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref32] 32. Yu S, Peng J, Ge Y, Yu X, Ding F, Li S, et al. A traffic state prediction method based on spatial–temporal data mining of floating car data by using autoformer architecture. Computer aided Civil Eng. 2024;39(18):2774–87.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref33] 33. Hong J-T, Bai Y-L, Huang Y-T, Chen Z-R. Hybrid carbon price forecasting using a deep augmented FEDformer model and multimodel optimization piecewise error correction. Expert Systems with Applications. 2024;247:123325.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref34] 34. Vartholomaios A, Karlos S, Kouloumpris E, Tsoumakas G. Short-term renewable energy forecasting in Greece using prophet decomposition and tree-based ensembles. In: International Conference on Database and Expert Systems Applications. 2021. p. 227–38. https://doi.org/10.48550/arXiv.2107.03825

[ref35] 35. Chen W, Yang Y, Liu J. A combination model based on sequential general variational mode decomposition method for time series prediction. arXiv preprint 2024. https://doi.org/arXiv:240603157

[ref36] 36. Kontopoulou VI, Panagopoulos AD, Kakkos I, Matsopoulos GK. A review of ARIMA vs. machine learning approaches for time series forecasting in data driven networks. Future Internet. 2023;15(8):255.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref37] 37. Wang X, Wang Y, Peng J, Zhang Z, Tang X. A hybrid framework for multivariate long-sequence time series forecasting. Appl Intell. 2022;53(11):13549–68.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref38] 38. Woo G, Liu C, Sahoo D, Kumar A, Hoi S. Etsformer: exponential smoothing transformers for time-series forecasting. arXiv preprint 2022. https://arxiv.org/abs/2202.01381

[ref39] 39. Niyogi D. A novel method combines moving fronts, data decomposition and deep learning to forecast intricate time series. arXiv preprint 2023. https://arxiv.org/abs/230306394

[ref40] 40. Zhong S, Song S, Zhuo W, Li G, Liu Y, Chan SHG. A multi-scale decomposition mlp-mixer for time series analysis. arXiv preprint 2023. https://arxiv.org/abs/231011959

[ref41] 41. Ni J, Sarbajna R, Liu Y, Ngu AHH, Yan Y. Cross-modal knowledge distillation for vision-to-sensor action recognition. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022. https://doi.org/10.1109/icassp43922.2022.9746752

[ref42] 42. Ni J, Ngu AHH, Yan Y. Progressive cross-modal knowledge distillation for human action recognition. In: Proceedings of the 30th ACM International Conference on Multimedia. 2022. p. 5903–12. https://doi.org/10.1145/3503161.3548238

[ref43] 43. Lee H, Lee C, Lim H, Ko S. TILDE-Q: a transformation invariant loss function for time-series forecasting. arXiv preprint 2022. https://doi.org/arXiv:221015050

[ref44] 44. Fu Y, Virani N, Wang H. Masked multi-step probabilistic forecasting for short-to-mid-term electricity demand. CoRR. 2023. https://doi.org/abs/2302.06818

[ref45] 45. Sanchis-Agudo M, Wang Y, Duraisamy K, Vinuesa R. Easy attention: a simple self-attention mechanism for transformers. arXiv preprint 2023. https://arxiv.org/abs/2308.12874

[ref46] 46. Hua Y, Zhao Z, Li R, Chen X, Liu Z, Zhang H. Deep learning with long short-term memory for time series prediction. IEEE Commun Mag. 2019;57(6):114–9.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref47] 47. Meenal R, Binu D, Ramya KC, Michael PA, Kumar KV, Rajasekaran E. Weather forecasting for renewable energy system: a review. Archives of Computational Methods in Engineering. 2022;29(5).
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref48] 48. Kingma DP, Ba JL. Adam: a method for stochastic optimization. In: International Conference on Learning Representations; 2014.

[ref49] 49. Mustaqeem M, Mustajab S, Alam M, Jeribi F, Alam S, Shuaib M. A trustworthy hybrid model for transparent software defect prediction: SPAM-XAI. PLoS One. 2024;19(7):e0307112. pmid:38990978
View Article
PubMed/NCBI
Google Scholar

[94] View Article

[95] PubMed/NCBI

[96] Google Scholar

[ref50] 50. Kitaev N, Kaiser L-, Levskaya A. Reformer: the efficient transformer. arXiv preprint 2020. https://doi.org/arXiv:200104451

[ref51] 51. Liu S, Yu H, Liao C, Li J, Lin W, Liu AX, et al. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In: # Placeholder parent metadata value#. 2022.

[ref52] 52. Li L, Wang K, Li S, Feng X, Zhang L. Lst-net: Learning a convolutional neural network with a learnable sparse transform. In: European conference on computer vision. Springer; 2020. p. 562–79.

[ref53] 53. Yu Y, Si X, Hu C, Zhang J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019;31(7):1235–70. pmid:31113301
View Article
PubMed/NCBI
Google Scholar

[101] View Article

[102] PubMed/NCBI

[103] Google Scholar

[ref54] 54. Nie X, Zhou X, Li Z, Wang L, Lin X, Tong T. Logtrans: Providing efficient local-global fusion with transformer and cnn parallel network for biomedical image segmentation. In: 2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys). IEEE; 2022. p. 769–76.

[ref55] 55. Hewage P, Behera A, Trovati M, Pereira E, Ghahremani M, Palmieri F, et al. Temporal convolutional neural (TCN) network for an effective weather forecasting using time-series data from the local weather station. Soft Comput. 2020;24(21):16453–82.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref56] 56. Taylor SJ, Letham B. Forecasting at scale. The American Statistician. 2018;72(1):37–45.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref57] 57. Salinas D, Flunkert V, Gasthaus J, Januschowski T. DeepAR: probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting. 2020;36(3):1181–91.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref58] 58. Neo W, Bradley G, Xue B, Shawn O. Deep transformer models for time series forecasting: The influenza prevalence case. arXiv. Cornell University; 2020.

Figures

Abstract

1 Introduction

2 Related work

2.1 Transformer-based long-term time series forecasting

2.2 Decomposition of time series

3 Methodology

3.1 Background

3.2 Data decomposition

Model input.

Encoder.

Decoder.

3.3 Knowledge extraction process

3.3.1 Autocorrelation function.

3.3.2 Self-attention mechanism.

3.3.3 Knowledge selection.

3.3.4 Decoupled knowledge extraction.

4 Experiment

4.1 Datasets

4.2 Implementation details

4.3 Baselines

4.4 Performance comparison

4.4.1 Multivariate results.

4.4.2 Univariate results.

4.4.3 Ablation research.

5 Discussion

5.1 Time series decomposition effects on the mode

5.2 Effect of KEDformer number on encode and decode

5.3 Effect of KEDformer on computational efficiency

5.4 Efficiency analysis and performance analysis

5.5 Computation efficiency

6 Conclusion

References