Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

CL-Informer: Long time series prediction model based on continuous wavelet transform

  • Baijin Liu ,

    Roles Conceptualization, Data curation, Methodology, Visualization, Writing – original draft, Writing – review & editing

    1692797200@qq.com

    Affiliation Jilin Institute of Chemical Technology, Longtan, Jilin, Jilin, China

  • Zimei Li,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Writing – review & editing

    Affiliation Changchun Institute of Technology, Chaoyang, Changchun, Jilin, China

  • Zhanlin Li,

    Roles Visualization

    Affiliation Jilin Institute of Chemical Technology, Longtan, Jilin, Jilin, China

  • Cheng Chen

    Roles Visualization

    Affiliation Jilin Institute of Chemical Technology, Longtan, Jilin, Jilin, China

Abstract

Time series, a type of data that measures how things change over time, remains challenging to predict. In order to improve the accuracy of time series prediction, a deep learning model CL-Informer is proposed. In the Informer model, an embedding layer based on continuous wavelet transform is added so that the model can capture the characteristics of multi-scale data, and the LSTM layer is used to capture the data dependency further and process the redundant information in continuous wavelet transform. To demonstrate the reliability of the proposed CL-Informer model, it is compared with mainstream forecasting models such as Informer, Informer+, and Reformer on five datasets. Experimental results demonstrate that the CL-Informer model achieves an average reduction of 30.64% in MSE across various univariate prediction horizons and a reduction of 10.70% in MSE across different multivariate prediction horizons, thereby improving the accuracy of Informer in long sequence prediction and enhancing the model’s precision.

Introduction

Early research on the TSF problem was primarily based on classical mathematical models and algorithms rooted in statistical principles and assumptions, such as auto-regressive (AR) [1], moving average (MA) [2], autoregressive moving average (ARIMA) [3], seasonal autoregressive moving average (SARIMA) models [4], among others. These models assume stationarity of data and capture autocorrelation, moving averages, and seasonality by establishing lag values and residuals for time series data. However, traditional statistical methods are typically built upon linear model assumptions. In real-world time series data, non-linear trends, seasonality, and other intricate patterns may not be effectively captured by conventional statistical approaches.

In order to enhance the accuracy of time series predictions, machine learning-based methods have been extensively employed. Recurrent neural network (RNN) is a suitable neural network for processing sequences [5]; However, it encounters issues such as gradient vanishing, gradient exploding, and limited parallelism [6]. Long Short-Term Memory Network (LSTM) partially addresses the problems of gradient vanishing and exploding in RNN by incorporating gate mechanisms and cell states, enabling RNN to capture certain levels of long-term dependencies within the processed sequence. Nevertheless, LSTM still confronts challenges in capturing and reducing long-term dependencies when dealing with extensive periods or complex dependency relationships within the sequence. The literature introduces a time series modeling method based on Convolutional Neural Networks (CNN) [7], known as TCN. While TCN captures local patterns and dependencies through convolution operations on time series data, its model structure may not adequately capture long-term dependencies since TCN’s convolution operations primarily focus on local neighborhoods [8]. Neither the RNN-based nor TCN-based model explicitly models distant temporal dependencies or facilitates efficient information exchange between them.

The Transformer model (Vaswani et al., 2017) is a neural network architecture based on self-attention mechanism [9], initially designed for natural language processing tasks like machine translation. It eliminates the sequential processing constraint and enables parallelization of sequence data processing. While self-attention has shown significant effectiveness in capturing dependencies among each element, the computational complexity grows quadratically when dealing with sequences. Various self-attention mechanisms have been proposed to address this issue in recent years. LogSparse Transformer (Li et al., 2019) introduced LogSparse self-attention, which breaks the memory bottleneck by selecting elements at exponentially growing intervals [10]. Performer (2020) is a Transformer-based acceleration model that utilizes a low-rank attention mechanism and random feature mapping to reduce computation and storage complexity [11]. Nyströmformer explores linear attention in a dual softmax form, which reduces the computational complexity of the self-attention mechanism by employing low-rank matrix approximation [12], thus improving the scalability and efficiency of the model to some extent. However, the Nyström method approximates large kernel matrices, which may increase computational complexity when dealing with large datasets or long sequences. Informer (Zhou et al., 2021) introduces sparse self-attention to replace the conventional self-attention [13], achieving time complexity of O(Llog L) and memory usage of O(Llog L) in dependency alignment. Although Informer demonstrates outstanding performance in capturing long-range dependencies with its self-attention mechanism, using distillation methods for model performance improvement and compression can lead to the loss and weakening of long-term dependencies in temporal sequences. Informer is also a Transformer-based neural network, so its ability to capture local dependencies is relatively weaker.

We propose CL-Informer, which is an improved hybrid model based on Informer. In order to improve Informer’s ability to capture local and long dependencies, we have improved the model in two aspects, adding an embedding layer of continuous wavelet transform to the encoding part of the input sequence, adding two layers of LSTM after the self-attention module and deleting the ‘distillation’ method [14]. While distillation operations improve the model’s training efficiency and generalization ability, they may affect the LSTM’s capture of long-term dependencies in time series, thus reducing the model’s predictive performance. By improving local dependence and long dependence, the model can comprehensively capture the correlation of different scales and ranges in time series and improve the data’s modeling ability and prediction accuracy. This will make the model more adaptable to various complex time series data, leading to better performance and more accurate results in practical applications.

The main contributions of this paper are as follows:

  1. We first design an embedded layer CWT based on continuous wavelet transform, through which the model can better learn and process data of different scales and improve model prediction accuracy.
  2. We add the LSTM layer after the sparse self-focusing blocks of the coding layer and decoder to further capture the long and local dependencies of the model. The long-term forecasting model CL-Informer based on CWT and LSTM is established.
  3. Many experiments show that our CL-Informer model has better predictive performance in long-term series prediction.

Proposed deep learning model

Problem definition

Multivariate time series forecasting is a time series analysis method used to predict multiple future values or trends simultaneously. Typically, we have an input variable , and our goal is to forecast the values for multiple consecutive future time points based on past data. Here, dx and dy represent the dimensions of the input and output variables, respectively. The model for Long Short-Term Forecasting (LSTF) can be represented as Eq (1): (1)

The predicted time series denoted as , is influenced by the hyperparameter Λ in the prediction model.

CL-Informer

The overall architecture of the CL-Informer model is similar to that of Informer, as shown in Fig 1. For long sequence time series forecasting tasks, we employ a two-stage process. The first stage is the Time Series Embedding Layer, and the second is the Encoder-Decoder. The Time Series Embedding Layer maps the original time series data into a low-dimensional continuous vector space. The encoder part of the model encodes the embedded representation of the time series data into a fixed-length vector, capturing the relationships between different time steps. Subsequently, the decoder part decodes the encoded vector into the predicted target sequence.

Model input.

The way in which the original time series is inputted into CL-Informer differs from Informer. Informer places more emphasis on modeling time information. By introducing input embeddings and positional encoding, Informer encodes time series data to capture temporal and sequential information, allowing the model to perceive the order of time and local/global temporal dependencies within the sequence. In the Embedding layer of CL-Informer, we transform the time series data into time-frequency information using continuous wavelet transformation and encode it, enabling the model to capture more local and global dependencies.

As shown in Fig 2, a CWT coding layer is added to the time series embedding layer [15]. Continuous wavelet transform has the characteristics of multi-scale and time-frequency localization and has been widely used in signal processing and analysis. It can provide a more comprehensive characterization of time series, especially for time series analysis with non-stationarity, transient characteristics, or frequency changes. Specifically, the embedding layer based on time series includes position coding, time coding, input sequence value coding, and continuous wavelet transform coding. The calculation process can be written as Eq (2): (2)

thumbnail
Fig 2. The architecture of the time series embedding layer.

https://doi.org/10.1371/journal.pone.0303990.g002

In the equation, X represents the values of the input long time series, T represents the time information generated by the time series, Position(X) represents the positional encoding for the time series, Time(T) represents the encoding for the time series values, Value(X) represents the encoding for the input time series values, and CWT(X) represents the encoding of the continuous wavelet transform for the time series.

The implementation process of the Continuous Wavelet Embedding layer is shown in Fig 3. In this process, Cwt [16] represents the computation of the Continuous Wavelet Transform. In the Continuous Wavelet Transform, we use the ‘Morlet’ wavelet as the wavelet basis to compute the transformation of the input time series in both the time and frequency domains at multiple scales. The specific calculation can be derived from Eqs (3) and (4). (3) (4)

In the given context, a represents the scale factor, b represents the translation factor, f(t) denotes a function related to the time series, Ψ(t) represents the expression of the ‘Morlet’ wavelet function, where ω0 signifies the central frequency. We compute the time series using formula Cwt, resulting in a coefficient matrix for the time series at multiple scales, with Lx representing the length of the input sequence. is obtained through scale transformation analysis of the time-frequency characteristics of the time series at different scales using the ‘Morlet’ wavelet as the central frequency. This matrix reflects the scale coefficients in the time-frequency domain, portraying the time series’ variations across different spatial and frequency scales, thereby extracting both local structures and global features. Subsequently, we aggregate along the second dimension to merge the frequency domain features obtained at different scales within the same time domain, yielding Wt. Finally, the resulting multi-scale time-frequency domain features undergo encoding through a convolution block with a 3-sized kernel, as represented by Eqs (5) and (6), ultimately yielding the embedding layer CWT(X) for the time series concerning continuous wavelets. (5) (6)

However, by selecting the appropriate wavelet basis function, the continuous wavelet transform can capture the signal characteristics better and achieve accurate frequency analysis and time-frequency representation. In order to capture more time series features in the time-frequency domain, the transformer oil temperature dataset ETTh1 with a granularity of 1 hour and ETTm1 with a granularity of 15 minutes were selected for wavelet selection in this study. Both data sets were taken from the same transformer in a county in China and spanned two years.

This study selected three commonly used wavelet bases, Morlet, Gaus wavelet, and Marr wavelet, to verify their ability to capture temporal features in CL-Informer. For the Gaus wavelet, the eighth-level wavelet basis from the Gaussian wavelet series was chosen. To compare the feature capturing abilities of the three wavelet bases, the performance of MSE for single-element prediction was compared on the ETTh2 dataset with an input length of 168 and prediction lengths of {24, 48, 168, 336, 720}.

Fig 4 shows that compared with the Gaus8 wavelet base and Morlet wavelet base on the ETTh2 dataset, the Marr wavelet base is relatively less sensitive to the model. The influence of the Gaussian wavelet base and Morlet wavelet base on the mean square error of the model is similar. However, the support length of the Gaussian wavelet base is longer, and the computational complexity is higher. In order to reduce the computational complexity of the model, we choose different support lengths of Gaussian wavelet bases. The effects of MSE and MAE on the single element prediction performance were compared on the ETTh1 dataset with an input length of 168 and prediction length of 24 under different support lengths of Gaussian wavelet.

thumbnail
Fig 4. The architecture of the time series embedding layer.

https://doi.org/10.1371/journal.pone.0303990.g004

From Fig 5, it can be observed that as the support length of the Gaus wavelet increases, the prediction performance gradually improves. Among them, the Gaus6 wavelet basis exhibits the best prediction performance and reduces the wavelet transformation’s complexity compared to the Gaus8 wavelet basis. Therefore, we will compare the Gaus6 wavelet basis with the Morlet wavelet basis in single-element and multi-element predictions on the ETTm1 dataset to select the wavelet basis with better prediction accuracy.

thumbnail
Fig 5. Gaussian wavelet prediction of different support degree.

https://doi.org/10.1371/journal.pone.0303990.g005

As can be seen from Fig 6, it can be observed that during training with an input length of 168 and prediction lengths of {24, 48, 96, 288, 672}, for the ETTm1 dataset with a granularity of 10 minutes, CL-Informer employed both Morlet and Gaus6 wavelets for continuous wavelet transform. In most cases, the Morlet wavelet transform model exhibited lower mean squared error (MSE) and mean absolute error (MAE) than the Gaus6 wavelet transform model. This indicates that the Morlet wavelet demonstrates better accuracy in single-element and multi-element predictions. Moreover, when conducting continuous transforms, the computational complexity of the Morlet wavelet is lower than that of Gaus6. Therefore, this study selected the Morlet wavelet as the basis function for continuous transformation.

thumbnail
Fig 6. Morlet and Gaus6 for single and multi-element prediction in ETTm1.

https://doi.org/10.1371/journal.pone.0303990.g006

Encoder-Decoder.

The main component of Informer relies on the self-attention mechanism, specifically the ProbSparse attention mechanism that reduces time complexity. ProbSparse attention creates a sparse matrix based on the KL divergence, selecting a few high-scoring dot products and taking the average of the other low-scoring dot products to reduce time complexity and memory usage. Compared to the regular Transformer model’s self-attention mechanism, it achieves a complexity of O(LlogL). ProbSparse defines Q, K, and V as the query, key, and value vectors. However, it only includes the key-value pairs under the sparse query Top-u, where the size of you is determined by a sampling parameter, and d represents the corresponding dimensionality. The definition of ProbSparse is shown in Eq (7). (7)

As shown in Fig 1, we also adopted the self-focusing mechanism of ProbSpare in the CL-Informer model, but we deleted the distillation module and added the LSTM layer. Because we add continuous wavelet transform to the model input to enhance the local and global dependence of the model input, it also brings a lot of noise and some redundant information to the model input. The LSTM layer can control the transfer of information through the gated unit, and the memory unit can effectively capture and store long-term memory and selectively transfer and forget information when needed.

As shown in Fig 7, LSTM [17] memory unit can selectively forget the information redundancy caused by continuous wavelet transform coding and further extract the long dependency of input time series so that the experimental model has better robustness. The definition of LSTM at time t is as follows: (8) (9) (10) (11) (12) (13)

In order to better learn the long-term dependency of LSTM, the output of the updated LSTM unit is represented as follows: (14) Where Wf, Wi, Wo, Wc, Wd, Wy, Wc, , bf, bi, bo, bc and bd represent the learnable parameters, ct and ht represent the storage cell state variable and hidden layer state variable, tanh represents the hyperbolic tangent activation function, and σ represents the sigmoid activation function.

In the CL-Informer model, we use a 2-layer LSTM to capture further long-term dependencies in the time series, but to reduce computational complexity, we change the number of layers in the encoder to one, thereby reducing the spatial complexity of the calculations. Finally, the output is obtained through a fully connected layer for result prediction.

Results

Experiment

Implementation step.

Using the ADAM optimizer with an initial learning rate of 1E − 4, the model was trained for 10 epochs with a batch size of 32. The training process was terminated prematurely after 10 iterations. All experiments were repeated 10 times and conducted on an NVIDIA GTX4070 GPU with a RAM capacity of 12GB. CL-Informer differs from Informer and InformerStack in having only one encoder layer and an eight-head attention decoder layer. The prediction windows are consistent across all configurations based on the dataset’s timestamps, while the input sequence length for the CL-Informer model encoder is set to 168, and the start mark (label length) for the decoder is fixed at 48. Two evaluation metrics, namely and , were employed for each prediction window, with a stride value of one applied to traverse through the entire dataset.

Data.

To evaluate CL-Infomer, we conducted experiments using multiple datasets. In order to explore the long dependency enhancement effect of the model, we conducted experiments using five datasets mentioned in Zhou et al. (2021).

  1. ETTh and ETTm are two different granularity datasets that contain the load characteris-tics of seven oil and power transformers, respectively, from July 2016 to July 2018. ETTh contains the two-hour dataset {ETTh1, ETTh2}, and ETTm contains the 15-minute dataset ETTm1.
  2. The Wheather dataset contains local climate data for nearly 1,600 locations in the United States between 2010 and 2013. Data from each site was collected at an hourly rate for a to-tal of four years. The data set contains the target value “wet bulb number” and 11 climate characteristics.
  3. The ECL dataset collected hourly electricity consumption from 321 customers from 2012 to 2014. We split the data set according to Zhou et al(2021). For the ETT data set, the train-ing/validation/test set contains data for 12/4/4 months. For the Wheather data set, the training/cycle/test was 28/10/10 months and the ECL was split according to the train-ing/cycle/test was 15/3/4 months.

Datum line.

In time series prediction, we chose six models to compare with CL-Informer, including two RNN-based models, LSTMa (Bahdanau et al., 2015) [18] and LSTnet (Lai et al., 2018) [19]. DeepAR [20], a neural network based on autoregressive loops (Flunkert et al., 2017), and transformer-based Reformer (Kitaev, Kaiser, and 2019) [21], LogSparses(Li et al., 2019), Informer (Zhou et al., 2021). In order to explore the improvement of the accuracy of CL-Informer’s prediction of long-time series better, we added a normalized self-attention variant (Informer+) to the experiment for comparison.

Results and analysis

As shown in Fig 8, with an input length of 168, as the prediction lengths extend to {24, 48, 96, 288, 672}, the CL-Informer model exhibits lower mean squared error (MSE) and mean absolute error (MAE) values compared to the Informer model in both single-element and multi-element predictions on the ETTm1 dataset. Therefore, we have enhanced the model’s prediction accuracy by introducing measures such as the CWT embedding layer, eliminating distillation methods, and incorporating LSTM.

thumbnail
Fig 8. MSE and MAE graphs of different prediction lengths of CL-Informer and Informer under ETTm1.

https://doi.org/10.1371/journal.pone.0303990.g008

To thoroughly compare CL-Informer and Informer and other time series models, we conducted comprehensive multi-element and single-element prediction experiments on different data sets and in different Settings. For all data sets, the input length is 168, where ETTh1, ETTh2, Wheather prediction length includes {24, 48, 168, 336, 720}, ETTm1 prediction length includes {24, 48, 96, 288, 672}, The ECL prediction length is selected from {48, 168, 336, 720, 960}. They are highlighted in bold to show the best results for different forecast periods.

Cell time prediction.

Table 1 shows that CL-Informer outperforms all models except for DeepAR in all cases and only slightly performs worse on a few datasets. For example, in the input-168-predict-720 setting, compared to the state-of-the-art results, CL-Informer achieves a 37.9% reduction in MSE for ETTh1 (0.269→0.160), a 10.2% reduction for ETTh2 (0.277→0.249), a 34.8% reduction for Weather (0.359→0.234), and a 37.1% reduction for ECL (0.540→0.385). Additionally, in the input-168-predict-672 setting, CL-Informer achieves a 57.0% reduction in MSE for ETTm1 compared to the state-of-the-art results (0.512→0.220). CL-Informer exhibits an overall MSE reduction of 35.4% in the mentioned settings.

thumbnail
Table 1. Data set (5 cases) single element long series time prediction results.

https://doi.org/10.1371/journal.pone.0303990.t001

Furthermore, we can observe that CL-Informer demonstrates good stability as the prediction length increases, particularly in longer prediction horizons. This indicates that CL-Informer is more suitable for long-term time series forecasting, which is significant for real-world applications like long-term energy consumption planning and weather forecasting.

Multi-element time prediction.

Table 2 shows the results of four typical datasets. Our model performs well in the long-term prediction task compared to other models. For example, in the input-168-predict-336 setting, MSE decreased by 5.6% (1.128→1.068) in ETTh1 and 8.9% (0.702→0.639) in Wheather compared with the previous most advanced result. The ECL was reduced by 24.9% (0.381→0.286), and the MSE of ETTm1 was reduced by 15.3% (1.056→0.894) in the input-168-predict-288 setting. Overall, the overall MSE of CL-Informer in the appellate setting decreased by 13.6%. We can find that the CL-Informer model’s long time series prediction effect is better than that of a based time series model and some transformer-based time series models in multi-element time prediction.

thumbnail
Table 2. Data set (4 cases) multi-element long series time prediction results.

https://doi.org/10.1371/journal.pone.0303990.t002

CL-Informer predicted effect.

The two pictures in Fig 9 respectively show the single-element prediction effect of CL-Informer under ETTm1. Since CL-Informer is a multi-step prediction model, we randomly select the predicted and actual values under a single step to compare. Fig 9(a) shows the actual and predicted values’ curves in the ETTm1 data set under CL-Informer input-168-predict-288. Fig 9(b) shows the effect of the actual and predicted values in a single prediction window. From Fig 9, we can see that CL-Informer performs well predicting long series.

thumbnail
Fig 9. Prediction effect slice of CL-Informer under ETTm1.

https://doi.org/10.1371/journal.pone.0303990.g009

Discussion

In order to verify the enhancement of the Informer model by CL-Informer, we also considered conducting additional ablation experiments on ETTh1 to compare the transformation of MSE and MAE of the C-Informer model after removing the LSTM coding layer. Tables 1 and 2 compare CL-Informer with Informer’s model and the main-stream models LogTrans, Reformer, LSTMa and LSTnet. In this ablation experiment, we compare our approach to Informer’s model. In Table 3, we compare the performance of CL-Informer, C-Informer, and Informer for MSE, MAE with input length {288, 576} and prediction length {720, 1440}. For a fair comparison, except for CL-Informer, C-Informer kept the encoder at layer 1, and the other hyperparameters were unchanged. In this experiment, data set ETTh1 was used for model ablation.

thumbnail
Table 3. Single element predictive ablation experiment of CL-Informer.

https://doi.org/10.1371/journal.pone.0303990.t003

From Table 3, it can be observed that removing the LSTM layer resulted in inferior performance in terms of MSE and MAE compared to including the LSTM layer. The average MAE increased by 6.4%, indicating poorer performance without the LSTM layer. Furthermore, the C-Informer, which only includes the CWT embedding layer, outperformed the Informer model in terms of MSE and MAE, with an average decrease in MAE of 20.6%. As the LSTM and CWT embedding layers are gradually removed, the model’s performance continuously deteriorates. This indicates that by adding the CWT embedding layer along with the LSTM layer while removing distillation operations, the model’s predictive accuracy is indeed improved. Fig 10 also demonstrates that as the CWT embedding layer and LSTM layer are added to the Informer model, the error consistently decreases, indicating that CL-Informer exhibits good predictive performance.

thumbnail
Fig 10. Sensitivity of three wavelet bases to the model.

https://doi.org/10.1371/journal.pone.0303990.g010

Although capturing more scale of time series features through continuous wavelet transform has enabled CL-Informer to outperform the Informer model in predictive accuracy across multiple datasets, further research is needed regarding selecting wavelet basis functions and extracting time-frequency domain features.

Conclusion

This paper studies the problem of long-series time series prediction, and an improved CL-Informer model for long-series prediction based on Informer is proposed. Specifically, to improve the model’s prediction accuracy in long series, we embed a continuous wavelet transform layer (CWT) in the coding layer and extracted multi-scale features and dependencies from the time series through CWT. On this basis, we remove the “distillation” operation of the coding-decoder, add an extended short-term neural network (LSTM), change the number of encoder layers to 1 layer, and finally get the prediction model CL-Informer.

By comparing with other experimental models, it has been demonstrated that CL-Informer is more effective in discovering predictive dependencies. Additionally, under various experimental settings, CL-Informer consistently achieves better predictive performance. For single-variable prediction horizons, the CL-Informer model exhibited an average decrease of 30.64% in Mean Squared Error (MSE), while for multi-variable predictions, the reduction was 10.70%. This further validates the effectiveness and stability of the proposed model.

Supporting information

S1 Fig. CWT diagram of the embedding layer architecture.

Show the composition and execution flow of CWT embedding layer.

https://doi.org/10.1371/journal.pone.0303990.s001

(VSDX)

S2 Fig. Univariate linear comparison plot between MOrlet and Gaus6 wavelet basis in the ETTm1 dataset.

Discount plots of MSE versus MAE for both wavelet bases.

https://doi.org/10.1371/journal.pone.0303990.s002

(VSDX)

S3 Fig. Multivariate linear comparison plot between MOrlet and Gaus6 wavelet basis in the ETTm1 dataset.

Discount plots of MSE versus MAE for both wavelet bases.

https://doi.org/10.1371/journal.pone.0303990.s003

(VSDX)

S1 Table. CL-Informer compared to Informer in univariate accuracy optimization percentage.

Calculate the MSE and MAE improvement of CL-Informer compared with Informer in five data.

https://doi.org/10.1371/journal.pone.0303990.s004

(XLSX)

S2 Table. CL-Informer compared to Informer in multivariate accuracy optimization percentage.

Calculate the MSE and MAE improvement of CL-Informer compared with Informer in four data.

https://doi.org/10.1371/journal.pone.0303990.s005

(XLSX)

S1 File. The five kinds of data sets used in the experiments.

https://doi.org/10.1371/journal.pone.0303990.s006

(7Z)

Acknowledgments

The author expresses gratitude to Zimei Li.

References

  1. 1. Dickey David A, Fuller Wayne A. Distribution of the Estimators for Autoregressive Time Series with a Unit Root. Journal of the American Statistical Association. 1979;74(366a):427–431.
  2. 2. Alevizakos Vasileios, Chatterjee Kashinath, Koukouvinos Christos. The triple exponentially weighted moving average control chart. Quality Technology & Quantitative Management. 2021;18(3):326–354.
  3. 3. Svetunkov Ivan, Boylan John E. State-space ARIMA for supply-chain forecasting. International Journal of Production Research. 2020;58(3):818–827.
  4. 4. Zhang G Peter. Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing. 2003;50:159–175.
  5. 5. Sherstinsky Alex. Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) network. Physica D: Nonlinear Phenomena. 2020;404:132306.
  6. 6. Greff Klaus, Srivastava Rupesh K, Koutnik Jan, Steunebrink Bas R, Schmidhuber Jurgen. LSTM: A Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems. 2017;28(10):2222–2232.
  7. 7. Chua Leon O. CNN: A Vision of Complexity. International Journal of Bifurcation and Chaos. 1997;07(10):2219–2425.
  8. 8. Daheng Wang, Prashant Shiralkar, Colin Lockard, Binxuan Huang, Xin Luna Dong, Meng Jiang. TCN: Table Convolutional Network for Web Table Interpretation. In: Proceedings of the Web Conference 2021. WWW’21. New York, NY, USA: Association for Computing Machinery; 2021. p. 4020–4032. Available from: https://doi.org/10.1145/3442381.3450090.
  9. 9. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, et al. Attention Is All You Need. CoRR. 2017;abs/1706.03762.
  10. 10. Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, et al. Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting. In: Advances in Neural Information Processing Systems. vol. 32. Curran Associates, Inc.;.Available from: https://proceedings.neurips.cc/paper/2019/hash/6775a0635c302542da2c32aa19d86be0-Abstract.html.
  11. 11. Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamás Sarlós, et al. Rethinking Attention with Performers. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net; 2021.Available from: https://openreview.net/forum?id=Ua6zuk0WRH.
  12. 12. Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, et al. Nyströmformer: A Nyström-based Algorithm for Approximating Self-Attention. Proceedings of the AAAI Conference on Artificial Intelligence. 2021;35(16):14138–14148.
  13. 13. Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, et al. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting; 2021. Available from: http://arxiv.org/abs/2012.07436.
  14. 14. Singh Aasim, S N, Mohapatra Abheejeet. Repeated wavelet transform based ARIMA model for very short-term wind speed forecasting. Renewable Energy. 2019;136:758–768.
  15. 15. ALTobi Maamar Ali Saud, Bevan Geraint, Wallace Peter, Harrison David, Ramachandran K P. Fault diagnosis of a centrifugal pump using MLP-GABP and SVM with CWT. Engineering Science and Technology, an International Journal. 2019;22(3):854–861.
  16. 16. Stockwell R G, Mansinha L, Lowe R P. Localization of the complex spectrum: the S transform. IEEE Transactions on Signal Processing. 1996;44(4):998–1001.
  17. 17. Van Houdt Greg, Mosquera Carlos, Nápoles Gonzalo. A review on the long short-term memory model. Artificial Intelligence Review. 2020;53(8):5929–5955.
  18. 18. Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. In: Yoshua Bengio, Yann LeCun, editors. 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings; 2015.Available from: http://arxiv.org/abs/1409.0473.
  19. 19. Guokun Lai, Wei-Cheng Chang, Yiming Yang, Hanxiao Liu. Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks; 2017. Available from: https://arxiv.org/abs/1703.07015v3.
  20. 20. Valentin Flunkert, David Salinas, Jan Gasthaus. DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks. CoRR. 2017;abs/1704.04110.
  21. 21. Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya. Reformer: The Efficient Transformer. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net; 2020.Available from: https://openreview.net/forum?id=rkgNKkHtvB. https://doi.org/10.1609/aaai.v35i16.17664