Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Energy consumption prediction using the GRU-MMattention-LightGBM model with features of Prophet decomposition

Abstract

The prediction of energy consumption is of great significance to the stability of the regional energy supply. In previous research on energy consumption forecasting, researchers have constantly proposed improved neural network prediction models or improved machine learning models to predict time series data. Combining the well-performing machine learning model and neural network model in energy consumption prediction, we propose a hybrid model architecture of GRU-MMattention-LightGBM with feature selection based on Prophet decomposition. During the prediction process, first, the prophet features are extracted from the original time series. We select the best LightGBM model in the training set and save the best parameters. Then, the Prophet feature is input to GRU-MMattention for training. Finally, MLP is used to learn the final prediction weight between LightGBM and GRU-MMattention. After the prediction weights are learned, the final prediction result is determined. The innovation of this paper lies in that we propose a structure to learn the internal correlation between features based on Prophet feature extraction combined with the gating and attention mechanism. The structure also has the characteristics of a strong anti-noise ability of the LightGBM method, which can reduce the impact of the energy consumption mutation point on the overall prediction effect of the model. In addition, we propose a simple method to select the hyperparameters of the time window length using ACF and PACF diagrams. The MAPE of the GRU-MMattention-LightGBM model is 1.69%, and the relative error is 8.66% less than that of the GRU structure and 2.02% less than that of the LightGBM prediction. Compared with a single method, the prediction accuracy and stability of this hybrid architecture are significantly improved.

1. Introduction

Short-term prediction of energy consumption predicts, estimates, analyzes, judges and speculates on the future development of the energy system, mainly by constructing a mathematical model reflecting the internal activities and external connections of the energy system. The prediction and accurate control of energy consumption have become important for energy savings and emission reduction. The prediction of short-term energy consumption data can ensure the stable operation of the system and assist the scheduling of energy consumption plans. Therefore, the construction of the energy prediction model is a comprehensive modeling project.

The energy consumption forecasting problem can be regarded as a traditional time series forecasting problem. Luthuli et al. [1] compared conventional energy prediction models and proposed time series decomposition with ANN feature extraction; on this basis, they used the machine learning model SVM method to predict energy consumption. However, the Residual item is discarded, and only Trend and Seasonality were predicted. Its prediction information is not fully exploited. Jain et al. [2] proposed an SVM model based on time window features to predict energy consumption. However, its prediction results are not good enough to give the corresponding evaluation. To capture the nonlinear relationship of time series, Chen, YB et al. [3] proposed support vector regression (SVR) and extreme learning machines (ELMs) to predict energy consumption. Sungwoo Park et al. [4] tried LightGBM in energy prediction based on sliding time window features, which exploited the advantages of ensemble learning and achieved good prediction results.

In addition, the neural network model also has good nonlinear prediction performance. Neural network models can also flexibly adjust feature construction for multistep prediction. Pei, SQ et al. [5] used long-term short memory (LSTM) for multistep forecasting of energy load. Sehovac, L et al. [6] simplified the LSTM model architecture and used a more concise GRU structure, which also achieved good results in energy consumption prediction. Jarábek, T et al. [7] used a recurrent neural network (RNN)-containing decoder to further improve the model performance. Atef, S et al. [8] added a bidirectional structure to the gating mechanism, and the bidirectional LSTM makes the energy consumption prediction more accurate. However, RNN-based models are sensitive to different random weight initializations and thus prone to overfitting. Therefore, the single network structure lacks predictive stability and generalization ability.

Although these single-output forecasting models have demonstrated good forecasting accuracy, they are limited in handling uncertainties such as sudden load changes. To solve the above problems, many studies have begun to combine traditional time models, machine learning models, and neural network models. Park et al. [9] proposed a two-stage STLF model. The first stage included extreme gradient propulsion (XGB) and random forest (RF) prediction methods, and the second stage combined their predictions using multiple linear regression (MLR) models based on sliding Windows. Without sufficient subsampling, randomized trees can lead to overfitting [10]. Xie, Y et al. [11] proposed a two-stage prediction model combining ATT-BiLSTM and MLP. First, a 24-hour prediction was made based on ATT-BiLSTM, and then the results were input into MLP as new features to obtain the final prediction results. However, their deep neural network model is also sensitive to different random weight initializations, which is very unstable and does not combine with machine learning methods. Yuxuan L et al. [12] proposed a method based on the combination of pretrained GRU and LightGBM. However, GRU is used for feature extraction in their work, the periodicity of energy consumption data is not well captured, and GRU-attention has higher prediction accuracy than GRU. Jung S et al. [13] constructed a hybrid model of empirical mode decomposition (EMD) and gated cyclic units (GRUs) based on an attention mechanism. Bu S J et al. [14] proposed a neural network model of an attention mechanism to predict energy consumption. This method only uses an attention structure and has a poor ability to learn the periodic characteristics of power consumption, and the accuracy is not as good as that of GRU-attention.

LightGBM, a single machine learning model, has strong predictive ability in general time series, but its accuracy in predicting periodic time series is still inadequate. Deep learning methods, such as LSTM, GRU and other RNN-based models, rely heavily on initialization parameter selection, and the model stability and accuracy are not good enough. A single GRU or a single attention is not as good at predicting periodic time series as a combination of the two. Current research cannot solve the above three key problems simultaneously. Based on simple time series decomposition, we propose a feature selection method based on prophet decomposition. The gating-attention mechanism can learn the structure of internal correlation between features. The ensemble learning structure has a strong anti-noise ability and can reduce the impact of mutation points on the overall prediction effect of the model. We proposed a hybrid model architecture of GRU-MMattention-LightGBM. This architecture inherits the advantages of every single architecture and has a more stable prediction effect than a single model.

2. Model

2.1 Prophet decomposition

Prophet was developed by Facebook’s data science team in 2017 (Taylor S J et al. [15]). It used a decomposable time series model (Chung et al. [16]), with three main components: trend, seasonality, and residuals. Although the influences on time series are complex, all series fluctuations can be decomposed into four parts: trend factors Tt, cyclic fluctuations Ct, seasonal changes St, and residuals Rt.

AC Harveyd et al. [17] simplified the above four items into three items (Tt, St, Rt). In this paper, we combined the characteristics of energy consumption data and made the following simplifications in the time series decomposition model, as given by Eq (1).

(1)

In the Prophet model, the trend term Tt contains two items, namely, the segmental model based on linear regression and the saturated growth model based on logistic regression. The linear part is shown in Eq (2). (2) where k represents the growth rate of the model, δ is the change in k, m is the offset parameter, t is the timestamp, α(t) is the indicator function, α(t)T is the transpose vector of α(t), γ is the offset of the smoothing process, and its function is to make the piecewise function continuous.

The expression of the saturated growth model based on logistic regression is Eq (3). (3) where C represents the model bearing capacity, k represents the growth rate, and m is the offset parameter. When the rate k is adjusted, the offset parameter must also be adjusted. Next, the piecewise logistic growth model is Eq (4).

(4)

The Prophet model can incorporate trend changes into the model by setting change points to change the growth rate. Assuming that for timestamp t, there are n change points, then . Additionally, for a moment sj, its offset γj, γj is set to −sjδj. The trend generation model is that there are S rate points in the history of point T, and the rate change of each changing point is δj~Laplace(0, τ).

The seasonal and residual terms can be represented as Eq (5) and Eq (6) (5) (6)

Zarnowitz V et al. [18] pointed out that trend, season, and residual are interrelated and not independent among each other and established a model of interdependence among the three. Therefore, we propose the following more general assumptions, as Eq (7), Eq (8) and Eq (9) are given.

(7)(8)(9)

f1, f2, f3 are different nonlinear functions, expressed here in the form of an implicit function. Therefore, Tt, St, It, which is determined by the joint action of Tti, Sti, Rti, i∈[1, k] and iN. Therefore, the structure of the feature based on Prophet decomposition is Eq (10).

(10)

F contains Trend,Season,Innovation in the first K periods of the energy consumption sequence. Together, they serve as feature inputs to the predictive model. Finally, learn that f1, f2, f3 are learned and xt is predicted through the following hybrid architecture.

2.2 LightGBM

LightGBM is a variant of the gradient boosting decision tree (GBDT) [19]. We use the Prophet method to extract features and transform the time series forecasting problem of electricity consumption into a supervised learning problem. It is hoped that the learner of ensemble learning can better learn the nonlinear interaction between the trend, season, and residual based on the features produced by Prophet. LightGBM is based on the additive model of the boosting strategy. During training, the forward stagewise algorithm is used for greedy learning. Each iteration learns a CART tree to fit the residual between the prediction result of the previous t−1 tree and the true value of the training sample. In each combination, the weak learner was better than the previous group. Similar to XGBoost, LightGBM is explicitly regularized. The first half is the loss function, and the second half is the regular term L1+L2. The approximate objective function is obtained by a second-order Taylor expansion of the loss function, as shown by Eq (11).

(11)

The GBDT needs to scan all the data to estimate all possible split points for information gain, which takes considerable time and memory. LightGBM uses the histogram decision tree algorithm to make the memory footprint smaller and improve the calculation speed. It also uses the GOSS algorithm to calculate only high gradient data, which further saves space and time overhead. The EFB algorithm is also used to bundle features, which reduces the dimension and time complexity of the algorithm. LightGBM uses a leafwise algorithm with depth constraints. In addition, it supports efficient parallel computing and increases the cache hit rate.

2.3 GRU

Cho K et al. [20] proposed an improved structure of the gating mechanism in 2014 to better solve the long-term dependency problem, GRU (gate recurrent unit), which optimizes the gate function of LSTM, combining the forget gate and the input gate in one update in the door. The update gate contains both the neuron state and the hidden state, which can reduce the complexity of the network unit, reduce the number of parameters, and greatly shorten the training time of the model. We use GRU to forecast time series; in contrast, timeliness is more important to the whole system, so we choose the GRU structure that consumes less time and has the same prediction effect as LSTM. A schematic diagram of its gate control structure is shown in Fig 1.

For each unit in the sequence, we denote σ as a sigmoid function. Tanh represents a hyperbolic tangent function. xt is the input at time t. It is the implicit state ht−1 of the moment t−1, which contains the dependency information of each previous moment. rt represents the reset gate, and zt stands for the update gate. means that the calculation logic of the two gates of the update gate is to splice the input of the current moment and the hidden state of the previous moment, and the output is controlled between [0, 1] through the sigmoid function. The calculation logic of the two gates is to join the input of the current moment and the hidden state of the last moment and control the output between [0, 1] through the sigmoid function. The output is inhibited as it approaches 0 and activated as it approaches 1.

First, the reset gate and update gate formulas are Eq (12) and Eq (13).

(12)(13)

Then, the reset gate is used to reset the information, and the data are scaled to the range of [–1, 1] by the Tanh function to obtain nt. nt contains the information to be added at the current moment, which is equivalent to memorizing the state of the current moment, as given by Eq (14).

(14)

The last stage outputs the final hidden information. The function of this step is to forget some dimension information passed down and add some dimension information input by the current node. Output the current moment yt according to its hidden information, as given by Eq (15) and Eq (16).

(15)(16)

2.4 MMsAttention

Vaswani et al. [21] proposed a multihead attention mechanism, which uses different heads for different representation subspaces under the structure of the self-attentional mechanism. Multihead attention enables the model to jointly pay attention to different representational subspace information at different locations. In addition, multihead attention can also consider the information of different head positions to capture the intraday variation regularity of energy consumption more forcefully. In this paper, a multihead self-attention structure is proposed for the energy consumption prediction problem. As shown in Fig 2.

thumbnail
Fig 2. Schematic diagram of the multihead self-attention structure.

https://doi.org/10.1371/journal.pone.0277085.g002

We take xi as input to the features corresponding to the decomposition of the volume of each hour of the day; there are 24 hours in a day, so the input of the masked multihead attention block is a vector sequence whose length M is 24. The sequence of vectors can be represented as an input matrix I, given by Eq (17).

(17)

2.4.1 Self-attention mechanism.

The essence of a self-attention function can be described as mapping a query and a set of key-value pairs to an output, where query, key, value, and output are all vectors. Each feature of the input has a set of vectors consisting of q, k and v. Q, K and V are the concatenation of all vector sequences of q, k and v, respectively. More intuitively, the attention mechanism is an operation that computes the similarity between a query and a key and extracts the query-related values for a weighted sum, as given by Eq (18), Eq (19), and Eq (20).

(18)(19)(20)

Both Wq, Wk, Wv are matrices that need to be updated through iterative training. The dimensions of the Q and K matrices corresponding to the input vector are both dk, and the dimension of the V matrix is dv. In the attention matrix, the larger the value of the element is, the stronger the interaction relationship between energy consumption in different periods. The correlation matrix is given by Eq (21).

(21)

The output of the self-attention layer is the weighted sum of the respective values, and the weight corresponding to each value is divided by the inner product of the corresponding query and key divided by the of the corresponding key so that the inner product will not be too large. As is given by Eq (22).

(22)

2.4.2 Multi-head self-attention mechanism.

The multihead self-attention layer given by Eq (23) and Eq (24). (23) (24) where the projections are parameter matrices , , , are head-specific weights for keys, queries and values, and linearly combines outputs concatenated from all heads headi.

The difference between the energy prediction and the text translation task is that the text translation can combine the following information of the input to output the above content translation. Energy consumption forecasting can only predict the future based on past energy consumption, so we propose a masked multihead attention structure. This masking ensures that the predicted value at time i can only rely on known outputs less than time i. The elements in corraltionmatrix can be expressed as Eq (25).

(25)

C is a lower triangular matrix, which ensures that the network will not see future information when making predictions and the results will not cheat us or make the prediction effect too accurate for us to rely on.

2.5 GRU-MMattention hybrid model architecture

As shown in Fig 3, to enable the network model to jointly pay attention to different representation subspace information at different locations, that is, to learn the interaction relationship between electricity consumption in different periods within a day, we added a masked-multihead attention layer. Due to the large network depth, the Add&Norm layer is adopted to improve the prediction effect and speed up the network convergence. Without layer normalization, the gradient descent process is slow, and the descent trajectory fluctuates greatly. Then, the dimensionality is reduced with a feed-forward layer. Next, the data pass through the Add&Norm layer, and finally, the final result is output through the Linear layer.

2.6. GRU-MMattention-LightGBM model prediction process

Hybrid model training mainly includes three steps: feature construction, model training and prediction. The process is shown in Fig 4.

thumbnail
Fig 4. GRU-MMattention-LightGBM model prediction process.

https://doi.org/10.1371/journal.pone.0277085.g004

Step 1, feature extraction on original data using Prophet.

After cleaning the original data, according to the Prophet decomposition method in section 2.1, the time series of energy consumption is decomposed into Trend, Seasonality and Residual. Then, features are built based on the data features as Eq (12).

Step 2, model training.

The LightGBM model is trained on the training set. We pick the model with the best prediction performance on the validation set. After that, the parameters of the LightGBM model are frozen and saved. Then, the hyperparameters of GRU-MMattention are set and trained on the training set. The results obtained by the neural network and the results of LightGBM on the training set are used as the input of the multilayer perception (MLP) at the same time. The weights of both models are learned by the MLP. Finally, only the parameters of the neural network are iteratively adjusted according to the results of the validation set, and the best hybrid model is selected.

Step 3, prediction.

The ensemble model trained in step 2 is saved, and prediction is performed on the test set.

3. Empirical analysis

3.1 Data processing process

We selected the energy consumption data of the US PJM regional energy supply company in 14 regions from the Kaggle data website for research (https://www.kaggle.com/datasets/robikscube/hourly-energy-consumption). The energy consumption data from 0:00 on January 1, 2015, to 23:00 on August 3, 2018, were selected, with a frequency of 1 hour, and a total of 31440 samples. We performed k-neighbor imputation for the 4 missing values and removed 2 duplicate values. The first 85% of the data are the training set, 5% of the data are the validation set, and 10% of the data are the test set.

3.2 Feature window length selection

Fig 5 shows that the data of daily electricity consumption with the frequency of hours have a significant intraday cycle. The data of electricity consumption with a frequency of days have a significant seasonal effect. Combined with the censoring feature of PACF, the selection of the hyperparameter for the length of the feature window K is 3. In addition, after the preexperiment of grid search with K = [1, 6] when K = 3 on LightGBM and GRU, the test set performs best. Therefore, the window length K of feature selection is 3.

thumbnail
Fig 5. ACF and PACF of electricity consumption per hour and daily.

https://doi.org/10.1371/journal.pone.0277085.g005

3.3 Feature selection and normalization

Based on the cleaned data, Prophet decomposition is performed on the original data according to the method in section 2.1, as shown in Fig 6.

It can be seen from subgraphs 1 and 2 of Fig 6 that the energy consumption cycle is decomposed very neatly, and the feature extraction is very effective.

To improve the convergence speed of the neural network, when the neural network model is established, minimum-maximum value normalization is performed on the features of each dimension in all samples so that the original data are in the range of [0, 1]. The normalized equation is Eq (26). (26) where represents the normalized data, xij represents the original data, N is the number of samples, is the minimum value in the same dimension feature in all samples, and is the maximum value in the same dimension feature in all samples.

Since the normalized data prediction is not the real predicted value, it is necessary to save the conversion factor to facilitate denormalization after the prediction and obtain the actual predicted value. The denormalization method is shown in Eq (27).

(27)

represents the final predicted value after denormalization, represents the model prediction value under normalized data training, ymax represents the maximum value of the labels in the test training set, and ymin represents the minimum value of the labels in the test training set.

3.4 Evaluation indicators

This article discusses the problem of time series forecasting. Labels are numerical data, and we are more concerned with the gap between the actual value and the predicted value. Therefore, MAPE, MAE, MSE and RMSE are selected to measure the prediction accuracy and generalization ability of different models. yi represents the true value, i = 1,2,…,N. Eq (28), Eq (29), and Eq (30) are as follows: (28) (29) (30)

3.5 Results

3.5.1 GRU.

We used a grid search to adjust the hyperparameters in the preexperiment, and the hidden size ranged from {16,32,64,128}. The number of layers is {1,2,3,4}, and the group with the lowest MAPE and MAE is selected for comparison, as shown in Table 1.

thumbnail
Table 1. Comparison of the results of P-feature and TF-feature in GRU.

https://doi.org/10.1371/journal.pone.0277085.t001

3.5.2 GRU- MMattention.

Due to the logical requirements of time series prediction, we use the masked multihead attention architecture, and the number of heads is set to 2. To capture the interaction between different periods within a day, we also added positional encoding to the architecture. The same GRU-MMattention also uses grid search to adjust the hyperparameters in the preexperiment, and the hidden size ranges from {16,32,64,128}. The number of layers is {1, 2, 3, 4}. We select a set of hyperparameters with the lowest MAPE and MAE, as shown in Table 2. The prediction results of GRU and GRU-MMatten are shown in Fig 7.

thumbnail
Fig 7. The prediction results in the test set.

(a) GRU; (b) GRU-MMattention.

https://doi.org/10.1371/journal.pone.0277085.g007

For all samples predicted by this model, MAPE is 4.71%. The MAE is 700.38. The RMSE is 800.39.

3.5.3 Light GBM.

Jiang X et al. [22] denoted LightGBM using moving temporal window features as TFLightGBM. We denote the LightGBM model of Prophet decomposition features as PlightGBM. Set the number of leaves as 31, the maximum depth as 5, the learning rate as 0.01, the number of estimators as 100, the minimum subtree weight as 0.01, and the minimum subtree sample as 20.

As Table 3 shows, PlightGBM is much more accurate than TFLightGBM, which shows that Prophet decomposition is significant in feature construction. The PlightGBM prediction results are shown in Fig 8.

thumbnail
Table 3. Comparison of the results of P-feature and TF-feature in LightGBM.

https://doi.org/10.1371/journal.pone.0277085.t003

3.5.4 GRU-MMatten-LGB.

Based on the trained LightGBM model, including Prophet feature decomposition in section 3.5.3, the parameters of the LightGBM model are frozen and saved. Then, we set the hyperparameters of the optimized GRU-MMattention as in section 3.5.2, training on the training set. The results obtained by the neural network and the results of LightGBM on the training set are used as the input of the MLP simultaneously. The weights of both models are learned by the MLP. Set epochs = 400 when retraining the mixed model. The prediction results are shown in Fig 9.

For all samples predicted by the model, MAPE = 1.69%, MAE = 157.0741, and RMSE = 207.682.

3.6 Model comparison results

3.6.1 Comparison of the methods’ accuracy.

Based on the results in 3.5, we compare the optimal models in each structure, as shown in Table 4 and Fig 10.

3.6.2 Comparison of the antinoise ability of the methods.

The strong anti-noise capability in this paper refers to the strong anti-interference capability of the model against the noise of the training set. Specifically, if noise appears in the training data set and the model learns the data contaminated by noise, the prediction result is still not much different from the data without noise; then, we can conclude that the model has a good anti-noise ability. We selected the energy consumption data of PJM regional power supply companies, different from those in the previous areas. The power consumption data from 0:00 on January 1, 2017 to 23:00 on August 3, 2018 were selected, with a frequency of 1 hour, and a total of 11380 samples. After data cleaning, we added noise to 10% of the data in the training set, Noise~N(0,1000), where 1000 is the standard deviation of the power consumption data itself. Finally, the model is retrained, and metrics are calculated on the test set. The experiment was repeated 10 times, and the metrics were recorded each time. After the t test of the 10 experimental results, if the metrics did not change significantly, the GRU-MMatten-LGB method could be considered to have a strong anti-noise ability.

Table 5 shows that, without noise, the MAPE of GRU is 0.028152. Mape for LightGBM is 0.026251. Mape for GRU-attention is 0.020580. GRU-MMatten-LGB’s MAPE is 0.019802.

thumbnail
Table 6. The MAPE of the noise experiment on all methods.

https://doi.org/10.1371/journal.pone.0277085.t006

Table 7 shows that the above p values are all greater than 0.1, so there is no reason to reject the null hypothesis. Therefore, we conclude that the distribution of MAPE in the noise experiment conforms to a normal distribution. Then, we can check whether the MAPE mean of the noise experiment is smaller than the given population mean, which is denoted as MAPEnormal

Meannoise represents the mean of the MAPE on a certain model in 10 noise experiments, and the data are shown in Table 6.

Meannormal represents the MAPE of a certain model without noise, and the data are shown in Table 5. The results of the t test are shown in Table 8.

According to the results of Tables 6 and 8, the p value of the t test in the GRU noise experiment is less than 0.1. The mean MAPE of the GRU model in 10 noise experiments is 0.03327, which can be considered to be significantly higher than 0.028152 when noise is not added. Therefore, we can conclude that the GRU model has poor anti-noise ability. Similarly, the p value of the t test in the GRU-MMattention noise experiment is less than 0.1, and we can also draw the conclusion that GRU-MMattention has poor anti-noise ability.

However, for the LightGBM method in the noise experiment, its p value of the t test is more than 0.1. The mean MAPE on LightGBM in 10 noise experiments is 0.02720, which cannot be judged to be significantly greater than 0.026851 without noise. It can be considered that the impact of noise on the prediction results of the LightGBM model is not significant, and the LightGBM method has a strong anti-noise ability. Similarly, the p value of the t test on GRU-MMatten-LGB is also more than 0.1. It can be concluded that GRU-MMatten-LGB has a strong anti-noise ability.

4. Conclusion

Based on PJM District Energy Company’s energy consumption data in 14 districts from January 1, 2015, to August 3, 2018, our conclusions are as follows:

  1. For feature construction, the method based on prophet decomposition is more predictive than the simple time window method. The MAPE, MAE and MSE of PlightGBM were 3.71%, 530.62 and 631.86, respectively. The MAPE, MAE and MSE of TFLightGBM are 4.6%, 657.14 and 700.76, respectively. The prediction accuracy of LightGBM using Prophet features is improved by 1.1%. Similarly, P-GRU improves the prediction accuracy by approximately 2% over TFGRU. Therefore, Prophet decomposition is effective for energy consumption prediction and can be regarded as a good distributed representation in representation learning.
  2. As far as a single model is concerned, GRU-MMattention can improve GRU, and the masked multihead-attention architecture has reference significance for energy consumption prediction. The MAPE, MAE and MSE of P-GRU were 10.35%, 1594.04 and 2011.069, respectively. For the GRU-MMattention architecture, MAPE = 4.71%, MAE = 700.38, and RMSE = 800.39. The GRU-MMattention architecture is approximately 6% better than GRU. The combination of the gating-attention mechanism is more stable and accurate than a single gating mechanism.
  3. Comparing the evaluation indicators of the four model architectures of GRU, GRU-MMattention, LightGBM, and GRU-MMatten-LGB, it can be clearly seen that the structure of GRU-MMatten-LGB is practical and effective for energy prediction. The MAPE of the GRU-MMattention-LightGBM model is 1.69%, which is 8.66% lower than the GRU structure relative error and 2.02% lower than the LightGBM prediction relative error. Compared with the single model, the prediction accuracy and stability of the hybrid architecture have been significantly improved.
  4. Although GRU and GRU-MMattention are poor in anti-noise ability, the GRU-MMatten-LGB structure has the characteristics of a strong anti-noise ability of the LightGBM method, which can reduce the impact of the energy consumption mutation point on the overall prediction effect of the model.

References

  1. 1. Luthuli Q W, Folly K A. Short term load forecasting using artificial intelligen-ce[C]. 2016 IEEE PES PowerAfrica. IEEE, 2016: 129–133.
  2. 2. Jain A, Satish B. Clustering based short term load forecasting using support vector machines[C]. 2009 IEEE Bucharest PowerTech. IEEE, 2009: 1–8.
  3. 3. Chen Y, Xu P, Chu Y, et al. Short-term electrical load forecasting using the Support Vector Regression (SVR) model to calculate the demand response baseline for office buildings[J]. Applied Energy, 2017, 195: 659–670.
  4. 4. Park S, Jung S, Jung S, et al. The Journal Sliding window-based LightGBM model for electric load forecasting using anomaly repair[J]. of Supercomputing, 2021: 1–22.
  5. 5. Pei S, Qin H, Yao L, et al. Multistep ahead short-term load forecasting using hybrid feature selection and improved long short-term memory network[J]. Energies, 2020, 13(16): 4121.
  6. 6. Sehovac L, Nesen C, Grolinger K. Forecasting building energy consumption with deep learning: A sequence to sequence approach[C]. 2019 IEEE International Congress on Internet of Things (ICIOT). IEEE, 2019: 108–116.
  7. 7. Jarábek T, Laurinec P, Lucká M. Energy load forecast using S2S deep neural networks with k-Shape clustering[C]. 2017 IEEE 14th International Scientific Conference on Informatics. IEEE, 2017: 140–145.
  8. 8. Atef S, Eltawil A B. Assessment of stacked unidirectional and bidirectional long short-term memory networks for electricity load forecasting[J]. Electric Power Systems Research, 2020, 187: 106489.
  9. 9. Park S, Moon J, Jung S, et al. A two-stage industrial load forecasting scheme for day-ahead combined cooling, heating and power scheduling[J]. Energies, 2020, 13(2): 443.
  10. 10. Speiser J L, Miller M E, Tooze J, et al. A comparison of random forest variable selection methods for classification prediction modeling[J]. Expert systems with applications, 2019, 134: 93–101.
  11. 11. Xie Y, Ueda Y, Sugiyama M. A Two-Stage Short-Term Load Forecasting Method Using Long Short-Term Memory and Multilayer Perceptron[J]. Energies, 2021, 14(18): 5873.
  12. 12. Yuxuan L, Yang C, Sun Y. Dynamic time features expanding and extracting method for prediction model of sintering process quality index[J]. IEEE Transactions on Industrial Informatics, 2021.
  13. 13. Jung S, Moon J, Park S, et al. An Attention-Based Multilayer GRU Model for Multistep-Ahead Short-Term Load Forecasting[J]. Sensors, 2021, 21(5): 1639.
  14. 14. Bu S J, Cho S B. Time series forecasting with multiheaded attention-based deep learning for residential energy consumption[J]. Energies, 2020, 13(18): 4722.
  15. 15. Taylor S J, Letham B. Forecasting at scale[J]. The American Statistician, 2018, 72(1): 37–45.
  16. 16. Chung J, Gulcehre C, Cho K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[J]. arXiv preprint arXiv:1412.3555, 2014.
  17. 17. Harvey A C, Peters S. Estimation procedures for structural time series models[J]. Journal of forecasting, 1990, 9(2): 89–108.
  18. 18. Zarnowitz V, Ozyildirim A. Time series decomposition and measurement of business cycles, trends and growth cycles[J]. Journal of Monetary Economics, 2006, 53(7): 1717–1739.
  19. 19. Ke G, Meng Q, Finley T, et al. Lightgbm: A highly efficient gradient boosting decision tree[J]. Advances in neural information processing systems, 2017, 30: 3146–3154.
  20. 20. Cho K, Van Merriënboer B, Bahdanau D, et al. On the properties of neural machine translation: Encoder-decoder approaches[J]. arXiv preprint arXiv:1409.1259, 2014.
  21. 21. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]. Advances in neural information processing systems. 2017: 5998–6008.
  22. 22. Jiang X, Luo Y, Zhang B. Prediction of PM2. 5 Concentration Based on the LSTM-TSLightGBM Variable Weight Combination Model[J]. Atmosphere, 2021, 12(9): 1211.