Improved Neural Networks with Random Weights for Short-Term Load Forecasting

An effective forecasting model for short-term load plays a significant role in promoting the management efficiency of an electric power system. This paper proposes a new forecasting model based on the improved neural networks with random weights (INNRW). The key is to introduce a weighting technique to the inputs of the model and use a novel neural network to forecast the daily maximum load. Eight factors are selected as the inputs. A mutual information weighting algorithm is then used to allocate different weights to the inputs. The neural networks with random weights and kernels (KNNRW) is applied to approximate the nonlinear function between the selected inputs and the daily maximum load due to the fast learning speed and good generalization performance. In the application of the daily load in Dalian, the result of the proposed INNRW is compared with several previously developed forecasting models. The simulation experiment shows that the proposed model performs the best overall in short-term load forecasting.


Introduction
As with water supply, gas supply, communications, and transportation systems, electric power system is a necessary component of the urban lifeline engineering as well. Accurate load forecasting is increasingly important since it is critical for the planning, operations and investments of power systems [1]. Improving the accuracy of load forecasting contributes to the promotion of the power supply efficiency and the reduction of operating costs [2].
Load forecasting can be classified into long-term, mid-term, short-term and very shortterm forecasting, based on the forecasting horizon. During the past decades, researchers have developed many different kinds of methods to improve the load forecasting accuracy [1], especially in the field of short-term load forecasting [3][4][5]. Most of these methods have been restricted in practical applications due to the randomness and nonlinearity of the short-term load. In contrast, some intelligent forecasting calgorithms, such as artificial neural network (ANN) [6,7] and support vector machine (SVM), have been widely used [2]. Park et al. first used ANN to forecast short-term load [8]. Lee et al. analyzed the influence of different structures of the ANN on forecasting results [5]. Hippert et al gave a review of ANN methods for short-term load forecasting, and pointed out the overfitting problems existing in ANN methods [4]. Taylor et al. took the weather into account while modeling with ANN methods [9]. In addition, SVM performs well in the field of short-term load forecasting as well. Moreover, as SVM is based on the structural risk minimization framework, it can overcome overfitting problems effectively [10]. However, the effectiveness of SVM depends on the selection of kernel, the kernel's parameters, and the regularization parameter. Typically, each combination of parameters is checked using the cross validation, and the best combination of parameters is often selected by the grid search method with the exponentially growing computational complexity. Simulated annealing algorithm [11], genetic algorithm [12] and particle swarm optimization were used by some researchers to select the proper parameters of SVM.
Recently, researchers from all over the world have been improving the ANN according to different forecasting tasks and have obtained some satisfying results [13]. Nevertheless, the gradient-based learning algorithms are widely used to train traditional ANNs, which may result in some drawbacks such as the slow convergence speed, the local minimum, and the overfitting phenomenon. In order to solve the aforementioned problems, we focus our study in this paper on an improved machine learning algorithms based on neural networks with random weights (NNRW) models [14]. There are three layers in NNRW: input layer, hidden layer, and output layer. In the NNRW, the weights connecting the input layer to the hidden layer, as well as the bias values of the hidden layer, are randomly generated before the learning process. Only the weights connecting the hidden layer to the output layer are trained by the fast linear regression. Because of the rapid learning speed and the good generalization performance, NNRW has been successfully used in fields of computational intelligence and machine learning communities, such as electricity price forecasting [15], power loss analysis [16], lying and truth-telling classification [17], and attention-deficit/hyperactivity disorder (ADHD) classification [18]. The structure of NNRW, i.e. the number of the hidden nodes, is one of the important factors that affect the performance of NNRW. It is empirically determined by the users. Recently, neural networks with random weights and kernels (KNNRW) [19][20][21][22] has been proposed by replacing the hidden nodes mapping with the kernel mapping. It does not need to determine the number of hidden nodes of KNNRW.
Based on the analysis above, this paper proposes a short-term load forecasting method based on KNNRW, which can combine the fast learning speed of NNRW and the good generalization performance of SVM. Eight relevant factors (e.g., the historical load data, the temperature data, and the holiday data) are first selected as the inputs of the forecasting model. It is known that the inputs are treated equally in KNNRW. However, different inputs may have different influences on the forecasting values. As a result, a mutual information weighting algorithm is then applied to allocate different weights to the inputs according to the corresponding influences. Finally, the resulting improved neural networks with random weights is used to approximate the nonlinear function between the selected inputs and the daily maximum load.

Basic Neural Networks with Random Weights
NNRW has been proposed by Schmidt et al. [14]. However, there are still existing some similar ideas coming out from other researchers, such as Pao et al. [23] and Huang et al. [24]. Pao et al. described such randomized learner models as the random vector functional-link (RVFL) net [23]. Huang et al. defined such machine learning models as extreme learning machine (ELM) [24]. Researchers have done some further researches on RVFL and ELM, and achieved some theoretical results [22,25,26]. In fact, a feed forward NNRW has a simple three-layer structure: input layer, output layer, and a hidden layer consisting of a large number of nonlinear processing nodes. Mathematically, NNRW [14] can be expressed as follows: where W in 2 R L×m is the input weight matrix, b 2 R L is the bias value vector of the hidden layer, w 2 R L is the output weight vector, g(Á) is the activation function (g(Á) could be almost any nonlinear piecewise continuous activation function or any linear combination of these functions), N is the number of the samples, L is the number of the hidden layer nodes, x k 2 R m is the input vector which has m-dimension features, and o k 2 R is the output value. The output of the proposed forecasting model is the maximum load of the next day. Consequently, we use the single output form of NNRW in this paper.
For N arbitrary distinct samples {x i 2 R n , t i 2 R}, NNRW with L hidden nodes can approximate these N samples with zero error. It means that X N k¼1 ko k À t k k ¼ 0, i.e., there exists w in NNRW such that The matrix-vector formulation of (2) can be written as where H ¼ In the NNRW model, W in and b are generated randomly beforehand, and remain fixed in the training process. w is the only parameter that needs to be tuned through the training. It can be calculated analytically as follows: where H † is the Moore-Penrose generalized inverse of matrix H. The training of the NNRW model can be summarized as follows: a. Randomly generate the input weight W in and the hidden layer bias b; b. Calculate the hidden layer output matrix H; c. Calculate the output weight w by (4).
As can be seen from the above, the training process of NNRW is a simple linear regression process, which can overcome the limitations of traditional ANNs effectively. Despite the success of NNRW, there is still room for improvement, such as the determination of the structure (i.e., the number of the hidden layer nodes), and the ill-conditioned solution in the training process [22].

Neural Networks with Random Weights and Kernels
In order to overcome the aforementioned shortcomings of NNRW, neural networks with weights and kernels (KNNRW) has been proposed by introducing the kernel function mapping of SVM as the hidden node mapping of NNRW [19,21].
The optimization problem of NNRW can be written as: where ξ i is the training error related to the ith training sample x i , C is the regularization coefficient, and h(x i ) denotes the ith row of H. The corresponding dual optimization problem of (5) can be formulated as: where α i is the Langrage multiplier with respect to the ith training sample x i . The corresponding Karush-Kuhn-Tucker (KKT) conditions are as follows: Substituting (7) and (8) into (9), the following equation can be obtained where I is an identity matrix. Considering (7) and (10), the weight w can be calculated as: Thus, the output function of NNRW can be written as: It can be seen from (12) that the specific form of h(x) is not important as long as the dot product of HH T (or h(x)H T ) is known. As a result, if the hidden node mapping h(x) is unknown, we can define the kernel matrix of KNNRW as follows: Consequently, the output function can be rewritten accordingly as: In the kernel implementation of NNRW, h(x) can be unknown, while the corresponding kernel function K (u, v) usually should be given (e.g., K (u, v) = exp(−γku−vk 2 ), where γ is the kernel width.). Hence, the number of the hidden layer nodes does not need to be determined any more. Moreover, the KNNRW has the following universal approximation capability: Theorem [21]: Universal Approximation Capability: According to NNRW, a widespread type of the hidden node mapping h(x) can be used in NNRW so that NNRW can approximate any continuous target function. In other words, given any target continuous function g(x), there is a weight vector w such that With this universal approximation capability, KNNRW can use a wide range of feature mappings, such as Sigmoid, radial basis function (RBF), trigonometric, and polynomial mappings. The optimization objective functions of KNNRW are similar to those of traditional SVM/least squares support vector machine (LS-SVM). However, KNNRW does not have any constraints on the Lagrangian multipliers. As a result, KNNRW can obtain a better solution than SVM/LS-SVM. In addition, as KNNRW does not need the bias values while SVM does need, it is superior to the traditional SVM/LS-SVM algorithms in the performance of the scalability and learning rate [21].

Inputs of KNNRW
In this section, the proposed INNRW was used to forecast the short-term load of Dalian city of China. The output of the model was the daily maximum load. With the analysis in literature [4], the load data had weekly and monthly characteristics. It can be seen from Figs 1 and 2 that values of the load remain stable on weekdays while dropping apparently at weekends; values of the same month show approximately the same tendency; values of every week indicate a regular variation tendency (Take Dalian as an example). Therefore, we took both weekly and monthly characteristics as the inputs. Additionally, it was verified that the temperature was an essential factor influencing the maximum load [9], and the temperature showed an obvious correlation with the maximum load. Therefore, we selected the temperature as another input.
Meanwhile, the holiday data also affected the maximum load, for the descent of the industries power consumptions during the holidays can lead to the decrease of the total power consumptions. For example, as is known, there were 6 Chinese legal holiday vacations in 2012, and they were from 1st January to 3rd January, from 22nd January to 28th January, from 2nd April to 4th April, from 29th April to 1st May, from 22nd June to 24th June, and from 30th September to 7th October, respectively. In addition, it can be clearly seen from Fig 3 that the load data have an obvious holiday characteristic, that is, values of the load descend sharply during the holidays. Consequently, the binary encoded holiday data served as an input in this paper. As the maximum load was closely related to the historical maximum load, which can be verified by analyzing the load data as time series, we selected the maximum load of the day before, and that of the day last week as inputs of KNNRW.
Finally, the inputs selected for the INNRW were month of the year, day of the month, day of the week, week number, holiday indicator, daily average temperature, maximum electricity load of the day before, and maximum electricity load of the day last week.

Mutual Information Weighting Algorithm
In order to further improve the forecasting accuracy, the contributions of the inputs to the output of KNNRW were calculated and the weight values were allocated to the inputs accordingly.
The mutual information (MI) is a measurement of the variables' mutual dependence [27][28][29][30]. Accordingly, the high mutual information indicates the high dependence, and the low mutual information indicates the low dependence.
For two given discrete variables X and Y, suppose the joint probability distribution was P XY (x, y), and the mutual information between X and Y, denoted I(X;Y), can be formatted as where P X (x) and P Y (y) were the marginal probability distribution In the case of continuous variables, (16) was replaced by where P XY (x, y) was the joint probability density function of X and Y, and P X (x) and P Y (y) were the marginal probability density functions of X and Y, respectively. For discrete feature variables, both the joint and marginal probability can be estimated by tallying the samples of the categorical variables in the data. For continuous feature varibles, the following Parzen windows method was used to approxiamte I(X;Y).
Given N samples of a vector variable x, the approximate density funcitonP X ðxÞ had the following form:P where x (i) was the ith sample, h was the window width, and δ(Á) was the Parzen window function: where z = x − x (i) , d was the dimension of the sample x and S was the covariance of z. When d = 1, (19) returned the estimated marginal density; when d = 2, we can use (19) to estimate the density of the bivariate (x, y), P XY (x, y), which was the joint density of x and y in fact. Hence, in this paper, we used the mutual information to determine the contribution of the inputs to the output of the INNRW. First, the mutual information MI i , i = 1,. . ., m of the inputs to the output were calculated. Then the weights can be allocated to the corresponding inputs according to the following equation where μ i was the weight allocated to the ith input. Then, the input of KNNRW can be expressed as And the resulting forecasting model was denoted as the improved neural networks with random weights.

Simulation
In order to verify the effectiveness, the proposed model was applied to forecast the actual maximum load. The electricity load data from January 1, 2012 to November 30, 2013 from the Dalian Electricity Corporation in China, the temperature, the holiday indicator and some other data were used to train the forecasting model. Daily maximum load data of 31 days in December of 2013 were used to test the performance of the forecasting model. The forecasting results were described using Mean Absolute Percentage Error (MAPE), Maximum Error (ME) and Forecasting Error (FE) as follows: where L R i stood for the actual values of the daily maximum load, L P i stood for the forecasting values of the daily maximum load, and n stood for the number of days.

Simulation Experiment
Firstly, data sets were normalized. The inputs were normalized to [-1, 1] and the outputs were normalized to [0, 1]. According to (18), the weights were calculated and allocated to the corresponding inputs. Secondly, the INNRW model was initialized, in which the Gaussian kernel function was used in the hidden layer, and the regularization coefficient and kernel width were determined by the grid search.  Thirdly, the INNRW model was trained by the training samples. Fourthly, the testing samples based on the trained INNRW were forecasted, and the forecasting results of the daily maximum load of 31 days in December of 2013 were obtained.
Eventually, the residual errors between predicted values and actual values were calculated.

Experiment Results
Based on Eq (18), the mutual information and the resulting weights of the inputs are summarized in Table 1.
Based on the analysis above, the Gaussian kernel function K(u, v) = exp(−γku−vk 2 ), where γ was the kernel width, was chosen to be the kernel function in the INNRW model. Fig 4(A) illustrates the relations among MAPE, the kernel width and the regularization coefficient, while Fig 4(B) illustrates the relations among ME, the kernel width and the regularization coefficient. It can be seen from Fig 4 that both the kernel width and the regularization parameter are key parameters influencing the forecasting performance of the INNRW. The grid search method was used to optimize the two parameters. The optimal kernel width was 3.7276e+03, and the optimal regularization parameter was 1.3895. In order to further illustrate the effectiveness of the proposed method, a comparison was conducted between the INNRW method and several state-of-the-art load forecasting methods, such as back propagation (BP) neural network, RBF neural network, support vector regression (SVR), NNRW, online sequential extreme learning machine (OS-ELM) and KNNRW. The forecasting results were shown in Table 2, Table 3 and Table 4. The forecasting results of BP neural network, SVR, OS-ELM and the proposed INNRW were summarized in Table 2 and Table 3. It can be observed from Table 2 and Table 3  As can be seen from Table 4, KNNRW and the proposed INNRW can obtain much better forecasting results in both MAPE and ME indexes than the other methods. Moreover, the

Conclusions
A forecasting model based on the INNRW was proposed for the short-term load forecasting. Through the data pre-processing, eight features, i.e. month of the year, day of the month, day of the week, week number, holiday indicator, daily average temperature, maximum electricity load of the day before, and maximum electricity load of the day last week, were selected as the inputs of the INNRW. Then, in order to further improve the forecasting accuracy, different weights were allocated to the inputs according to their mutual information with the forecasting load values. A novel neural network, KNNRW, which combined the universal approximation ability and the fast learning speed of NNRW and the good generalization performance of SVM, was used to model the nonlinear function between the selected inputs and the maximum load. Simulation experiment results based on the actual load data from Dalian, China, showed that the proposed method can obtain smaller predicted errors than the traditional forecasting methods in both MAPE and ME. The kernel types and kernel parameters were crucial to the forecasting performance of the INNRW, and they were selected by the time-consuming grid search in this paper. The multiple kernel learning will be a potential solution. It is able to combine the kernel funtions which have different types or different parameters. As a result, the investigation of the multiple kernel learning in the INNRW will be a subject of the further research.