^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: CL. Performed the experiments: YL. Analyzed the data: YL CL. Wrote the paper: YL CL.

In experimental and theoretical neuroscience, synaptic plasticity has dominated the area of neural plasticity for a very long time. Recently, neuronal intrinsic plasticity (IP) has become a hot topic in this area. IP is sometimes thought to be an information-maximization mechanism. However, it is still unclear how IP affects the performance of artificial neural networks in supervised learning applications. From an information-theoretical perspective, the error-entropy minimization (MEE) algorithm has newly been proposed as an efficient training method. In this study, we propose a synergistic learning algorithm combining the MEE algorithm as the synaptic plasticity rule and an information-maximization algorithm as the intrinsic plasticity rule. We consider both feedforward and recurrent neural networks and study the interactions between intrinsic and synaptic plasticity. Simulations indicate that the intrinsic plasticity rule can improve the performance of artificial neural networks trained by the MEE algorithm.

Artificial neural networks with nonlinear processing elements are designed to deal with the troublesome problem of nonlinear and nonstationary signal processing. In a supervised learning problem, we are provided with a training data set containing the input,

So far, experimental and theoretical studies on neural plasticity have mostly focused on synaptic plasticity, which is in accordance with Hebb's idea that memories are stored in the synaptic weights and learning is the process that changes those synaptic weights. Interestingly, recent experimental results have revealed that neurons are also capable of changing their intrinsic excitability to match the dramatic change of the level of received synaptic input

The two plasticity mechanisms, intrinsic plasticity and synaptic plasticity, have been studied mostly separately. We are wondering how these two plasticity mechanisms would cooperate in artificial neural networks to learn complex mappings. To this end, we combine Bell and Sejnowski's information-maximization algorithm

Studying the effects of intrinsic plasticity on various neural functions and dynamics relies on modelling intrinsic plasticity. In

We apply this information-maximization rule as the intrinsic plasticity rule for artificial neural networks in this paper. Note that the original information-maximization rule in

As mentioned above, the MSE criterion considers only the first two moments of the error distribution, thus it is ill-suited to non-linear applications in which the errors are not normally distributed. Recently, the error entropy (EEC) criterion based on ideas from information theory has been proposed as an alternative cost function for learning

In applications, Renyi's quadratic entropy,

We now introduce our synergistic information-theoretic learning algorithm, which is the simple combination of the IP rule of Eq. (2) and the synaptic plasticity rule of Eq. (7).

From the perspective of information maximization, there are potential advantages of intrinsic plasticity in training artificial neural networks. For traditional weight update learning algorithms (synaptic plasticity rules), the activation functions of neurons are fixed during the training procedure. However, an invariant nonlinear activation function might be unsuitable for the input distribution. In an extreme case, the output of the neuron may be constantly found at saturation, and therefore carry very little information about the input. For real-world applications, the desired output distribution of a single neuron is far from these distributions with very low information. As we stated in the previous section, the information-maximization algorithm (the intrinsic plasticity rule) can adjust the shape of the activation function to match the input distribution and consequently increase the mutual information between the input and the output. Therefore, we hypothesize that the intrinsic plasticity rule might be beneficial to learning in artificial neural networks.

The MEE algorithm requires several samples to accurately estimate the information potential. We therefore perform batch (epochwise) learning iterations, whereby the weights

One may think that the effects of synaptic plasticity and intrinsic plasticity merely superpose in the learning process. In fact, we argue that they interact, which is why we call this combination “synergistic learning”. Indeed, the weight updating procedure affects the input of the neuron, and further influences the IP learning; the IP learning procedure affects the output of the neuron, and further influences the weight updating.

It has been noted that, in reservoir networks, due to the incremental nature of intrinsic plasticity (the value of parameter

In order to study the performance of the proposed synergistic learning algorithm, we first choose a general class of feedforward neural networks (FNN) as an example. As illustrated in

The difference between the desired output,

On the basis of the algorithm description in

Choose a random set of small values for the

The epochwise training procedure begins with

Update the weight matrices

Update the parameters

If the stopping criterion is satisfied, the training procedure is stopped; otherwise, set

In this section, we continue to study the proposed synergistic learning algorithm in a general class of recurrent neural networks

The

Following the approach of

The proposed synergistic learning algorithm for the RNN is summarized as follows:

Choose a random set of small values for the

The epochwise training procedure begins with

Let

Repeat the calculation until

Update the weight matrix

Update the parameters

As one epoch ends, if the stopping criterion is satisfied, the training procedure is stopped; otherwise, set

The FNN and RNN are widely applicable to a set of problems such as regression and classification. As a typical example, we test the proposed synergistic learning algorithm on the single-step prediction of time series. For comparison, we also perform simulations for the MEE algorithm alone. The time series is denoted as

The

In the following simulations, the elements of the initial weight matrices

The first simulation compares the learning curves of the MEE algorithm and the synergistic algorithm. In synergistic learning, the activation functions of neurons in both the hidden layer and the output layer are adjusted by IP. In this simulation, structural parameters of the FNN are set to

The dashed lines denote the learning curves of the MEE algorithm, and the solid lines denote the learning curves of the synergistic algorithm. (A) 300-epoch learning curves for the training data set “MG”. (B) 1000-epoch learning curves of “MG”. (C) 300-epoch learning curves for the training data set “SS”. (D) 1000-epoch learning curves of “SS”.

The dashed lines denote the learning curves of the MEE algorithm, and the solid lines denote the learning curves of the synergistic algorithm. (A) 300-epoch learning curves for the training data set “MG”. (B) 1000-epoch learning curves of “MG”. (C) 300-epoch learning curves for the training data set “SS”. (D) 1000-epoch learning curves of “SS”.

Data set | Training set | Testing set | ||

Criterion | MSE | MSE | ||

No IP | 2.2632 | 0.0056670 | 2.2967 | 0.0051939 |

With IP | 2.3950 | 0.0048786 | 2.4212 | 0.0044354 |

Improvement (MSE) | 13.91% | 14.60% |

Data set | Training set | Testing set | ||

Criterion | MSE | MSE | ||

No IP | 2.1518 | 0.0074291 | 2.6066 | 0.0019096 |

With IP | 2.1905 | 0.0067886 | 2.6703 | 0.0012658 |

Improvement (MSE) | 8.62% | 33.71% |

In order to analyze the synergies between IP and synaptic plasticity in detail, input, output, and error distributions of neurons for the training set “MG” are presented. All these distributions are obtained by kernel density estimation in a single run, and results of two independent runs are similar. In order to explain the results clearly, we decompose the FNN into two parts, which are shown in

(A) The input layer and the hidden layer of the FNN. (B) The output layer of the FNN.

Input and output distributions for the five hidden neurons with the training data set “MG” are displayed. (A) Initial input distributions for the five hidden neurons. (B) Input distributions after 1000-epoch training for the two algorithms. (C) Initial output distributions for the five hidden neurons. (D) Output distributions after 1000-epoch training for the two algorithms. In (B) and (D), the dash lines denote the distributions obtained by the MEE algorithm, and the solid lines denote the distributions obtained by the synergistic algorithm.

Input distributions for the single output neuron and error distributions with the training data set “MG” are presented. (A) Initial input distribution. (B) Input distributions after 1000-epoch training for the two algorithms. (C) Initial error distribution. (D) Error distributions after 1000-epoch training for the two algorithms. In (B) and (D), the dash lines denote the distributions obtained by the MEE algorithm, and the solid lines denote the distributions obtained by the synergistic algorithm.

Thus, the analysis on the results in

The training data set “MG” is used. (A) Mean of the gain parameter

The second simulation concerns the IP learning rate.

The training data set “MG” is used. The initial IP learning rates

Training results after 1000-epoch training for the case of the training data set “MG” are presented. The circle markers denote the results obtained by the MEE algorithm, and the cross markers denote the results obtained by the synergistic algorithm. (A) Results of the quadratic information potential. (B) Results of the mean square error.

As in the case of the FNN, we discuss how the recurrent neural network handles the problem of single-step prediction using the same data sets. The input of the network consists of

The values of elements in the initial weight matrix are also randomly selected as small values uniformly distributed in [0, 0.05]. Results in this section are also averaged over 10 independent runs. The MEE learning rate is set to

The first simulation compares the learning curves of the synergistic algorithm and the MEE algorithm. Structural parameters of the RNN are set to

The dashed lines denote the learning curves of the MEE algorithm, and the solid lines denote the learning curves of the synergistic algorithm. (A) 300-epoch learning curves for the training data set “MG”. (B) 1000-epoch learning curves of “MG”. (C) 300-epoch learning curves for the training data set “SS”. (D) 1000-epoch learning curves of “SS”.

Data set | Training set | Testing set | ||

Criterion | MSE | MSE | ||

No IP | 2.5666 | 0.0021446 | 2.5876 | 0.0019388 |

With IP | 2.6653 | 0.0013974 | 2.6994 | 0.0009594 |

Improvement (MSE) | 34.84% | 50.51% |

Data set | Training set | Testing set | ||

Criterion | MSE | MSE | ||

No IP | 2.2362 | 0.0060375 | 2.6811 | 0.0011350 |

With IP | 2.2765 | 0.0055386 | 2.7057 | 0.0009260 |

Improvement (MSE) | 8.26% | 18.41% |

The training data set “MG” is used. Neuron 1 (output neuron): (A) Initial input distribution. (B) Input distributions after 1000-epoch training for the two algorithms. (C) Initial error distribution. (D) Error distributions after 1000-epoch training for the two algorithms. Neuron 2: (E) Initial input distribution. (F) Input distributions after 1000-epoch training for the two algorithms. (G) Initial output distribution. (H) Output distributions after 1000-epoch training for the two algorithms. In (B), (D), (F), and (H), the dash lines denote the distributions obtained by the MEE algorithm, and the solid lines denote the distributions obtained by the synergistic algorithm.

The training data set “MG” is used. (A) The gain parameter

The second simulation concerns the initial IP learning rate.

The training data set “MG” is used. The initial IP learning rates

Training results after 1000-epoch training for the case of the training data set “MG” are presented. The circle markers denote the results obtained by the MEE algorithm, and the cross markers denote the results obtained by the synergistic algorithm. (A) Results of the quadratic information potential. (B) Results of the mean square error.

In some of the above simulations, we do not show the results for the data set “SS”, since they are similar to those for the data set “MG”. All activation functions used in the above-mentioned neural networks are tanh functions, thus the intrinsic plasticity rule for the tanh function is applied to the synergistic learning algorithm. In the case of using logistic functions, similar results can be obtained.

Combining the MEE algorithm as the synaptic plasticity rule and the information-maximization algorithm as the intrinsic plasticity rule, we proposed a synergistic information-theoretic learning algorithm for training artificial neural networks. Whereas the information-maximization algorithm can increase the mutual information of a single neuron, it can not optimize the cost function such as EEC and MSE. Nevertheless, simulations have shown that this information-maximization-based IP rule benefits the artificial neural networks in both the convergence speed and the final learning result. As the IP rule adjusts the activation function of a single neuron to match its input distribution so that all output levels tend to appear equivalently, the input can be encoded much more efficiently and the discriminative ability of the neuron is enhanced. We believe that the discriminative ability of a neuron plays a nontrivial role in the performance of artificial neural networks. In terms of the FNN, the synergistic learning algorithm with IP only in the hidden layer or only in the output neuron still outperforms the MEE algorithm without IP, but is inferior to the learning algorithm with IP in both layers (we do not present these results in the paper).

Compared with the original algorithm, the synergistic learning algorithm can be performed with a relatively small increase in computational cost due to the local nature of the IP mechanism and the simplicity of the information-maximization algorithm. In addition, we have used the efficient batch version of the information-maximization algorithm. In applications, a long training process is unnecessary since the improvement is minor at the end part of the training. For example, with a 300-epoch training, the IP rule is quite effective to improve the performance. In a long run, the synergistic learning maintains good performance.

Advanced search methods for nonlinear optimization such as conjugate gradient algorithms and the Levenberg-Marquardt algorithm can be used to further speed up the learning process. In order to focus on the synergies between IP and synaptic plasticity and preclude influences of other advanced search methods on learning, we used the simple gradient descent (GD) method.

In biology, Bell and Sejnowski's information-maximization algorithm can match the statistics of naturally occurring visual contrasts to the response amplitudes of the blowfly's large monopolar cell (LMC). The contrast-response function of the LMCs in the blowfly's compound eye approximates to the cumulative probability distribution of contrast levels in natural scenes, thus the inputs are encoded so that all response levels are used with equal frequency, resulting in a uniform output distribution

As related work, several studies have combined synaptic learning algorithms with Triesch's IP rule

We gratefully acknowledge the anonymous reviewers for providing valuable comments and suggestions, which greatly improved our paper.