An Efficient Supervised Training Algorithm for Multilayer Spiking Neural Networks

The spiking neural networks (SNNs) are the third generation of neural networks and perform remarkably well in cognitive tasks such as pattern recognition. The spike emitting and information processing mechanisms found in biological cognitive systems motivate the application of the hierarchical structure and temporal encoding mechanism in spiking neural networks, which have exhibited strong computational capability. However, the hierarchical structure and temporal encoding approach require neurons to process information serially in space and time respectively, which reduce the training efficiency significantly. For training the hierarchical SNNs, most existing methods are based on the traditional back-propagation algorithm, inheriting its drawbacks of the gradient diffusion and the sensitivity on parameters. To keep the powerful computation capability of the hierarchical structure and temporal encoding mechanism, but to overcome the low efficiency of the existing algorithms, a new training algorithm, the Normalized Spiking Error Back Propagation (NSEBP) is proposed in this paper. In the feedforward calculation, the output spike times are calculated by solving the quadratic function in the spike response model instead of detecting postsynaptic voltage states at all time points in traditional algorithms. Besides, in the feedback weight modification, the computational error is propagated to previous layers by the presynaptic spike jitter instead of the gradient decent rule, which realizes the layer-wised training. Furthermore, our algorithm investigates the mathematical relation between the weight variation and voltage error change, which makes the normalization in the weight modification applicable. Adopting these strategies, our algorithm outperforms the traditional SNN multi-layer algorithms in terms of learning efficiency and parameter sensitivity, that are also demonstrated by the comprehensive experimental results in this paper.


Introduction
Increasing the level of realism in a neural simulation and improving the computational capability of artificial neural networks [1] [2], the spiking neural networks employing temporal coding mechanism is introduced as the third generation of neural networks and has achieved great success in various artificial intelligence tasks [3]- [6]. Most traditional neural networks represent real-valued analog data by the firing rate of neurons, like the first generation neural networks pioneered by the McCulloch-Pitts model [7] and the second generation by the the perceptron model [8]. However, there are substantial evidences that in biological neural systems there exist fast computations that are very likely based on spike firing events [1][2] [9]. To simulate these firing events, the third generation of neural networks, the SNNs transmit information by spike times instead of the firing rate, and have been proven computationally more powerful than networks with rate coding [10]- [13].
In the learning of SNNs with temporal encoding mechanism, the supervised training is an important biomimetic concept which could potentially improve the learning speed with the help of an instructor signal. Various supervised training algorithms of SNNs have been proposed by now, which can broadly be subdivided into two classes: training algorithms for single layer SNNs, and for multilayers.
The single layer training algorithms are introduced based on the gradient decent rule or learning windows. Regarding to the gradient decent rule, the Tempotron [14] is a classical algorithm employing the distance between the output neuron's voltage and the firing threshold as the cost function and can complete training efficiently, however, it can only complete binary classification tasks. The Chronotron [15] and spike pattern association neuron algorithm [16] try to minimize the distance between the desired and actual output spike trains by the gradient descent rule, with the distance defined by the Victor and Purpura metric [17] and the van Rossum metric [18] respectively.
A lot of algorithms based on learning windows have been proposed [19] for single layer networks. Among which, the remote supervised learning method is a classical one employing both the Spike-Timing-Dependent Plasticity (STDP) window, and the anti-STDP learning window to complete training [20]. The perceptron-based spiking neuron learning rule adopts a learning window based on the postsynaptic voltage function to instruct training [21]. The Synaptic Weight Association Training (SWAT) utilizes the STDP learning window and the Bienenstock-Cooper-Munro learning rule [22] to drive learning and achieves convergence. The precise spike driven synaptic plasticity learning rule [23] combines the Windrow-Hoff rule and the learning window of postsynaptic potential. Further algorithms adopting learning windows are introduced in [24]. These training algorithms employing learning windows are often more efficient than those with the gradient descent rule. But these single layer algorithms cannot complete training when the network structure contains hidden layers. However, electrophysiology experiments on cat's visual system and monkey striate cortex reveal that the information in biological neurons is processed hierarchically rather than by a single layer [25]- [27]. Then, training a hierarchical spiking neural network is by far the closest way to the biological system.
For multilayers learning of the SNNs, the Spike Propagation (SpikeProp) [28] is the pioneer method that defines the computational error by the distance between the actual and target firing time, and minimizes the error by the gradient descent rule. It achieves training accurately but inefficiently, and only the first spike of a neuron can be trained. Different variations of the SpikeProp, the Quick Propagation, Resilient Propagation [29] and the Multiple SpikeProp [30] [31] are proposed to improve the SpikeProp's learning performance. The Multi-layer Remote Supervised Learning Method(Multi-ReSuMe) [32] extends the ReSuMe [20] to multiple layers by the gradient decent rule, assuming that the relation between the input and output firing rates is linear. All of these existing algorithms can achieve learning, while the efficiency of them is much lower than that of the biological system [33] [34], and does not meet the requirements of the real-time applications.
To solve the low efficiency problem in the multilayer training of SNNs, the Normalized Spiking Error Back-Propagation (NSEBP) is proposed in this paper, which is motivated by the selective attention mechanism of the primate visual system [35,36] and its layer-wise feature representation method in hierarchical structures [25,26]. Different from traditional algorithms, our algorithm only selects target spike times as attention areas and ignores the states of other times. Besides, the voltage difference is employed to evaluate training errors, and the relation between the weight variation and voltage error change is uncovered, which enables the NSEBP to adjust each synaptic efficacy accurately. Moreover, the computational error is back propagated to previous layers by presynaptic spike jitter instead of the traditional gradient decent rule, which realizes layer-wise training in our algorithm. In the feedforward calculation, the analytic solutions of the spike time are calculated in the spiking response model, instead of detecting the postsynaptic voltage states at all time points. Employing these strategies, our algorithm achieves a significant improvement in training efficiency compared with the traditional training methods.

Learning Algorithm
In this section, a new algorithm for feed-froward multilayer spiking neural networks, the Normalized Spiking Error Back Propagation (NSEBP) is presented.

Neuron Model
In our study, the simplified Spike Response Model (SRM 0 ) is employed because of its simplicity and effectiveness. In the SRM 0 [37], once the jth spike is emitted, a fundamental voltage j is inspired and transmitted to its postsynaptic neuron. Each postsynaptic neuron integrates the weighted sum of all presynaptic influence j at time t as its voltage u(t), and emits a spike if its voltage u(t) reaches the threshold. The postsynaptic voltage u(t) is described in the following equations: with the Heaviside step function Specifically,t out denotes the last recent output spike of the postsynaptic neuron, w j is the weight of the presynaptic neuron emitting the jth input spike, and Zðt Àt out Þ is the refractory function to simulate the biological refractory period. Γ j is a set containing the spike time emitted by all the presynaptic neurons, u ext is the external voltage, s j ¼ t À t j in , with t j in denoting the jth firing time of the input spike train. τ 1 and τ 2 are constant parameters.
In our algorithm, the Post-Synaptic Potential (PSP) learning window is employed, which is represented in Eq (4), providing a relation of the weight modification and spike time deviation. Obviously, it only directs weight modification if the presynaptic neuron fires before the postsynaptic one.
where A 1 is a constant set to be 1 in our study, and s j ¼ t À t j in denotes the time distance between the current time t and the input time t j in .

The NSEBP Algorithm
In this section, the Normalized Spiking Error Back Propagation (NSEBP) is proposed. The feedforward calculation process is derived in the Theorem 1, and the feedback training process of this learning rule is described here for one postsynaptic neuron with several presynaptic neurons.
In the network with n layers employing the SRM 0 model, an arbitrary postsynaptic neuron o has an input spike train . . . t P in g denoting the ordered input spikes, and a target spike train Inspired by the selective attention mechanism of the primate visual system, the NSEBP only detects and trains the voltage states at target time points for neuron o, and ignore states on other non-target time.
For each postsynaptic neuron o, instead of the traditional time error, the voltage distance between the threshold ϑ and the postsynaptic voltage u(t d ) at the target time t d is employed as the network error in our algorithm described in Eq (5), which is trained to become zero in our algorithm: To train the postsynaptic voltage to ϑ, two steps are applied to our algorithm, that are the presynaptic spike jitter to back propagate error and the weight modification to complete training of the current layer. Presynaptic spike jitter. The presynaptic spike jitter is employed to back propagate error instead of the traditional gradient descent rule. It can influence the postsynaptic neuron voltage u(t d ) and realize layer-wised training. To achieve the back-propagated layer-wised learning, neurons in hidden layers also require the target spike time and training error. Then, the error in Eq (5) is allocated to n layers by the normalized parameter r and back propagated by the presynaptic spike jitter. The error assigned to the current layer err n w is err n w ¼ rerr; and to the previous n − 1 layers err n t is where r is set to 1/n in our algorithm. This error assignment is shown in Fig 1. The error err n t is back propagated to previous n − 1 layers by shifting each influential presynaptic spikes (presynaptic spikes which have influence to the postsynaptic neuron o at the current target time t d ). The error assigned to the jth influential presynaptic spike is Δu j , which is calculated by In which g jn t is an assign variable and calculated by : with A 1 and W ind (s j ) defined in Eq (4). m 1 and m 2 are the first and last indexes of the influential presynaptic spikes respectively. To achieve training of this Δu j , the time variation of the jth input spike Dt j pre is calculated by with a ¼ Àw j exp ððt j pre À t d Þ=t 2 Þ; ð11Þ b ¼ w j exp ððt j pre À t d Þ=t 1 Þ; ð12Þ where w j is the corresponding weight of the jth presynaptic spike, t d is the target spike time, τ 1 and τ 2 are model parameters defined in Eq (2), t j pre denotes the jth presynaptic spike time. If a 6 ¼ 0 and b 2 − 4ac ! 0 hold, Dt j pre is calculated by Eq (10) and the solution with the minimum absolute value is applied to our algorithm. The calculation is derived in the following theoretical derivation section.
Weight modification. To train err n w to 0, the weight modification in our algorithm for an arbitrary influential input spike j is calculated by where g n j is a parameter defined by the normalized learning window with j (s j ) calculated by Eq (2), and W ind (s j ) by Eq (4). The error and its assignment in our algorithm. The error err is assigned to two parts, among which err n w is assigned to the current layer for weight modification, and err n t is propagated to previous (n − 1) layers. doi:10.1371/journal.pone.0150329.g001 To avoid the j (s j ) in Eq (14) going to infinitesimal when s j is too large, not all presynaptic spikes but only the spikes with voltage j (s j ) > ϑ v at t d are trained, where ϑ v is the voltage threshold. Solving the same mathematical equations as Theorem 2 in the following section, the time boundaries are t 1 ¼ t d À t j pre þ t 1 ln ðð1 À ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 À 4W v p Þ=2Þ and If there is no input spike in the range [t 1 , t 2 ], S spikes in [t 1 , t 2 ] are added randomly to the presynaptic hidden neurons with a probability. The probability p i assigned spikes to neuron i is calculated by in which, m is the number of presynaptic neurons, n i is the number of spikes emitted by neuron i, and n i = 0.5 if there is no spike emitted. Consequently, the fewer spikes emitted by neuron i, the higher probability p i it possesses. This allocation approach not only solves the none input problem in the training, but also balances the spike distribution. These added spikes are regarded as the target time of previous hidden layers and trained in the previous layer. The voltage in the time scope t À t j pre 2 ½t 1 ; t 2 . The voltage j (s j ) caused by the input spike t j pre is above W v when t À t j pre 2 ½t 1 ; t 2 . The voltage of the input t j pre is set to 0 at time t if this t À t j pre is not in the interval [t 1 , t 2 ]. doi:10.1371/journal.pone.0150329.g002

Mathematical Analysis Theoretical Derivation
In this section, the theoretical derivations in the feedforward and back propagation of our algorithm are presented in Theorem 1 and Theorem 2 respectively.
In traditional methods employing the temporal encoded SNNs, the output spikes of a neuron are detected serially in time, leading to an inefficient feed forward process. In our study, the output spike times are obtained by solving the voltage function instead of traversing all time points, which are derived in the following theorem.
Supposing that for a postsynaptic neuron o, the refractory period Zðt Àt out Þ is set to ÀA 2 exp ðÀðt Àt out Þ=t 1 Þ with the constant A 2 > 0, the external interference voltage u ext = 0, and T pre ¼ ft 1 pre ; t 2 pre ; t 3 pre ; . . . t P 1 pre g is the ordered presynaptic spike train of the P 1 presynaptic input spikes, where t j pre denotes the jth presynaptic spikes with w j representing its response synapse weight. m o is the index of the first influencing spike to the postsynaptic neuron o, ϑ is the firing threshold, τ 1 and τ 2 are model parameters defined in Eq (2). With these definitions, the relation between the pre and postsynaptic spikes is obtained by solving the quadratic function, which is proved in the following theorem: If the following conditions hold: the postsynaptic output spike t out in the range ½t j pre ; t jþ1 pre Þ is solved by Proof: According to the SRM 0 model described in Since ( Zðt Àt out Þ ¼ ÀA 2 exp ðÀðt Àt out Þ=t 1 Þ; ð21Þ at time t out , we have The postsynaptic neuron will fire once its voltage reaches the threshold ϑ, then the postsynaptic output spike time t out follows According to Eq (2), Thus, we have and refer to Eqs (17), (18), (25) and (II), By (I), the solutions of Eq (26) is and for all presynaptic time, we have t i > 0, z > 0, then The result follows. The Theorem 1 proves the relation between the pre and postsynaptic spikes, which is applied to our algorithm to improve the feedforward computation efficiency.
In the feedback process of our algorithm, the error is back propagated by the presynaptic spike jitter instead of the traditional gradient decent rule, by which, the layer-wised training is applicable to our algorithm and improves the learning efficiency of our algorithm significantly. The relation of the presynaptic spike jitter and the voltage change is investigated in the following theorem.
Supposing that Δu j is the voltage variation of the postsynaptic neuron o generated by the jth presynaptic spike, t d is the current target spike time, and other variables are the same as that in Theorem 1, then the relation between the time jitter Dt j pre and Δu j is obtained by solving the quadratic function, which is proved in the following theorem: In the SRM 0 model with τ 1 = 2τ 2 , if the voltage change Δu j follows ðIÞ w j 6 ¼ 0; ðIIÞ the voltage variation Δu j of the postsynaptic neuron can be achieved by the presynaptic spike time jitter Proof: Supposing that u j is the postsynaptic voltage stimulated by the input spike t j pre at the target time t d , and the presynaptic spike jitter 4t j pre makes the voltage change 4u j . By Eq (1), we have and then Let and by Eqs (29)-(31) and (34) can be expressed by Under the condition (I), we have w j 6 ¼ 0 ) a 6 ¼ 0, and if w j > 0, for condition (II), An Efficient Training Algorithm for Multilayer Spiking Neural Networks Then the discriminant of Eq (36) is Analogously, when w j < 0, for condition (III), Then, under these conditions, Eq (36) has solutions By the property of the logarithmic function, the spike jitter Dt j pre can be obtained by Eq (35) Under condition (II), for w j > 0, Du j > Àw j j ðt d À t j pre Þ, we have j ðt d À t j pre Þ þ Du j =w j > 0, and then Analogously, by condition (III), w j < 0, Then, under these conditions, Dt j pre is solved by Eqs (35) and (43) with The result follows.

An Efficient Training Algorithm for Multilayer Spiking Neural Networks
Specially, when Δu j exceeds the boundary of Theorem 2 (II) or (III) in our algorithm, it is set to the corresponding feasible boundary in the same direction of the condition.

Convergence Analysis
In this section, the convergence of our algorithm is investigated. To guarantee the convergence of the traditional and our algorithms employing the SRM 0 model, some conditions need to be met to select target time points. These conditions for traditional algorithms and our algorithm are studied in Theorem 3 (1) and Theorem 3 (2) respectively by analyzing the voltage function and the spiking firing conditions. Theorem 3 In the network under n layers employing the SRM 0 model described in Eqs (1)-(3) with τ 1 = 2τ 2 , we have: (1) To guarantee the convergence of the traditional algorithms based on the precise spike time mechanism, a time point t m d is available as target time only if there exist input spikes in ½t m d À ðn À 1Þt 1 ln 2; t m d Þ.  (2), Taking the partial derivatives, In the traditional precise time mechanism, it is @ pre ðtÀt j in Þ @t ! 0 when emitting spikes, then we have by τ 1 = 2τ 2 , Then in traditional algorithms, the input spike t j in can only inspire output spikes in the scope ðt j in ; t j in þ t 1 ln 2 for its postsynaptic neurons. Analogously, the output spikes caused by t j in are in ðt j in ; t j in þ ðn À 1Þt 1 ln 2 after n layers. Consequently, traditional algorithms cannot get convergent at t m d if there is no input spike in ½t m d À ðn À 1Þt 1 ln 2; t m d Þ. Then, to guarantee the convergence of the traditional algorithms based on the precise spike time mechanism, a time point t m d is available as target time only if there exist input spikes in ½t m d À ðn À 1Þt 1 ln 2; t m d Þ.
(2) Our algorithm employs the primate selective attention mechanism instead of the precise spike time rule, then there is no requirement of @ pre ðt À t j in Þ=@t ! 0. When the local influence shown in Fig 2 is not applied to our algorithm, all presynaptic spikes which have an influence on t m d can be trained to complete learning. Consequently, the time scope for input spikes is ½0; t m d Þ. If the local influence shown in Fig 2 is applied to our algorithm, the time scope of output spikes generated by the input t j in after several layers is shown in Fig 3. It is obvious that in layer 1, the time scope is [t 1 , t 2 ], and in layer 2, the earliest time to fire is t 1 0 = t 1 + t 1 , and the latest firing time is t 2 0 = t 2 + t 2 . Then, for layer n − 1, we have t d satisfying Eq (56) to complete training: Obviously, if there is no input in ½t m d À nt 1 ; t m d À nt 2 , this t m d can not be trained convergently. Then, in this condition, the time points t m d is available as target time only if there exist input spikes in ½t m d À nt 1 ; t m d À nt 2 . The results follow. Theorem 3 provides conditions for encoding target times, which guarantees the convergence of different algorithms. It also indicates that the convergent condition of our algorithm is relaxed compared with traditional algorithms. When target spikes are selected following these conditions, different algorithms have different convergence properties and speed, which depend on their training mechanisms. In our algorithm, all qualified target times can be trained successfully and efficiently.
At the mth target time t m d , the convergence of our algorithm is proved in the Theorem 4 by analyzing the postsynaptic voltage. Since the interference from training other target spikes or patterns varies with different network status, the following theorem proves the convergence of our algorithm ignoring this interference. The convergent situations with various interference are investigated in the following simulations sections.
Theorem 4 In the SRM 0 model described in Eqs (1)-(3) with τ 1 = 2τ 2 , the postsynaptic voltage of our algorithm at an arbitrary target time t m d is convergent to the threshold ϑ if the condition in Theorem 3 (2) holds, ignoring the interference from training other target spikes or patterns.
Proof: There are two cases in the training: (1) the presynaptic hidden neurons emit qualified spikes.
Since our algorithm are trained layer-wisely, each layer shares the same training process and becomes convergent in the same way. Then, the convergence of our algorithm is proved in one layer with a postsynaptic neuron and several presynaptic neurons. Suppose that the voltage of the postsynaptic neuron at t m d is uðt m d Þ calculated by Eq (1), and the voltage variations generated by the presynaptic weight modification and spike jitter are Δu w and Δu t respectively. For an arbitrary postsynaptic neuron o, and to the pth presynaptic spike, the voltage variation generated by the weight modification of this spike is Du p w , which is calculated by where Δw p is the variation of its weight w p . Since @uðt m d Þ=@w p ¼ p ðt m d À t p in Þ, For all presynaptic inputs, the voltage variation generated by the weight modification is in which t p pre is the pth presynaptic spike time, and m 1 and m 2 are the first and last indexes of these presynaptic spikes. By Eqs (5), (6) and (14),  An Efficient Training Algorithm for Multilayer Spiking Neural Networks By Eq (15), If the conditions in Theorem 2 hold, according to Eq (8), the Dt p pre calculated by Eq (32) is which makes the postsynaptic voltage variation generated by the presynaptic spike jitter Du p t become g pn t err n t , with err n t and g pn t defined in Eqs (7) and (9) respectively. Under these conditions, for all presynaptic spikes, the voltage variation inspired by the presynaptic spike jitter Δu t is By Eq (9), P m 2 p¼m 1 g pn t ¼ 1, then according to Eqs (5) and (7), where r is a parameter defined in Eq (7). Consequently, the whole postsynaptic voltage variation Δu generated by both presynaptic spike jitter and weight modification is and then If the value of Du p t exceeds the boundary in Theorem 2, the corresponding feasible boundary value is set to Du p t , and the solution is obtained according to Eq (32). It is obvious that the boundary value has the same training direction with Δu, which can make err close to 0. Then our algorithm will be convergent after several leaning epochs at t m d . (2) If there is no qualified input spike in the presynaptic hidden layer, our algorithm adds spikes randomly with probability calculated by Eq (16), after which all weight modifications and spike jitters are the same as case (1), and our algorithm can get convergence.
The layer-wise training is employed in our algorithm, by which each layer shares the same training process and becomes convergent in the same way. In this analysis, the interference of other target spike trains or patterns are ignored. With the influence, our algorithm requires several more epochs to offset this interference and complete training.
The results follow.

Computational Complexity
In this section, the computational time complexity of our algorithm is studied and compared with two traditional algorithms, the SpikeProp [28] and Multi-ReSuMe [32]. Before this, the detailed pseudo-codes of the feedforward and feedback processes of our method are listed. Feedforward calculation. According to the previous description, the pseudo-code of the feedforward calculation of our method is listed below. Since in the multilayer networks, each layer shares the same feedforward calculation process, only the calculation of one layer is described in this pseudo-code.

The Feedforward Calculation of Our Algorithm
Definition: T pre : the set of presynaptic spikes, which contains spikes emitted by all presynaptic neurons ft 1 pre ; t 2 pre ; t 3 pre ; . . . ; t P pre g. T pre is sorted and has no duplicate numbers.

Initialization:
The weight matrix W is initialized randomly. Feedforward calculation: For each postsynaptic neuron: For each presynaptic spike interval t j pre to t jþ1 pre with 1 < j < P − 1: For all presynaptic spikes before t j pre , calculate parameters a and b by Eqs (17) and (18). If a and b meet the conditions in Theorem 1: Calculate the output spike time in this scope by Eq (19) and add it to the output spike train of the postsynaptic neuron.

End If End For End For
Supposing that there are M presynaptic neurons, N postsynaptic neurons, P input spikes of all these presynaptic neurons, and the time length is T. As described in the pseudo-code, our algorithm detects each input spike scope and calculates parameters a and b by all of these P input spikes in the worst case, then the time complexity of our algorithm for one layer is O (NP 2 ), which also reveals the number of operations in our method.
For traditional feedforward calculation method, all discrete time points in T are detected instead of the input spike scopes of P, then the second loop in the pseudo-code above is replaced by the time scope in T (supposing that the time interval is 1ms). For each time scope, it calculates the postsynaptic voltage by these P input spikes and determines whether the voltage is greater than the threshold. In this way, the time complexity of the traditional method in one layer is O(NTP). Since a neuron can only emits one spike in a time points, we have P T. Consequently, the time complexity of our method in the feedforward calculation is less than that of the traditional approach.
Feedback modification. Similar to the feedforward calculation, the feedback weight modifications in each layer and each output neuron of our algorithm have the same training process. Then in this part, only the training process of one layer and one output neuron is listed in the following pseudo-code.

The Feedback Modification of Our Algorithm
Definition: T pre : the set of presynaptic spikes, which contains spikes emitted by all presynaptic neurons ft 1 pre ; t 2 pre ; t 3 pre ; . . . ; t P pre g. T d : the set of target output spikes, which contains all target spikes of the postsynaptic neuron ft 1 d ; t 2 d ; t 3 d ; . . . ; t D 1 d g.

Initialization:
The weight matrix W is initialized randomly. Feedback modification: For each target spike time in T d : Calculate the weighted sum of all input spikes as the postsynaptic voltage u(t d ), and calculate the error err by Eq (5).
Step2: Calculate the presynaptic spike variation in the current layer by Eq (10).
Step3: Adjust all presynaptic weights in this layer by Eq (14).

End If End For
Assuming that there are M presynaptic neurons, N postsynaptic neurons, and the number of target spikes for all postsynaptic neurons is D, which is equal to D 1 + D 2 + . . . + D N , with D i represents the number of target spikes of the ith postsynaptic neuron. The number of input spikes is P, and the time length is T. According to the pseudo-code for the feedback modification of our algorithm, the time complexity for one layer is O(DP), which also reveals the number of operations in our algorithm.
In traditional algorithms, like the SpikeProp and Multi-ReSuMe, the postsynaptic states at all time points of T instead of target intervals T d are detected and their corresponding weights are modified by a given condition. Then the time complexity for most traditional algorithms like the SpikeProp and Mullti-ReSuMe in one layer is O(TP). Since D < T, the number of operations in our algorithm is less than traditional methods.

Training Performance
In this section, the training performance of our algorithm is investigated and compared with two classical algorithms, the SpikeProp [28] and Multi-ReSuMe [32].
A spiking network structure employing an input layer with 50 neurons, a hidden layer with 100 spiking neurons, and an output neuron is devised in our simulations, which is shown in Fig 4. The training of the multilayer neural network consists of two steps, the feedforward calculation and feedback weight modification. The efficiency of the both two steps is studied in the following parts.

Feedforward Calculation
The feedforward calculation is an important step before learning, it computes the output spikes from the input ones. In this subsection, two simulations are conducted to investigate the computational performance of our proposed method described in Theorem 1 compared with the traditional precise time calculation method. Specifically, the network structure is shown in Fig 4, and these input output spikes are generated by a homogeneous Poisson process.
The first simulation is carried out to explore the feedforward calculation performance in different time lengths from 200 ms to 2800 ms, and for each input neuron, one spike generated by a homogeneous Poisson process is emitted. To evaluate the similarity of the output spike trains calculated by our algorithm and traditional method quantitatively, the correlation-based measure C [38] is employed, with where v 1 Á v 2 is the inner product, and |v 1 |, |v 2 | are the Euclidean norms of v 1 and v 2 respectively. The v 1 and v 2 are vectors obtained by the convolution of the two spike trains using a Gaussian filter: where N i is the number of spikes in the test spike train, and t i m is the mth spike in it. σ is the standard deviation of this Gaussian filter which is set to be 1 in our study. Generally, the measure C equals to 1 for identical spike trains and decrease towards zero for loosely correlated spike trains.
The simulation results are shown in Fig 5A, which indicate that our proposed method has the same output spike trains as the traditional method, but achieves higher efficiency than it, because our method detects only time intervals of the input spikes instead of all time points in the traditional method.
In the second simulation, the performance of our proposed method is tested with the input firing rate of a homogeneous Poisson process ranging from 100 Hz to 1100 Hz with time length fixed to 1000 ms. Obviously, the higher the input firing rate, the higher the input densities, and when the firing rate is higher than 1000 Hz, the density of input spikes and all time points is similar.
Simulation results shown in Fig 5B indicate that our method has the same output spike trains as the traditional ones, and its computational time is growing with the increase of the input spike density. However, our method is still a little more efficient than the traditional method even if the input spike density is similar to that of all time intervals with a rate above 1000 Hz.

Feedback Weight Modification
In this section, the training performance of our algorithm is investigated and compared with the traditional classical multi-layer algorithms, the Spikeprop and Multi-ReSuMe. The first simulation is devised to study the learning efficiency at different time lengths of spike trains, in which there are 50 input neurons, 100 hidden neurons, and one output neuron with a network structure depicted in Fig 4. The input spike train of each input neuron is generated by a homogeneous Poisson process with r = 10 Hz, ranging the time length from 200 ms to 2800 ms. The output neuron is desired to emit only one spike in these time lengths because the SpikeProp cannot complete training for multiple target spikes. For fast convergence of traditional algorithms, the target time t d is set to t o + 5, where t o is the output firing time of the first epoch. Similar to the previous simulation, C is employed here to measure the accuracy.
The comparison results are shown in Fig 6A, in which the upper sub-figure depicts the learning accuracy of these three algorithms. It illustrates that in this simulation, our algorithm  The second simulation is conducted to test the learning efficiency of our algorithm under different firing rates, in which the input and output spike trains share the same time length of 500 ms, and the input spike train of each input neuron is generated by a homogeneous Poisson process ranging from 10 Hz to 100 Hz.
The simulation results are shown in Fig 6B, where the upper sub-figure shows the accuracy of these three algorithms, the middle and the below ones depict the learning efficiency of traditional algorithms and our algorithm respectively. Similar with the previous simulation, our algorithm achieves an approximate accuracy with the SpikeProp and Multi-ReSuMe, and improves the training efficiency significantly both in training epochs and training time.
To further explore the learning efficiency of these algorithms, the training time of one epoch is tested in various firing rates and time lengths. Firstly, the time length is fixed to 500 ms, and each input neuron emits a spike train generated by a homogeneous Poisson process with firing rate ranging from 10 Hz to 100 Hz, and the output neuron emits only one spike.
The simulation results shown in Table 1 indicate that the higher the firing rate, the more time required for these three algorithms to complete one training epoch, since more input spikes required to be trained. Besides, our algorithm consumes less time than traditional ones in various firing rates.
Secondly, in order to verify the effect of the time length on the training efficiency, another simulation is carried out where each input neuron emits two spikes and the output neuron emits one spike generated by a homogeneous Poisson process, and the maximum time length varies form 100 ms to 1000 ms.
The simulation results are shown in Table 2, which denotes that in various time lengths, our algorithm has the similar running time for training one epoch, while the time of the SpikeProp and Multi-ReSuMe increases obviously. This reveals the advantage of the selective attention mechanism which enables our NSEBP to concentrate attention on target contents and does not have to scan all time points as traditional algorithms do. It also indicates that the running time of NSEBP has no direct relation to the time length. These simulations in this section demonstrate that our algorithm achieves a significant improvement in efficiency compared with traditional algorithms.

Non-linear Spike Pattern Classification
The XOR Benchmark In this section, we perform experiments with the NSEBP on a classical example of a non-linear problem, the XOR benchmark to investigate its classification capability and the influence of  It depicts that the input 0 and 1 are encoded randomly, which is set to the spike time [1,2] and [3,4]  Our algorithm is applied to the feed-forward network described above with ϑ = 1, τ 1 = 5 and r = 0.5. Different from traditional multi-layer networks in [28,32], our algorithm requires  This training efficiency is higher than traditional algorithms that is at least 63 epochs in Multi-ReSuMe [32] and 250 cycles in Spike-Prop [28]. In the following we systematically vary the parameters of our algorithm and investigate their influence.

The Parameters
In this part, we explore the influence of the parameter τ 1 , the number of hidden neurons, the parameter r defined in Eq (6), and the threshold ϑ on the convergent epochs. 50 simulations are carried out and the average learning epoch is obtained .  Fig 9(a) shows the convergent epochs for different values of the threshold ϑ, with the number of hidden neurons fixed to 10, r = 0.5, and τ 1 = 5. It suggests that the convergence of our algorithm is insensitive to the threshold, and it can complete training in 8 epochs in various situations. Fig 9(b) depicts the convergent epoch with different numbers of hidden neurons, which indicates that in the beginning, more neurons in the hidden layer lead to less learning epochs, while when it above 30, this change is not obvious. This is mainly because more hidden neurons make more sparse representation of the input patterns in the hidden layer, which are easier to be trained because there is less interference between different patterns. Different amount of information requires different numbers of hidden neurons for a sparse representation, and if  (6), which determines the proportion of the error back-propagated to the previous layers. The simulation results demonstrate that the convergence of our algorithm has no noticeable relationship with r. To balance the load of each layer, we suggest r = 1/n when there are n layers required to be trained in our algorithm. In real world applications, r can be set to different values according to different requirements. Fig 9(d) shows the convergent epoch for different τ 1 , which indicates that our algorithm can achieve rapid convergence in various cases, and different values of τ 1 have some influence on the convergent speed, but not obvious.
Simulations in this section demonstrate that our algorithm can complete non-linear classification efficiently. Besides, simulation results shown in Fig 9 indicate that our algorithm is not sensitive to various parameters, which makes our algorithm more convenient to be applied to various applications.

Classification on the UCI Datasets
In this section, we apply NSEBP to classify both the Iris and Breast Cancer Wisconsin (BCW) datasets of the UCI [39] to investigate the capability of our algorithm over classification tasks.

Iris Dataset
The Iris dataset is firstly applied to benchmark our algorithm. It contains three classes, each with 50 samples and refers to a type of the iris plant: Iris Setosa (class 1), Iris Versicolour (class 2), and Iris Virginica (class 3) [40]. Each sample contains four attributes: sepal length (feature 1), sepal width (feature 2), petal length (feature 3), and petal width (feature 4).
To make the difference between the data apparent, the data of each feature are mapped into a high dimensional space using the population time encoding method [41]. There are 12 uniforming distributed Gaussian receptive fields in [0, 1], which distribute an input variable over 12 input neurons which are shown in Fig 10. The network structure devised for this classification task is shown in Fig 11, in the iris data set, the input layer has 48 neurons with each feature 12 inputs. The hidden layer has 4 neurons, each possesses local connections with 12 neurons of the input layer that represent one feature. In the training period, only synaptic weights from input to hidden neurons are adjusted, and the output neuron has weight 1 which is applied to decision making.
In this application, the ith sample of class c has the input spike train ti i c for four features and a target spike train td i c which is obtained by td i c ¼ ti i c þ 2t 2 ln 2, with parameter τ 2 defined in Eq (2). The local connections in our network enable each feature to train an independent subnetwork with 12 synapses. Since samples of one class have similar spike trains, encoding results reveal that there are only few different target time trains for each feature. As in the previous simulations, our algorithm chooses the class with minor voltage error. Table 3 compares the classification accuracy and convergent epoch of the NSEBP with four classical neural network classifier: Multi-ReSuMe [32], Spikeprop [28], MatlabBP, SWAT [22] for the Iris dataset on the training set and testing set. The classifier of the Multi-ReSuMe and Spikeprop are conducted with three layer SNN structures proposed in [32] and [28] respectively. The SWAT employs a SNN architecture of 13 input neurons with each connects to 16 neurons in middle layer, and the squared cosine encoding method is employed [22]. The simulation results are shown in Table 3.
The comparison results indicate that our algorithm achieves comparable or even higher accuracy compared with the traditional classifiers, while our algorithm only requires 18 epochs to complete training, instead of 2.6 Á 10 6 epochs for the MatlabBP, 1000 epochs for Spikeprop, 500 epochs for the SWAT, and 174 for multi-RuSuMe. The simulation results prove that our algorithm is the most efficient one, and outperforms the compared neural networks methods significantly.

Breast Cancer Wisconsin Dataset
The two-class Breast Cancer Wisconsin (BCW) dataset is also applied to analyze our algorithm. This dataset contains 699 samples, while 16 samples are abandoned because of missing data. Each sample has nine features obtained from a digitized image of a fine needle aspirate (FNA) of a breast mass [42].
The same network structure and training settings as these in the Iris classification simulations are employed here. There are nine features instead of four, then there are 108 input neurons and 9 hidden neurons in the network structure depicted in Fig 11. Table 4 compares the accuracy and efficiency of the NSEBP against the existing algorithms for the BCW dataset. It shows that the test data accuracy of the NSEBP is comparable to that of the other approaches, while the NSEBP only requires 16 epochs for complete training, instead of 1500 epochs for the Spikeprop, 500 epochs for SWAT, and 9.2 Á 10 6 epochs for MatlabBP. Then, the training efficiency of our algorithm is improved significantly compared with these classical neural network algorithms.
Simulation results in this section demonstrate that our algorithm achieves a higher efficiency and even a higher learning accuracy than the traditional neural network methods in the classification tasks. The selective mechanism and the presynaptic spike jitter adopted in our algorithm make the efficiency of the NSEBP to be independent with the length of the spike train, and overcome the drawbacks of low efficiency in traditional back-propagation methods. With this efficiency, the training of SNNs can meet the requirement of real world applications.

Conclusion
In this paper, an efficient multi-layer supervised learning algorithm, the NSEBP, is proposed for spiking neural networks. The accurate feedforward calculation and weight modification employing the normalized PSP learning window enables our algorithm to achieve a rapid convergence. Besides, motivated by the selective attention mechanism of the primate visual system, our algorithm only focuses on the main contents in the target spike trains and ignores neuron states at the un-target ones, which makes our algorithm to achieve a significant improvement in efficiency of training one epoch. Simulation results demonstrate that our algorithm outperforms traditional learning algorithms in learning efficiency, and is not sensitive to parameters. The classification results on the UCI data sets indicate that the generalization ability of our algorithm is a little better than the traditional backpropagation method, and similar to the Spi-keProp, but lower than the SWAT. It means that our algorithm does not make great contribution to the over fitting problem. However, the traditional methods to improve the training generalization ability can also be applied to our algorithm, such as employing an optimized network structure and better decision-making method, or better sample validate methods, that will be studied in the future work.
Our algorithm is derived from the SRM 0 model, but the same derivation process is feasible to other models when the voltage u can be expressed by an equation of time t and can be transformed to a quadratic function by the substitute method. Besides, the proposed feed forward calculation method can be applied to the existing algorithms to improve their learning performance. Consequently, employing these training strategies, the SNNs can be applied efficiently to various applications with multilayer network structure and arbitrary real-valued analog inputs.
Supporting Information S1