Computational Model of Primary Visual Cortex Combining Visual Attention for Action Recognition

Humans can easily understand other people’s actions through visual systems, while computers cannot. Therefore, a new bio-inspired computational model is proposed in this paper aiming for automatic action recognition. The model focuses on dynamic properties of neurons and neural networks in the primary visual cortex (V1), and simulates the procedure of information processing in V1, which consists of visual perception, visual attention and representation of human action. In our model, a family of the three-dimensional spatial-temporal correlative Gabor filters is used to model the dynamic properties of the classical receptive field of V1 simple cell tuned to different speeds and orientations in time for detection of spatiotemporal information from video sequences. Based on the inhibitory effect of stimuli outside the classical receptive field caused by lateral connections of spiking neuron networks in V1, we propose surround suppressive operator to further process spatiotemporal information. Visual attention model based on perceptual grouping is integrated into our model to filter and group different regions. Moreover, in order to represent the human action, we consider the characteristic of the neural code: mean motion map based on analysis of spike trains generated by spiking neurons. The experimental evaluation on some publicly available action datasets and comparison with the state-of-the-art approaches demonstrate the superior performance of the proposed model.


Introduction
It is a universally accepted fact that human can easily recognize and understand other peoples action from complex natural scene. It attributes the success to hundreds or thousands of neurons in visual cortex of the brain and neural networks formed by their connection in a certain way, which perceive and process motion information of human action for action recognition task. The question is how neurons and neural networks process motion information to perform this task. Researchers have made many neurophysiological studies and obtained some important findings to answer these problems. For example, the visual information is processed through two distinct pathways: the dorsal stream and the ventral stream, originating from primary visual cortex (V1). The majority of neurons in V1 are exquisitely sensitive to the orientation of a stimulus in a given position of the visual field, and their responses to a stimulus presented in the classical receptive field (RF) are often suppressed by another stimulus simultaneously presented outside the classical RF, known as "surround suppression" [1]. Based on these properties of neurons and neural mechanisms, some biophysically-plausible computational models for biological motion recognition are developed [2]. These models essentially reproduce certain properties of visual systems and make predictions for neuroscience, but have been relatively fewer reports on practical applications for human action recognition.
With the remarkable advances in the understanding of human action perception in psychophysics [3], many bio-inspired approaches of human action recognition [4]- [5] are proposed. Most of them are based on the work of M. Giese and T. Poggio [2], which puts forward a biologically plausible neural model separately to evaluate both visual pathways in biological motion recognition. These approaches are built with feedforward architecture and by modeling neural mechanism in intermediate and higher visual areas of the dorsal stream such as middle temporal (MT) and lateral medial superior temporal (MST). However, these approaches largely ignore some properties of neurons in V1 as a beginning area of visual cortex, such as inseparable properties of the classical RF of many simple cells in space and time. It hampers the processing of the shape information addressed in ventral stream and the analysis of motion information involved in dorsal stream.
Moreover, biological motion recognition can be realized in the human visual cortex with latencies of about 150ms and even faster [6], which, considering the visual pathway latencies, may only be compatible with a very specific processing architecture and mechanism. There is a neural computational theory support this mechanism, which pattern motion is computed in V1 where end-stopped cells could be involved in encoding pattern motion because they respond well to line terminators (or features) moving in their preferred direction and speed [7], [8]. The network models incorporated with feedback mechanisms have also been proposed to support the idea that pattern motion can be computed at the V1 stage [9]. In computer vision, Kornprobst [10] demonstrated that early visual processes in V1 could be sufficient to perform such task of human action recognition. Although computation of pattern motion is dynamical over space and time and is limited in V1 to reduce computation load, it does not achieve the better performance of human action recognition since many important properties of cells in V1 are not considered. Thus, it still need further research of bio-inspired approaches for human action recognition based on the properties of cells in V1.
In this paper, a new bio-inspired model is proposed for real video analysis and recognition of human actions. It focuses on three parts: 1) perceiving the spatiotemporal information by modeling properties of cells in V1 such as spatiotemporal properties of classical receptive field (RF) and surround suppression; 2) automatically detecting and localizing moving object (human) in the scene with visual attention built by the spatiotemporal information, and 3) encoding spike trains automatically generated by spiking neurons for action recognition.
According to RF properties of single neuron in V1, there are three basic RF types [11]: oriented RFs, non-oriented RFs, and non-oriented large field. In general, cells with oriented RFs are broadly modeled with filter bands to detect information in a direction from images or videos, such as 2D Gabor bands in [12] and spatiotemporal filters in [13], whereas cells with nonoriented RFs are not considered to do for it, but, by most accounts, respond optimally to moving stimuli over a restricted range of velocities. Furthermore, for a majority of cells, the spatial structure of the RF changes as a function of time can be characterized in the space-time domain [14]. These properties facilitates the detection of spatiotemporal information in different directions and at different speeds.
In addition, neurophysiological studies have also shown that the responses of neurons in V1 are suppressed by stimuli provided by the region surrounding the RF [1]. It is known as surround suppression, which is an useful mechanism for contour detection by inhibition of texture [15]. A similar mechanism has been observed in the spatiotemporal domain, where the response of such a neuron is suppressed when moving stimuli are presented in the region surrounding its classical RF. The suppression is maximal when the surround stimuli move in the same direction and at the same disparity as the preferred center stimulus [8]. An important utility of surround mechanisms in the spatiotemporal domain is to evaluate detection of motion discontinuities or motion boundaries.
To recognize human actions from clustered visual field where there are multiple moving objects, we need to automatically detect and localize every one in the actual application. Visual attention is one of the most important mechanisms of the human visual system. It can filter out redundant visual information and detect the most salient parts in our visual field. Some research works [16], [17] have shown that the visual attention is extremely helpful to action recognition. Many computational models of visual attention are raised. For example, a neurally plausible architecture is proposed by Koch and Ullman [18]. The method is highly sensitive to spatial features such as edges, shape and color, while insentitive to motion features. Although the models proposed in [17] and [19] have regarded motion features as an additional conspicuity channel, they only identify the most salient location in the sequence image but have not notion of the extent of the attended object at this location. The facilitative interaction between neurons in V1 reported in numerous studies is one of mechanisms to group and bind visual features to organize a meaningful higher-level structure [20]. It is beneficial to detect moving object.
To sum up, our goal is to build a bio-inspired model for human action recognition. In our model, spatiotemporal information of human action is detected by using the properties of neurons only in V1 without MT, moving objects are localized by simulating the visual attention mechanism based on spatiotemporal information, and actions are represented by mean firing rates of spike neurons. The remainder of this paper is organized as follows: firstly, a review of research in the area of action recognition is described. Secondly, we introduce the detection of spatiotemporal information with 3D Gabor spatial-temporal filters modeling the properties of V1 cells and their center surround interactions, and detail computational model of visual attention and the approach for human action localization. Thirdly, the spiking neural model to simulate spike neuron is adopted to transfer spatiotemporal information to spike train, and mean motion maps as feature sets of human action are employed to represent and classify human action. Finally, we present the experimental results, being compared with the earlier introduced approaches.

Related Work
For human action recognition, the typical process includes feature extraction from image sequences, image representation and action classification. Based on image representation, the action recognition approaches can be divided into two categories [21], i.e. global or local. Both of them have achieved success for human action recognition to some extent, yet there are still some problems to be resolved. For example, the global approaches are sensitive to noise, partial occlusions and variations [22], [23], while the local ones sometimes suffer from heavy computational burden [24], [25] for extracting a sufficient amount of relevant interest points [26]. In recent years, some approaches combine both global and local representations to improve recognizing performance [27][28][29]. However, they are mainly applied into some special situations. Thus, some bio-inspired approaches emerge to perform the task of action recognition.
The work of bio-inspired action recognition based on the feedward architecture of visual cortex is related to several domains including motion-based recognition and local feature detection. In the area of local feature detection, a large number of different schemes have been developed based on visual properties and feature descriptors [4], [30], [31], [32]. In [4], a feedforward architecture modeling dorsal visual pathway was proposed by Jhuang, which can be seen as an extension of model of ventral pathway architecture [12] according to similar organization of both ventral and dorsal pathways [33]. Jhuang mapped the cortical architecture, essentially primary visual cortex (V1) (with simple and complex cells), but never claim any biological relevance for the corresponding subsequent processing stages (from S2 to C3) [13]. The work in [31] is similar to Jhuang's idea in concept, but uses different window settings. Schindler and Van Gool [30] extend Jhuang's approach [4] by combining both shape and motion responses. Due to a collection of independent features obtained in matching stage, the approach is suffering from heavy computation.
Researchers also have developed a large number of different schemes based on various combinations of visual tasks and image descriptors [5,13]. Escobar et al. [13] still used feedforward architecture and simulated dorsal visual pathway to create a computational model for human action recognition, called V1-MT model, in which the analysis of motion information is done in V1 and MT areas [33]. The model not only combines motion-sensitive responses but also considers connections between V1 cells and MT cells found in [34], [35], which allows them to model more complex properties such as motion contrasts. The main difference from Jhuang's approach is that the approach is based on Casile and Giese theory [36], which augment that biological motion recognition can be done in a coarse spatial location of the mid-level optic flow features. The visual observation of human action is encoded as a whole with spiking neural networks in [13], [5], and is considered as global representations. Although Escobar's approach satisfies biology plausibility, there are some key problems to be solved. For example, which properties of the cells in V1 should be used to detect spatiotemporal information? how are human actions detected and localized? and how is such task of human action recognition performed through early visual processing in V1? Therefore, we aim to give some schemes to settle these issues.

Visual Perception and Information Detection
Biological visual system is very complex. Physiological and psychological studies suggest four crucial properties of biological vision: Fovea-periphery distinction on the retina, oculomotor, image representation and serial processing [37]. In this paper, we propose a novel bio-inspired approach for human action recognition according to these properties. Fig 1 shows the block diagram of our approach from the input image sequence containing human action as stimulus to its final classification. It contains four steps: 1) detecting spatiotemporal information in form of responses of simple and complex cell in V1; 2) localizing moving object with computational model of visual attention by integrating spatiotemporal information sensitive to speed and direction; 3) extracting features from spiking trains generated by spiking neurons with leaky integrate-and-fire model [38], [39], and encoding them for action representation, 4) recognizing human action with the support vector machine (SVM).

Spatiotemporal Information Detection
In V1, many simple cells possess the property of the speed and direction selectivity (orientedcell), and their RF profiles are essentially modeled with spatiotemporal filters. However, most of existing spatiotemporal filters often are non-causal, hence biologically implausible [4,31]. To this end, we build a family of spatiotemporal filters to model the spatiotemporal RF profiles of simple cells similar to [40], denoted by g v,θ,φ (x, t), which is causal and consistent with the V1 cell physiology. The formula of spatiotemporal filter is defined in Eq (1).
where ð x; yÞ ¼ ðx cos y þ y sin y; Àxsiny þ y cosyÞ, ε(t) is step function, and x = (x, y). The parameters v, θ and φ respectively present the preferred speed, the preferred direction of motion and the preferred spatial orientation, and the spatial symmetry of the filter. This filter is composed of spatial Gaussian envelope and temporal Gaussian envelope. The spatiotemporal RF profile is tilted to preferred direction of motion in space-time, originating the selectivity for moving stimuli, and is qualitatively similar to the experimentally determined ones by DeAngelis [14]. Considering the correlation between preferred spatial scale and preferred speed of spatiotemporal RF profile, we use the following equation to describe the relation between the preferred spatial wavelength λ and the preferred speed v: where the constant λ 0 is the spatiotemporal period of the filter, σ/λ = 0.56. So, v determines the preferred wavelength and the receptive field size. The faster the filter speed v is, the larger the receptive field will be. Moreover, τ in the temporal Gaussian envelope, set as constant of 2.75 in [40], determines the temporal decay of g v,θ,φ (x, t) in time t. However, the temporal decay is dynamic and a function of the speed. It causes different time correlation in different preferred speeds. We therefore compute τ using the following function: A gray-scale image sequence, I(x, t), is first analyzed by 3D Gabor filters corresponding to the simple cells in V1. The response r v,θ,φ (x, t) to image sequence is computed by convolution: where j Á j + is an operator with half-wave rectification. From Eq (4), the response of the filer is phase sensitive. A phase insensitive response as the one of a complex cell, called Gabor energy, can be obtained by quadrature pair summation of the responses of two filters with a phase difference of π/2 as follows: In form of Eq (5), the application for detection of spatiotemporal information is illustrated in Fig 2 (Second Row).
Besides oriented cells in V1, there are also some insensitive simple cells to direction (nonoriented cell). Watson et al. [41] suggested a causal temporal filter for non-oriented cell, which is consistent with the electrophysiological studies and the psychophysical data. The speed tuning properties are also studied by considering the responses of motion energy filters to motion stimulus at different speeds without orientation selectivity. For the sake of computation, however, the response of non-oriented cell is approximatively computed with Gabor energy in all directions: where N θ is number of preferred orientations. As spatiotemporal information for a specific range of speeds at each location x, local Gabor energy, detected in Eqs (5) and (6), often is ambiguous [9]. To stabilize and disambiguate initial spatiotemporal information, a modified detector defined by a shift @x = (@x, @y) along a specific speed between two successive frames is used to model complex cells to compute a spatiotemporal correlation. Similar to [9], unambiguous or disambiguated motion information is computed as following:r v;y ðx; tÞ ¼ r v;y ðx þ @x; t À 1Þ Á r v;y ðx; tÞ ð7Þ The resulting activitiesr v;ðyÞ ðx; tÞ of different directions (including non-direction) at different speeds indicate unambiguous motion at corners and line endings, ambiguous motion along contrasts and no motion for homogeneous regions, as shown in Fig 2 (Third Row).
To characterize the motion in video scene, we compute the motion energy using 3D Gabor filters with N v different speeds and N o different directions. At each speed v, N o + 1 responses in N o directions and one non-direction are computed.

Center Surround Interaction
To further process motion information, center surround interactions are used. Surround interactions observed in V1 [1] originate from horizontal interconnections between neurons in spiking neural networks according to results of some anatomic studies, which often are antagonistic for RFs of many cells in V1. The response of such a neuron is suppressed when moving stimuli are presented in the region surrounding its classical RF.
In the purely spatial domain, a model with a 2D difference of Gaussian (DoG) functions is used to compute the spatial summation properties of a center-surround cell [42]. In spatiotemporal domain, due to RF dynamics, we define the surround suppression weighting function w ðk 1 ;k 2 Þ v;y with the half-wave-rectified difference of two concentric Gaussian envelopes: where k Á k 1 denotes the L 1 norm and G v,k,θ (x, t) is similar to RF function g v,θ,φ (x, t), but without the cosine factor, decaying with time: Moreover, the non-oriented cells also show characteristic of center surround [43]. Therefore, the non-oriented term G v,k (x, t) is similarly defined as follows: where σ 0 = σ + 0.05σt. To be consistent with the surround effect, the value of the surround weighting function should be zero inside the RF, and be positive outside it but dissipate with distance. Therefore, we set k 2 = 1 and k 1 = k, k > 1. In order to facilitate the description of oriented and non-oriented terms, we use w where the factor α controls the strength with which surround suppression is taken into account. The proposed inhibition scheme is a subtractive linear mechanism followed by a nonlinear half-wave rectification (results shown in Fig 2 (Fourth Row)). The inhibitory gain factor α is unitless and represents the transformation from excitatory current to inhibitory current in the excitatory cell. It is seen that the larger and denser the motion energyr v;ðyÞ ðx; tÞ in the surroundings of a point (x, t) is, the larger the center surround termr v;ðyÞ ðx; tÞ Ã w ðkÞ v;ðyÞ ðx; tÞ is at that point. The suppression will be strongest when the stimuli in the surroundings of a point have the same direction and speed of movement as the stimulus in the concerned point.

Attention Model and Object Localization
Visual attention can enhance object localization and identification in a cluttering environment by giving more attention to salient locations and less attention to unimportant regions. Thus, Itti and Koch have proposed an attention computational model efficiently computing a saliency map from a given picture [44] based on the work of Koch and Ullman [18]. Although some models [17] and [19] try to introduce motion features into Itti's model for moving object detection, these models have no notion of the extent of the salient moving object region. Therefore, we propose a novel attention model to localize the moving objects.  In the proposed model, visual perception is implemented by spatiotemporal information detection in above section. Because we only consider gray video sequence, visual information is divided into two classes: intensity information and orientation information, which are processed in both time (motion) and space domains respectively, forming four processing channels. Each type of the information is calculated with the similar method in corresponding temporal and spatial channels, but spatial features are computed with perceiving information at low preferred speeds no more than 1ppF. The conspicuity maps can be re-used to obtain motion object mask instead of only using the saliency map.

Perceptual Grouping
In general, the distribution of visual information perceived generally is scattered in space (as shown in Fig 2). To organize a meaningful higher-level object structure, we should refer to human visual ability to group and bind visual information by perceptual grouping. The perceptual grouping involves numerous mechanisms. Some of computational models about perceptual grouping are based on the Gestalt principles of colinearity and proximity [45]. Others are based on surround interaction of horizontal interconnections between neurons [46], [47].
Besides antagonistic surround described in above section, neurons with facilitative surround structures have also been found [1], and they show an increased response when motion is presented to their surround. This facilitative interaction is always simulated using a butterfly filter [46]. In order to make the best use of dynamic properties of neurons in V1 and simplify computational architecture, we still use surround weighting function w ðkÞ v;ðyÞ ðx; tÞ defined in Eq (9) to compute the facilitative weight, but the value of θ is repaced by θ + π/2. For each location (x, t) in oriented and non-oriented subbands {v,(θ)}, the facilitative weight is computed as follows: where n is the control factor for size of the surrounding area. According to the studies of neuroscience, the evidence shows that the spatial interactions depend crucially on the contrast, thereby allowing the visual system to register motion information efficiently and adaptively [48]. That is to say, the interactions differ for low-and high-contrast stimuli: facilitation mainly happens at low contrast and suppression occurs at high contrast [49]. They also exhibit contrast-dependent size-tuning, with lower contrasts yielding larger sizes [50]. Therefore, The spatial surrounding area determined by n in Eq (13) dynamically depends on the contrast of stimuli. In a certain sense, R ðkÞ v;ðyÞ presents the contrast of motion stimuli in video sequence. Therefore, according to neurophysiological data [48], n is the function of R ðkÞ v;ðyÞ , defined as follows: where z is a constant and not more than 2, R ðnÞ v;ðyÞ ðx; tÞ is normalized. The n(x, t) function is plotted in Fig 5. For computation and performance sake, set z = 1.6 according to Fig 5 and round down n(x, t), n = bn(x, t)c.
Similar to [46] where (Á) is θ for oriented subband and v for non-oriented subband.

Saliency Map Building
To integrate all spatiotemporal information, similar to Itti's model [44], we calculate a set of the intensity (non-orientd) feature maps F v (x, t) in terms of each feature dimension as follows: where we set k 2 {2, 3, 4} in term O ðkÞ v ðx; tÞ, and È is point-by-point plus operation through across-scale addition.
Another set of the orientation feature maps also are computed by similar method as follows: Each set of feature maps computed are divided into two classes in according to speeds. One class includes spatial feature maps obtained at speeds no more than 1ppF, and another class contains the motion feature maps. To guide the selection of attended locations, different feature maps need to be combined. The feature maps are then combined into four conspicuity maps: spatial orientation F o and intensity F; motion orientation M o and intensity M: Because modalities of the four separative maps above contribute independently to the saliency map, we need integrate them together. Due to different dynamic ranges and extraction mechanisms, a map normalization operator, N(Á), is globally employed to promote maps. The four conspicuity maps are then normalized and summed into the saliency map (SM) S: 3

Salient Object Extraction
Although the saliency map S defines the most salient location in image, to which the attentional focus should be directed, at any given time, it does not give the regions of suspicious objects. Thus, some methods with adaptive threshold [51] are proposed to obtain a binary mask (BM) of the suspicious objects from the saliency map. However, these methods only are suitable for simple still images, but not for the complex video. Therefore, we propose a sampling method to enhance BM. Let a window W slide on the saliency map, then sum up the values of all pixels in the window as the 'salient degree' of the window, defined as follows: where S(x, t) represents the saliency value of the pixel at position x. The size of W is determined by the RF size in our experiments. Consequently, we obtain r salient degree values S W i , i = 1, Á Á Á, r. Similar to [51], the adaptive threshold (Th) value is regarded as the mean value of a given salient degree: where h(i) is a salient degree value histogram, k is a constant. Once the value of salient degree S W i is greater than Th, the corresponding region is regarded as a region of interest (ROI). Finally, morphological operation is used to obtain the BM of the interest objects, BM 1 = {R 1,1 , Á Á Á, R 1,q 1 }, where q 1 is number of the ROIs. Because motion of interest objects is often nonrigid, each region in BM 1 may not comprise complete structure shapes of the interest objects. To settle such deficiencies, we reuse conspicuity spatial intensity map to get more completed BM. The same operations are performed for conspicuity spatial intensity map (S 1 = N(F o ) + N(F)) to obtain BM including structure shapes of the objects, BM 2 = {R 2,1 , Á Á Á, R 2,q 2 }. Then, BM of moving objects, BM 3 = {R 3,1 , Á, R 3,q 3 }, is achieved by the interaction between both BM 1 and BM 2 as follows: ( To further refine BM of moving objects, conspicuity motion intensity map (S 2 = N(M o ) + N (M)) is reused and performed with the same operations to reduce regions of still objects. Assume BM from conspicuity motion intensity map as BM 4 = {R 4,1 , Á, R 4,q 4 }. Final BM of moving objects, BM = {R 1 , Á Á Á, R q } is obtained by the interaction between BM 3 and BM 4 as follows: ( It can be seen in Fig 6 an example of moving objects detection based on our proposed visual attention model. Fig 7 shows different results detected from the sequences with our attention model in different conditions. Although moving objects can be directly detected from saliency map into BM as shown in Fig 7(b), the parts of still objects, which are high contrast, are also obtained, and only parts of some moving objects are included in BM. If the spatial and motion intensity conspicuity maps are reused in our model, complete structure of moving objects can be achieved and regions of still objects are removed as shown in Fig 7(e).

Spiking Neuron Network and Action Recognition
In the visual system, perceptual information also requires serial processing for visual tasks [37]. The rest of the model proposed is arranged into two main phases: (1) Spiking layer, which transforms spatiotemporal information detected into spikes train through spiking neuron

Neuron Distribution
Visual attention enables a salient object to be processed within the limited area of the visual field, called as "field of attention" (FA) [52]. Therefore, the salient object as motion stimulus is firstly mapped into the central region of the retina, called as fovea, then mapped into visual cortex by several steps along the visual pathway. Though the distribution of receptor cells on the retina is like a Gaussian function with a small variance around the optical axis [53], the fovea has the highest acuity and cell density. To this end, we assume that the distribution of receptor cells in the fovea is uniform. Accordingly, the distribution of the V1 cells in FA bounded area is also uniform, as shown Fig 8. A black spot in the distribution map represents single spiking neuron and the color circle indicates its CRF.
Due to non-rigid motion and scale change of the salient object in sequence, the size and center of the FA change with its BM. We consider FA area as a square with sides of length L and central position x c . The length of L is defined as follows: where l x and l y are width and height of the BM bounded area, respectively. ΔL is extending spatial extent, which is set n 1 times of a constant r, thus ensuring the BM completely embedded in FA, as shown in Fig 8. In generally, due to the continuous movement of the salient object in sequence, L(t) is a time-varying function. To avoid frequent changes, L(t) is constrained by follows: where t is present time and t 0 is last time when L(t) is updated. n 2 is a factor constant, constrained by n 2 < n 1 .
On the other hand, the visual attention is able to track the salient object in motion and to keep it in the foveal region, known as smooth pursuit [17]. It makes FA center position x c be almost identical with BM geometer center x b . Similar to above method, x c can be determined by x b as follows: ( where n 3 is another factor constant. The constraint of n 2 + n 3 < n 1 ensures BM within FA bounded area. In this paper, n 1 , n 2 , n 3 are respectively set as 7, 2 and 2. Finally, the original video streams are resized and centered to produce sequences of 120 × 120 pixels according to FA bounded areas. The spatiotemporal information falling in the FA is further processed by V1 cells. We consider N v layers of organized V1 cells, each of which is built with the V1 cells with the same properties of spatial-temporal tuning. The RF of V1 cell at the physical position x i is defined by its properties of spatial-temporal tuning. Each layer is consist of N o + 1 sub-layers with N o different orientations and non-orientation. In the physical Computational Model of Primary Visual Cortex position, where RF of cells is centered, one column is formed in each layer, which has as many elements as N o + 1 orientations defined. Therefore, for all layers, there are N v × (N o + 1) cells along N v layers in x i .

Spiking Neuron Model
A typical neuron is synaptically linked with hundreds of thousands of others. To capture functional properties and realistic dynamic behaviors, a spiking neuron is always described by computational model according to biological plausibility and the computational efficiency. So, many models have been proposed to simulate the entity in the literature [54].
In this paper, we use conductance-driven integrate and fire neuron model (IF model) [38] to simulate spiking neurons. The formula is as follows: where G E i ðtÞ is the normalized excitatory conductance directly associated with the pre-synaptic neurons connected neuron i, and G I i ðtÞ is an inhibitory normalized conductance; The conductance g L is the passive leaks in the cell's membrane; I i (t) is an external input current. When the normalized membrane potential u i (t) ! u 0 , spiking neuron i will emit a spike and the voltage reset to the resting potential. As some properties of the cells in V1 are used to detect spatiotemporal information, the first and second terms corresponding to G I i ðtÞ and G E i ðtÞ in Eq (29) as internal current are integrated into I i (t) here. Eq (29) is rewritten as The typical values for V L is -70mv.

Neuron's Input
Objective of the spiking neuron model described above is to transform the analogous response of V1 cell defined in Eq (12) to the spiking response so as to characterize the activity of a neuron. From Eq (30), the activity of a neuron is determined by external input current I i (t) of the the spiking neuron and the membrane potential threshold. First, let us consider input of a spiking neuron i in V1 whose center is located in x i . Its external input current I i (t) associates with the analogous response of V1 cell defined in Eq (12). However, the activation of the cell is in range of classical RF. The computational operator over RF in a sub-layer (e.g. same preferred motion direction and speed) is needed [55]. Thus, the input current I i (t) of the ith neuron is modeled in Eq (31) as follows: where K exc is an amplification factor, R v,(θ) (x, t) refers to V1 cell response defined in Eq (12) with k = 4 and max i is a operator of local maximum [56].

Spike Train Analysis for Action Recognition
According to above description, every spiking neuron in V1 generates a series of spikes corresponding to stimuli of human action over time, called spike train η i (t). To recognize human action, we only need to analyze the activity of spiking networks built by spiking neurons in V1 cortex, so that features representing human action can be extracted from spike trains. For a spike train, it comprises of discrete events in time, can be described by succession of emission times of a spiking neuron in V1 as Z i ðtÞ ¼ fÁ Á Á ; t n i ; Á Á Ág, where t n i corresponds to the nth spike of the neuron of index i.
Since our main purpose focuses on action recognition based on the proposed framework rather than strategies of spike-based code, some methods about high-level statistics of spike trains [57] are not considered in this paper. Similar to [13], mean firing rate over time, which is one of the most general and effective methods, is used.
For a spiking neuron, its mean firing rate over time is computed with the average number of spikes inside a temporal window, Eq (32) defined as: where η i (t − Δt, t) counts the number of spikes emitted by neuron i inside the glide time window Δt. Fig 9 displays the spike train of a neuron and its mean firing rate map, where Δt = 7. In Eq (32) and Fig 9, the estimation of the mean firing rate depends on the size of the glide time window. A wider window Δt can reduce the individual spike generated by noise stimuli resulting in smooth curve of mean firing rate, but it simultaneously degrates the significance in time. Although the smaller can highlight instantaneous firing rate, it also emphasizes the uncertainty of the spike train corresponding to dynamic stimulus. To do this, we will select a suitable size of the glide time window to measure the mean firing rate according to our given vision application.
Another problem for rate coding stems from the fact that the firing rate distribution of real neurons is not flat, but rather heavily skews towards low firing rates. In order to effectively express activity of a spiking neuron i corresponding to the stimuli of human action as the process of human acting or doing, a cumulative mean firing rate T i (t, Δt) is defined as follows: where t max is length of the subsequences encoded.
Remarkably, it will be of limited use at the very least for the cumulative mean firing rates of individual neuron to code action pattern. To represent the human action, the activities of all spiking neurons in FA should be regarded as an entity, rather than considering each neuron independently. Correspondingly, we define the mean motion map M v,(θ) at preferred speed and orientation corresponding to the input stimulus I(x, t) by where N c is the number of V1 cells per sub-layer. Because the mean motion map includes the mean activities of all spiking neuron in FA excited by stimuli from human action, and it represents action process, we call it as action encode. Due to N o + 1 orientation (including non-orientation) in each layer, N o + 1 mean motion maps is built. So, we use all mean motion maps as feature vectors to encode human action. The feature vectors can be defined as: where N v is the number of different speed layers, Then using V1 model, feature vector H I extracted from video sequence I(x, t) is input into classifier for action recognition.
Classifying is the final step in action recognition. Classifier as the mathematical model is used to classify the actions. The selection of classifier is directly related to the recognition results. In this paper, we use supervised learning method, i.e. support vector machine (SVM), to recognize actions in data sets.
KTH data set consists of 150 video sequences with 25 subjects performing six types of single person actions: walking, jogging, running, boxing, hand waving (handwave) and hand clapping (handclap). These actions are performed several times by twenty-five subjects in four different conditions: outdoors (s1), outdoors with scale variation (s2), outdoors with different clothes (s3) and indoors with lighting variation (s4). The sequences are down-sampled to a spatial resolution of 160 × 120 pixels.
UCF Sports data set includes diving, golf swinging, kicking, lifting, horseback riding, running, skating, swinging a baseball bat, and pole vaulting. The dataset contains over 200 video sequences at a resolution of 720 × 480 pixels. The collection represents a natural pool of actions featured in a wide range of scenes and view points.

Parameter setting
Our proposed model is constructed with N v layers of preferred speeds and each layer is composed of five sub-layers corresponding to five orientations (0°, 45°, 90°, 135°, and a non-orientation). As the preferred speeds at which the model runs are associated with spatial-temporal frequency and computing load, their number and values will be determined by experimental results. The parameter settings can be seen in Table 1. The model has a total of 5N v sub-layers, formed by 5 orientations (including a non-orientation) and N v different spatial-temporal tunings. There is a total of 1600 cells in a sub-layer, being distributed in the whole FA. It is noted that the FAs generated by our attention model are resized and centered in 120 × 120 pixels, forming new FA sequences. The sizes of receptive field patch and surrounding area are 2σ and 8σ respectively.
To compare the performance with other methods, we conduct experiments on all of the three given datasets under the following three experimental setups: • Setup 1 is that one sequence of a subject is selected as the testing data while the sequences of other subjects are employed as the training data, called leave-one-out cross validation similar to [31].
• Setup 2 uses the sequences of more than one subjects for testing and others for training [13] and [5]. We select 6 random subjects as a training set and the remaining 3 subjects as a testing set for Weizmann dataset, and 16 subjects randomly drawn from KTH dataset for training and the remaining 9 subjects for testing. We run all the possible training sets (84) for Weizmann and do 100 trails for KTH • Setup 3 is similar to setup 2, but only do five random trails, following the same experimental protocol described in Jhuang et al. [4].
Each setup examines the ability of the proposed approach to recognize human actions in videos. The performance is based on the average of all trails. It is noted that this is done separately for each scene (s1, s2, s3, or s4) in KTH dataset.

Experimental Results
Extensive experiments have been carried out to verify the effectiveness of the proposed approach. The following describes the details of the experiments and the results. Frame length. Firstly, to examine the impact of the frame length of the selected subsequence t max on the recognition results, we apply the classifier SVM to assess the proposed model on all subsequences randomly selected from all original videos of Weizmann and KTH datasets. Note that all tests are performed at five different speeds v, such as 1, 2, 3, 4 and 5 ppF, with the size of glide time window 4t = 3. The classifying results with different parameter sets are shown in Fig 11, which indicates that: (1) the average recognition rates (ARRs) increase with increment of subsequence length t max from 20 to 100; (2) ARR on each of test datasets is different at different preferred speeds; (3) ARRs on different test datasets are different at each of the preferred speeds.
How long subsequence is suitable for action recognition? We analyze the test results on Weizmann dataset. From Fig 11, it can be clearly seen that the ARR rapidly increases with the frame length of selected subsequence at the beginning. For example, the ARR on Weizmann dataset is only 94.26% with the frame length of 20 at preferred speed v = 2ppF, whereas the ARR rapidly raises to 98.27% at the frame length of 40, then keeps relatively stable at the length more than 40. In order to obtain a better understanding of this phenomenon, we estimate the confusion matrices for the 81 sequences from Weizmann dataset (See in Fig 12). From a qualitative comparison between the performance of the human action recognition at the frame length of 20 and 60, we find that ARRs for actions are related to their characteristics, such as average cycle (frame length of a whole action), deviation (see Table 2). The ARRs of all actions are improved significantly when the frame length is 60, as illustrated in Fig 12. The reason mainly is that the length of average cycles for all actions is not more than 60 frames. Certainly, it can be observed that the larger the frame length is, the more information is encoded, which is helpful for action recognition. Moreover, it is relatively significant that the performance can be improved for actions with small relative deviations to average cycles.
The same test on KTH dataset is performed and the experimental results under four different conditions are shown in Fig 11(b)-11(e). The same conclusion can be obtained: ARRs increase with increment of the frame length and keep relatively stable at the length more than 60 frames. It is obvious for overall ARRs under all conditions at different speeds shown in Fig  11(f). Considering the computational load increasing with the growing frame length, as a compromise plan, maximum frame length of the subsequence selected from original videos is set to 60 frames for all following experiments.
Size of glide time window. Secondly, to evaluate the influence of the size of glide time window Δt in Eq (33) on the recognition results, we perform the same test on Weizmann and KTH datasets (s2, s3 and s4). It is noted that the maximum frame length is 60 for all subsequences randomly selected from original videos for training and testing and the SVM based on Gaussian kernel is used as a classifier which discriminates action classes from others. Fig 13 shows experimental results with different size values of glide time window at different preferred speeds. It is seen that the ARRs at different speeds on each dataset (including each condition) vary with size of glide time window. Considering performance at all speeds used in test, we find that the optimal window size value is 3 in most cases. It also indicates that the features computed with different sizes of glide time window also affect the recognition performance. The mean motion maps are easily interrupted by undesired stimulus when the window size is small, whereas the distinctiveness of feature vectors among human actions are degraded in large window size. According to the average ARRs at all speeds from the experimental results shown in Fig 13, the size of glide time window is set to 3.
Number of the preferred speeds and their values. The experimental results shown in Figs 11 and 13 exhibit distinct recognition performance at different speeds. For example, the highest ARR on KTH dataset (s2) is provided at the preferred speed of v = 3ppF (Δt = 3), whereas the actions on KTH dataset (s3) are more accurately classified at the preferred speed of v = 2ppF. As the different human actions operate at the different speeds and the same action in different scales also does with different speeds, number of the preferred speeds and their values employed to compute action features will greatly affect the recognition results.
However, it is impossible to detect features at all different speeds to evaluate the influence of preferred speeds on human action recognition due to huge computational cost. Moreover, only choosing one preferred speed for action recognition is not reasonable because of the  complexity of action. To obtain more accurate recognition performance, we need to evaluate how many and which preferred speeds should be introduced into our model to extract motion features for human action recognition in general videos. It is known that most real-world video sequences have a center-biased motion vector distribution. More than 70 to 80% of the motion vectors can be regarded as quasi-stationary and most of the motion vectors are enclosed in the central 5 × 5 area [58]. Therefore, we opt to evaluate the performance of our model with combination of different speeds of which the value is no more than 5. For simple computation, the It is clearly seen that the different combinations in our model have an important effect on the accuracy of action recognition. For example, the recognition performance at the combination of two speeds 1+3ppF is the best one datasets except KTH (s3) dataset, and is better than that at most combinations on KTH (s3) dataset. The average recognition rate at this combination on all datasets achieves 95.16% and is the best. In view of computation and consideration for overall performance of our model on all datasets, action recognition is performed with the combination of two speeds (1 and 3ppF) for all experiments.   Table 3. Results show that our model significantly outperforms the model with traditional 2D Gabor, especially on datasets including complex scenes, such as KTH s2 and s3.
Surround inhibition. To validate the effects of the surround inhibition on our model, we user v;ðyÞ ðx; tÞ in Eqs (7) and (8) as input of integrate-fire model in Eq (29) to replace R v,(θ) (x, t) in Eq (31). For each training and testing sets, the experiment is performed two times: only considering the activation of the classical RF, and the activation of RF with the surround inhibition proposed. We construct a histogram with the different ARRs obtained by our approach in two cases (Fig 15). As we can see in Fig 15, the values of ARR with the surround inhibition are much higher than no surround inhibition on Weizmann and KTH datasets. At the same time, ARR values with no surround inhibition have a strong variability and the recognition performance highly depends on the sequences used to construct the training set, while the values with surround inhibition are relatively stable.
Field of attention and center localization. The attention computational model described in the preceding section is introduced in our action recognition model. The binary masking (BM) of an action object is obtained to determine the center position and size of FA based on our attention model. There are many methods to evaluate the performance of the attention model in terms of correct detections, detection failures, matching area, and so on. In our case, the aim is not to emphasize the performance of action object detection, but the effect of action object detection on the action recognition performance. From another perspective, ARRs reflect the performance of moving object detection to a certain extent.
The inaccurate detection of action object will lead to the inaccuracy of the size and position of FA so that the recognition performance decreases. For example, the larger FA size causes useless features to be encoded by neurons in V1. To evaluate performance of our attention model and verify the effect of the center localization on action recognition, we implement exhaustive experiments under different conditions: BM obtained by manual and automatic methods, the FA size with fixed value and adaptive value determined by the binary mask of action object. All experiments on Weizmann and KTH datasets are performed four times. The experimental results are shown in Table 4.
According to these results, it is clearly seen that the recognition rates under manual BM are higher than that under automatic BM, and the recognition rates under FA size with adaptive value are higher than that with fixed value. But, the recognition performance on different datasets under automatic BM condition is close to one under manual BM condition except for KTH s3. Even though the bags and clothes of the action object in KTH s3 directly impact on detection of the moving objects, resulting in low performance of action recognition, the recognition rate is still acceptable. It represents that our attention model is effective.
Moreover, it can also be seen from Table 4 that the recognition rate on KTH s2 under FA size with adaptive value is much higher than that with fixed value. The main reason is that the proposed method with automatically adjusting FA size satisfies scale variation of action object, the size of the action objects in KTH s2 changes greatly due to the zoom shots. It indicates that the our model is robust.

Comparisons with Different Approaches
Comparison I-With Bio-inspired Approaches. The purpose of this comparison is to find which bio-inspired approach proposed is more effective. It is more meaningful and fair to make comparison of different approaches on the same dataset. Tables 5 and 6 show the  performance comparisons of some bio-inspired approaches on both Weizmann and KTH datasets respectively. On Weizmann dataset, the best recognition rate is 92.81% under experiment environment Setup 2 by Escobar's approach [13] which uses the nearest Euclidean distance measure of synchrony motion map with triangular discrimination method, while the best performance of Jhuang's [4] achieves 97.00% using SVM under experiment environment Setup 3. However, we can draw more conclusions from Table 5. Firstly, no matter what kind of approaches, sparse feature is beneficial to the performance improvement. It is noted that the effective sparse information is obtained by center-surround interaction. Secondly, the comprehensive and reasonable configurations of center-surround interaction can enhance the performance of action recognition. For example, more accurate recognition can achieved by the approach [5] using both isotropic and anisotropic surrounds than the model [59] without these. Finally, our approach obtains the highest recognition performance under different experimental environment even if only isotropic surround interaction is adopted.
From Table 6, it is also seen that the recognition performance of the proposed approach on KTH dataset is superior to others in different experimental setups. For each of four different conditions in KTH dataset, we can obtain the same conclusion. Moreover, our approach is only simulating the processing procedure in V1 cortex without MT cortex, and the number of neurons is less than that of Escobar's model. The architecture of proposed approach is more simple than that of Escobar's and Jhuang's. As a result, our model is easy to implement. Comparison II-Compendium of Results Reported. Due to the lack of a common dataset and standardized evaluation methodology, the development of action recognition algorithms obviously has been limited even if a large number of papers reported good recognition results on individual datasets which contains various human actions. Due to the real difficulties of making such quantitative comparison, the comparison among various different approaches seldom is made cross datasets. Here, in order to ensure consistency and comparability, we simply list some representative studies in terms of the same datasets, and approximate accuracies in Table 7. To some extent, these approaches reflect the latest and best work in human motion or action recognition.
In Table 7, we report the experimental results on the KTH dataset. Our experiment setting is consistent with the respective setting in [4], [5], [31], [29], [60], and we train and test the proposed method with Setup1 and Setup3 on the entire dataset. The experimental results of our approach under Setup 2 are also provided. From Table 7, we can see that performance of proposed approach demonstrated here is comparable to others with respect to recognition rates. Moreover, we have also found that recognition rates of our approach are relative stable under different setups in the comparable data set, and the difference between them is not more than 0.5%. Fig 16 represents the confusion matrices of the classification on the KTH dataset using our approach. The column of the confusion matrix represents the instances to be classified, while each row represents the corresponding classification results. The main confusion occurs between jogging and running in four different scenarios. It is a difficult challenge to distinguish the jogging and running because the two actions performed by some subjects are very similar. We also use two cross-validation strategies under Setup1 and Setup3 for UCF Sports dataset used in the computer vision. Again, our performance shown in Table 8 is at 90.82% accuracy, and it is better than other contemporary approaches except Wu' method, which achieves at best 91.3%. These results clearly demonstrate that our approach is a notable new representation for human action in video and capable of robust action recognition in a realistic scenario.

Discussion and Conclusions
In this paper we propose a bio-inspired model to extract spatiotemporal features from videos for human action recognition. Our model simulates the visual information processing mechanisms of spiking neurons and spiking neural networks composed with them in V1 cortical area. The core of our model is the detection and processing of spatiotemporal information inspired by the visual information perceiving and processing procedure in V1. The dynamic properties of V1 neurons are modeled using 3D Gabor spatiotemporal filter which can detect spatial and temporal information inseparately. To further process spatiotemporal information for effective features extraction and computation of saliency map, we adopt the center surround interactions, inhibition and facilitation based on horizontal connections of neurons in V1. The visual attention model is then integrated into the proposed approach for better action recognition performance. Then the bio-inspired features generated by neuron IF model are encoded with the proposed action code based on the average activity of V1 neurons. Finally the action recognition is finished via a standard classification procedure. In summary, our model has several advantages: 1. Our model only simulates the visual information processing procedure in V1 area, not in MT area of visual cortex. So our architecture is more simple and easier to implement than the other similar models.
2. The spatiotemporal information detected by 3D Gabor, which is more plausible than other approaches, is more effective for action recognition than the spatial and temporal information detected separatively.
3. Salient moving objects are extracted by perceptual grouping and saliency computing, which can blind meaningful spatiotemporal information in the scene but filter the meaningless one.

4.
A spiking neuron network is introduced to transform the spatiotemporal information into spikes of neurons, which is more biologically plausible and effective for the representation of spatial and motion information of the action.
Although extensive experimental results have validated the powerful abilities of the proposed model, further evaluation on a larger dataset, with multivaried actions, subjects and scenarios, needs to be carried out. Both shape and motion information derived from actions play important roles in human motion analysis [2]. Fusion of the two information is, thus, preferable for improving the accuracy and reliability. Although there have been some attempts for this problem [30], they usually use the linear combination between shape and motion features to perform recognition. How to extract the integrative features for action recognition still remains challenging.
In addition, the recognition result of our model suggests that the longer subsequences may be more helpful for information detection. However, in many practical applications, it is impossible to recognize action for long time. Most of the application focus on the short sequences. Thus, the feature extraction should be as fast as possible for action recognition.
Finally, surround suppressive motion energy can be computed from video scene based on the definition of the surround suppression weighting function, stimulating biological mechanism of center surround suppression. We can find that the response of texture or noise in one position is inhibited by texture or noise in neighboring regions. Thus, the surround interaction mechanism can decrease the response to texture while not affecting the responses to motion contours, and is robust to the noise. However, as a particular V1 excitatory neuron identified as the target neuron, its surround inhibition properties are known to depend on the stimulus contrast [50], with lower contrasts yielding larger summation RF sizes. To fire the neuron at lower contrast, the neuron has to integrate over a larger area to reach its firing threshold. It requires that the surround size can be automatically adjusted according to local contrast. Therefore, there are still problems to be solved in the model, for instance, the dynamical adjustment of summation RF sizes and further processing of motion information in MT.
Supporting Information S1 File. The granted permission. (PDF)