A Stable Biologically Motivated Learning Mechanism for Visual Feature Extraction to Handle Facial Categorization

The brain mechanism of extracting visual features for recognizing various objects has consistently been a controversial issue in computational models of object recognition. To extract visual features, we introduce a new, biologically motivated model for facial categorization, which is an extension of the Hubel and Wiesel simple-to-complex cell hierarchy. To address the synaptic stability versus plasticity dilemma, we apply the Adaptive Resonance Theory (ART) for extracting informative intermediate level visual features during the learning process, which also makes this model stable against the destruction of previously learned information while learning new information. Such a mechanism has been suggested to be embedded within known laminar microcircuits of the cerebral cortex. To reveal the strength of the proposed visual feature learning mechanism, we show that when we use this mechanism in the training process of a well-known biologically motivated object recognition model (the HMAX model), it performs better than the HMAX model in face/non-face classification tasks. Furthermore, we demonstrate that our proposed mechanism is capable of following similar trends in performance as humans in a psychophysical experiment using a face versus non-face rapid categorization task.


Introduction
Although real-world object recognition is one of the most complex and difficult of tasks, it is robustly and rapidly performed by the primate visual system. The visual system can easily adapt itself to real-world object recognition, where objects are presented in cluttered backgrounds that can vary in illumination, viewpoint, position and scale. Neurobiological evidence demonstrates that object recognition in the visual cortex is mediated by the ventral visual pathway [1], which starts from the primary visual cortex V1, continues over the extrastriate visual areas, V2 and V4, to the inferotemporal cortex (IT) and then to prefrontal cortex (PFC) [2][3][4]. This pathway exhibits a hierarchical structure in which the complexity of the preferred stimuli and the receptive field of cells correspondingly increase along the hierarchy [2,3]. Based on widely accepted evidence, several models of visual cortex have been proposed. For example, a major breakthrough in this field has been derived from the work of Hubel and Wiesel on the cat [5,6] and macaque primary visual cortex [7]. These studies demonstrate that the processing in the visual cortex follows a hierarchical structure. Following Hubel and Wiesel's pioneering proposal of a hierarchical model for the primary visual cortex, several hierarchical object recognition models have been developed. For example, Fukushima [8] proposed Neocognitron, a hierarchical multilayered neural network that is capable of robust visual pattern recognition through learning [9,10]. Riesenhuber and Poggio [11] also proposed the HMAX model, which is based on the classical simple-to-complex cells model by Hubel & Wiesel. The HMAX model attempts to quantitatively resemble visual processing in the ventral visual pathway. A significant degree of invariance to scale and translation are some characteristic of the HMAX model. Furthermore, this model outperforms some stateof-the-art computer vision systems in applications such as object recognition and scene understanding [12].
Another group of models, including the LAMINART and SMART models, does not fall into the category of object recognition models. These models try to implement details of circuits and layers of the visual cortex. The LAMINART model [13][14][15] is a model of the visual cortex that attempts to implement details of layers and circuits in the lateral geniculate nucleus (LGN), and the V1 and V2 areas of the visual cortex. The Synchronous Matching ART model (SMART) [16] implements interactions between the laminar cortical circuits and higher-order thalamic nuclei. These models are based on the adaptive resonance theory, which was developed and inspired by how the brain performs information processing [17,18].
Solving the stability-plasticity dilemma together with achieving memory stability in an evolving input environment is considered as a fundamental goal. The stability-plasticity dilemma is related to how our brain learns enormous amounts of information and can remain stable against forgetting previously learned material. The LAMINART and SMART models attempt to show how the ART mechanism may be embedded in the cerebral cortex and attempt to propose a solution to the stability-plasticity dilemma observed in the cerebral cortex.
Extracting biologically plausible visual features that can mimic visual processing in the primate brain has been a challenging goal for computational models of object recognition. For example, learning in the model proposed by Serre et al. involves a simple mechanism of selecting random patches from the training images [19]. However, random selection is not a biologically plausible approach. To select only relevant features for a given task, LeCun used a supervised back-propagation approach to learn visual features in a convolutional network [20]. M. Ghodrati et al. proposed a method which uses feedbacks from classifier (analogous to PFC) to extract informative visual features. Their method uses an optimization algorithm to select informative patches from a large pool of patches [21]. Masquelier et al. [22] used the spike timing-dependent plasticity (STDP) learning rule in an architecture on the basis of the Serre et al. model. Although this is a biologically-plausible approach, it is not stable due to the forgetting of previously learned information. Furthermore, each input is required to be presented several hundred times, whereas usually our brain is able to learn scenes at first glance.
In this paper, by using a stable visual feature learning mechanism, we propose a model which incorporates one of the well-know object recognition models (the HMAX model), that is based on the hierarchical model of Hubel and Wiesel. The HMAX model is a feedforward network of four layers of alternating simple and complex units (S 1 , C 1 , S 2 , C 2 ). The HMAX model with our proposed feature learning mechanism, inspired by the ART system, suggests a mechanism for solving the problem of stability versus plasticity in object recognition systems. Both the ART mechanism, which is employed in our model, and the STDP rule are biologically plausible. However, the ART mechanism enables our model to learn informative features in a single presentation of the input image. This is in contrast to the STDP rule, which requires hundred times of image presentation.
There are some other object recognition models that have used the Adaptive Resonance Theory. For example, Woodbeck et al. [23] proposed a biologically plausible hierarchical structure which was an extension of the sparse localized features (SLF) suggested by Mutch et al. [24]. One of their contributions was that, instead of using support vector machines (SVM) for classification, they used Fuzzy ARTMAP as a biologically plausible multiclass classifier [25] which is based on the Adaptive Resonance Theory (ART). There are also some other studies that have employed Adaptive Resonance Theory to classify objects after extracting features [26,27]. However, we have adopted Adaptive Resonance Theory for selecting informative visual features before classification stage in a learning mechanism. There are also many other pattern recognition systems based on the ART mechanism [28][29][30][31][32], which do not have a hierarchical structure inspired by the primate visual cortex.
We evaluated the proposed learning mechanism in a facial categorization task and compared the results with a benchmark model of object recognition; we also compared the performance of the both models with the performance obtained from a psychophysical experiment using human observers. Our results demonstrate that the proposed model has a higher classification performance than the benchmark model and resembles human responses at an acceptable level.

Materials and Methods
The stability-plasticity dilemma Humans can memorize new faces at a glance, but this fast learning ability does not yield forgetting the previously known faces. The ability of our learning system to memorize novel events is called plasticity. In contrast, the ability that prevents the catastrophic forgetting of previously learned information is called stability. This mechanism, which exists in all adaptive processes of the brain, is called the stability-plasticity dilemma [18]. This dilemma hinges on the idea that human and mammalian brains are able to learn massive amounts of new information throughout their life without forgetting previously learned information.
One theory that addresses the stability-plasticity dilemma is the ART, which was proposed by Grossberg [17]. The ART is a cognitive and neural theory that attempts to provide a solution for the stability-plasticity dilemma. It proposes a top-down matching mechanism in which bottom-up signals activate top-down expectations; this attracts attention to the relevant information in the bottom-up pathway ( Figure 1). The ART works with an oncenter, off-surround network that amplifies the activities of the cells within the matched portion (on-center) while suppresses the activities of irrelevant cells in the non-matched portion (the surround) ( Figure 1). The top-down modulatory on-center, offsurround circuit [33][34][35][36][37] is used for the matching process in our proposed model. We used this matching process for selecting attended features and inhibiting unattended ones. This proposed model makes use of the bottom-up adaptive weights as well as the top-down expectations, which enables the attended feature patterns to be learned. If the input pattern adequately matches the top-down expectations, then these top-down expectations will reactivate relevant bottom-up pathways, thereby generating a state of feedback resonance between the bottom-up and top-down pathways. In contrast, a large mismatch can lead to hypothesis testing or searching for a new and more predictive category.
As previously described, top-down connections exist in the early layers of the visual cortex such as V1 and V2, which demonstrates that the visual cortex not only has feed-forward connections (unlike the classical model of Hubel and Wiesel), but also possesses feedback connections, which is thought to have a key role in the stabilization of both development and learning within multiple cortical areas including the V1 and V2 areas [40]. Therefore, the feedback loop from complex cells to simple cells through a modulatory on-center, off-surround network can be thought of as an implementation of ART matching in the visual cortex.
The stability-plasticity dilemma in the visual cortex How the visual cortex automatically develops circuits and can still remain stable is a major question for which several models have been developed, including the LAMINART model [13], that attempts to implement details of the layers and circuits of the LGN, V1, and V2 areas in the visual cortex. The Synchronous Matching ART model [16] is another example, which goes beyond the LAMINART model and implements interactions between laminar cortical circuits and higher-order thalamic nuclei. The LAMI-NART and SMART models are based on the adaptive resonance theory, which suggests a solution for the stability-plasticity dilemma.
Simple cells in the V1 area receive direct inputs from the LGN and also from an on-center, off-surround network [38]. Complex cells receive inputs from simple cells with the same orientation but different contrast polarities and can thus respond to both polarities. In addition to these bottom-up connections, cortical connections of the visual cortex have been shown to provide feedback to lower level layers. For instance, active complex cells send top-down signals to simple cells through an on-center, offsurround network, and simple cells in turn activate complex cells. This feedback process is called folded feedback (see Figure 2B in [14]). The top-down signals from complex cells to simple cells, Figure 2. A schematic diagram of the proposed model architecture. Grayscale images are applied to the system and the outputs of S 1 and then C 1 are attained. Then, the S 2 responses are computed using existing prototypes. Next, to compute the C 2 responses, the S 2 units with the maximum response for each prototype for all positions and scale bands are selected. The highest active C 2 units are then selected as prototypes to represent the image (these are shown in the red box at the top of the figure). This selection is achieved by top-down expectations, which match the input image to the prototypes. A lateral subsystem (vigilance control), which uses a vigilance parameter (r), determines the matching degree between the prototypes and various parts of the input image. If a selected active C 2 unit has a smaller response than the vigilance value, then a new prototype is extracted from the current input image and added to the existing prototypes. doi:10.1371/journal.pone.0038478.g002 using an on-center off-surround network, enables highly active complex cells to inhibit lower active cells [39]. The V2 circuitry also demonstrates a similar pattern to that of V1, but on a larger spatial scale.

The proposed model
We propose a biologically motivated object recognition model which incorporates the HMAX model, and uses a stable learning method, inspired by the ART mechanism, to solve the stabilityplasticity dilemma. The proposed model is generally based on Neocognitron [9] and HMAX (which is another hierarchical model based on Neocognitron) proposed by Riesenhuber and Poggio [11]. Some parameters of the model proposed in this study, particularly those in the edge detection stage, have been adjusted to be comparable with the HMAX model in facial categorization tasks (We used the HMAX MATLAB implementation, which was freely available at http://cbcl.mit.edu/software-datasets/index.html). Furthermore, to solve the stability-plasticity dilemma, we used the ART mechanism to extract more informative features of intermediate complexity, and this consequently provides a more realistic biologically inspired model.
The proposed model has a hierarchical structure and intends to emulate rapid object categorization in the visual cortex. The model consists of alternating simple and complex units: simple (S) units correspond to the simple cells in the visual cortex, which combine their inputs according to a bell-shaped tuning function to increase selectivity. Complex (C) units correspond to the complex cells in the visual cortex, which show tolerance to a shift in the position and size of the stimuli within their receptive field. These units pool their inputs through a maximum (max) operation [11] to increase invariance (biologically plausible circuits for these two operations can be found in [41]). The proposed model consists of four layers of alternating simple and complex units ( Figure 2). The S 1 units take the form of the Gabor function [42] and convolve the input image to detect bars and edges. The Gabor function has many free parameters, which agrees well with physiological data recorded from simple cell receptive fields in cat striate cortex [43]. The parameters of the Gabor function were set up to match the tuning properties of simple cells in V1. The S 1 units include 16 filter sizes, spanning a range of sizes from 767 to 37637 pixels in steps of two pixels, and four orientations (0u, 45u, 90u, 135u). Totally, there are 64 different S 1 units. These 64 filters are then divided into eight bands where each band contains two adjacent filter sizes [12].
Each of the complex C 1 units pools its inputs over a group of simple S 1 units which have the same preferred orientation but at slightly different positions and sizes. The index of the filter size bands determines the pool range for the C 1 units. This pooling increases the invariance to the changes in shift and size inside the receptive field of the units.
The next layer is S 2 , which is selective to more complex patterns than bars or edges within their receptive field. The units of this layer receive their input from retinotopically organized C 1 units in a spatial grid and in all four orientations via weighted connections that respond to specific patterns or prototypes, bottom-up weights ( Figure 1).
The last layer of the model consists of C 2 units that respond to the prototypes of the input image extracted from different locations, which increases invariance. A C 2 unit has connections with S 2 units of the same prototype but in a different size and position. Thus, the results of this layer are C 2 values in a vector of size N, where N is the number of prototypes learned by the model. The C 2 responses illustrate the matching between the prototypes and the input image. A high C 2 response indicates that the extracted prototype is sufficiently matched by a portion of the input image and is thus suitable for representing the input image.
The feedback from complex cells to simple cells through the oncenter, off-surround network in the V1 and V2 areas of the visual cortex leads to the excitation of related simple cells by winner complex cells and inhibits irrelevant cells. In addition to the feedback from complex cells to simple cells, the feed-forward connections between simple and complex cells create a feedback loop that yields a resonant state for relevant cells [39].
According to this feedback loop, we simulate this match learning to learn informative intermediate-level visual features from the input images. This feedback excites portions of inputs that are matched by the prototypes of the active C 2 units and inhibits portions of inputs that are not matched by these prototypes (Figure 1). In contrast, if the mismatch is higher than the value of vigilance parameter (this parameter is explained later), this means that the existing C 2 units are unable to represent the input image. Next, new prototypes from the current input are extracted and added to the preceding C 2 units. In other words, we assume that for each input image, P numbers of C 2 units are sufficient to represent the image. If these P features were previously available in the current pool of patches, we would have an accurate representation of the input image. Otherwise, the new patches will be extracted and added to the pool of patches. To achieve informative prototypes for each image, we employed the match learning and reset mechanism of the ART system ( Figure 2). An analogy can be seen between adding new C 2 units and matchbased learning, which has been suggested to be a learning mechanism in the brain. Match-based learning updates memory only when a completely new input occurs or there are some inputs from the external world, which are sufficiently close to internal expectations [16].
We presented all of the training images to the system, and outputs of S 1 and then C 1 were attained. The S 2 responses were then computed by utilizing the existing prototypes. Next, to compute the C 2 responses, the S 2 units with a maximum response for each prototype for all of the positions and scale bands were selected. We selected P C 2 units with the highest activity to represent the image (this selection was achieved by top-down expectations, which match the input image to prototypes) and compared them with a vigilance parameter to determine the matching degree between the prototypes and the input image. These selected units are shown separately at the C 2 level (Figure 2). If the amount of matching is lower than the vigilance, then the prototype will not represent the input image appropriately and results in extracting new prototypes from the current image and adding them to the prototype pool. Using this learning process, with a single presentation of an image of the training set, proper prototypes that represent the image are efficiently extracted.
To control the generality of the learned features, a vigilance parameter in the model was used that is analogous to the process mediated by acetylcholine. According to the SMART model [16], a combination of nonspecific nuclei and the nucleus basalis of Meynert is proposed to play the role of the vigilance parameter in our model (see Table 1 in [16]). The vigilance parameter is set in such a way to attain the highest performance with the fewest prototypes. The selection of the vigilance parameter is highly critical in an ART network, and there is no special rule for setting the value of vigilance [44]. To determine the vigilance parameter, a group of images were randomly selected from the dataset prior to the training and testing stages. Next, from these images, the vigilance parameter was specified manually. Finally, the vigilance parameter remained fixed during both the training and testing stages for these experiments.
The classification stage: to compare our model with the HMAX model [12] in a face/non-face categorization task, we added a classification stage to the model that is similar to that of the HMAX model. For all images in the training and testing sets, each image was passed through the layers of the model, and the responses of the C 2 units were computed and saved as a vector representing the extracted features for that image. Next, these vectors were subsequently passed to a linear classifier (Simple linear SVM classifier) for classification.

Images dataset
To evaluate the performance of the proposed model, we used the face image category of the widely used California Institute of Technology (Caltech101) datasets [45]. These datasets consist of 101 different object classes as target images and a background folder as negative examples. We used the background dataset as distractor images. The face dataset contains face images of various people against various backgrounds in various positions. This dataset appears to be challenging for facial categorization. The number of images in the face and background datasets are 435 and 451, respectively. The dataset is freely available at http://www. vision.caltech.edu/Image_Data sets/Caltech101 (This dataset is completely free and has been widely used and represented by authors. Some researchers who have used these face images in their work include [22,[45][46][47][48]).

Classification by the Proposed Model
We designed various experiments to compare the proposed model with the HMAX model in face/non-face categorization tasks. The images were converted to grayscale values and rescaled to be 140 pixels in height. The width was rescaled accordingly to preserve the aspect ratio. In all experiments, the following procedure was performed: 1. Extracting C 2 -level features: Our stable fast learning algorithm was performed on the training dataset to extract a set of C 2 -level features. 2. Training the SVM classifier: All of the training set images were applied one by one to the model, and C 2 responses were calculated. The C 2 responses with labels (1 for positive and -1 for negative examples) were used to train a classifier (i.e., the Simple linear SVM classifier). It is noteworthy that layers are fixed at this stage, and learning in lower levels of the system is stopped. 3. Evaluating the extracted features: The performance of the classifier on the test set was evaluated. The overall procedure was repeated 20 times, and the average performance and standard deviation (SD) were reported.
In the first experiment, we evaluated the performance of the proposed model in a face/non-face classification task. For this purpose, the datasets were randomly divided into two subsets with equal number of images, i.e., for the training and test sets. The first subset was used for extracting C 2 -level features and training the SVM classifier, and the second subset was used for evaluating the classification performance.
In the next experiment, we studied the effect of the number of training samples on the classification performance. The model was evaluated using different numbers of positive training samples ( For further studies regarding the biologically plausibility of the proposed model, we compared the performance of the face/nonface categorization task in humans with the model.

Results
In the next two sections, we report the results of different comparisons made between the proposed model, another biologically plausible model (HMAX), and human subjects. First, the results of the proposed model are compared against the HMAX model in three different experiments. As a follow-up to these results, we compare the performance of the human subjects in a psychophysical test (rapid categorization of faces versus non-faces) with the performance of our proposed model.

Comparison with another biologically plausible model
We compared our results against another established biologically motivated object recognition model, the HMAX model. This model outperformed many machine-vision object recognition systems at several tasks [12]. We evaluated the performance of the HAMX model using the proposed visual feature learning mechanism against the standard HMAX model. For this purpose, we used the face category of Caltech101.
In the first experiment, the face and background datasets were randomly divided into two separate sets of equal sizes. Next, we applied our stable fast-learning algorithm to the training dataset to extract the most informative intermediate-level features from the images. The vigilance parameter in the model was determined such that the most informative features with the highest possible performance were extracted. After this stage, the prototype learning was stopped, and the obtained features were applied in the face/non-face classification task. The classification performance of these features was then computed. In the classification stage, we used a linear SVM classifier. The performances were reported with an accuracy measure at the equilibrium point, which occurs at the accuracy point when the false positive rate equals the missed rate. For a fair comparison, we also used the HMAX model on the same training and test set. The classification , Examples of distractors. The first row consists of noisy images, and the second row corresponds to noise-free images. (C), The psychophysical task process. A face image is presented for 20 ms, and then a blank screen is presented (ISI 10 ms). Next, a noisy mask is presented for 80 ms. Finally, the subject is asked to select ''YES'' or ''NO'' by pressing the appropriate key on a computer keyboard. doi:10.1371/journal.pone.0038478.g004 performance was 98.5% for our model and 98% for the HMAX model. To determine whether the performance differences between the HMAX model and the proposed model were statistically significant, we used two non-parametric statistical tests, i.e., the Wilcoxon rank sum [49] and the two-Sample Kolmogorov-Smirnov test [50] (Implemented in MATLAB statistical toolbox. Under the null hypothesis the distribution and mean of both groups are equal, so that the probability of an observation from one population (X) exceeding an observation from the second population (Y) equals the probability of an observation from Y exceeding an observation from X. Note that, distributions are classification performances obtained over 20  Figure 3A (experiments were independently performed 20 times. Afterwards, the average performance and standard deviation were reported). For each training stage run and after giving the training data to the proposed model, sufficient informative C 2 -level-features were extracted and subsequently used to train the classifier for the face/non-face classification task. We then implemented the benchmark model using the same number of features. To make the comparison more challenging for our proposed model, we also performed the benchmark model using 1,000 features. As shown in Figure 3A and 3B, the performance of the proposed model was better than the benchmark model (for the same number of features) across different number of training examples; the proposed model also performed moderately better than the benchmark model with 1,000 features. In some cases in Figure 3A and 3B our results are not statistically significant. However, when we use fewer features, as it can be seen in Figure 3C, the classification performances of our method are significantly better, and p-values reveal that the results in this case are statistically significant.
In the next experiment, we compared the performance of the proposed model with the HMAX model using different numbers of features. In this case, the parameters of the proposed model were set to extract a number of features, and these features were then used in the classification task. The datasets were randomly divided into two separate subsets of equal size (the training set and the test set). As shown in Figure 3C, when we use fewer numbers of features, our proposed model significantly outperforms the Figure 6. Details of the process used to compare the stability of the proposed model with the HMAX model. The first iteration can be explained as follows: first, two images (i.e., the t 1 set) are randomly selected and sent to the train bag. Next, both the proposed model and the HMAX model are used to extract a pool of patches called t 1 patches, which are depicted in the red-colored dashed box (box Q). Subsequently, patches are extracted of two other randomly selected images (i.e., the t 2 set), and these patches are added to the pool of training patches, indicated by the pinkcolored squares in box Q. The process is continued by extracting patches of all of the images in the test bag and storing them in box P (the patches are shown as pale blue-colored squares). Next, the average distances between all the training and test patches, which are shown in boxes Q and P, respectively, are computed. doi:10.1371/journal.pone.0038478.g006  benchmark model. For example, the proposed model had significantly better performance when using fewer features (e.g., approximately 93% with only 9 features) than the benchmark model (e.g., approximately 80% with 9 features). This demonstrates that the proposed visual feature learning mechanism can strongly improve the performance of the HAMX model using very few features in contrast to the HMAX model with randomly extracted features. This finding illustrates that our mechanism has extracted more informative features from the input images than did the standard HMAX model. With such a biologically plausible learning mechanism, we addressed the stability-plasticity dilemma and also solved the problem of extracting redundant features.
Moreover, the proposed model suggests a stable biologically plausible learning mechanism for extracting intermediate level visual features (for more information regarding the stability of the proposed model please refer to the stability of the proposed model section).

Comparison with human
We performed a psychophysical experiment for categorizing faces versus non-faces to compare the proposed model with human observers. For this purpose, we selected the Caltech face and background datasets as positive samples and distractors, respectively. In addition, we added various levels of salt and pepper noise to these images. The noisy images were shown to human subjects using a computer screen in a random order. The human subjects were instructed to respond as fast and as accurately as possible to determine whether the image contains a human face or a distractor by pressing the ''YES'' or ''NO'' key. The results obtained from the human subjects were compared with those obtained using the proposed model on the same image set.
We used 16 human subjects in this experiment (18-36 years old) with an equal number of male and female subjects. The Stimulus Onset Asynchrony (SOA) in this test was a fixed SOA of 30 ms (20 ms image presentation followed by an Interstimulus Interval (ISI) of 10 ms). The experiment was performed in a dark room. The participants were seated 0.5 m away from the computer screen (Intel core 2 duo processor (2.66 GHz), 4 GB RAM). We used the MATLAB software with the psychophysics toolbox [51][52][53]. In the experiment, the image was presented for 20 ms, and this was followed by the presentation of a random noise mask. The mask appeared after a fixed ISI for duration of 80 ms (which corresponded to an SOA of 30 ms). Please refer to Figure 4C for additional details of the psychophysical experiment procedure.
To pose a variety of challenges to the task, we used five sets consisting of an equal number of images in each set (60 faces and 60 distractors at the same level of noise in each set, 600 stimuli in total). These five sets correspond to various levels of noise (0, 20, 40, 60, and 80 %; see Figure 4). These images (300 faces and 300 distractors) were randomly selected from both the face and background datasets. Next, various levels of salt and pepper noise were generated and superimposed on the images in each group. The images were presented in a random order at the center of the screen (256*300 pixels, grayscale images). Each image only appeared once to omit the potential for image-specific learning effects. The subjects were then asked to accurately respond as fast as they could as to whether the image contained a human face or a distractor image by pressing the ''YES'' or ''NO'' key on the computer keyboard. In addition, the subjects were alternately asked to use their left or right hand to press the ''YES'' vs. ''NO'' key. Each experiment lasted approximately 15 minutes. The remaining images for both the face and distractor datasets were used to extract C 2 -level features and to train the classifier in the proposed model. Obviously, the training images were noise-free. Next, we evaluated the performance of the classifier on the ''test'' set.
A comparison between the average performance of the human observers (n = 16, 30 ms SOA) and the proposed model in the face/ non-face classification task is shown in Figure 3D. The performance was measured using a performance measure d9, which combines both the hit and false-alarm rates of each observer into a single standardized score. The responses of both the proposed model and the human subjects were roughly similar. The proposed model was capable of following similar trends in responses as humans in this experiment. The performance of the HMAX model for this experiment is also demonstrated in Figure 3D (green line).
We also compared our results with human responses using ROC curves. The blue curve in Figure 5 was obtained by averaging all of the ROC curves across 10 random runs and the upper and lower green curves are the maximum and minimum ROC curves, respectively, which correspond to the highest and lowest performance of the proposed model in the different runs. Because it is impossible to use the ROC curve for the human observer responses, we represented the true positive to false positive ratio of each subject using the sixteen red circles shown in Figure 5 (we magnified some important parts of the plots in Figure 5 for better visualization). The majority of the red circles were located below the maximum, above the minimum, and adjacent to the average ROC curves, which implies that the proposed model nearly resembles the performance of the human observers.

Stability of the proposed model
One interesting property of the proposed model is its stability, which means that after learning new features, the model is still capable of remembering previously learned ones. To examine the stability of the model, we designed an experiment that enabled the measurement of the stability of the proposed model and the comparison of its stability with the HMAX model.
For the purpose of measuring stability, we trained each model using n images, and subsequently, added m new images to both trained models. The two models were compared to determine how well the different models retained the first n trained images. For each iteration of this procedure, while previously learned features are preserved, we present m new training images to both models. In this step, our stable visual feature learning mechanism will only extract new patches in which the vigilance parameter determines whether the patches are necessary to be added to the previously learned pool of patches; in this way, the new pool becomes more capable of representing these new m images. However, in the HMAX model, the same number of patches is randomly extracted. Next, in the test phase, we extract new patches from all of the preceding images except for the recent m training images. Then the average of the minimum distance between these two groups of patches is computed (for details see Figure 6) to determine how similar the extracted patches remain to the previously extracted patches after adding m new training images. We consider this average distance a measure for comparing the stability of the models (additional details are depicted in Figure 6). In each step, we present two new images, and new training patches are extracted from these new images. Figure 7 provides information about the stability of our proposed approach, which was trained with various patch sizes. In this experiment, we compared the stability of the proposed model with that of the HMAX model. The average minimum distances between the test and training patches for six patches of sizes 4, 8, 12, 16, 20, and 24 are reported. In general, a downward trend in each curve indicates that the average distance decreases on adding new training images, thus confirming that the model will not forget previously learned features (as the slop goes steeper, it shows more stability). For more clarification, imagine the model has learned some features from an input image, then, by adding a new image to the model, it may require learning new features or may not. Therefore, the model learns new features only when the previously learned features are insufficient for describing the new input image. As a result, the average minimum distance decreases in each stage because the model does not forget previously learned features and only extracts new required features. If the model was not stable, it would extract non-required features in every stage which would result in an increase to the average minimum distance. As observed, the proposed model exhibited steeper downward slopes in most cases unlike the HMAX model, which mostly shows upward trends. In Figure 7E and 7F, both models exhibit downward trends; however, the trend in the proposed model has a steeper downward slope. With the exception of patch size 4 ( Figure 7A), for which the proposed model showed an upward trend, indicating that this specific patch size was not stable. This could be due to the small area that a patch of size 4 covers. This small patch size may not cover important discriminative components of the face in an image; therefore, it may not able to separate a face from a distractor sufficiently well ( Figure 8 also illustrates that the performance was close to chance level for patch size 4).
We also probed the relationship between the performance and stability of the model by running an experiment for all of the patch sizes separately using a different number of training images. In this experiment, the classification performance was measured for each patch size. As shown in Figures 7 and 8, we observed that when the proposed model is trained with patches that are more stable, better performance could be obtained. This suggests a direct relationship between the stability of the proposed model and its performance.

Discussion
The most widely accepted biological evidence shows that visual processing in the brain exhibits a hierarchical structure, which starts from the primary visual cortex (V1), and then continues to the extrastriate visual areas (V2 and V4), which are next followed by the inferotemporal cortex (IT) and the prefrontal cortex (PFC).
It is thought that plasticity and learning probably occurs at all stages, in particular, at the level of the IT and PFC [54]. The way by which this learning and plasticity occurs in the cortex has been a major concern in computational models of the visual cortex. For example, the learning process in the proposed model of Serre and colleagues occurs only between layers C 1 and S 2 which is a simple mechanism of indiscriminately selecting patches from the training images [19]. This approach leads to acceptable results, but redundancy between features is very high. Moreover, many of the features may be irrelevant to the task of classification. This increases the cost of classification and decreases the performance. However, random selection is not a biologically plausible approach. Apart from random selection, some other approaches have been suggested including the use of a supervised backpropagation approach to learn the visual features in a convolutional network. Another potential approach used the STDP learning rule to extract intermediate-complexity visual features [22]. These features have been shown to exhibit robust object recognition in some classification tasks. However, due to the nature of the STDP rule, which causes forgetting previously learned information, this approach is unstable. Furthermore, for the sake of learning by this rule, each input must be presented several hundred times, whereas our brain is able to learn scenes at a glance. In contrast to the STDP rule, we proposed another approach for the learning of intermediate-level features, which is not only a biologically plausible method but also addresses the problems of instability, the need for repeated image presentation, and the issue of the redundancy of the extracted visual features in the HMAX model. Whereas other models do not illustrate how the visual cortex is stable against the destruction of previously learned information over time, our model applied the ART mechanism, which solves the stability-plasticity dilemma. We showed that the proposed model is capable of learning new information without losing previously learned information. We also demonstrated that there is a direct relationship between the stability of the model and its performance. This means that if the model is trained with more stable patches, it performs better. This mechanism was implemented in a hierarchical feed-forward model of the visual cortex and used in face categorization. We also compared our results with the HMAX model in face/non-face categorization tasks, and the obtained results showed that it performed better than the HMAX model in 'different number of training images' experiment although not significant. However, our model significantly outperformed the HMAX model in 'different numbers of features' experiment, particularly with fewer numbers of features. Performed experiments using different numbers of features showed that our model extracts as fewest as possible features from the training images, which are the most informative features; and yet achieves an acceptable performance. In contrast, the HMAX model requires extracting more features to reach the similar performance. This showed that features learned by the proposed mechanism are highly informative which makes them capable of giving much better representation of the input images in higher processing layers. This thus results in improving the classification performance while using fewer numbers of features, as shown in Figure 3C.
To determine to what extent the proposed model can mimic the performance of human subjects, we performed the same face/nonface categorization task on humans in a rapid categorization psychophysical test. Our results showed a trend using the model that approximately resembles the trend observed in human subjects.