Constructive Autoassociative Neural Network for Facial Recognition

Autoassociative artificial neural networks have been used in many different computer vision applications. However, it is difficult to define the most suitable neural network architecture because this definition is based on previous knowledge and depends on the problem domain. To address this problem, we propose a constructive autoassociative neural network called CANet (Constructive Autoassociative Neural Network). CANet integrates the concepts of receptive fields and autoassociative memory in a dynamic architecture that changes the configuration of the receptive fields by adding new neurons in the hidden layer, while a pruning algorithm removes neurons from the output layer. Neurons in the CANet output layer present lateral inhibitory connections that improve the recognition rate. Experiments in face recognition and facial expression recognition show that the CANet outperforms other methods presented in the literature.


Introduction
Computer models developed based on theories of the human brain structure have been applied in various problems in computer vision. While the human brain has not been fully well understood, these models inspired many methods used in pattern recognition, such as Artificial Neural Networks (ANNs) [1]. ANNs have been improved using concepts based on theories of the human brain such as the receptive and inhibitory fields (also known as lateral inhibition) [2][3][4][5] and the autoassociative memory [6]. Receptive fields concepts have been used in neural networks for implicit feature extraction of input images, where neurons are connected to predefined regions in previous layers [7][8][9][10]. The lateral inhibition concepts are applied in order to improve the neural network stability and efficiency, making the model less sensitive to image distortions [11][12][13][14].

Receptive and inhibitory fields
Receptive fields concept refers to an area in which the presence of an appropriate stimulus produces response in a sensitive neuron and it was already identified in the visual, auditory and somatosensory systems of the human brain [2]. On the other hand, the inhibitory fields correspond to a region surrounding a given neuron that sends inhibitory stimulus simultaneously to the excitatory effect of the receptive field.
The concepts of receptive and inhibitory fields have already been used in image processing to improve texture analysis and contour detection accuracy [20][21][22], while ANNs inspired by such concepts have been proposed to incorporate feature extraction in their architecture for visual pattern recognition tasks [8-10, 23, 24]. In another work, we presented a study about receptive fields and lateral inhibition and proposed a pyramidal neural network for image classification that integrates both concepts, called LIPNet [7]. This neural network obtained low error rates, fast performance and low memory consumption.

Autoassociative memory
Autoassocative Artificial Neural Networks (AANNs) have the advantage of allowing non-linear correlations [25]. AANNs are based on autoassociative memory that is a type of memory where the input pattern and the desired output are the same. The classifiers based on such memory are useful to determine whether or not a pattern is known. AANNs have been successfully applied in many computer vision tasks. Thompson et al. [26] applied AANNs for novelty detection demonstrating that the learning is more ample than a simple memorization. Cavalcanti et al. [27] applied an AANN in a face verification problem. Wang and Chen [28] proposed an autoassociative model, called EFMAM, to perform pattern recognition tasks obtaining improvements in comparison with the not autoassociative version of the proposed model. Rao et al. [29] performed emotion recognition in image sequences by extracting features from face regions with a five layers AANN.
The output of the AANNs is the reconstruction of the input pattern presented to them. They have a bottleneck structure with fewer neurons in the hidden layers, responsible for data compression, than in the input and output layers. The pattern is mapped to a new feature space in the hidden layers and then the neural network learn the inverse mapping with respect to the minimization of the distance from the input to the output pattern. The AANNs have the ability to implicitly select and extract the features of the input data without any a priori knowledge or specific instruction [27]. However, this kind of representation is likely to have a high computational complexity due to the high number of neurons and synaptic connections.
AAPNet [15] is an autoassociative pyramidal neural network that achieved good results in object categorization. This neural network has a pyramidal architecture composed of receptive fields and shared weights reducing the number of synaptic connections between the neurons.

Constructive neural networks
Constructive algorithms [30] are learning methods used to adaptively adjust the architectural models of the ANNs. Many algorithms have been proposed to update the architecture of a neural network [31], such as: N Constructive: add layers, neurons and connections to provide a minimal neural network architecture during the training.
N Pruning: remove layers, neurons and connections that are redundant from a neural network with a larger and deeper structure during the training.
N Constructive-Pruning: hybrid approach in which the neural network is pruned after a constructive process. N Regularization: add or remove a punish term to the error function for discard not important connections during the training [32].
Sharma and Chandra [31] and Kwok and Yeung [33] presented a literature review for the constructive algorithms, emphasizing two approaches: Cascade-Correlation (CC) and Dynamic Node Creation (DNC). The CC algorithm [34] creates neural networks with multiple hidden layers with one neuron each that is connected to all other neurons previously added. This algorithm enables the neural network to detect high order features in the input pattern. However, the neural network generalization ability decreases as the number of the neurons added increases and the stimulus propagation might become very slow [33]. CC expansions have been proposed to allow, for example, more than one neuron in a same layer, but the decision about in which layer a new neuron should be added is not trivial and algorithms like the proposed by Ma and Khorasani [35] and Islam et al. [36] have been used limiting the number of neurons that can be added in each layer. DNC is a model proposed by Ash [37] to dynamically add neurons in a hidden layer until the neural network reaches an approximation of the precision of its output. This algorithm creates neural networks with one hidden layer training the entire neural network every time a new neuron is added. DNC is a simple algorithm which follows the convergence properties of the universal approximators [38] of the underlying architecture. The main disadvantage of DNC is that the search space is too large, increasing the computational cost and the convergence time [39].
The constructive algorithm of an one-hidden-layer feedforward neural network (OHL-FNN) [33] is an extesion of the DNC algorithm used to avoid high computational cost. OHL-FNN freezes the neural network weights that have been previously trained and with an addition of a new neuron, the weights affected by the insertion are retrained. Ma and Khorasani [17] used the OHL-FNN strategy for facial expression recognition achieving a better classification rate than other neural network models with fixed structure.
In problems of one-class classification with many patterns, like facial recognition, constructive learning is particularly interesting since each neural network is evolved to individually learn each expression. The main advantages of using constructive methods are the following [31,40]: N If the constructive algorithm is successful, the obtained neural network can be used to estimate the complexity of the learned problem.
N It is possible to incorporate domain specific knowledge in the neural network that can be modified with the emergence of new training patterns.
N A different neural network configuration can automatically be defined for each learned pattern instead of using the same predefined neural network architecture for all known patterns.

Constructive Autoassociative Neural Network
The CANet is a model designed to implicitly extract features in a dynamic architecture aiming to reconstruct the input image in the output layer that belongs to the OHL-FNN [33] approach. It uses the concepts of receptive fields and autoassociative memory to represent a visual pattern with implicit feature extraction in an one-class classification model. In order to optimize the architectural configuration for each known pattern, a constructive algorithm for the neural networks with one hidden layer is used, this choice was motivated by the work of Ma e Khorasani [17]. Fig. 1 presents the CANet training model. In each new training iteration, the learning algorithm evaluates if it should change the neural network architecture to improve the reconstruction accuracy. New neurons are added to the hidden layer in order to approximate the neural network output to the input image. It is important to note that each neuron in the hidden layer is connected to receptive fields in the same location of the input and output layers without any overlap between adjacent receptive fields. Thus, since neurons in the output layer do not have a bias connection, all the neurons in the same receptive field region return the same value. In order to avoid noisy pixels in the reconstructed image that could impair the image classification, a pruning algorithm [18] eliminates the neurons in the output layer that present the highest error rates in the training set, reducing the computational cost of the CANet without compromising its efficacy.
In this section, we present the neural network architecture, the connectivity model, the training algorithm, the constructive and prunning algorithms and the multi-class recognition system. Table 1 presents the notation and definitions used to describe CANet. Fig. 2 presents the CANet architecture that is composed of 2-D layers connected in cascade, i.e., the output of one layer is the input to the next one. First, the input pattern is reduced to a feature map that is smaller than the input layer. The extracted features are then used to reconstruct the input image in the output layer.

CANet Architecture
The CANet architecture is composed of three layers: N Input layer: each neuron in this layer represents a pixel in the input image and is associated with a weight w I i,j . Thus, the images used as input to the neural network must have the same size of the input layer.
N Constructive layer: this layer is responsible to extract the features from the input image. New neurons are added in this layer during the training based on the error sensitivity of the neurons in a same receptive field of the reconstruction layer, d R,k i,j , where k denotes a given training image.
N Reconstruction layer: responsible for returning the reconstruction of the input image using the features extracted in the constructive layer. Neurons in a same receptive field in the reconstruction layer share the same weight from the constructive layer.
The receptive field of a neuron is given by r u,v i,j |r u,v i,j where (i,j) denotes the position of the neuron in the constructive layer that was previously generated from the neuron (u,v). The output of a neuron in the constructive layer, y u,v i,j , depends on the pixel value, I k m,n , and on the weight associated to it, w I m,n , of all the pixels in the receptive field of the neuron. The constructive neuron output and the weight associated to it, w u,v i,j , are used to reconstruct the input image along with the inhibitory stimulus sent by the neurons in the reconstruction layer with the size of the inhibitory field given by h R . The output of the neuron in the reconstruction layer, y R s,t , is the neural network output and it is an approximation of the pixel in the position (s,t) of the input image.

Connectivity Model
The first layer of the CANet is the input image. The second layer is the constructive one. Each neuron in the constructive layer is associated to another neuron from which it was previously generated. The label of a neuron is given by two pairs of coordinates, the lower index (i,j) and the upper index (u,v). (i,j) is the coordinate of the neuron after the division process, while (u,v) corresponds to the lower index coordinate of the previously divided neuron. The first neuron is located in position (1,1) of the constructive layer and it is not associated with any neuron, represented by (0,0). All the other neurons are generated from this neuron or from one of its descendants. Neurons in the constructive layer are connected to receptive fields with different sizes in the input and output layers. of the constructive layer to the reconstruction layer, respectively The output of each neuron in the constructive layer consists in the application of a non-linear activation function over the weighted summation of the neurons in its receptive field. Thus, considering that (i,j) is the position of a neuron expanded from the neuron in the position (u,v) of the constructive layer, (m,n) the position of a pixel in the input layer and b u,v i,j the bias associated to the neuron in position (i,j), the output y u,v i,j of the neuron in the constructive layer is given by where F u,v,I i,j is the receptive field in the input layer of the neuron in the position (i,j) in the constructive layer.
The output of a neuron in the reconstruction layer, y R s,t , depends on the output of the neuron in the constructive layer that contains it in its receptive field, represented by y u,v i,j and on the lateral inhibition effect caused by neurons in its neighborhood. y R s,t is calculated in three steps: N Excitatory stimulus: for each neuron in the reconstruction layer, the excitatory stimulus is calculated using the following equation where w u,v i,j denotes the weight associated with the neuron in position (i,j) expanded from (u,v) in the constructive layer that contains the neuron (s,t) of the reconstruction layer in its receptive field, F u,v,R i,j .
N Inhibitory stimulus: for each neuron in the reconstruction layer, the lateral inhibition is calculated using the following equation N Activation function: the excitatory and inhibitory stimulus are combined as input to a non-linear activation function that is monotonically increasing, continuous, differentiable and bounded, given by In this work, f is the sigmoid function.

CANet Training
CANet is a supervised neural network that tunes its weights during the training in order to reduce the error calculated through the output obtained and the input image. In each training step, first the error sensitivity for each neuron in CANet is calculated. Thus, the error gradients for the weights are derived. Finally, the weights are updated in order to learn a visual pattern with a pre-defined architecture configuration. The constructive algorithm that optimizes the CANet configuration is shown in the next section. The error sensitivity d for each neuron in the reconstruction layer for an input image I k is calculated in three steps: N Image error: difference from the obtained output and the pixel intensity, given by where y R s,t is the neuron output and I k s,t is the pixel intensity N Sensitivity of the same layer: calculated using the image error summation of the neurons in the reconstruction layer that contains the neuron in the (s,t) position in the inhibitory fields, given by N Sensitivity of the neuron: where s R,k s,t is the input for the neuron (s,t) at the reconstruction layer, f ' is the differential of the activation function f and k is the index representing each training image.
The error sensitivity for the neurons in the constructive layer is given by Furthermore, the error gradient, LE, of the weights and the biases can be derived as follows.
N Error gradient of the weights in the constructive layer, w u,v i,j : N Error gradient of the biases: Finally, the weights in the neural network must be updated following a given learning rule. In this work, the Resilient Propagation (RProp) [41] is used. The RProp is known as an algorithm that converges fast with high accuracy and robustness [42]. The RProp updates the weights taking in account only the sign of the partial derivative over all patterns. Thus, the weights are adaptively updated based on the gradient signal, according to the following rule: and D (t) i,j is the adaptation rule given by where g z w1 and 0vg { v1 are the increase and decrease factors, respectively, that define the jump given in each learning step.

Constructive and Pruning Algorithms
Constructive learning algorithms optimize the neural networks configurations for pattern classification without ad hoc choices [40]. On the other hand, pruning algorithms delete redundant connections in the neural network improving the efficiency of the model without compromising the effectiveness. Algorithm 1 in Table 2 presents the hybrid constructive-pruning algorithm proposed to train the CANet. Initially, there is only one neuron in the hidden layer with a receptive field containing all the neurons in the input and output layers and new neurons are iteratively added to the hidden layer. With the addition of each neuron in the hidden layer, the neural network is retrained by updating only the new connections. The criteria chosen to add new neurons in the hidden layer is based on the mean error rate for the receptive fields of the hidden neurons in the output layer. At each iteration, the neuron with the highest difference between the maximum and the minimum error values, eMax u,v i,j {eMin u,v i,j , is expanded by dividing its receptive field in other four equal size receptive fields and three new neurons are added to the hidden layer.
The receptive fields are divided aiming to connect the neurons to homogeneous regions of the input image and they are indexed using a quadtree model. Fig. 3 presents this structure. Initially, there is only one receptive field with the same height and width of the output layer, given by H and W, respectively. The receptive field is divided into four receptive fields with sizes H 2 and H 2 . Finally, the receptive field denoted by F 1,2,R 1,1 is divided into four receptive fields with sizes H  CANet is retrained. The receptive fields division process repeats until the maximum number of neurons in the hidden layer p is achieved. At the end of each expansion iteration, the validation error is calculated and the constructive algorithm returns the CANet with the lowest validation error disregarding the number of hidden neurons.
During the CANet training, it is possible that some neurons do not learn well the representation of some pixels. Thus, after the constructive training, a pruning step is performed in the CANet reconstruction layer and the neural network output is obtained considering only the q most similar pixels between the input image and the neural network output for all training images. The pruning algorithm keeps in the reconstruction layer only the neurons that approximates the neural network output to the input image with highest accuracy. Fig. 4 presents an illustration of the pruning algorithm. First, the mean error rate for each neuron in the reconstruction layer for all images used in the CANet training is calculated where y R,k s,t is the output of the neuron (s,t) in the reconstruction layer for an input image k, K is the number of images used to train the CANet and I k s,t is the intensity of the pixel in position (s,t) of the input image k. Second, the mean error rates are sorted and the q lowest mean error rates are selected. Finally, the neurons in the reconstruction layer associated with the selected rates are kept while the remaining neurons are removed.

Multi-class Recognition System
CANet is a neural network for one-class learning. The CANet training defines a closed decision boundary and the distance from a pattern to such boundary is a Constructive Autoassociative Neural Network for Facial Recognition measure of dissimilarity between the pattern and the class represented by the CANet. In multi-class tasks, a CANet committee is created for each class. Fig. 5 presents the multi-class recognition system of the CANet. The test image is used as input to each trained CANet and a decider assigns the recognized class using the distances from the input image to the obtained outputs, given by In this work, the classification is set using the minimum operator. Therefore, the committee output is the minimum distance obtained among all trained neural networks.

Experiments
In this section, we compare the results obtained by the CANet with other methods in the literature in order to demonstrate the effectiveness of the proposed model. Three public databases were used: ORL [43] and AR [44] databases for face recognition; and JAFFE [45] for facial expression recognition. The experiments with the JAFFE database were also performed to determine the CANet parameters that were used in the face recognition experiments with ORL and AR databases.

JAFFE Database
JAFFE (Japanese Female Facial Expression database) [45] was created to evaluate different methods for facial expression recognition [46][47][48][49]. JAFFE contains images from the six basic facial expressions plus the neutral expression, collected from 10 persons, presenting 3 or 4 images of each expression for each person. Experiments with the JAFFE database were performed using downsampled images leading to a lower computational cost. The original images are cropped to 128|128 pixels to reduce the background influence. Cropped images are then scaled to 40|40 pixels with the histogram equalized. Feature extraction methods are not used and the pre-processing steps are independent of the image class. Fig. 6 presents some images after the pre-processing.
CANet has 4 parameters that have to be experimentally determined: maximum number of neurons in the hidden layer p, the number of neurons considered in the output layer q, the size of the inhibitory field h R and the lateral inhibition strength y R . The following experiments show the influence of each parameter in CANet facial expression recognition using the first test approach with two randomly chosen images of each expression per person for training and the remaining for test.
The number of neurons in the hidden layer determines how sensitive the constructive algorithm is to the error. Fig. 7 presents the facial expression recognition rate for different numbers of neurons in the hidden layer using all the output neurons q~1600, i.e., no prunning is performed, and with no lateral inhibition. The highest recognition rate of 90.1% is achieved with approximately 25% of neurons in the hidden layer in comparison with the input size, such configuration is used in the following experiments.
The number of neurons considered in the output layer by the prunning algorithm, given by q, determines how sensitive is the CANet to variations in the learning for different image pixels, allowing the classification only with the pixels that better represent a class. The facial expression recognition rate is evaluated using only 50% of the neurons in the reconstruction layer that reaches a rate of Constructive Autoassociative Neural Network for Facial Recognition 91.2%, being 1.1 percentile point higher than the rate obtained considering all the neurons in the CANet output. We experimentally evaluated other values for q, but they did not present improvements in the recognition. Thus, for the following experiments q is set equal to 800.
Different configurations for lateral inhibition in the reconstruction layer of CANet are evaluated with the inhibitory field size and its strength varying from 1 to 10 and 1 to 25, respectively. Table 3 presents only the best results obtained varying the inhibitory field size, h R , and strength, y R . The highest recognition rate is obtained with the inhibitory size of 6 and strength equals to 17. Hypothesis test using t-Student with 5% of significance level between CANets with and without  Table 3. Facial expression recognition rate (%) for different configurations of inhibitory field size, h R , and inhibition strength, y R , of the CANet in the JAFFE database. lateral inhibition statistically demonstrates that the presence of inhibitory fields improved the results obtained by the neural network. Two approaches are used to evaluate CANet in facial expression recognition using the JAFFE database in comparison to other methods in the literature. In the first approach, we use the same methodology applied by Zhi et al. [46] that randomly chooses two images of each expression per person for training and uses the remaining images for test. Bashyal and Venayagamoorthy [50] used a similar approach. In the second approach, called leave-one-image-out cross-validation, each image in the database is used to test in one iteration while the remaining images are used for training. Such approach was used by Cheng et al. [47]. Each approach was repeated 30 times and the average recognition rate is presented. Table 4 presents a comparison between the CANet, the AAPNet [15] and different methods with feature extraction. The first approach is used and the feature extraction methods which CANet is compared with are the ones presented by Zhi et al. [46] and Bashyal and Venayagamoorthy [50]. The best method with feature extraction presents a facial expression recognition rate of 91.5%, while CANet presents a rate of 93.0% indicating that process of implicit feature extraction is able to generalize the CANet learning with no need of any prior feature extraction method. Also the CANet classification rate is 3.2 percentile points higher than the one obtained by AAPNet. Table 5 presents the confusion matrix obtained with CANet.

Inhibitory configuration
In the second approach used to test the method, one image is used for test at each iteration while the remaining images are used for training. The results obtained by CANet with such approach are compared with other classifiers without feature extraction. Cheng et al. [47] proposed a gaussian classification process without any feature extraction method. Following the same approach, Cheng et al. [47] obtained a facial expression recognition rate of 93.4%. The CANet obtains a recognition rate of 99.9%. Table 6 presents the results obtained Table 4. Comparison between the facial expression recognition rate (%) obtained by CANet and different methods with feature extraction with the first test approach in the JAFFE database.

Method
Recognition rate (std) Constructive Autoassociative Neural Network for Facial Recognition by the CANet, the AAPNet, the gaussian process and the k-NN classifier without feature extraction using the best value calculated for the neighborhood size. The autoassociative neural networks presents a recognition rate much higher than the other classifiers without feature extraction and the CANet presents the highest recognition rate.

ORL Database
The ORL (Cambridge Olivetti Research Lab) face database [43] contains 400 different images, taken at different times from 40 people of different gender, age and race. The face images include variations in expression (such as open or closed eyes and smiling or not smiling), details (such as glasses/no glasses), rotation (up to about 20 degrees) and scale (up to about 10%). In this face recognition experiment, the main concern is to recognize thumbnail-sized face image, which requires less storage memory and recognition time. Thus, all the images were subsampled to 28|23 pixels. Comparison between the facial expression recognition rate (%) obtained by CANet and different methods without feature extraction with the second test approach in the JAFFE database. Table 7 shows the comparison between the error rate obtained with the CANet and the results presented by Zhu et al. [51]. The CANet presented the lowest error rate in all the experiments using 3, 4 and 5 training images.

AR Database
The last experiment was conducted on the AR face database [44] using a cropped version with images from 50 males and 50 females [52] downsampled to 40|40 pixels. Twenty six images divided in two sessions were captured from each person. The same experiment protocol of Mi et al. [53,54] was used and for each individual 7 images from the first session without any occlusion were used for training, while the 7 correspondent images from the second session was used for testing. Table 8 shows that the CANet achieves the highest recognition rate in comparison with the methods presented by Mi et al. [53,54] such as Linear Regression-based Classification (LRC), Robust Linear Regression-based Classification (CLRC) and Sparse Representation-based Classification on K-Nearest Subspace (SRC-KNS).

Discussion
In this paper, we proposed a novel neural network inspired by biological concepts present in the brain, called CANet. The proposed model is a constructive autoassociative neural network that returns as output an approximation of the presented input image using a dynamic architecture. CANet presents the concepts of receptive fields for implicit feature extraction and lateral inhibition and autoassociative memory for image reconstruction.
The CANet is an one-class classifier that connect the hidden neurons to homogenous regions in both input and output layers of the neural network. A constructive algorithm is applied in order to find the number of neurons in the hidden layer that minimizes the mean distance between the input images and the neural network outputs. After the CANet training, a prunning algorithm is used to keep in the output layer only the neurons with the highest accuracy in the training set, improving both the classification accuracy and the computational cost. It is important to note that the constructive algorithm improves the implicit feature extraction process performed in the hidden layer, while the pruning algorithm removes the redundancy of the output layer. Finally, the recognition system presented allows the use of the CANet in multi-class tasks. We showed that CANet outperforms other state-of-the-art algorithms in facial recognition tasks.
Improvements in the CANet could be applied, such as to create an overlapped region among adjacent receptive fields and to define different shapes for the receptive and inhibitory fields. In this way, allowing more hidden layers, which creates a deeper model. It is also interesting the possibility of using evolutionary techniques to find the most suitable architecture for the neural network. Moreover, experiments with other tasks rather than facial recognition should also be addressed.