Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Virtual sample based techniques using deep features for SSPP face recognition in unconstrained environment

  • Muhammad Tariq Siddique ,

    Roles Conceptualization, Methodology, Validation, Writing – original draft

    tariqsiddique.bukc@bahria.edu.pk

    Affiliation Department of Computer Sciences, Bahria University, Karachi, Pakistan

  • Ibrahim Venkat,

    Roles Conceptualization, Project administration, Supervision, Writing – review & editing

    Affiliation School of Computing and Informatics, Universiti Teknologi Brunei, Jalan Tungku Link Gadong, Brunei-Muara, Brunei Darussalam

  • Humera Farooq,

    Roles Conceptualization, Writing – review & editing

    Affiliation Department of Computer Sciences, Bahria University, Karachi, Pakistan

  • Sharul Tajuddin,

    Roles Conceptualization, Supervision, Writing – review & editing

    Affiliation School of Computing and Informatics, Universiti Teknologi Brunei, Jalan Tungku Link Gadong, Brunei-Muara, Brunei Darussalam

  • S. H. Shah Newaz

    Roles Conceptualization, Supervision, Writing – review & editing

    Affiliation School of Computing and Informatics, Universiti Teknologi Brunei, Jalan Tungku Link Gadong, Brunei-Muara, Brunei Darussalam

Abstract

As challenging as it is to use face recognition with a Single Sample Per Person, it becomes even more difficult when face recognition based on a single sample is performed in an unconstrained environment. The unconstrained environment is normally considered irregular in facial expressions, pose, occlusion, and illumination. This degree of difficulty increases as a result of the single sample and in the presence of occlusion. Extensive research has been done on face recognition under pose and expression changes. Comparatively, less research has been reported on the occlusion problem that occurs in facial images. Occlusion may alter the appearance of facial images and cause deterioration in recognition. A robust method is required to handle the occlusion in the face image to improve the recognition performance. This study aimed to implement an effective augmentation technique that improves the performance of the Single Sample Per Person face recognition system in unconstrained environments. Virtual samples were created to expand the sample size to address the problem of a single sample. A local region-based technique was proposed to deal with occlusion by creating virtual samples. A deep neural network-based model, FaceNet, was used to extract the features and a support vector machine was used for classification. The performance of the proposed approach was evaluated, demonstrating its superiority in handling occlusion compared to that of its state-of-the-art counterparts. The proposed method achieved significant accuracy improvements, specifically 94.83% for the occlusion with sunglasses and 98% for the occlusion with scarves in the AR dataset.

1 Introduction

Face recognition systems attempt to verify a subject from a video or image using facial information [1]. These systems possess the characteristics that make them unique from other biometric systems, and these qualities are contactless, nonintrusive, convenient, and scalable. These traits are used to identify and verify individuals. Face recognition is considered a prominent approach to the identification and surveillance of biometrics field users [2].

The Single Sample Per Person (SSPP) is a complex problem of single reference samples from facial databases. The goal is to recognize and identify a person considering interrelated information along with unpredictable poses, geometric and photometric changes, facial expression, age, illumination, low resolution, dimensionality, occlusion with accessories, makeup, and hairstyle from one sample [3]. Variations caused by the unconstrained environment in facial appearance have a greater effect than personal identity [4]. In particular, face images of different subjects look similar compared to the same subject’s face image taken under varied conditions in an unconstrained environment [5]. The researchers found it a challenging problem with most current algorithms due to less training data and low dimensionality [6].

Since the face recognition problem is considered noninvasive, it could go through the occlusion problem. The most common occlusion for subjects uses different accessories such as sunglasses, scarves, and masks to partially hide their facial regions. The objects in between the camera and the face also create occlusion, and this causes the loss of certain parts of the information. Local region-based methods are used to solve the occlusion problem in SSPP face recognition. The works of [7,8] show the effectiveness of these methods in occlusion. These methods also proved that face recognition performance is more affected by the upper region compared to the lower region occlusion.

In addition, to address the above-mentioned issues, a large training set is required, especially for Deep Learning (DL) approaches. In the past decade, DL approaches have been extensively applied to enhance Face Recognition (FR) models[9] and several efforts have been made to handle the Single Sample Per Person Face Recognition (SSPP FR) problem under controlled conditions [10]. However, the existing literature shows that there is still scope to investigate and deploy machine learning approaches to handle uncertainty elements encountered in unconstrained environments. Recently, a survey has been conducted for the utilization of DL approaches for SSPP [9]. Google [11] and Facebook [12] have played an important role in understanding the importance of the availability of large training data sets to utilize DL approaches for FR in an uncontrolled environment effectively. These challenges are mainly due to the non-rigid structure of the face, capturing conditions, and modeling of face recognition problems. In addition, when the environment has uncontrollable conditions, the system performance significantly decreases and severely limits the success of the identification.

Local region-based augmentation works independently on different features of the face. The local region-based augmentation technique has been widely used in the past and several studies have reported successfully treating facial expression, pose, and occlusion [6,13]. Efforts are still underway to overcome the issue of occlusion, that is, sunglasses and scarf occlusion in an unconstrained environment. Some examples are [14] and [7] using local region information to solve the occlusion problem in an unconfined environment. Facial images were divided into non-overlapped patches and statistical descriptors and histogram methods were used for feature extraction [14]. In [7], after dividing images into nonoverlapping patches, the eigenvalue was used to extract features in the form of matrices. In another method, facial information was divided into bilateral symmetry. The projection matrices method was used to extract the features [15].

The proposed study aims to identify and provide a solution to the problem of a small training set that is putting in place an effective enhancement technique. Knowledge of the right augmentation technique to enhance the training set is assumed to result in improved recognition performance. This information led us to propose a local region-based augmentation technique to solve the occlusion problem. The proposed approach requires a small number of augmented training samples and hence has the advantage of minimizing the training overhead of the classifier models.

2 Related work

Classical augmentation techniques are used to learn facial features as a global region for the intra-class variation and to increase the sample size for training by generating virtual samples [1518]. On the other hand, the local region-based augmentation techniques are used to learn inter-class variations to increase the sample size virtually [6,7,14,15,19]. Similarly, deep learning and image processing techniques are also used to improve image quality and expand the training set for SSPP [20,21]. The generative techniques include Generative Adversarial Networks to produce synthetic images. For this purpose, various transformations are applied to the face images. Different existing studies based on the above-mentioned techniques are presented in Table 1 and discussed in the following subsections:

thumbnail
Table 1. Existing studies using different techniques to solve SSPP problem where P = Pose, E = Expression, I = Illumination, LR = Low resolution, O = Occlusion.

https://doi.org/10.1371/journal.pone.0322638.t001

2.1 Classical augmentation techniques

The augmentation techniques are used to artificially expand or increase the data volume for the training model [22]. It is widely used in the SSPP FR problem to enhance the available sample into multiple sets. The augmentation techniques are broadly divided into classical techniques and generative techniques [23]. The classical techniques include geometric transformation, which is divided into variations like rotation, image translation, noise injection, flipping and cropping, and photometric transformation which deals with light and shadow problems. Another study reported that face recognition is performed using infrared images for an unconstrained environment [24]. A detailed survey by X. Wang discussed the challenges and future directions for augmentation techniques for face datasets [3].

Recently, 3D Modeling [25] has been proposed that used augmentation techniques. In another study, virtual samples [17] were created using the Non-Negative Matrix Factorization (NMF) method. They used extension methods such as sliding windows, mirror transform, and bit plane. A k-LiMAPS algorithm was proposed that was based on optimal direction and iterative l0-norm minimization to substitute the Sparse Dictionary Learning (SSDL) technique. They worked on Geometric Transformation (Translate and Scale) [6]. Addressing the same problem, in a study different photometric and geometric transformations (rotating, flipping, cropping, edge enhancement, color jittering, and addition of principle components to the images) were used to increase the classification performance of CNN [26].

However, the classical techniques are unable to handle the diversity of face images in an unconstrained environment. Further, the training samples generated by the classical augmentation techniques could be easily correlated with the gallery images. This limitation degrades the classification as they could not be considered autonomous samples. In addition, the greater number of generated virtual samples maximizes training cost [6].

2.2 Generative methods

Generative methods are used to generate synthetic images to enlarge the training dataset. According to a survey paper [27], the generative methods enrich the gallery set from the available single reference face image by generating new synthetic images. The Generative Adversarial Network (GAN) has gained popularity due to its image restoration ability, especially in SSPP problems [28,29]. The method (IL-GAN) introduced by [28] based on GAN and a variational autoencoder. The singular value decomposition was used to create a decision-maker to distinguish between illumination levels. One method employed by [29] was based on creating virtual samples using GAN (SharedGAN). They trained the GAN on a generic dataset and used these variations in the gallery set. They have used a Convolutional Neural Network (CNN) for feature extraction and classification. Further, the different variations of the contaminated dataset were addressed by proposing a Variation Disentangling Generative Adversarial Network [30]. The proposed framework was based on the generator and discriminator that work in an adversarial way. In [31] by applying the GAN model, addressed the data imbalance problem in face recognition and proposed a large-scale system to deal with the pose, expression, and illumination. In the study [32], the authors have created samples using a generic dataset and error coding technique. They extracted errors from the generic set and perform multilevel error coding. An identity-attribute disentanglement framework has been introduced by [33] that separates identity-related and identity-irrelevant features to improve recognition accuracy. An adversarial feature augmentation mechanism was employed to generate diverse identity-preserving samples, enhancing model generalization. Recently, a conditional GAN proposed to generate synthetic face images from a single real sample. The method modifies attributes such as expression, age, gender, pose, and lighting to create diverse images while preserving the original identity [34].

2.3 Local region based techniques

The local region-based methods are used to solve the occlusion problem in SSPP face recognition. Different surveys conducted on SSPP FR highlight the working and effectiveness of local region-based methods [1,35]. To apply local region-based methods, several studies were reported in the past that proposed different techniques to learn facial information of the face and enhance small training data to generate robust results. To address the complex variation in an unconstrained environment the authors proposed methods like dictionary learning [3640], Fuzzy Multi-Manifold Classifier [8], patch-based methods [4143], and sparse representation classification [44] which are based on the local region or intra-class variation. These methods also proved that face recognition performance is more affected by the upper region than by the lower region occlusion. Another idea is to apply local region-based methods that are considered effective and low computational cost solutions for face recognition.

By using local region-based methods, most of the existing techniques work on the different variations of different existing datasets. A combination of different methods such as Auxiliary dictionary [39], Eigen Values [7], Projection Matric [15], Multi-Block Color-Binarized Statistical Image Features (MB-C-BSIF) [14], and Edge Pixels [45] are the recent studies that work on a pose, expression, occlusion, and illumination. Facial images were divided into non-overlapped patches and statistical descriptors and histogram methods were used for feature extraction [14]. In their extended work, they used VGG 16 for Feature Extraction and improved their results [13]. In [7], the Eigenvalue is used to extract features in the form of matrices after dividing images into non-overlapping patches. In another method, the facial information was divided into bilateral symmetry. The projection matrices method was used to extract the features [15]. A Self-Organizing Maps (SOM)-based technique [38] achieved good accuracy as compared to the existing studies, however, the usage of Scale-Invariant Feature Transform (SIFT) descriptors for extracting the local features may increase the computational time and cost. A dual-feature classification approach (DF-SRC) has been proposed that improves face recognition under occlusion. It combines global features from Discrete Wavelet Transform (DWT) and local features from Local Binary Patterns (LBP) for better representation. A sparse representation-based classification (SRC) method then reconstructs and classifies the face, making recognition more robust [46].

However, a closer look at the existing literature on SSPP problems based on local region-based methods reveals several gaps and shortcomings in the unconstrained environment. The local region-based methods provide solutions for inter-variance and intra-variance class problems in an unconstrained environment. Local region-based methods can lessen the effect of moderate facial variations. The image variations have a deep impact on local feature extraction and discriminative learning from partitioned patches. Misalignment or pose variations in probe samples cause a mismatch between the gallery and probe samples. The limitation of these methods lies in the complexity of variations. Patch-based methods are considered traditional methods that work on the local region of the faces. The main issue with patch-based methods is the presence of face misalignment that causes degradation of the recognition rate. The proposed study will use the local regions to solve the occlusion problem using the augmentation technique. The aim is to use this technique to increase the recognition performance of the SSPP FR using the Support Vector Machine (SVM) classification method.

3 Proposed augmentation techniques for SSPP in unconstrained environment

The presented approach aims to apply augmentation techniques to generate virtual samples to solve the problem of a small training set for the SSPP. We have applied the local region-based technique. The key points of facial images have been extracted from the eye and mouth region. While dealing with the occlusion in the facial images, we have proposed two-fold solution. To deal with the face occlusion problem the facial key points are detected first and ROI is extracted. After this, an overlay patch has been used to create artificial augmentation that is used as a virtual sample.

The facial features are then extracted to train the classifier, and the test samples are classified. The pre-trained FaceNet model is used for normalization and face embedding. For classification purposes, we have used the SVM. The virtual sample will be used as a training set (gallery images) whereas testing will be done by the probe images. The complete methodology of the proposed study is shown in Fig 1.

thumbnail
Fig 1. Proposed approach for SSPP FR problem using augmentation techniques.

https://doi.org/10.1371/journal.pone.0322638.g001

3.1 Local area region-based technique

The local region-based methods are used to solve the occlusion and illumination and pose estimation problems in SSPP FR in an unconstrained environment. Local region-based augmentation works independently on different features of the face. We extracted the facial features and regions of interest (ROI) and used this information to recognize the face. For this purpose, the first step is the landmark detection method. This method is used to localize the important points and select ROI at the eye and mouth region of the gallery images. In the next step, an overlay patch was used in the ROI region of the gallery images. In the last step, artificial occlusion has been introduced to increase the sample size and thus create an augmented gallery. These steps are shown in Algorithm 1:

Algorithm 1. Local region-based augmentation.

Input: I (Gallery Image)

Output: Iaug (Occluded Augmented Images)

  • Perform face detection to identify the face region
    • Rf = Facedetection(I)
  • Detect facial landmark points and extract the region of interest (ROI)
    • L,RROI = Mediapipe(Rf)
  • Apply an overlay patch P to the ROI
    • Roccluded = RROI P
  • Create occluded images by replacing RROI in I with Roccluded
    • Iocc = I - RROI + Ioccluded
    • Generate multiple occluded images
    • Iaug = {Iocc1,Iocc2,,Ioccm}
  • Train the model using Iaug
    • Extract features
    • F = Fe(Iaug)
    • Classify features
    • y = C(F)

3.2 Facial keypoint detection

For local region-based augmentation, the landmark points detection method has been used. We have utilized the MediaPipe framework for this purpose [49,50]. The MediaPipe is developed by Google as an open-source framework and is available as a library for customization. This framework aims to provide a solution for different problems that could require an extensive computational cost. It could help design machine learning models for different objects using sensory devices. It helps to detect and track the key points of the human body known as landmark points and creates a face mesh. The face mesh is based on transfer learning and is designed to recognize the human face in three dimensions.

The landmark points shown in Fig 2A are the provided information by FaceMesh and Fig 2B shows landmark points in the AR dataset image. We have used 04 landmark points (P1 to P4) out of 468 landmark points for the eyes region. For the mouth region, there are 19 landmark points (P1 to P19). These landmark points are obtained with the reference code provided in Fig 2A. The points used in this study with the referenced MediaPipe landmark points for the eye region are represented in Table 2. Whereas, Table 3 lists down the landmark points for the mouth region.

thumbnail
Fig 2. Landmark points of MediaPipe Face Mesh (A) Landmarks points (B) Landmarks points at AR dataset image.

https://doi.org/10.1371/journal.pone.0322638.g002

thumbnail
Table 3. Landmark points for the Mouth Region.

https://doi.org/10.1371/journal.pone.0322638.t003

The selected landmark points are shown in the first row of Fig 3. These 19 landmark points and their location in the mouth region are shown in the bottom row of Fig 3. Other than these 19 landmark points, these labeled key points, a selected region, and an overlay patch at the selected region are also described. Given the input image, different face regions are detected according to the extreme points.

thumbnail
Fig 3. An example of AR dataset image for retrieving landmark points and location.

https://doi.org/10.1371/journal.pone.0322638.g003

For a gallery image, the landmark set can be described as in Eq 1

(1)

Where represents the position of each landmark, i represents the landmark points from 1–468, and x, y, and z are the coordinates. Since in the proposed study, the face images are 2D that is why we utilized x and y landmark points and excluded the z coordinate.

For example, for the overall eye region the extreme left and right, where p1 represents the extreme left and p2 represents the extreme right. The eyebrows are represented by p3 and the below eye is represented by p4. Hence these four coordinates provide us with the region of interest for the eye region. Similarly, the calculated region of interest for the mouth is along the left and right cheeks starting from landmark point 93 until 323 with a total of 19 landmark points.

3.3 Artificial occlusion augmentation

Artificial occlusion augmentation is a major step in the training phase after the landmarks’ points detection. The region of interest is extracted to occlude the specific facial part. For the proposed study, both eyes and mouth are occluded to embody the sunglasses and mask or scarf occlusion of faces. For generating artificial occlusion, the gallery images of each subject have been extended. To apply the augmentation technique for creating artificial occlusion an overlay patch in a rectangular shape has been used. This overlay patch is used to occlude the eyes and mouth region. For the overlay patch, the value has been used that represents two extreme points i.e. (0,1), where 0 value represents the dark region and 1 value represents the transparent region. For the eye region, the major occlusion is because of wearing sunglasses. Subject to the sunglasses, for the overlay patch the having 0 value represents the sunglasses whereas 1 for the transparent region. Moreover, for the mouth region, the major reason for occlusion is wearing of mask and scarf. To deal with it the overlay patch has been introduced over the mouth region in which the value 0 represents the patch like a mask or scarf and 1 represents the transparent region. After this extension, the gallery images are used to train the model as shown in Fig 4. Furthermore, the probe images that have real-time occlusion are used as a testing set.

thumbnail
Fig 4. AR training dataset with original and augmented images.

https://doi.org/10.1371/journal.pone.0322638.g004

3.4 Feature extraction, face embedding, and classification

After pre-processing images, the next step is to do the feature extractions of gallery and probe images. To extract the features, we have used a Deep Neural Network (DNN). There are two steps in this process. The first one is normalization and the second one is face embedding. Prior to the extraction of face embedding Z-Score normalization is utilized to standardize the image where the normalized image is obtained according to the following Eq 2.

(2)

Where and are the mean and standard deviation of image pixel values.

For the face embedding process, we used a pre-trained FaceNet model [11]. FaceNet is an integrated system for face recognition, and it is also known as the Siamese network. It is a one-shot model, having the same size as the input images. The image input size is 160x160x3 which transforms into a 128-dimensional face embedding vector. The convolutional layers are used to learn facial features as mapping. Later on, the triplet loss function is used to calculate the face similarity using Euclidean distance. At last, the embedded feature vectors are extracted that are used for face recognition and verification [51].

The feature classification phase leads to face image verification and recognition. Verification compares a test image to other face images to approve the authentication of the requested face while recognition compares a test image with other images to come up with the identity of the face with several possibilities. In both scenarios, the known face images registered in the system are called gallery images. The face images either registered or unregistered used for testing are called probe images. Since our dataset is small in size so in the presented study, we have used SVM for the classification. The extracted features from the training face images in the dataset are utilized to train an SVM classifier. Once trained, the SVM classifier predicts the features of the test face images. SVM is chosen due to its superior interpretability and computational efficiency compared to other classifiers. Additionally, SVM demonstrates strong generalization capabilities, particularly in handling challenges such as the Single Sample Problem, where only limited training data is available for certain classes. Furthermore, SVM’s ability to find an optimal hyperplane for classification makes it well-suited for high-dimensional feature spaces, as often encountered in face recognition tasks. By employing kernel functions, SVM effectively maps non-linearly separable data into higher dimensions, improving classification accuracy. This flexibility, combined with its robustness against overfitting, enhances the reliability of SVM in real-world face recognition applications.

4 Experimental results

The proposed technique is used to extract the key points from the eye and mouth region as explained in detail in the methodology section.

4.1 Experimental setup

For local region-based augmentation techniques, the AR database [52] has been used because of the reason of the availability of occlusion. While selecting the dataset, the following criteria were taken:

  1. a The existing datasets should have face images captured on realistic cases with varied challenges in unconstrained conditions.
  2. b The actual image size was sampled down to the same size (160x160x3) for the experimental setup

The AR dataset has the combination of sunglasses and a scarf over the face occluded images. In the AR dataset, the original image resolution is 768X576 pixels. We down-sampled the original images into 160X160 pixels. For the experiments, we selected 100 subjects with 26 images. Among these 50 subjects are male and 50 subjects are female. The images of Set A in Session I having a neutral facial expression are selected as the gallery image as shown in Fig 5. For testing purposes, we selected test sets H- M and U-Z from Session I and II as follows:

  1. S-I Occlusion with Sunglasses (H, I, and J)
  2. S-I Occlusion with Scarf (K, L, and M)
  3. S-II Occlusion with Sunglasses (U, V, and W)
  4. S-II Occlusion with Scarf (X, Y, and Z)
thumbnail
Fig 5. Representation of used sets from AR dataset.

https://doi.org/10.1371/journal.pone.0322638.g005

To evaluate the model, the classification results have been validated through the accuracy. We have also demonstrated misclassification errors by taking sample results from each dataset. The results evaluation is performed based on the test accuracy. These experiments were performed on Intel Core i7-8750H with a 2.2 GHz processor and 16GB of memory. We used the Keras framework to load the FaceNet model for the experiments.

4.2 Occlusion with sunglasses

To solve the issue of occlusion, especially the sunglasses, the normal face image has been augmented with an occluded image. For this purpose, an overlay over the eyes area with different alpha values (0.0–0.5) has been applied to augment the face image. Since the transparency of the overlay patch will affect the accuracy value it is important to select the optimal value of alpha for the experiments. By evaluating different alpha values, it is observed that the normal image and occluded image having alpha values 0.0 and 0.1 show high average accuracy values i.e. for both as compared to other alpha values. It is also observed that the combination of different alpha values i.e. from 0.0 to 0.2 and 0.0 to 0.4 shows higher accuracy of . The reason for not selecting this combination is to avoid more computational costs that are required for augmentation and training. To select the optimal alpha value we have calculated the mean square error of alpha values (0.0–0.5) as shown in Fig 6. The Mean Square Errors (MSE) of alpha values greater than zero are computed with the baseline of alpha value zero. If the error is small, then the image is closer to the baseline image otherwise it is different from the baseline. Hence, alpha values provided us with the baseline value for further experiments. All the presented experimental results are based on the alpha value of 0.0.

thumbnail
Fig 6. Occluded Image with a0.0–0.5 and MSE with sunglasses.

https://doi.org/10.1371/journal.pone.0322638.g006

To demonstrate the experimental results with sunglasses, we have set three different conditions (Normal + both eyes, Normal+ Scarf, and Normal + both eyes + Scarf) as shown in Table 4. In this table, the Normal reflects the gallery image of Set A. Firstly, the probe images from H-J and U-W (termed sunglasses probe images) are tested against gallery images without any augmentation. In the next step, the sunglasses probe images are tested against augmented images of both eyes and gallery images. Later, sunglasses probe images are tested against augmented images of the scarf. At last, the sunglasses probe images were tested against the combination of eyes and scarf occluded images. From the results of the Session I (Normal + both eyes), it is noticed that high accuracy is achieved by set H i.e. 100%, whereas the lowest accuracy is obtained by set J i.e.96%. On the other hand, for Session II, U obtained higher accuracy i.e. 99% as compared to the W which obtained the lowest accuracy i.e. 91%. Similarly, when the Session I and II probe images are tested for normal+both eyes+scarf occlusion, it is observed that in Session I set H received higher accuracy i.e. 100%, and set w obtained lower accuracy i.e. 94%. However, for Session II, the higher accuracy is achieved by set U i.e. 98% and set W received lower accuracy i.e. 88%. The reason for declined accuracies for Normal+both eyes + scarf is the presence of facial images having scarf occlusion.

thumbnail
Table 4. Experimental results of occluded sunglasses set based on the accuracy.

https://doi.org/10.1371/journal.pone.0322638.t004

4.3 Occlusion with scarf

In order to solve the occlusion problem in the case of the scarf, the normal image has been tested with the occluded image. For this purpose, different values (0.0–0.5) have been applied to the occluded image. For the overlay patch at the mouth, the transparency of the images is also evaluated based on the value. It is observed that the normal image with the occluded image having values 0.0 and 0.1 and 0.2 shows a high average accuracy value with 2 samples i.e. 99.17%, 99.17%, and 99.33% respectively. It is also observed that the combination of different values i.e. from 0.0 to 0.3, 0.0 to 0.4, and 0.0 to 0.5 shows a higher accuracy of 99.50%. The reason for not selecting this combination is to avoid more computational costs that are required for augmentation and training. The analysis of Mean Square Error shows the value 0.0 more transparency as shown in Fig 7. Hence for the experiments of occlusion with the scarf, the alpha value of 0.0 is considered a baseline value for further experiments.

thumbnail
Fig 7. Occluded image with 0.0–0.5 and MSE with scarf.

https://doi.org/10.1371/journal.pone.0322638.g007

While working occlusion with a scarf the sets K-M of Session I and X-Z of Session II are used for experimental results as shown in Table 5. The gallery image reflects the normal image. The probe images from K-M from Session I and X-Z from Session II (termed scarf probe images) are tested against gallery images without any augmentation. Then the scarf probe images are tested with augmented images of both eyes and gallery images. In the next step, the scarf probe images are tested against augmented images of the scarf and normal images. Lastly, the scarf probe images are tested against the combination of eyes and scarf occluded images along with normal images.

thumbnail
Table 5. Experimental results of occluded scarf set based on the accuracy.

https://doi.org/10.1371/journal.pone.0322638.t005

During analysis of the results, it is noticed that the same accuracy was obtained by sets K and L of Session I for the Normal + Scarf occlusion i.e. 100% while Set M has 98% accuracy. Along the same lines, in Session II, sets X and Y achieved higher accuracy values i.e. 100%, and set Z received a 98% accuracy value. However, in normal+both eyes+scarf, the obtained results show that for Session I set K received a higher accuracy value i.e. 100%, and set M received a lower accuracy value i.e. 96%. For Session II, set X received 100% accuracy whereas set Z received lower accuracy i.e. 96%. The reason for the decline in accuracy for (normal+both eyes+scarf) is the presence of occluded images of sunglasses.

4.4 Misclassification results

Table 6 illustrates the misclassification with the subsets having facial images with sunglasses and scarf occlusion. For the sunglasses, the results indicate that in most of the subsets, the classification rate is less than 100% and higher than 90%. However, the V and W set to show the highest misclassification of 5% and 9% respectively. The reason for more classification is due to more exposure to light illumination and occlusion in terms of sunglasses. For the scarf occlusion, the results indicate that in most of the subsets, the classification rate is 100%. However, the M and Z sets show misclassification of 2% and 3% respectively. The misclassification results for local region-based augmentation techniques with scarf, sunglasses, and scarf+ sunglasses.

thumbnail
Table 6. Representation of results for comparison.

https://doi.org/10.1371/journal.pone.0322638.t006

Table 7 illustrates the misclassification with the subsets having facial images with sunglasses + scarf occlusion. Here the results indicate that the highest classification rate is 100% for H, K, and X sets whereas the highest misclassification rate is for W sets i.e. 12%. It is observed here again the W set of Session II remains the most complex set.

thumbnail
Table 7. Misclassification with Scarf + Sunglasses.

https://doi.org/10.1371/journal.pone.0322638.t007

5 Discussion

Local region-based augmentation provides the solution to improve the accuracy of SSPP FR by extending samples to add occlusions at the different regions of the face. For the comparison, we have set two protocols. In the first protocol, two studies are selected that have the same experimental setup [7] and [14] i.e. 100 subjects having 50 males and 50 females with 12 sets that deal with occlusion. In the second protocol, the obtained results are compared with the occlusion and light + occlusion with different existing studies. The results are compared with occlusion with sunglasses and scarf and light with occlusion of sunglasses and eyes. The comparison with existing studies shows the significance of the proposed local region-based augmentation technique as shown in Tables 8 and 9. The accuracy obtained by the proposed study is 94.83% for the occlusion with sunglasses, whereas it is 98% for scarf occlusion. The results obtained by the existing study [14] show better accuracy for sets I and J as compared to the present study for the occlusion with sunglasses, however for the occlusion with the scarf, and overall accuracy of the proposed approach shows better accuracy as compared to the existing study [14]. On the other hand, the proposed approach shows a significant improvement in accuracy compared to the existing research [7] i.e. 69.83% accuracy.

thumbnail
Table 8. Comparative analysis of existing techniques with sunglasses occlusion.

https://doi.org/10.1371/journal.pone.0322638.t008

thumbnail
Table 9. Comparative analysis of existing techniques with scarf occlusion.

https://doi.org/10.1371/journal.pone.0322638.t009

As mentioned earlier for protocol II, for a fair comparison with existing studies, we have merged H+K which is occlusion with sunglasses and eyes, and J+M which is light with occlusion of scarf and eyes. All these sets are taken from Session I. The comparison with existing studies shows the significance of the proposed local region-based augmentation technique as shown in Table 10. The obtained average accuracy for the sets H, and K is 100% in our case whereas for the sets J and M it was 97%. The comparison of results shows that the proposed study has declined slightly in accuracy as compared to the existing study [14] and [13]. The reason for better accuracy in [14] and [13] for light + occlusion is due to improved accuracy in the M set of Session I as compared to the proposed approach.

thumbnail
Table 10. Comparison with Session I (occlusion and light + occlusion)

https://doi.org/10.1371/journal.pone.0322638.t010

Occlusion over the upper face like covering the eyes and nose affects the accuracy of the face recognition system as compared to the occlusion of the lower face like a scarf or mask that covers the mouth or chin. This is evident from the experiments without introducing the occlusion augmentation to the gallery image. The results shown in Table 11 reveal that for the sunglasses occlusion, the gallery images having normal + both eyes occlusion obtained a 5.7% improvement. However, the gallery images having normal+ scarf occlusion obtained declined accuracy by -15.8%. Due to the fact of a different region of occluded parts on the face. For the scarf occlusion, it was noticed that the gallery images with the scarf and with both eyes and scarf had obtained an improvement of 1.67% and 0.5% respectively. The decline observed was -8% for the gallery images with both eyes.

thumbnail
Table 11. Representation of results for comparison.

https://doi.org/10.1371/journal.pone.0322638.t011

5.1 Limitations of the study

Unlike normal face recognition, SSPP face recognition relies on a single image that is taken in a controlled environment with proper light and resolution. It does not include the other variations available in an unconstrained environment. Hence some limitations are highlighted while working with the dataset available for SSPP face recognition.

The major limitation of working with SSPP face recognition datasets is that most of the datasets are not specifically designed for SSPP FR. The training sample also has some challenges to deal with. Another limitation is that computer vision and CNN techniques require high-quality testing images so it will be hard to work with low-quality, very dark, and occluded images. Due to the insufficient available data for training, the performance of the proposed approach was evaluated using the pre-trained model. We also analyzed that there is no publicly available dataset to represent exactly the Single Sample Per Person problem that could apply to real-time scenarios such as Law enforcement and e-passport problems.

Further, in the proposed study, large datasets or real-time scenario-based datasets were not used. Due to the limitations of existing techniques for varied capturing conditions, for the evaluation of the efficiency of the proposed approaches for occlusion, there is a requirement for a real-time dataset that should be captured in different scenarios.

6 Conclusion

To address the occlusion variance, a local region-based augmentation technique has been applied. Landmark key points are used to extract the key points of the face. Based on these key points, the region of interest is extracted from a particular region. An overlay patch is applied to the region of interest to create facial images with artificial occlusion. These occluded faces are used to create virtual samples. We have used these virtual samples to train the model. Although we focused on occlusions caused by sunglasses and scarves, our methodology can be directly extended to other sources of occlusion such as hats, beards, and long hair. We have shown that the extraction of landmark points from the local area can increase the accuracy of classifiers. The proposed approach requires a small number of augmented training samples and hence has the advantage of minimizing the training overhead of the classifier models. Despite the promising results, our approach has certain limitations. The study primarily focuses on occlusions caused by sunglasses and scarves, while other types of occlusions, such as hand occlusions, and dynamic occlusions in videos, were not extensively considered. Additionally, the evaluation was conducted on limited datasets, and the generalizability of the proposed method to large-scale and more diverse datasets remains unverified. Furthermore, the approach is designed for static images, and its effectiveness in handling occlusions in real-time video-based applications has not been explored. Practical deployment in real-world scenarios may also introduce computational challenges, such as processing speed and adaptability to varying environmental conditions. Future research on single sample per person face recognition, particularly in handling occlusions and recognizing individuals, can benefit from several advanced techniques. Self-supervised learning and transformer-based models, such as Vision Transformers and Masked Autoencoders, can improve feature extraction and help the system recognize faces even when parts are obscured. Diffusion models can be used to restore missing facial details and generate additional training samples, improving robustness. Graph Neural Networks can analyze facial landmarks to focus on visible regions, making recognition more reliable.

References

  1. 1. Kumar N, Garg V. Single sample face recognition in the last decade: a survey. Int J Patt Recogn Artif Intell. 2019;33(13):1956009.
  2. 2. Gururaj H, Soundarya B, Priya S, Shreyas J, Flammini F. A comprehensive review of face recognition techniques, trends and challenges. IEEE Access. 2024.
  3. 3. Wang X, Wang K, Lian S. A survey on face data augmentation for the training of deep neural networks. Neur Comput Appl. 2020;1–29.
  4. 4. Hemathilaka S, Aponso A. A comprehensive study on occlusion invariant face recognition under face mask occlusion; 2022. https://arxiv.org/abs/2201.09089
  5. 5. Essel JK, Mensah JA, Ocran E, Asiedu L. On the search for efficient face recognition algorithm subject to multiple environmental constraints. Heliyon. 2024;10(7):e28568. pmid:38590879
  6. 6. Cuculo V, D’Amelio A, Grossi G, Lanzarotti R, Lin J. Robust single-sample face recognition by sparsity-driven sub-dictionary learning using deep features. Sensors. 2019;19(1):146. pmid:30609846
  7. 7. Zhang Z, Zhang L, Zhang M. Dissimilarity-based nearest neighbor classifier for single-sample face recognition. Visual Comput. 2021;37(4):673–84.
  8. 8. Yang C, Xu J, Li Z. Fuzzy multi-manifold classifier for one-sample face identification. In: Chinese Automation Congress (CAC). IEEE; 2020. pp. 561–6.
  9. 9. Tomar V, Kumar N, Srivastava AR. Single sample face recognition using deep learning: a survey. Artif Intell Rev. 2023;56:1063–111.
  10. 10. Wang M, Deng W. Deep face recognition: a survey. Neurocomputing. 2021;429:215–44.
  11. 11. Schroff F, Kalenichenko D, Philbin J. Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. pp. 815–23.
  12. 12. Taigman Y, Yang M, Ranzato M, Wolf L. Deepface: closing the gap to human-level performance in face verification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014. pp. 1701–8.
  13. 13. Adjabi I. Combining hand-crafted and deep-learning features for single sample face recognition. In: 7th International Conference on Image and Signal Processing and their Applications (ISPA). 2022. pp. 1–6. https://doi.org/10.1109/ispa54004.2022.9786302
  14. 14. Adjabi I, Ouahabi A, Benzaoui A, Jacques S. Multi-block color-binarized statistical images for single-sample face recognition. Sensors. 2021;21(3):728. pmid:33494516
  15. 15. Chu Y, Zhao L, Ahmad T. Multiple feature subspaces analysis for single sample per person face recognition. Visual Comput. 2019;35(2):239–56.
  16. 16. Abbaspoor N, Hassanpour H. Face recognition in a large dataset using a hierarchical classifier. Multim Tools Appl. 2022;81(12):16477–95.
  17. 17. Li F, Yuan T, Zhang Y, Liu W. Face recognition in single sample per person fusing multi-scale features extraction and virtual sample generation method. Front Appl Math Statist. 2022:27.
  18. 18. Choi SI, Lee Y, Lee M. Face recognition in SSPP problem using face relighting based on coupled bilinear model. Sensors. 2018;19(1):43.
  19. 19. Du Q, Da F. Block dictionary learning-driven convolutional neural networks for fewshot face recognition. Vis Comput. 2021;37(4):663–72.
  20. 20. Abdelmaksoud M, Nabil E, Farag I, Hameed HA. A novel neural network method for face recognition with a single sample per person. IEEE Access. 2020;8:102212–21.
  21. 21. Plichoski GF, Chidambaram C, Parpinelli RS. A face recognition framework based on a pool of techniques and differential evolution. Inf Sci. 2021;543:219–41.
  22. 22. Khan A, Fraz K. Post-training iterative hierarchical data augmentation for deep networks. Adv Neural Inf Process Syst. 2020;33:689–99.
  23. 23. Zhuchkov A. Analyzing the effectiveness of image augmentations for face recognition from limited data. In: International Conference ``Nonlinearity, Information and Robotics’’ (NIR). IEEE; 2021. pp. 1–6.
  24. 24. Butt AR, Ur Rahman Z, Ul Haq A, Ahmed B, Manzoor S. Unconstrained face recognition using infrared images. Int J Image Graph. 2024;2550056.
  25. 25. Tu H, Duoji G, Zhao Q, Wu S. Improved single sample per person face recognition via enriching intra-variation and invariant features. Appl Sci. 2020;10(2):601.
  26. 26. Lv JJ, Shao XH, Huang JS, Zhou XD, Zhou X. Data augmentation for face recognition. Neurocomputing. 2017;230:184–96.
  27. 27. Chen D, Liu F, Li Z. Deep learning based single sample per person face recognition: a survey. arXiv preprint. 2020. https://arxiv.org/abs/2006.11395
  28. 28. Zhang Y, Hu C, Lu X. IL-GAN: illumination-invariant representation learning for single sample face recognition. J Visual Commun Image Represent. 2019;59:501–13.
  29. 29. Ding Y, Tang Z, Wang F. Single-sample face recognition based on shared generative adversarial network. Mathematics. 2022;10(5):752.
  30. 30. Pang M, Wang B, Cheung Y, Chen Y, Wen B. VD-GAN: a unified framework for joint prototype and representation learning from contaminated single sample per person. IEEE TransInformForensic Secur. 2021;16:2246–59.
  31. 31. Ding Z, Guo Y, Zhang L, Fu Y. One-shot face recognition via generative learning. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). 2018. pp. 1–7. https://doi.org/10.1109/fg.2018.00011
  32. 32. Luan X, Wang X, Liu L, Li W. Multi-level dynamic error coding for face recognition with a contaminated single sample per person. Pattern Recognit Lett. 2023;171:38–44.
  33. 33. Yao L, Liu F, Ou Z, Wang F, Chen D. Single sample face recognition based on identity-attribute disentanglement and adversarial feature augmentation. In: Chinese Conference on Biometric Recognition. Springer; 2023. pp. 212–22.
  34. 34. Iqbal MA, Jadoon W, Kim SK. Synthetic image generation using conditional GAN-provided single-sample face image. Appl Sci. 2024;14(12):5049.
  35. 35. Tan X, Chen S, Zhou Z-H, Zhang F. Face recognition from a single image per person: a survey. Pattern Recogn. 2006;39(9):1725–45.
  36. 36. Zhu P, Yang M, Zhang L, Lee IY. Local generic representation for face recognition with single sample per person. In: Asian Conference on Computer Vision. Springer; 2014. pp. 34–50.
  37. 37. Liu F, Yang S, Ding Y, Xu F. Single sample face recognition via BoF using multistage KNN collaborative coding. Multim Tools Appl. 2019;78(10):13297–311.
  38. 38. Liu F, Wang F, Ding Y, Yang S. SOM-based binary coding for single sample face recognition. J Ambient Intell Humaniz Comput. 2021;1–11.
  39. 39. Gan W, Yang H, Zeng J, Chen F. Auxiliary dictionary of diversity learning for face recognition with a single sample per person. Int J Artif Intell Tools. 2020;29(05):2050015.
  40. 40. Liu F, Ding Y, Xu F, Ye Q. Learning low-rank regularized generic representation with block-sparse structure for single sample face recognition. IEEE Access. 2019;7:30573–87.
  41. 41. Zhu P, Zhang L, Hu Q, Shiu SC. Multi-scale patch based collaborative representation for face recognition with margin distribution optimization. In: European Conference on Computer Vision. Springer; 2012. pp. 822–35.
  42. 42. Liu F, Tang J, Song Y, Bi Y, Yang S. Local structure based multi-phase collaborative representation for face recognition with single sample per person. Inf Sci. 2016;(346–347):198–215.
  43. 43. Karaaba MF, Surinta O, Schomaker L, Wiering MA. Robust face identification with small sample sizes using bag of words and histogram of oriented gradients. In: International Conference on Computer Vision Theory and Applications. vol. 5. SCITEPRESS; 2016. p. 582–9.
  44. 44. Liu F, Tang J, Song Y, Zhang L, Tang Z. Local structure-based sparse representation for face recognition. ACM Trans Intell Syst Technol. 2015;7(1):1–20.
  45. 45. Dang-Nguyen C, Do-Hong T. Reducing computational complexity of new modified hausdorff distance method for face recognition using local start search. Int J Electr Electron Eng Telecommun. 2021.
  46. 46. Li C, Zhao S, Song W, Xiao K, Wang Y. Ubiquitous single-sample face recognition under occlusion based on sparse representation with dual features. J Ambient Intell Humaniz Comput. 2024. pp. 1–11.
  47. 47. Mokhayeri F, Granger E, Bilodeau GA. Domain-specific face synthesis for video face recognition from a single sample per person. IEEE Trans Inf Forens Secur. 2018;14(3):757–72.
  48. 48. Deng W, Hu J, Wu Z, Guo J. From one to many: pose-aware metric learning for single-sample face recognition. Pattern Recognit. 2018;77:426–37.
  49. 49. Lugaresi C, Tang J, Nash H, McClanahan C, Uboweja E, Hays M, et al. Mediapipe: a framework for building perception pipelines. arXiv preprint. 2019. https://arxiv.org/abs/1906.08172
  50. 50. [Web Page]; 2019. Available from: https://google.github.io/mediapipe
  51. 51. Ming Z, Chazalon J, Luqman MM, Visani M, Burie J-C. Simple triplet loss based on intra/inter-class metric learning for face verification. In: IEEE International Conference on Computer Vision Workshops (ICCVW). 2017. pp. 1656–64. https://api.semanticscholar.org/CorpusID:34038047
  52. 52. Martinez A, Benavente R. The AR face database: CVC technical report, 24. 1998.