Detecting face presentation attacks in mobile devices with a patch-based CNN and a sensor-aware loss function

Waldir R. Almeida; Fernanda A. Andaló; Rafael Padilha; Gabriel Bertocco; William Dias; Ricardo da S. Torres; Jacques Wainer; Anderson Rocha

doi:10.1371/journal.pone.0238058

Abstract

With the widespread use of biometric authentication comes the exploitation of presentation attacks, possibly undermining the effectiveness of these technologies in real-world setups. One example takes place when an impostor, aiming at unlocking someone else’s smartphone, deceives the built-in face recognition system by presenting a printed image of the user. In this work, we study the problem of automatically detecting presentation attacks against face authentication methods, considering the use-case of fast device unlocking and hardware constraints of mobile devices. To enrich the understanding of how a purely software-based method can be used to tackle the problem, we present a solely data-driven approach trained with multi-resolution patches and a multi-objective loss function crafted specifically to the problem. We provide a careful analysis that considers several user-disjoint and cross-factor protocols, highlighting some of the problems with current datasets and approaches. Such analysis, besides demonstrating the competitive results yielded by the proposed method, provides a better conceptual understanding of the problem. To further enhance efficacy and discriminability, we propose a method that leverages the available gallery of user data in the device and adapts the method decision-making process to the user’s and the device’s own characteristics. Finally, we introduce a new presentation-attack dataset tailored to the mobile-device setup, with real-world variations in lighting, including outdoors and low-light sessions, in contrast to existing public datasets.

Citation: Almeida WR, Andaló FA, Padilha R, Bertocco G, Dias W, Torres RdS, et al. (2020) Detecting face presentation attacks in mobile devices with a patch-based CNN and a sensor-aware loss function. PLoS ONE 15(9): e0238058. https://doi.org/10.1371/journal.pone.0238058

Editor: He Debiao, Wuhan University, CHINA

Received: January 20, 2020; Accepted: August 8, 2020; Published: September 4, 2020

Copyright: © 2020 Almeida et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: We have uploaded our dataset to the Zenodo repository (https://zenodo.org/record/3749309#.Xv5Hr2gzZ3h). The dataset is now published and is available under an EULA, required by the Ethics Committee that approved the collection of the data.

Funding: This work was funded by Motorola. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Smartphones have become so popular that they are almost an extension of the user’s body and mind. Most people use them as their main medium of communication, storing conversational history, pictures, passwords, and other private data. As such, these devices must be secured, so that only the owner can access the data stored therein. Face authentication is a convenient way for unlocking such devices, requiring only that the owner looks at the built-in frontal camera, as in normal usage.

However, it has become popular knowledge that face authentication systems are somewhat vulnerable to presentation attacks (PA) at the sensor level. A PA can be made by simply showing the system an image of the device owner. That requires little technical expertise, as most people’s images are readily available on the Internet, and even a laptop display could be used as an attack medium.

Over the last years, there have been increased research interest in face presentation attack detection (PAD), but the existing approaches have been shown not to generalize beyond the conditions represented in the public datasets used as benchmarks, as is evident in cross-dataset evaluations [1–4]. Moreover, most studies are based on similar handcrafted features, and do not target the mobile-device scenario.

In this work, we focus on face PAD for modern smartphones, considering printed-photo and screen attacks. We take a data-driven approach and present training techniques targeting the PAD problem. We adapt a pre-trained architecture for PAD using, during training, multi-resolution face patches, making the model more robust to changes in resolution, while also avoiding overfitting to specific facial features. We introduce a loss function that closely models the PAD objective, forcing genuine-access examples from the same device to be more compactly located in the learned feature space, while also reducing inter-device confusion. By using a lightweight but powerful architecture as the core of the proposed method, we ensure that inference can run with small memory footprint and in under one second in modern smartphones.

To further improve the effectiveness of these models in real-world situations, when they are deployed on mobile devices and are presented with images from the same user, we also propose one strategy to adapt the decision boundary to the characteristics of a specific user and of a sensor device.

Our contributions are the following:

Two techniques to train deep convolutional neural networks to model the problem in a purely data-driven fashion, with RGB pixels as input.
A simple yet effective method for adapting a trained model by using a gallery of user data on the device thus heightening the discriminability of the model.
A novel face presentation-attack detection dataset—RECOD-MPAD—that is representative of the target scenario herein, with more realistic illumination conditions.
An extensive study of error cases, considering multiple factor-disjoint protocols, which leads to a better understanding of the problem.

The remainder of this article is organized as follows. The background section explores important concepts and outlines several PAD methods in the literature. The proposed method section presents our approach to tackle the PAD problem, with a detailed description of the proposed method and its techniques. The datasets section describes the datasets used in the experiments, highlighting the one specifically constructed for this work. The experimental results section analyzes and validates the methods in terms of performance and comparative experiments, considering multiple factor-disjoint protocols. Finally, the last section draws conclusions and presents possible future directions of investigation.

Background

We start by looking at how genuine-access and attack images are created, discussing some general assumptions that are involved in software-based presentation-attack detection. Next, we present an overview of some relevant techniques in the literature. Finally, we expose some of the problems with the state of the art, motivating our approach.

Image acquisition and attack clues

The presentation attack detection problem consists of answering whether or not a captured biometric sample is genuine. Note that the only resource is the biometric sample, and the hypothesis is that we can answer the question by looking at pixels only. To seek for attack clues and understand how the problem may be solvable, we take into consideration how image data is transformed before being acquired by the camera in the user device.

While genuine samples are acquired as a single capture by directly photographing the authenticating user, in an attack event the biometric sensor actually recaptures a previously captured image of the user, which is displayed on an attack instrument (paper or screen); i.e., an “attack camera” captures and preprocesses an image of the target user face and, when that image is displayed on the attack instrument, it is further modulated by the medium’s own reproduction, geometric, and reflectance characteristics.

Each part of the recapturing process changes the data to different degrees, and it is not trivial to identify whether an image feature is due to the interfering attack camera or display medium, indicating an attack, or simply a normal variation in the user facial traits or lighting conditions during acquisition. All factors can vary arbitrarily and interact in seemingly unforeseeable ways.

In comparison to a similar genuine image, attacks often have different color distributions, due to limitations of the reproduction medium. Overall contrast is typically lower in printouts, due to soft focus and the influence of the light source on the flat surface, and higher in most electronic displays, due to the strong backlight. Printouts can have visible printing defects, while low-quality liquid-crystal displays (LCD) can suffer from varying brightness levels throughout the screen. Finally, the resampling process often generates its own aliasing artifacts. One example is the moiré pattern [5] that appears when a sensor samples images containing fine-grained regular structures, such as the pixel grid in electronic displays. Other regular artifacts can be caused by slow refresh rates in older displays or low frame rates when replaying videos.

An overview of existing methods

Over the last years, different PAD methods were proposed. The interest increased significantly since the release of the NUAA dataset [6] and the advent of the first competitions. Because of that, it is no longer feasible to give an exhaustive analysis of all published methods. It is, however, noticeable that most methods are related, and tend to be based on common assumptions and feature descriptors.

Based on liveness or motion detection.

These methods seek to detect PAs through evidence for the lack of vitality in the captured face and typically depend on motion information. The archetypal method is eye-blink detection [7], which can be effective if the attacker uses a photograph, but is easily circumvented by video replay attacks, or even by cutting holes in the printed face image and using one’s own eyes to simulate blinking [8]. Another class of methods tries to detect subtle movements of a living human face, using optical-flow estimation [9], motion magnification [10], or temporal extensions of low-level texture descriptors [11]. Some methods take advantage of motion correlations between foreground and background or other scenic clues [12]. This is likely to succeed if the attacker does not use a fixed-support when performing the attack with a printout or display, but would probably fail otherwise. These methods have the disadvantage of requiring a potentially long sequence of frames to make a single prediction, and most of them can be circumvented by faking eye-blinks and carefully handling the attack instruments.

Based on physics or geometry.

Face PAs typically present the forged user representation on a flat surface, which has different reflectance properties compared to a living face. Some methods seek to detect this “flatness” or abnormal reflectance with physical or geometric motivations. One early method tries to capture depth information via Structure-from-Motion techniques [13]. Others propose to detect differences in motion between face areas via optical flow estimation [14] or by explicitly modeling 3D projective invariants [15]. Another possibility is to model local curvatures by using multiple images [16]. These methods typically require at least some user cooperation to succeed.

By assuming a simplified Lambertian model of reflectance, it is also possible to model the interaction between the illuminant and the reflective surface to extract albedo and normal maps [6], which are then used as representations to discriminate genuine-access from attack samples. Although the motivation is clear, lighting in the real-world is mixed and uncontrolled, so the basic assumptions do not hold in practice. Another option is to model the diffuse and specular components to try to separate the latter, which could emphasize characteristics of the attack medium surface [17].

Based on texture, noise analysis, or image quality.

These methods seek to detect artifacts left by the recapture process or estimate degradation in overall image quality. Texture characterization is typically motivated as a means of discriminating the intrinsic textural properties of attack instruments and living faces, but can also capture other types of high-frequency information. Most are based on variations of local-binary pattern (LBP) descriptors [3, 18], but temporal extensions were also proposed [11]. Other methods use a combination of low-level local descriptors [19]. Frequency-specific information can be captured by Difference-of-Gaussians (DoG) filtering [8] or through Fourier analysis [20, 21].

A more global characterization that discards content information in static-content videos to analyze noise signatures is proposed in [22], while the same type of residual information is encoded as mid-level temporal representations in [2]. Methods based on low-level texture descriptors or high frequency information can be effective in detecting paper texture and noise patterns, however the effectiveness is extremely dependent on the exact acquisition conditions and the capability of the camera resolving fine details. Moiré-like patterns are strong clues for attacks, but are not always present, making countermeasures solely based on them unreliable.

Explicit attempts at capturing image distortion artifacts can be found in [4]. Researchers also explored generic image quality metrics directly [23, 24]. As some of these metrics require a reference image, which is not available, these works compare the probe image to an artificially degraded version of itself. The hypothesis is that the difference is greater between genuine-access images than between attack images, since the latter are assumed to be of lower-quality, which is not always valid. Although under similar acquisition conditions attack and genuine-access samples would potentially be separable by generic image quality metrics and statistics, existing algorithms do not take context into consideration, which makes them fragile in real-world scenarios.

Based on feature learning.

Since 2012 [25], models based on Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance on many image recognition tasks. Data-driven methods like these, which receive pixels as input and learn intermediate representations directly from data, are said to perform representation or feature learning [26]. Feature learning is underrepresented in the face PAD literature, despite its success in other visual tasks.

Menotti et al. [27] studied architecture optimization for PAD. In this strategy, many simple architectures with random convolutional filters are sampled and used as feature extractors to train a final linear classifier. They found out that, although competitive, optimized architectures could not be improved by having their parameters further adjusted. This could be partially attributed to insufficient hyperparameter tuning involved in Stochastic Gradient Descent (SGD) training.

Yang et al. [28] trained a CNN based on the AlexNet network [25], using a classifier at the end. During pre-processing, they experimented with a few face-centered regions, including tighter face crops, and regions showing more background. They reported promising results on different datasets, but the best pre-processing configuration was different in each case. As one of the conclusions, they highlight differences in background between the two datasets, which makes evident that the network learned to exploit acquisition biases when too much background was used in training.

Patel et al. [29] also experimented with training deep CNNs using aligned faces and the whole frame as input. This choice is not well-founded, since it is strongly dependent on the dataset. The final system consists of a fusion scheme involving the output of the CNN and an eye-blink detector.

Atoum et al. [30] introduced a two-stream CNN for PAD, by extracting local features from patches and constructing depth maps from face images. The use of patches makes the method independent of spatial face areas and depth maps can be used to detect the presence of face-like depth. This work, differently from ours, does not consider multi-resolution patches to increase robustness neither a custom-tailered loss function to the PAD problem.

Jourabloo et al. [31] revisit the analysis of residual information for PAD by training a CNN to estimate recapture-related noise in order to create live faces. Other works formulate the PAD problem differently, aiming at creating more discriminative and generalizable representations. Liu et al. [32] argue about the importance of auxiliary supervision, instead of considering the PAD problem simply as a binary classification. Li et al. [33] take both spatial and temporal information by considering a 3D CNN which is firstly trained with cross-entropy loss, and further enhanced with a generalization loss.

We highlight that our approach falls into this category, however, we specifically target the mobile-device scenario. Critically, none of the relevant related work considers modern datasets for that scenario, and so far strategies for using the available data and training the networks have been limited.

A critical look at the state of the art

Early methods for face PAD were mostly based on eye-blink detection and other motion clues, which require several frames to be acquired, and typically fail under video replay attacks or simple cut-photo attacks. The community then moved on to exploring potentially more generalizable clues based on texture description, but currently most of these methods are based on the same low-level descriptors and simple classifiers, and yet they were shown to fail under more challenging cross-dataset protocols [1]. Other recent methods also suffer from the same problem [2–4].

Inasmuch as available public datasets have been useful for comparing different approaches, and inspiring new research efforts, they are now mostly outdated, both in terms of available cameras and attacks, and in terms of methodology. The partial shift to cross-dataset evaluations has shown the limitations of methods and datasets alike. Only recently has the community started to address the specific constraints of mobile applications [4]. New datasets, such as OULU-NPU [34] and REPLAY-MOBILE [35], have appeared with accompanying modern protocols, but they still have some of the same problems as other datasets, such as static sessions with low illumination variability.

Finally, efforts in devising deep learning, or other data-driven approaches to face PAD have been limited, with most solutions based on very similar aligned-face pre-processing and training strategies, and not taking into account our constraints (image acquisition peculiarities and limited memory and processing power). To the best of our knowledge, there are no rigorous studies of such methods considering modern protocols and mobile devices. It is in this context that we propose to study the problem in a purely data-driven fashion, aiming at gaining insight into how far we can go with software-based methods in such scenario.

Proposed method

We propose a method based on training a Convolutional Neural Network (CNN) to distinguish between genuine-access and attack images. Comparing to the traditional way of training a CNN with whole-face images and a cross-entropy loss, the proposed techniques change what a CNN sees as input during training, and how it is optimized at the end. This makes sense from a modeling perspective, because the interaction between the input and the optimization objective is what really defines the problem, driving the learning procedure.

We take inspiration from one of the versions of the SqueezeNet architecture [36]. As a baseline, we adapt the architecture to consider presentation-attack detection, formulating the problem as a 2-class classification, and using only aligned whole-face images for training the network. From this modified architecture, we introduce our method: the use of face patches of variable resolution during training, which reduces overfitting to user-specific characteristics, and promotes the learning of more robust representations that are not tied to a single scale; and a loss function that more closely models the PAD objective, promoting the compactness of intra-device genuine examples in the learned feature space.

Convolutional neural network core architecture

We adapt SqueezeNet v1.1 architecture [36] as our core architecture for different reasons: firstly, the network is small and fast enough to be embedded in mobile devices, in contrast to other popular architectures [37, 38]; it has a fully-convolutional structure, making interpretation of results easier, and is flexible to changing input size and alignment; finally, it was proven to be more accurate than AlexNet, which validates its potential for representing complex visual relationships.

The network is illustrated in Fig 1. Next to each arrow, we show the shape of the output tensor for a single input image of size 3 × 227 × 227. For instance, the fire module fire4 receives as input 128 activation maps of spatial dimension 28 × 28 and outputs 256 maps of size 28 × 28. Fire modules (illustrated at the top-right corner) are similar to inception modules [39] and are the main building blocks of SqueezeNet. They massively reduce the total number of parameters in 3 × 3 convolutions by first squeezing the channel dimension of the input tensor with 1 × 1 convolutions. The classification layer consists of a dropout operation [40] to reduce overfitting and a convolutional layer producing a number of feature maps that matches the number of classes in ImageNet. By averaging each class-specific map individually, the network also reduces the total number of parameters by eliminating the need for fully-connected layers. In summary, SqueezeNet v1.1 has approximately 1.2 million parameters, which can be stored in less than 5MB of memory, making it suitable to be used in mobile devices.

Download:

Fig 1. Squeezenet v1.1.

Original architecture [36] and generic micro-architectural details of a fire module.

https://doi.org/10.1371/journal.pone.0238058.g001

Baseline: CNN training with whole-face images

The baseline we adopt herein considers training a deep CNN by using aligned whole-face images, the traditional input format in most algorithms published in the literature [27]. In this case, however, the pipeline consists mostly of the whole multi-layered network, which is trained from end to end. Using aligned whole-face images can be justified as a means of reducing unnecessary variations during training and inference, putting the data in a predictable content-domain.

Architecture.

The core architectural component is SqueezeNet (Fig 1). As it is a fully-convolutional network, all feature maps are flexible in size, but we keep the input size as 227 × 227. Since the cropped-face region in our scenario typically varies from 300 to 550 pixels in each dimension, keeping the original input size ensures that only small details are lost due to rescaling, while increasing it would make the network too slow to train and be used in mobile devices. Most PAD pipelines in the literature pre-process images to a fixed size, typically varying from 64 × 64 to 256 × 256.

Pre-processing and data augmentation.

We start from an aligned and square-cropped image of the face region, both in training and inference phases. In practice, we found that the exact alignment does not significantly impact the performance of the method. During training, we read an RGB image, rescale the aligned face region to 256 × 256, crop a 227 × 227 central region, and flip the image horizontally with probability 0.5. Other data augmentation strategies that involve random photometric distortions and normalization [25] are potentially destructive for label information.

Before feeding the image into the network, we perform a simple pixel-wise transformation from the range [0.0, 1.0] to the range [− 1.0, 1.0]. Basic centering is common-place to obtain meaningful gradients in the first iterations of training, especially if parameters are randomly initialized and the ReLU activation function is used [25, 37]. In practice, we found that this transformation does not significantly affect accuracy, but it helps to make the training procedure more stable.

Training details.

Training is done via standard backpropagation [41]. Fig 2 illustrates the architecture and the training procedure. We feed the network with a preprocessed mini-batch of images containing faces and their labels. In each iteration, 64 images are randomly sampled with replacement from the training set. The probability of a single image being selected is inversely proportional to the number of samples with its label in the training set, to account for class imbalance. Each mini-batch consists of roughly 32 samples with label genuine and 32 samples with label attack.

Download:

Fig 2. Baseline (Whole-face CNN).

Architecture and training procedure.

https://doi.org/10.1371/journal.pone.0238058.g002

The 2-dimensional output corresponding to the two classes is used as input to a cross-entropy criterion, which is analogous to a traditional Softmax classifier. The function can be interpreted as normalizing the input vector into probabilities, and then measuring the mismatch between the predicted distribution and the expected distribution in which the mass is fully concentrated in the true label. In practice, we average over the whole mini-batch, giving the following expression, where f_c(X) is the network output for class c and input X, and B is a mini-batch of training examples: (1)

After computing the loss and intermediate activations, the gradient of the loss with respect to every adjustable parameter is computed via backpropagation. Finally, for the optimization step, we use the Adam optimizer [42], which is an adaptive optimizer based on SGD with momentum, requiring minimal hyperparameter tuning. All experiments were carried out with default Adam hyperparameters and a learning rate of 10⁻⁵. As regularization, we add to the loss function an L2 penalty (weight decay) with weight 10⁻⁴.

For parameter initialization, we start from pre-trained ImageNet weights for the core part of the network, which is preferable to random initialization. For the classification layer, we initialize biases to 0.0, and weights from a normal distribution with mean 0.0 and standard deviation 0.01.

Inference.

After the network is trained, it can be used to infer the label of new input images. Pre-processing is mostly as in the training phase. The detected face region is rescaled to 256 × 256 and centrally cropped to 227 × 227. The aligned and cropped whole-face image is centered in the pixel space by subtracting 0.5 from every pixel, and dividing by 0.5. In contrast to the training phase, no random mirroring is performed.

CNN training with multi-resolution patches and a multi-objective loss function

Our proposed method first models the problem as a task of distinguishing regions of arbitrary level of detail in attack images from regions of arbitrary level of detail in genuine images. We accomplish this in training by extracting patches of varying sizes from the full-resolution images, only then rescaling them to the network input format.

This approach is beneficial in different ways. Firstly, it increases the number of examples available for training, taking full advantage of the training data by not discarding information that would be lost by premature re-scaling. By forcing the network to distinguish patches at different resolutions, its robustness to blur, adverse lighting, and unseen cameras is increased. Finally, by not always receiving the whole user face, the network is encouraged not to depend on user-specific characteristics, which potentially reduces over-fitting.

The method then contemplates the problem of training models to be sensitive to a wide number of attack clues and sensor device specificities. We may ponder what is the best way to account for these differences during training or if it is reasonable to assume that genuine samples from different devices should have similar characteristics.

By analyzing overall noise and persistent high-frequency information across classes and sensors, we can observe more subtle differences between genuine and attack samples than between samples acquired by different devices. In Fig 3, we can observe that pattern noise can be more similar between genuine and attack samples than genuine access samples from different devices. This indicates that cameras are very distinct from each other and a formulation that does not account for their difference may end up with a model biased towards irrelevant aspects of the dataset, instead of representing important characteristics of the problem, such as attack clues. Inspired by this observation, we propose a loss function aiming at reducing such possible biases.

Download:

Fig 3. Center-cropped noise residuals for average frames from RECOD-MPAD dataset, highlighting differences in pattern noise across sensors.

In each case, 20 frames were randomly sampled from the training set and the residual [43] was computed from the average frame. Despite belonging to different PAD classes, patterns are visually similar between (a) and (c), and between (d) and (e). On the other hand, genuine-access examples can generate patterns that look dissimilar when comparing across sensors, as in (a) compared to (d). As sensor devices are different and interact differently with attack instruments, this difference should be taken into account to train more robust data-driven models for PAD.

https://doi.org/10.1371/journal.pone.0238058.g003

We reformulate the problem by adding another term to the training objective loss function, changing the way images are used during optimization. The goal is to force genuine samples from a given device to be more compactly located in intermediate feature spaces, but farther away from attack samples of the same device. We hypothesize that this would create better manifolds by not directly confounding information from different devices, as in traditional training strategies.

More specifically, we consider a latent representation f_ℓ (I) of the original input image I after it has been successively non-linearly transformed by ℓ layers. Consider a triplet of images I_n, I_r, I_a coming from the same device: a genuine anchor, another genuine example, and an attack example, respectively. Let n: = f_ℓ (I_n), r: = f_ℓ (I_r), and a: = f_ℓ (I_a), for short. Now, we can add the following loss function to the network: (2) where m is a margin hyperparameter, fixed beforehand, interpreted as the relative separation between attack examples and genuine examples to be enforced in the learned embedding. In general, this separation should be as large as possible, but when m >> 0, training becomes one-sided, since the objective is reduced to making attack examples as separated from its anchor as possible. Typically, training diverges due to large initial gradients, unless the learning rate is also reduced. On the other hand, the absence of such hyperparameter (m = 0) can prevent the network from learning the “attack concept”.

By minimizing L_triplet, we enforce the notion that genuine samples from a given device should be closer in this latent space to genuine samples of the same device sensor, but farther away from attack samples of the same device, up to a margin. In theory, this triplet loss could be used alone to optimize an embedding to directly compare pairs of images during inference [44]. But since we ultimately want the trained model to distinguish between arbitrary genuine-access images and attack images, we add the previously described cross-entropy loss to jointly enforce a classification objective: (3) where N, R, and A are the sets of genuine anchors, genuine non-anchor images, and attack images in a mini-batch, respectively, and T is the number of triplets. All images in a mini-batch are sampled from a single device. L_class (f (R ∪ A)) indicates the cross entropy loss is computed only for the 2T non-anchor images. In practice, the embedding term is computed as the maximum triplet loss over triplets in a mini-batch. This can be interpreted as performing an online selection of hard triplets [44]. The hyperparameter κ controls the relative weight of the triplet loss term.

Each term is a regularizer for the other objective in the spoof loss function L_spoof. The classification loss L_class encourages finding a single decision boundary separating genuine-access from attack patterns in general, while the triplet loss L_triplet encourages making intra-device genuine-access samples as compactly located as possible in the latent space, but farther away from attack samples of the same device.