How Well Do Computer-Generated Faces Tap Face Expertise?

Kate Crookes; Louise Ewing; Ju-dith Gildenhuys; Nadine Kloth; William G. Hayward; Matt Oxner; Stephen Pond; Gillian Rhodes

doi:10.1371/journal.pone.0141353

Abstract

The use of computer-generated (CG) stimuli in face processing research is proliferating due to the ease with which faces can be generated, standardised and manipulated. However there has been surprisingly little research into whether CG faces are processed in the same way as photographs of real faces. The present study assessed how well CG faces tap face identity expertise by investigating whether two indicators of face expertise are reduced for CG faces when compared to face photographs. These indicators were accuracy for identification of own-race faces and the other-race effect (ORE)–the well-established finding that own-race faces are recognised more accurately than other-race faces. In Experiment 1 Caucasian and Asian participants completed a recognition memory task for own- and other-race real and CG faces. Overall accuracy for own-race faces was dramatically reduced for CG compared to real faces and the ORE was significantly and substantially attenuated for CG faces. Experiment 2 investigated perceptual discrimination for own- and other-race real and CG faces with Caucasian and Asian participants. Here again, accuracy for own-race faces was significantly reduced for CG compared to real faces. However the ORE was not affected by format. Together these results signal that CG faces of the type tested here do not fully tap face expertise. Technological advancement may, in the future, produce CG faces that are equivalent to real photographs. Until then caution is advised when interpreting results obtained using CG faces.

Citation: Crookes K, Ewing L, Gildenhuys J-d, Kloth N, Hayward WG, Oxner M, et al. (2015) How Well Do Computer-Generated Faces Tap Face Expertise? PLoS ONE 10(11): e0141353. https://doi.org/10.1371/journal.pone.0141353

Editor: Alexandra Key, Vanderbilt University, UNITED STATES

Received: April 29, 2015; Accepted: October 6, 2015; Published: November 4, 2015

Copyright: © 2015 Crookes et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Data Availability: All data are available from the University of Western Australia Research Data Online database (https://researchdataonline.research.uwa.edu.au/handle/123456789/2021).

Funding: This research was supported by the Australian Research Council (ARC) Centre of Excellence in Cognition and its Disorders (CE110001021); ARC Professorial Fellowship to Rhodes (DP0877379); ARC Discovery Outstanding Researcher Award to Rhodes (DP130102300); and Hong Kong Research Council (HKU744911H) to Hayward.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Advances in technology have seen an increase in the use of computer-generated (CG) stimuli in face processing research in recent years. Artificial faces with a very human-like appearance can now be generated by a number of software programs with ease (either ‘from scratch’ or by inputting real photographs to be converted into 3-D head models). Different facial characteristics can be specified or varied when generating these faces including sex, age, ethnicity and attractiveness. Once generated, the faces can then be easily manipulated for facial expression and viewpoint. CG faces are also highly standardised in terms of lighting conditions, extra-facial information, size and image quality. All these factors make CG faces very appealing to face processing researchers, particularly given the limitations that existing databases of face photographs often impose on experimental design and the cost and time required to generate new photographic databases. However little is known about the validity of the CG faces being used in research, and it remains unclear whether, as stimuli, they are equivalent to photographs of real faces.

Humans are generally considered face experts, demonstrating remarkable abilities to extract a range of social information from faces. Despite little evidence regarding their validity, CG faces are being used to address important questions in face processing research. Examples include charting the developmental trajectory of face identity recognition [1], exploring the origins of race effects on face recognition [2], identifying the perceptual underpinnings of social judgements from faces such as trustworthiness [3], mapping the structure of face-space [4], investigating the types of faces for which there is special sensitivity to spacing between features in upright faces [5], and examining the category selectivity of neural responses to faces [6]. Results from such studies are being used to inform our understanding of how faces are processed and to develop and refine theories of face processing. It is therefore critically important to know the extent to which the CG faces being used in these studies truly allow for the demonstration of face expertise.

As the above examples attest, CG faces are being used to study a broad range of face processing abilities. The present study focussed on one aspect of face processing: namely the expert processing of face identity. Given the similarity between faces, our ability to efficiently discriminate between identities and accurately recognise many hundreds of familiar individuals is truly remarkable. It is generally agreed that this ability is supported by specialised face processing mechanisms (for review see [7]). However, this expertise is sensitive to deviations from the types of faces we are used to dealing with. For example people tend to demonstrate greater expertise, in the form of greater recognition accuracy, for faces of their own race than for faces from other races (for review see [8]). CG faces may represent another category of faces with which we are less expert.

CG faces that are currently being used in research are commonly generated by a program called FaceGen. They are remarkably human-like (see Fig 1 for examples), but they are certainly distinguishable from, and less familiar than, real photographs, which could potentially reduce their ability to engage face expertise. The most notable difference between these CG faces and face photographs is that the CG faces appear to lack fine-grained surface texture information and imperfections that are usually present in photographic face stimuli. This gives the impression of these faces being somewhat artificial and unreal. These CG faces may also lack animacy—the perception that a face belongs to a living being with a mind [9, 10]. Recent studies have found that behavioural and neural responses to faces are highly sensitive to animacy [10–13]. These clear differences raise the question of whether CG faces are processed in the same way as real faces and can allow for the full demonstration of face expertise.

Download:

Fig 1. Example Caucasian and Asian stimuli in the three formats used in Experiment 1: Real, CG_R, CG_A.

A slight view change was included between study and test (i.e., study faces = front view, test faces = 5° left or right). Note the same identities are depicted in the Real and CG_R conditions.

https://doi.org/10.1371/journal.pone.0141353.g001

Beyond giving the CG faces an unnatural appearance, the visible loss of surface texture information in these CG faces may also indicate a more primary issue with such stimuli. It could be that these faces lack vital information that is used by our face processing systems to recognise and discriminate faces. There is evidence that surface texture information is important for face recognition (e.g., [14, 15]). Similarly structural shape information (e.g., [14]) and information in certain frequency bands (e.g., [16–18]) have also been shown to be vital for recognition. If such diagnostic information is impoverished or absent in CG stimuli then they may not fully tap expert face recognition mechanisms.

In the area of identity recognition three studies have addressed the question of whether CG faces are processed like real faces. All three used FaceGen to generate stimuli. Matheson and McMullen [19] found that three hallmarks of face identity expertise—the other-race effect (ORE), the inversion effect and the reduction in the inversion effect for other-race compared to own-race CG faces—were present for randomly generated CG faces. These three effects reflect the fact that people tend to have the greatest expertise for, and therefore tend to be most accurate at recognising, upright own-race faces. They are less accurate at recognising faces presented upside-down (e.g., [20]) or faces from another race [8]. Given that people are less expert with other-race faces than own-race faces, inversion also has less of an effect on other-race faces [21–24]. Having demonstrated these key effects, Matheson and McMullen [19] concluded that CG faces are processed in a similar way to photographs of real faces and are therefore suitable for use in face research. Critically, however, methodological issues with their study permit other possible interpretations of the observed patterns.

First, the major critique, which affects the interpretation of all three results, is that a real face (i.e., photograph) condition was not included in the experiment. Therefore, it is impossible to know whether the observed effects were of the same strength as those that would be observed for real faces. The results do suggest qualitatively similar processing of real and CG faces. However to conclude that CG faces are equivalent to real photographs and allow for the full demonstration of face expertise it is necessary to not only qualitatively demonstrate key ‘expertise effects’ but also to show that these effects are quantitatively as strong for CG faces as for real faces. Second, only Caucasian participants were tested. This leaves open the possibility that rather than being an expertise effect, the observed ORE might have reflected stimulus effects, that is, the African American faces created for the study might have simply been more difficult to recognise than the Caucasian faces used. To rule out this possibility it is necessary to demonstrate the (reversed) ORE for the same stimuli with a group of African American participants. The results of Matheson and McMullen’s [19] study therefore do not provide clear evidence of full expert processing of CG faces. Additionally, Matheson and McMullen [19] did note that overall performance was particularly poor on their task (i.e., d' < 1.5). Their interpretation of this result was that the lack of distinctive elements (e.g., skin imperfections) in the CG faces poses a challenge to the visual system. This poor performance may in fact indicate a failure of CG faces to fully engage face expertise.

Papesh and Goldinger [4] also addressed the question of whether an ORE is observed in recognition memory for CG faces in the course of validating stimuli for another study. Here, performance for CG and real faces was directly compared, but the CG faces had been created from the real photographs rather than randomly generated. There was no indication that the ORE was any smaller for CG than real faces. Papesh and Goldinger [4] took this as evidence that CG faces were appropriate substitutes for real face photographs. However, as in Matheson and McMullen [19], only one race of participants was tested, leaving open the possibility that difference between own- and other-race faces was a stimulus effect rather than an expertise effect. Memory accuracy was also numerically poorer for CG than real faces, which might again potentially reflect a lack of expertise for CG faces. Therefore the critical question of whether CG faces fully quantitatively recruit expert face processing mechanisms remains open.

More recently Balas and Pacella [25] compared performance for photographs of real faces to CG versions of the same faces on a recognition memory and a face matching task. They reported that recognition memory accuracy was significantly worse for CG faces compared to real faces and concluded that CG faces are harder to remember. On the face matching task performance was also significantly poorer for CG compared to real faces but this effect was very small (<2%). Importantly, counter to the notion of diminished expert processing for CG faces, Balas and Pacella [25] found no reduction in the size of the inversion effect for the discrimination (inversion was not tested for the memory task) of these stimuli compared to real faces. However accuracy on this task was exceptionally good even in the inverted conditions (approximately 90%), which may have precluded identification of a larger inversion effect in the real faces condition.

There are a number of potential indicators of face identity expertise. The present study investigated two important markers that have been identified in the previous literature: accuracy of own-race face recognition and the ORE. These markers of expertise were tested with regards to both recognition memory (Experiment 1) and perceptual discrimination (Experiment 2). We compared performance with CG faces to that with photographs of real faces in order to detect any potential reduction in own-race accuracy or in the ORE. We also tested both Asian and Caucasian participants with Asian and Caucasian face stimuli to rule out differences between face sets as the source of any ORE. Importantly, unlike the previous three studies [4, 19, 25] which all used identical images at study and test, here we included a viewpoint change to ensure we were testing higher-level face recognition rather than low-level image matching.

If CG faces are equivalent to real photographs and allow participants to fully demonstrate their face expertise, then we expect no differences between CG and real faces in either own-race face accuracy or the magnitude of the ORE. However, if CG faces fail to fully recruit face expertise, then we expect to observe a reduction in own-race face accuracy and reduction of the ORE for CG faces compared to real photographs. This prediction for the ORE is based on the idea that any reduction in expertise would have the greatest effect on the faces we are most expert with—own-race faces. It may even be the case that CG faces do not recruit face expertise at all, in which case we would expect the ORE to be eliminated.

Experiment 1: Recognition memory

In Experiment 1 we tested old/new recognition memory for Caucasian and Asian faces presented in three formats: Real face photographs, CG-Real (CG_R) faces and CG-Artificial (CG_A) faces. The two CG formats were chosen because they represent the two types of CG faces that have been used in previous studies. CG_A faces were randomly generated by the software. This type of CG face is the most common in the literature (e.g., [2, 19, 26–28]). CG_R faces were generated by importing the photographs from the Real condition into the software to produce CG versions of the Real faces, (e.g., [4, 25]). Including both CG formats provides a thorough test of the usefulness of CG face stimuli. CG_R faces may also provide a fairer comparison to the Real faces than the arbitrarily generated CG_A faces. Assuming 100% fidelity in the conversion process the CG_R faces should be matched to the Real faces for within-set heterogeneity. As can be seen in Fig 1 the CG_R faces retain some of the imperfections of the Real faces, but still lack some fine-grained texture information and may give a weaker impression of animacy. Texture information was not applied to the CG_A faces as this has not been routinely done in previous studies. The CG_A faces therefore have a uniformly smooth appearance.

To recap, if CG faces do not fully recruit face expertise then we expect recognition of own-race faces to be less accurate for CG than Real faces and the ORE in recognition memory to be reduced for CG faces.

Method

Ethics Statement

The study was approved by the Human Research Ethics Committee at the University of Western Australia and the University of Hong Kong. All participants provided written consent prior to their participation in the project.

Participants

Caucasian participants were 36 students (17 male; Age: Mean = 20.5, SD = 3.9) at the University of Western Australia. Asian participants were 35 students or staff (10 male; Age: Mean = 20.5, SD = 1.9) at the University of Hong Kong. Participants received either course credit (Caucasian participants) or HK$40 (approximately US$5) for the 40 minute experiment.

Stimuli

There were three different formats of face stimuli: Real, CG_R, CG_A. Each format consisted of 80 young adult males (40 Caucasian and 40 Asian) with neutral expressions (see Fig 1 for examples). There were two versions of each face, one in front view and one facing 5° left or right.

Real faces.

The 40 Caucasian faces were photographs taken at the University of Western Australia. The 40 Asian faces were all ethnically Chinese and photographed in Hong Kong [29].

CG_R faces.

CG versions of each of the faces from the Real condition were created using the “Photofit” function of FaceGen Modeller 3.5.3. This process involved digitally placing markers at landmark points (e.g., bridge of nose, corner of mouth) on the front and profile views of the original faces (11 markers for front view, 9 markers for profiles). These points were then used to import the face into FaceGen integrating information from the front and profile views to create a 3D model of the head from which 2D images were exported.

CG_A faces.

FaceGen was also used to randomly generate a set of 40 Caucasian and a set of 40 East Asian young adult (approx. 25), male faces. The gender and age settings were locked at the same levels for all faces across both races of face. All faces had the same lighting conditions and no texture information was applied. Controllers for expressions, muscle modifiers (e.g., brow position) and phonemes (i.e., mouth shape) were set to zero (i.e., neutral expression).

The stimuli were edited and standardised using Adobe Photoshop CS3. All the faces were resized to have an inter-pupil distance of 80 pixels. Hair (and, in the CG conditions, bald head) information was masked with a black oval. Chin shape and some neck information was retained but all clothing was masked. None of the faces had facial hair and the stimuli were edited to remove any obvious distinguishing marks (i.e., blemishes, scars, moles). All faces were presented in colour. At the viewing distance of approximately 50 cm stimuli subtended a visual angle of approximately 5.4° × 6.6°.

Real and CG_R formats consisted of the same face identities, which were split into two sets of 20 faces for each race (i.e., Set A and Set B) such that Set A contained the same identities in each format. Each participant saw one set (e.g., Set A) in the Real condition and the other set (e.g., Set B) in the CG_R condition. Assignment of the face sets to conditions was counterbalanced across participants. For consistency, the faces in the CG_A condition were also split into two sets (e.g., Set C and Set D). Half the participants saw one set (e.g., Set C) and the other half saw the other set (e.g., Set D).

To ensure that the task tapped face recognition rather than picture recognition there was a slight viewpoint change between study and test. At study all faces were presented in front view. At test the faces were shown facing 5° to the left or right. Of the old faces, half faced to the right and half to the left. Similarly half of the new faces faced to the right and half to the left.

Procedure

Stimuli were presented using SuperLab 4.0 (Cedrus Corporation, California) on 21.5 inch iMac computers. Both face format (Real, CG_R, CG_A) and race of face (Caucasian, Asian) were blocked and manipulated within participants. Each participant, therefore, completed 6 study-test cycles. The three format blocks for each race of face were completed consecutively (e.g., Asian Real, Asian CG_R, Asian CG_A,Caucasian Real, Caucasian CG_R, Caucasian CG_A) with the race of face that was completed first counterbalanced across participants. Order of format blocks was counterbalanced across participants according to a Latin square.

Participants were informed that the task would test their memory for faces. They were instructed to concentrate on the study faces carefully because they would see different versions of the faces at test. Within blocks, each study phase was initiated by the participant via a key press. In each study phase, 10 front-view faces were presented sequentially in the centre of the screen for 3000ms each. Each face was followed by a blank screen for 500ms. The order in which the faces were presented was randomised for each participant. The same study faces were then presented a second time (3000ms each), in a different random order. Immediately following the study phase, participants initiated the test phase with a key press. In the test phase, 20 faces (10 “old” studied faces, 10 “new” unstudied faces) were presented sequentially and remained on-screen until response. Participants pressed labelled keyboard keys to indicate whether they thought each face was “old” or “new”. Responses immediately triggered the next trial. Test faces appeared in a different random order for each participant. To familiarise participants with the procedure they first completed a practice block consisting of 6 study and 12 test faces. Practice faces were characters from television cartoon The Simpsons.

Following the memory experiment participants completed a racial background and contact questionnaire adapted from Hancock and Rhodes [21]. Participants reported their ethnicity and rated their agreement with 7 statements about each race (e.g., “I know lots of Asian [Caucasian] people”) on a 6-point scale (1 = very strongly disagree; 6 = very strongly agree). Finally participants’ experience with CG faces was assessed using a questionnaire developed for this study. Participants rated their level of agreement with 5 statements (e.g., “I play video and/or computer games that contain computer generated faces) using the same 6-point scale as in the race questionnaire.

Results and Discussion

Contact

Self-reported contact with own-race, other-race and CG faces was calculated as the mean of the contact ratings for each type of face (see Table 1). As expected both groups of participants reported significantly greater contact with own-race than other-race faces: Caucasian participants, t(35) = 9.85, p <.001,Cohen’sd= 1.64; Asian participants, t(34) = 7.23, p <.001,Cohen’s d = 1.22. There was no difference between the Caucasian and Asian participants in reported experience with CG faces, t(64.9) = 0.43 p = .672,Cohen’s d = 0.10.

Download:

Table 1. Experiment 1: Mean (SD) self-reported contact with own-race, other-race and CG faces.

https://doi.org/10.1371/journal.pone.0141353.t001

Recognition accuracy

Accuracy was measured for each condition using the signal detection measure d' (see Table 2), calculated according to the standard formula d' = z(hits)–z(false alarms). We defined hits as correctly responding “old” to studied items and false alarms as incorrectly responding “old” to unstudied items. Hit and false alarm rates of 0 and 1 were replaced using the conventional formulas 1/(2N) and 1–1/(2N) respectively, where N is the maximum number of hits or false alarms [30]. Proportions of hits and false alarms are available in the supplementary materials (Table A in S1 File). The following analyses address the questions of whether own-race face recognition and the ORE are reduced for CG faces in turn.

Download:

Table 2. Mean (SD) face recognition accuracy (d') as a function of participant race, race of face and face format.

https://doi.org/10.1371/journal.pone.0141353.t002

Are own-race CG faces recognised less accurately than real photographs?

Analysis of d' for own-race faces, the condition for which all participants should be experts, showed that CG faces were recognised less accurately than Real faces (see Fig 2). A mixed model ANOVA with format (Real, CG_R, CG_A) as the within participants factor and participant race (Caucasian, Asian) as the between participants factor revealed a significant main effect of format, F(2,138) = 38.01, MSE = 0.48, p <.001,η_p² = .36.Real faces were recognised significantly more accurately than CG_R, t(70) = 5.08, p <.001,Cohen’s d = 0.60, and CG_A faces, t(70) = 7.65, p <.001,Cohen’s d = 0.91. CG_R faces were also recognised more accurately than CG_A faces, t(70) = 4.53, p <.001,Cohen’s d = 0.54. Reduced accuracy for the CG_R and CG_A faces is consistent with reduced expertise for CG faces. Finally, there was no main effect of participant race, F(1,69) = 0.01, MSE = 0.98, p = .944,η_p² = .00,and no interaction, F(2,138) = 0.63, MSE = 0.48, p = .534,η_p² = .01.

Download:

Fig 2. Experiment 1: recognition accuracy for own-race faces in the three face format conditions collapsed across race of participant.

Error bars show ± 1 SEM. *** = p < .001.

https://doi.org/10.1371/journal.pone.0141353.g002

Is the ORE reduced for CG faces?

To compare the size of the ORE across format conditions an ORE score was calculated as d' own-race minus d' other-race. The ORE was reduced or eliminated for CG faces compared to Real faces (see Fig 3). As shown in Fig 3 this reduction was particularly evident in the CG_A condition, that is, for the type of CG face most widely used in face processing research. To confirm the observed differences in ORE a mixed model ANOVA was performed on the ORE scores with format (Real, CG_R, CG_A) as a within participants factor and participant race (Caucasian, Asian) as a between participants factor. There was a significant main effect of both format, F(2,138) = 6.33, MSE = 0.81, p = .002,η_p² = .08,and participant race, F(1,69) = 4.14, MSE = 0.82, p = .046,η_p² = .06. These effects were qualified by a significant format x participant race interaction, F(2,138) = 3.42, MSE = 0.81, p = .036,η_p² = .05.

Download:

Fig 3. Experiment 1: Other-race effect (d' own-race minus d' other-race) as a function of format for A. Asian participants and B.Caucasian participants.

Results of one sample significance tests of the ORE are shown at the base of the bars. Error bars show ± 1 SEM. *** = p < .001,** = p <.01,+= p = .07.

https://doi.org/10.1371/journal.pone.0141353.g003

To explore this interaction we conducted separate one-way ANOVAs for each race of participant. For Asian participants (Fig 3A) there was a significant main effect of format, F(2,68) = 7.07, MSE = 0.71, p = .002,η_p² = .17.Compared to Real faces the ORE was significantly smaller in both the CG formats: Real vs. CG_R, t(34) = 3.46, p = .001,Cohen’s d = 0.58; Real vs. CG_A, t(34) = 2.85, p = .007,Cohen’s d = 0.48. The size of the ORE was not significantly different between the two CG conditions, t(34) = 0.63, p = .533,Cohen’s d = 0.11. These results for Asian participants thus provide evidence that the ORE is reduced for CG compared to Real faces and suggest that CG faces do not fully recruit face expertise.

For Caucasian participants (Fig 3B) the main effect of format was only marginally significant, F(2,70) = 3.10, MSE = 0.89, p = .051,η_p² = .08. This result suggests that the size of the ORE was not smaller for CG compared to real faces for Caucasian participants.

If CG faces fail to recruit face expertise at all then we would expect the ORE to be absent for CG faces. To test this prediction one-sample t tests comparing the ORE to zero were performed for each format condition separately for each race of participant. Firstly we confirmed that the ORE was significant in the real condition for both Asian participants, t(34) = 7.39, p < .001,Cohen’s d = 1.25, and Caucasian participants, t(35) = 3.10, p = .004,Cohen’s d = 0.52. In the CG_R condition the ORE was significantly different from zero for the Caucasian participants, t(35) = 3.02, p = .005, Cohen’s d = 0.50, but was not for Asian participants, t(34) = 1.85, p = .073, Cohen’s d = 0.31. In the CG_A condition the reverse was true, the ORE was significantly different from zero for the Asian participants, t(34) = 2.88, p = .007, Cohen’s d = 0.49, but not for the Caucasian participants, t(35) = 0.05, p = .96, Cohen’s d = 0.01. These results provide evidence that the ORE was eliminated for CG faces in some cases but not in others.

Overall poor own-race face recognition and reductions in the ORE signal that CG faces may not fully recruit face expertise. This was particularly the case for CG_A faces, which are the more common class of CG faces used in face processing research.

Experiment 2: Perceptual discrimination

In Experiment 2 we investigated whether the reduced own-race accuracy and reduced ORE for CG faces observed in Experiment 1 are restricted to recognition memory or also extend to perceptual discrimination. We used a simultaneous matching task in which participants had to match a target presented at the top of the screen to the same face identity in an array of 10 faces presented below the target [31]. On half the trials the target was not present in the array. A perceptual matching task including target absent trials was used to increase the difficulty of the task and because this task yields a clear ORE [22].

We compared matching performance for Real and CG_R faces. Given that these two formats contain the same identities the assignment of the same arrays to the Real or CG_R format could be counterbalanced across participants. The heterogeneity within the arrays has the potential to greatly affect accuracy on this task and could not be controlled at all in the CG_A format. If CG faces fail to fully recruit face expertise we expect to see reduced accuracy and a reduced ORE in the CG compared to the Real face condition.