Human-algorithm teaming in face recognition: How algorithm outcomes cognitively bias human decision-making

In face recognition applications, humans often team with algorithms, reviewing algorithm results to make an identity decision. However, few studies have explicitly measured how algorithms influence human face matching performance. One study that did examine this interaction found a concerning deterioration of human accuracy in the presence of algorithm errors. We conducted an experiment to examine how prior face identity decisions influence subsequent human judgements about face similarity. 376 volunteers were asked to rate the similarity of face pairs along a scale. Volunteers performing the task were told that they were reviewing identity decisions made by different sources, either a computer or human, or were told to make their own judgement without prior information. Replicating past results, we found that prior identity decisions, presented as labels, influenced volunteers’ own identity judgements. We extend these results as follows. First, we show that the influence of identity decision labels was independent of indicated decision source (human or computer) despite volunteers’ greater distrust of human identification ability. Second, applying a signal detection theory framework, we show that prior identity decision labels did not reduce volunteers’ attention to the face pair. Discrimination performance was the same with and without the labels. Instead, prior identity decision labels altered volunteers’ internal criterion used to judge a face pair as “matching” or “non-matching”. This shifted volunteers’ face pair similarity judgements by a full step along the response scale. Our work shows how human face matching is affected by prior identity decision labels and we discuss how this may limit the total accuracy of human-algorithm teams performing face matching tasks.

We performed a repeated measures ANOVA, examining the main effects of survey variant and demographic groups (between subjects) and prior identity information (within subjects) on the average accuracy with which subjects performed the face recognition task using a response threshold of 0.5 (ACC0.5). We found significant main effects of gender (F(1,214) = 13.2, p = 0.0004) and age (F(1,214) = 4.3, p = 0.04) such that accuracy of female subjects (ACC0.5 = 0.77) was greater than accuracy of male subjects (ACC0.5 = 0.71) and accuracy of younger subjects (ACC0.5 = 0.76) was greater than accuracy of older subjects (ACC0.5 = 0.72). We found no effect of race on accuracy.
There was no significant interaction between gender or age and the source of prior identity information or whether the prior identity information was a match or no match. These data are shown graphically in Figures S1, S2, and S3 below.

Shifts in criterion associated with prior identity information are consistent across demographic groups
We next examined whether the main effects described in our study were consistent across demographic groups. We performed repeated measures ANOVAs, examining the main effects of survey variant and demographic groups (between subjects) and prior identity information (within subjects) on the true positive rates (TPR) and the false positive rates (FPR) of the subjects using a response threshold of 0.5. In signal detection theory, these two rates determine the decision criterion such that increases in TPR and FPR are associated with a more permissive criterion and decreases in TPR and FPR are associated with a stricter criterion.
The consistency of the effects of prior identity information on subjects' decision criterion across demographic categories are summarized in Figure S4. Overall, criterion values increased (became more conservative) given "different" prior identity information and decreased (became more permissive) given "same" prior identity information consistently for all demographic groups examined. Criterion values in the absence of prior identity information ("none") were generally intermediate.

Other Race Effect
A well-known phenomenon in face perception is known as the "Other Race Effect" whereby subjects perform better on face discrimination tasks when the face pairs are of the same race as the subject as compared to when the faces are of a different race (Walker and Tanaka 2003;Phillips et al. 2011). We examined whether subjects' performance on face pairs included in our face matching task was, in fact, better when the race of the subjects matched the race of the face pair.
Our face matching task included face image pairs from the Glasgow Face Matching Task (GFMT) as well as new face image pairs from the NIST Multiple Encounters Dataset (MEDS) of individuals labeled as 'Black'. The images selected from MEDS were of face pairs with no emotion and were cropped and converted to gray scale to match GFMT stimuli. Demographic information was not explicitly provided for face pairs sourced from the GFMT dataset, however, this dataset is widely understood to contain predominantly 'White' faces. We therefore grouped our subjects into those that self-identified as 'Black' and all others 'Not Black' (Subject Race) and compared the performance of these groups for 'Black Faces' from the MEDS data set and 'Not Black Faces' from the GFMT dataset (Other Race).
We performed a repeated measures ANOVA, examining the main effects of subject demographics (between subjects) and the race of the face pairs (within subjects) on the overall accuracy with which all subjects performed all face matching tasks using a response threshold of 0.5 (ACC0.5). As previously shown in Figure S1, we found significant main effects of gender (F(1,338) = 19.6, p < 0.0001) and age (F(1,338) = 5.9, p < 0.016), but no effect of race on ACC0.5. Additionally there was a clear main effect of face pair race (F(1,339) = 226.8, p < 0.0001) such that face pairs created from the MEDS dataset (ACC0.5 = 0.86) were easier to discriminate than face pairs from GFMT (ACC0.5 = 0.68). This is not unexpected since MEDS face pairs were selected using a computer algorithm to determine similarity whereas GFMT face pairs were selected based on human raters. Importantly, we found a significant interaction between the race of the subject and the race of the face pairs (F(1,339) = 19.1, p < 0.0001) such that black subjects performed better on black face pairs (ACC0.5 = 0.88) relative to other subjects (ACC0.5 = 0.83) and worse on face pairs that were not black (ACC0.5 = 0.66) relative to other subjects (ACC0.5 = 0.71). This shows that the other race effect was present in our data. Figure S5. Improved accuracy for same-race face pairs. The left hand plot shows the average accuracy (ACC0.5) for subjects on Black faces from the MEDS dataset. The right hand plot shows accuracy for faces from the GFMT dataset. Note relatively higher ACC0.5 of Black subjects for Black faces from the MEDS dataset and a reversal for the GFMT dataset. Numbers above each bar correspond to sample size. Error bars are 95% bootstrap confidence intervals.

SUPPLEMENTARY DISCUSSION
We observed no broad effects on accuracy based on subject race despite the presence of the other race effect in our data. This is likely because face pairs of both races were present within each survey variant, which balanced performance between race groups. Taken together with the lack of any effect of prior identity information and subject race on accuracy ( Figure S3) and our robust observation that prior identity information affected response criterion independent of subject race ( Figure S4), these findings suggest that the other race effect and the novel response bias introduced by prior identity information that we demonstrate in this work may be mediated by separate mechanisms. Indeed, the other race effect is thought to be due to perceptual learning (Walker and Tanaka 2003), which improves the neural processing of some stimuli over others, consistent with an improved sensitivity (Sagi 2011). The influence of prior identity information, on the other hand, appears restricted to the response criterion and independent of sensitivity. Thus, our study shows for our population that the effect prior identity information on performance was distinct from both long-term perceptual enhancement as well as short-term attentional mechanisms.