The influence of facial variability
Real-life identification tasks rarely involve comparing a photo ID to a person when that ID was taken on the same day. Further, identity screeners see people from all over the world who do not share such a high degree of visual similarity. Subsequently, many identity-matching studies underscore the need to control for both within-person and between-person variability in a way that more strongly maps onto these real-world conditions (Burton, 2013; Megreya, Sandford, & Burton, 2013). Although constraining experimental materials might increase internal validity, it also might produce outcomes that drastically underestimate externally valid facial variability and, therefore, mask potential generalizability.
The Glasgow Unfamiliar Face Database (GUFD) (Experiment 1) includes images taken on the same day, which limits within-person variability, of mostly (if not exclusively) young, light-skinned individuals, which limits between-person variability. This database’s limited facial variability likely affected participants’ criteria based upon information at the item-level (i.e., considering only information about the presented image pair on screen) and series-level (i.e., considering information across successive trials). At the item-level, low within-person variability may have contributed to the significantly higher accuracy for our match than mismatch trials overall. It stands to reason that images taken approximately 15 min apart would bear a striking resemblance to one another in terms of a variety of both noticeable (e.g., hairstyle) and subtle (e.g., skin luminance) visual cues. Therefore, the low degree of within-person variability increases matched cues across images. At the series-level, such unrealistically high similarity between match pairs likely made mismatch cues more distinctive in contrast i.e., more obvious difference in the context of similarity; (e.g., Hunt, 2006). Mismatches may have popped out more so than would have been expected in a more variable image set, and, therefore, reduced participants’ tolerance for perceptual differences as they calibrated their expectations for natural variations in a person’s appearance from day to day (see also, Menon, White, & Kemp, 2015a for a targeted approach to manipulating expected identity variation that produced findings that align with our rationale).
Previous identity matching studies in the low-prevalence literature confirm that stimulus variability can alter the interpretation of results. For example, Bindemann, Avetisyan, and Blackwell (2010) also used the GUFD to compare identity matching performance across five different experiments. Mismatch prevalence was varied in the first 49 of 50 trials, such that participants with saw either 24 mismatches (high) or 0 mismatches (low). When the authors compared performance on a final critical mismatch trial, several variations of the study confirmed that the high mismatch prevalence group committed more mismatch errors. On the face of it, these results suggest that low mismatch prevalence (2%) did not produce the classic LPE. However, we caution against that interpretation on the basis of how mismatch trials were selected. The authors strategically selected the critical mismatch image pairs with higher similar ratings (M = .56) than noncritical mismatch image pairs (M = .20). In other words, participants in the high-prevalence group saw more obvious mismatches that increased their likelihood of missing a less obvious mismatch.
More recent work has added to our understanding of the LPE by using face databases containing images taken multiple days (if not years in some instances) apart, (Papesh and Goldinger, 2014; Papesh et al., 2018) and another recent study (Susa, Michael, Dessenberger, & Meissner, 2019) found the LPE. This latter study used images specifically designed to test cross-race influences, and, therefore, portrayed a wider-degree of within- and between-person variability. Therefore, we considered it important to utilize an image set that more strongly represents real-world facial variability. Experiments 2 and 3 included a racially diverse database with multiple images of each identity taken with different cameras at different times.
The influence of differing prevalence rates
Although our low mismatch prevalence group did experience fewer mismatch trials overall, a 20% mismatch prevalence rate is still far greater than real-world settings. Although no exact figure exists, one could estimate that a very small percentage (i.e., < 1%) of passengers present a fake ID, making it quite a bit rarer than we have accounted for here. Nevertheless, participants in both the high and low mismatch prevalence groups were sensitive to the imbalance, which suggests that prevalence effects follow a continuous function. Put another way, participants made more errors on whichever type of trial was relatively rarer (either 20% mismatches or 20% matches). Therefore, if participants were sensitive enough to modify their decisions in response to differences between 80/20 prevalence rates, then our results likely underestimate the errors one could expect in a typical identity-screening scenario. Studies outside of the facial recognition literature support this interpretation by confirming that a greater degree of imbalance (Mitroff & Biggs, 2014; Wolfe & Van Wert, 2010) increases the magnitude of the LPE. Although an ultra-rare prevalence condition in this particular task would introduce its own set of problems, Experiments 2 and 3 adopted a greater degree of imbalance (i.e., 90/10 and 10/90) against which to compare to an equivalent prevalence group.
The influence of feedback
Although feedback in real-world settings is rare, it is naturally skewed toward combating mismatch errors for two primary reasons. First, most IDs that screeners check are genuine and presented by their rightful owner (a fact also responsible for the LPE). Second, screeners in many settings (e.g., airport security, liquor store cashiers) may never be made aware of that they accepted a fake ID because they are unlikely to encounter that individual again. However, they are made aware when they erroneously reject an authentic ID if the individual is able to provide alternative means of identification. Therefore, investigating the effect of feedback is crucial when considering empirically driven training regimens designed to reduce the LPE, particularly if it persists under these real-world situations. Because the LPE literature with facial-identification tasks is both fractured in its use and findings with feedback, some overview of the efficacy of feedback interventions on learning is in order before we proceed with our specific predictions.
The position that feedback improves performance dates in psychological science to the earliest days of behaviorism (e.g., Thorndike’s (1927) Law of Effect). Many behaviorists eschewed theory, so the benefits of feedback were often taken at face value. Kluger and DeNisi’s (1996) Feedback Intervention Theory provides a framework from which we can make predictions relevant to security screening. Feedback Intervention Theory presupposes that five components are required for feedback to modify performance at a given task. First, a gap exists between the performance upon which feedback is given and the “standard” (i.e., the desired level of performance). Second, the various goals related to task performance are organized hierarchically. Third, feedback can only regulate future behavior when the gap between current performance and the standard receives the individual’s attention. Fourth, attention moderates the ranking that a particular standard has in the goal hierarchy. Fifth and finally, feedback interventions affect behavioral outcomes by shifting attention within this hierarchy, thereby reordering the various goals.
The success or failure of a particular feedback intervention relies on the specific aspects of the task to which feedback draws attention. Given this, Kluger and DeNisi (1996) concluded that feedback exerts its greatest influence when tasks are sufficiently challenging, yet concrete (i.e., tasks that are too easy, difficult, or nebulous are unlikely to benefit), and when it focuses attention towards cues related to the task’s standard rather than to the individual (e.g., mere praise or admonishment may alter feelings of self-efficacy, but they do not necessarily affect performance).
Central to the current studies is the differential attention paid to match and mismatch cues observable in faces presented side-by-side, and how feedback affects where these cues lie within matching task’s goal hierarchy. Both cue types are shared within and between face-identity images dichotomously (i.e., facial features can only match or mismatch). The observer, then, must decide whether match cues outweigh mismatch cues when deciding whether two face images belong to the same identity. According to Feedback Intervention Theory, feedback would operate in a face-matching task by shifting attention within the goal hierarchy to cues that are most likely to match between face images belonging to the same identity. Therefore, it would make the visual system more sensitive to within-person variability.
Indeed, multiple studies demonstrate that feedback improves unfamiliar face-matching performance (Alenezi & Bindemann, 2013; White, Kemp, et al., 2014). However, the opposite influence of feedback has also been argued (Papesh et al., 2018). To date, the facial-identification paradigms that failed to find an effect of low mismatch prevalence (Bindemann et al., 2010; Stephens, Semmler, & Sauer, 2017) did not incorporate trial-by-trial feedback. Under such conditions, Feedback Intervention Theory would predict that the absence of feedback would not draw attention to the imbalanced trial types, thus not altering the cue hierarchy. Participants may not even be explicitly aware of the different mismatch prevalence rates and assume successful task performance. To more directly test this possibility, we more fully explored feedback with our modified paradigm using image sets with high facial variability.
Predictions
If more realistic facial variability (as would be the case with a greater lapse in time between images of a diverse group of people) also increases the difficulty of the task and exacerbates the LPE, then discriminability should decrease when mismatches are either infrequent or frequent (compared to when matches and mismatches are balanced). We may also see evidence of criterion shifting, as a greater degree of within-person variability might interact with prevalence to shift criterion even more liberally under low mismatch prevalence and conservatively under high mismatch prevalence than with more similar-looking match pairs. In contrast, if a wider degree of between-person and within-person variability does not interact with mismatch prevalence, then we expect to replicate the criterion shifting seen in Experiment 1, but not necessarily see differences in discriminability by mismatch prevalence.
With regards to our predictions about feedback, Feedback Intervention Theory would predict that varying mismatch prevalence will interact with the effect of feedback on mismatch accuracy in fairly straightforward ways: Low mismatch prevalence within a set of trials will yield fewer opportunities to make mismatch errors, and, therefore, fewer opportunities to modify the standard cue hierarchy toward attending to mismatch cues. This finding should be true, and reduce discriminability, in either of the imbalanced mismatch prevalence rates. When mismatch prevalence is low, feedback will increase the weight of match cues in the hierarchy, resulting in imbalanced performance favoring match trials (but overall reduced discriminability). When mismatch prevalence is higher, feedback will increase the weight of mismatch cues in the hierarchy, resulting in imbalanced performance favoring mismatch trials (but overall reduced discriminability).
Method
Participants
Undergraduate students (N = 83) participated in the experiment (Mage = 24.1 years; 62 female) in exchange for partial course credit. Power analyses confirmed the sufficiency of this sample size for all omnibus tests (i.e., β − 1 > .88). Self-reported race reflected a diverse sample (7 Black/African American, 17 White/Caucasian, 54 Hispanic/Latino, 1 Asian/Pacific Islander, and 3 other with 1 failing to respond). All participants reported normal or corrected-to-normal vision.
Materials
For Experiments 2 and 3, we used a face database with a complete collection of images for each of 100 unique identities between the ages of 18 and 30 years and ethnic/racial categories aligned with the 2010 U.S. Census (Selfies for Science; Weatherford, Ottoson, Cocherell, & Erickson, 2016 used with permission). In order to systematically control non-face image properties, acceptable photographs were cropped to a standard size and minor artificial features (e.g., earrings) were naturalistically removed using Adobe Photoshop CS7. Front-facing static images for each identity included (1) a high-resolution image taken with a neutral expression in front a blue background, (2) a student ID photograph taken on a different day with a different camera, and (3) a participant-submitted ambient facial image (i.e., selfie) that included full face, no filters or digital alterations, and was taken at least 1 year prior to the high-resolution controlled image.Footnote 3 To create plausible mismatch trials, identities were paired using reported similarity ratings provided by an independent group of raters. Match and mismatch identities were fully counterbalanced and no images repeated across trials.
Design and procedure
The experiment included a 3 (Mismatch prevalence: high 90%, medium 50%, or low 10%) × 3 (feedback: error, full, or none) between-participants factorial design. Participants made 100 untimed decisions about whether a target image (a high-resolution controlled image) represented the same person as an image embedded in an ID card (a student ID image). The procedure was identical to Experiment 1 with the exception of the feedback manipulation. In the error-only feedback condition, participants viewed penalty screens as described in Experiment 1. In the full-feedback condition, participants viewed a 2.5-s feedback screen after every trial. In the no-feedback condition, participants viewed a 2.5-s black inter-stimulus interval screen after each trial.
Results
Accuracy
We analyzed our data using a 2 (Match Type-within: match, mismatch) × 3 (Mismatch prevalence-between: high, medium, low) × 3 (Feedback-between: error, full, or none) mixed-methods ANOVA. For accuracy (Fig. 3), there was a main effect of match type, F (1,74) = 20.35, p < .001, η2 p = .216, a main effect of mismatch prevalence, F (2,74) = 6.99, p = .002, η2 p = .159, but no main effect of feedback F (2,74) = .33, p = .717. However, these main effects were qualified by an three-way interaction, F (4,74) = 9.36, p < .001, η2 p = .336 (Figure available in “Additional file 1”). Simple main effects of mismatch prevalence on errors within each level of feedback revealed a simple main effect of mismatch prevalence within error-only feedback, F (2, 74) = 3.14, p = .049, η2p = .078, and also within full feedback, F (2,74) = 3.39, p = .039, η2p = .084. The no-feedback condition yielded no simple effect of mismatch prevalence.
These results followed the same pattern as observed in Experiment 1. In line with the predictions of Feedback Intervention Theory, feedback improved performance of high-prevalence trial types (e.g., high prevalence of mismatch trials or high prevalence of matched trials) in the imbalanced conditions at the expense of low mismatch prevalence trial types.
Signal detection measures
As with Experiment 1, we also considered the criterion-shift explanation of the LPE. Signal detection measures are represented graphically in Fig. 1. For d’, a between-subjects ANOVA revealed a main effect of mismatch prevalence, F (2,74) = 5.16, p = .008, η2 p = .122, but no main effect of feedback, F (2,77) = 1.21, p = .304, η2 p = .032. The interaction between feedback and mismatch prevalence did not reach significance. For C, a between-subjects ANOVA revealed a main effect of mismatch prevalence, F (2,74) = 20.98, p < .001, η2 p = .362, but no main effect of feedback, F (2,74) = .572, p = .567, η2 p = .015. However, any main effects were qualified by an interaction between mismatch prevalence and feedback, F (4,74) = 9.61, p < .001, η2 p = .342. Simple main effects tests of mismatch prevalence within each level of feedback revealed a simple main effect of mismatch prevalence within error-only feedback, F (2, 74) = 18.24, p < .001, η2 p = .330, and also within full feedback, F (2,74) = 21.90, p < .001, η2 p = .372. The no-feedback condition yielded no simple effect of mismatch prevalence.
Area under the curve
Due to the inordinately large number of comparisons in our complete design using all possible mismatch prevalence and feedback variations, we collapsed across feedback conditions to allow more straightforward comparison to Experiment 1’s results. As seen in Fig. 4, overall discriminability reduced compared to the high performance in Experiment 1. For the comparison between low mismatch prevalence and medium mismatch prevalence, the area spanned by low prevalence (pAUC = .42) was less than the pAUC for medium prevalence (pAUC = .48), D = 4.21, p < .001. For the low mismatch prevalence to high mismatch prevalence comparison, low prevalence spanned a smaller area (pAUC = .34) than high prevalence (pAUC = .39), D = 2.38, p = .006. Medium and high conditions were equivalent, p > .2.
Discussion
Similar to Experiment 1, we find support for criterion shifting. Unlike Experiment 1, Experiment 2’s paradigm resulted not only in differences in criterion, but also discriminability, by mismatch prevalence. These findings align well with other recent LPE studies in the facial-identification literature (e.g., Papesh et al., 2018; Papesh & Goldinger, 2014; Susa et al., 2019) and perhaps explain the lack of an effect in others (e.g., Bindemann et al., 2010; Stephens et al., 2017). This latter study and others like it used image sets with low between-person variability (e.g., Glasgow Face Matching Test; Burton et al., 2010). By contrast, we used an image set with more realistic variability and external validity. With a greater span of time between the two comparison images, match trials were likely less strikingly obvious.
Additionally, we found that feedback interacted with prevalence to produce differences in discriminability and criterion. As Feedback Intervention Theory predicts, any imbalance in trial types shifts the ranking of cues in the hierarchy when feedback (either error or full) emphasizes it. In response to imbalanced prevalence rates, participants shifted their criterion to either be more liberal when fake IDs were rare or more conservative when they were frequent.
Having established two different effects using different stimulus sets, Experiment 3 aimed to replicate and extend results to the most visually variable image set in the series. Ambient images may be the most representative of how individuals may present themselves during identification screenings. If this increasingly greater challenge between images follows the pattern of results of Experiment 2, we can have greater assurance of patterns of behavior that might emerge in real-world settings. If, however, Experiment 3’s results look more like Experiment 1, then we might expect criterion shifting, but not reduced discriminability, by prevalence and feedback conditions. Either outcome would be informative for future research and policy recommendations.