When experience does not promote expertise: security professionals fail to detect low prevalence fake IDs

Weatherford, Dawn R.; Roberson, Devin; Erickson, William Blake

doi:10.1186/s41235-021-00288-z

Original article
Open access
Published: 01 April 2021

When experience does not promote expertise: security professionals fail to detect low prevalence fake IDs

Dawn R. Weatherford ORCID: orcid.org/0000-0002-1145-1064¹,
Devin Roberson¹ &
William Blake Erickson¹

Cognitive Research: Principles and Implications volume 6, Article number: 25 (2021) Cite this article

3125 Accesses
6 Citations
4 Altmetric
Metrics details

Abstract

Professional screeners frequently verify photograph IDs in such industries as professional security, bar tending, and sales of age-restricted materials. Moreover, security screening is a vital tool for law enforcement in the search for missing or wanted persons. Nevertheless, previous research demonstrates that novice participants fail to spot fake IDs when they are rare (i.e., the low prevalence effect; LPE). To address whether this phenomenon also occurs with professional screeners, we conducted three experiments. Experiment 1 compared security professional and non-professionals. Experiment 2 compared bar-security professionals, access-security professionals, and non-professionals. Finally, Experiment 3 added a newly created Professional Identity Training Questionnaire to determine whether and how aspects of professionals’ employment predict ID-matching accuracy. Across all three experiments, all participants were susceptible to the LPE regardless of professional status. Neither length/type of professional experience nor length/type of training experience affected ID verification performance. We discuss task performance and survey responses with aims to acknowledge and address this potential problem in real-world screening scenarios.

Background

In response to various worldwide security concerns at borders and other ports of entry, professional screeners commonly restrict access of goods and services to authorized individuals who present an authentic ID card. However, a potential traveler may produce stolen, borrowed, or inauthentic documents. Professional screeners need to maintain safety by identifying such imposters, while still allowing lawful passengers through. Although technological advancements such as automatic face recognition systems (e.g., Taigman et al., 2014) and various methods of biometric scanning may seem like attractive alternatives to replace human screeners, such technologies face many of the same challenges as human recognizers (O’Toole et al., 2012; e.g., Tran et al., 2017), while also raising concerns about ethics, transparency, and accountability (Drozdowski et al., 2020). Therefore, the bulk of imposter detection duties has been and is being performed by humans.

Even under optimal viewing conditions, ID matching performance has a surprisingly high number of errors (e.g., Burton, 2013). Errors further increase with additional real-world challenges such as time pressure (e.g., Bindemann et al., 2016) and vigilance (e.g., Alenezi et al., 2015). Among a host of challenges to successful ID screening, the Low Prevalence Effect (LPE; e.g., Wolfe et al., 2007) also increases error rates. As a well-known cognitive phenomenon, the LPE has been demonstrated for infrequent targets such as objects in a visual array (e.g., Fleck & Mitroff, 2007), weapons in luggage X-rays (Wolfe et al., 2013), and most relevant to the current investigation, fake IDs (Papesh et al., 2018; Papesh & Goldinger, 2014; Susa et al., 2019; Weatherford & Schein, 2015; Weatherford et al., 2020; cf. Bindemann et al., 2016). For instance, Weatherford et al. (2020) tested untrained participants’ ability to make identity judgments about a target face presented beside an ID card. Fake IDs appeared in either 10%, 50%, or 90% of all trials. Consistent with the LPE, participants inaccurately accepted more fake IDs (i.e., mismatch errors) in the 10% prevalence condition. What’s more, performance feedback only exacerbated the effect.

Of theoretical interest, research suggests that the LPE is caused by early search termination (i.e., making a decision before exhausting all available cues) and/or criterion shifting (i.e., response criterion, as defined by signal detection theory, shifts over the course of trials). Although work outside of facial identification has demonstrated that early search termination might be corrected by allowing participants time to reconsider rash decisions (e.g., Fleck & Mitroff, 2007), ID matching tasks have consistently failed to find any evidence for early search termination (Papesh & Goldinger, 2014; Weatherford et al., 2020). In contrast, facial studies more strongly favor criterion shifting that biases acceptance of fake IDs as authentic even with additional consideration time, warnings about errors, bursts of high-prevalence mismatch trials, and other manipulations designed to combat criterion shifting. In other words, the LPE is an exceptionally stubborn source of errors for which we need a greater theoretical understanding.

Although LPE tasks in the laboratory (e.g., Papesh & Goldinger, 2014; Weatherford et al., 2020) may be cautiously generalized to other real-world ID matching tasks, no studies have directly tested if and whether the LPE emerges in an ID matching task with a professional sample. It is quite possible that this phenomenon is an artifact of the untrained participants performing tasks with which they may have little to no experience. Although it is difficult to estimate exactly how often professionals are presented with fake IDs, security screeners are likely far more accustomed to seeing an authentic ID card presented by its rightful owner, and that expectation may influence performance on an ID matching task.

Therefore, the current studies report data from professional samples on an ecologically valid variant of a routine task for which they have extensive experience. We hope that results will illuminate if professionals may commit LPE errors in a facial identification task.

Mapping a laboratory-based task onto a professional-security setting

To increase ecological validity and reduce the influence of potential confounds associated with laboratory-based designs, we modeled our ID matching task to reflect essential aspects of a professional-security setting.

Realistic document images

During an ID check in the real world, a screener is responsible for satisfying two goals. One goal involves authenticating the individual’s documents by visually scanning for particular security features such as expiration dates, ghost images, and black-light responsive materials. Many different agencies focus exclusively on this aspect of the screening process. Continually updated patents reflect that improvements in this area focus on combating the passage of fraudulent documents as a measure to heighten security and stop criminal behavior. The steps needed to complete this goal are relatively straightforward, and screeners are sometimes provided with tools (e.g., black-light scanners) to aid detection of fake IDs. Therefore, we reduced participants’ attention to this goal by presenting standardized document images for which security features (e.g., expiration dates, ghost images) were either removed or held constant to reduce possible suspicion of fraudulent documents.

A second goal involves authenticating an individuals’ identity by comparing the facial image to the person presenting it. The steps needed to complete this goal are far more nebulous—but equally, if not more, important. A successful screener needs to not only catch fraudulent documents (e.g., presented after printed expiration date), but also fraudulent identities (i.e., presented by an imposter). As the more abstract of the two tasks, this second goal of facial image comparison has been the subject of extensive cognitive investigations to reveal the underlying mechanisms that predict (or fail to predict) success.

Realistic facial images

Research suggests that although humans possess superior facial recognition skills with familiar faces (e.g., Kramer et al., 2018; Young & Burton, 2018), this expertise does not completely extend to unfamiliar faces (Abudarham et al., 2019; Burton, 2013; Dunn et al., 2018) that would be typical for a security-screening scenario. Although both processes share some perceptual characteristics (e.g., featural analysis, perceptual sensitivity; Abudarham et al., 2019), many of the hallmarks of familiar face processing revolve around more well-developed conceptual/associationistic processing that increases the screener’s ability to generalize from one image of someone to many others. In other words, facial image comparison only plays a crucial role when screeners need to authenticate the documents of unknown individuals.

Many agencies that produce IDs attempt to reduce or eliminate perceptual variations that impair unfamiliar face comparison (e.g., Hancock et al., 2000). These variations including suboptimal lighting/shadows (e.g., Braje et al., 1998), distance (e.g., Lampinen et al., 2014), pose (e.g., Hancock et al., 2000), image quality (e.g., Bruce et al., 2001), and partial face coverage (e.g., Davies & Flin, 1984). Accordingly, all facial images used in this series of experiments depict a frontal pose of each participant under adequate lighting without any obstructions (e.g., sunglasses, facial scarves). Further, images were of similar size and quality to those used in typical screening scenarios.

Realistic target comparison images

Even when presented with a facial images are optimized, the screening task remains challenging because a person may look markedly different from their own ID photograph. Variability over time (e.g., caused by a change in hair, weight, or cosmetic alterations) increases within-person variability. Recently, Weatherford et al. (2020) investigated the relationship between within-person variability and the LPE. Over the course of three experiments, the authors replicated the classic LPE, such that participants in the low mismatch prevalence condition demonstrated lower mismatch accuracy than those in medium and high mismatch prevalence conditions. As the within-person variability in each experiment was increased through the use of photographs that were captured further apart in time (i.e., from same day images in Experiment 1 to images taken at least one year apart in Experiment 3), the LPE only became more pronounced. Therefore, we removed the possibility that the task would be trivially simple by using target comparison images/videos taken in a different context, with a different camera, and that varied in time by at least six months from the facial image on the document.

How professional screening experiences affect the LPE

After establishing ecologically valid measures, the current study extends beyond student samples reported in previous studies (Papesh et al., 2018; e.g., Papesh & Goldinger, 2014; Susa et al., 2019; Weatherford et al., 2020) by recruiting ID screening professionals. Aside from training, which has shown mixed results in improving ID screening accuracy (e.g., Towler et al., 2014, 2019), professionals’ routine experience with ID screening may affect ID matching accuracy. As no published work has examined the relationship between professional experience and the LPE, two theoretically supported, yet divergent, possibilities may predict facial image comparison performance.

Professional experience might reduce LPE errors

In a practical sense, the best-case scenario would involve professional improvements in facial comparison performance alongside reductions in mismatch LPE errors. If experience promotes expertise, then a wide variety of security professionals should outperform naive, untrained participants on both the task itself (i.e., overall accuracy), and the detection of fake IDs (i.e., mismatch error rate). In the present studies, we focused on facial reviewers (i.e., individuals who perform a high volume of routine facial comparison tasks within a relatively short period of time throughout their entire shift; see also FISWG, 2011), as these professionals perform the bulk of identification matching operations designed to ensure public safety. Occupations such as police officers, security guards, border patrol agents, and the like would fit within this category.

Concerning overall accuracy, some studies have demonstrated that professional experience improves matching performance (i.e., overall performance on match and mismatch trials). These studies typically appeal to benefits by way of quantitative (e.g., years of employment) and qualitative (e.g., type and amount of identity comparisons made across a variety of contexts) aspects of professional experience. One illustrative example of improved performance involved the comparison of passport issuance officers to novices in a photograph-to- photograph-ID matching task. Using a relatively large sample size (n = 204), Towler et al. (2019; Experiment 2) found that professionals outperformed novices on a photograph-to-photograph task using unfamiliar faces, with benefits modestly increasing in line with the difficulty of the tasks. Similar findings using other photograph-to-photograph comparisons have also been observed (e.g., Phillips et al., 2018).

Professional experience might not affect, or might exacerbate, LPE errors

While the best-case scenario is attractive, the larger body of evidence supports the possibility of a non-significant finding. More specifically, several studies assessing facial reviewers’ skills (e.g., White et al., 2014; Papesh et al., 2018Heyer et al., 2018; Wirth & Carbon, 2017) were not as successful. Professional experience did not improve task performance. If the current studies fit within this body of work, then we might predict no benefit of professional experience.

One last theoretical possibility is that professional experience will exacerbate LPE errors. In the field, it is reasonable to expect that professionals see a low prevalence of fake IDs. This low prevalence likely shifts their criterion, thereby tempting a screener to misattribute a high variability between and ID and its presenter as within-person variability (i.e., the differences seen between two images of the same person) instead of between-person variability (i.e., the similarities seen between two images of different people). In other words, when individuals seeking access to restricted goods and services present an ID with large differences (e.g., glasses, aging effects, cosmetics), the most practical approach would be to assume that variability is a natural by-product of comparing an individual to an image from up to ten years ago. If fraudulent documents and identities are rare, this heuristic would serve to reduce false alarms (i.e., rejecting an authentic ID). To the extent that professional experience transfers to other identity matching tasks, and those heuristics and assumptions remain intact, then professionals may actually commit more LPE errors than untrained novices. Given that several studies have found criterion shifting even within a short experimental period (e.g., Papesh et al., 2018; Susa et al., 2019; Weatherford et al., 2020), it is entirely reasonable that routine, professional experiences could negatively transfer to this new task.

The current studies

In three studies, we manipulated the ratio of fake to authentic IDs across participants and expected to replicate previous findings with our non-professional participants. In the low mismatch condition (i.e., 90% matches to 10% mismatches), we predicted that non-professionals would identify fake IDs less often. Likewise in the low match condition (i.e., 10% matches to 90% mismatches), we predicted that non-professionals would identify authentic IDs less often. We expected that these differences, compared to a balanced condition (50% matches to 50% mismatches), would be largely driven by criterion shifting. In other words, participants in the imbalanced conditions would either relax their criteria for a “match” decision under the low mismatch prevalence or constrain their criteria for a “match” decision under low match conditions.

To foreshadow our findings, all of our predictions were confirmed for non-professionals. Furthermore, despite possible differences to task performance that might be expected with professional security experience, professional status did not affect our pattern of results. Participants in both types of imbalanced groups (low mismatch and low match) produced the typical pattern of the LPE. Even though feedback may have heightened awareness of the imbalanced base rates (a point we revisit in the General Discussion), the data consistently support no differences in ID verification performance across professional status, type of security occupation, and self-reports of previous employment and training histories.

Experiment 1

Method

Participants

Recruitment

All participants were recruited through a university subject pool or a Qualtrics panel (www.Qualtrics.com). Qualtrics recruiting ensured that participants’ self-reported age, gender, race/ethnicity, and income generally reflected established patterns reported in the 2010 US Census. For specific qualifications, such as professional security experience, Qualtrics panels use targeted marketing to attract participants from specific partners with whom they establish relationships and subsequently confirm the availability of participants with the requisite qualifications for inclusion. Understandably, many of the professionals might have been hesitant to provide the name of their employer. And, to protect their identities under such circumstances as to encourage their honest responding, we did not solicit that specific information.

Sample

The sample consisted of N = 78 individuals. Non-professionals (N = 54; Mage = 37.22 years, SD = 14.53; 4 failed to report; 38 females) participated in exchange for partial course credit or $6.00 (all compensation rates reported in USD),^{Footnote 1} depending upon recruiting source. Professionals (N = 24; Mage = 48.3 years, SD = 13.79; 4 failed to report; 7 females) reported at least 1 year of professional ID card screening experience and participated in exchange for between $20 and $30. Both groups represented a diverse sample of participants (59.0% White/Caucasian, 26.9% Hispanic/Latinx, 11.5% Black/African-American, 0% Asian/Pacific Islander, and 2.6% other).^{Footnote 2}

Power analysis

In order to determine the appropriate sample size for our design, we referred to the documented effect sizes in Weatherford et al. (2020). To capture the interactive effect of mismatch prevalence and feedback, we used effect sizes from Experiment 2, which used the same stimuli as we adopted here: η² p = 0.16 (with subsequent interpretation of values above 0.25 being large, 0.09 being medium, and 0.01 being small) which converts to f = 0.44 (with 0.40 considered large, 0.25 considered medium, and 0.10 considered small; Cohen, 1988). We conducted an a priori power analysis in G*Power (Faul et al., 2007) using the statistical test for “ANOVA: Fixed effects, special, main effects and interactions” using desired power (1 − β probability) of 0.90 and a numerator df of 2. G*Power indicated a required sample size of n = 70. Thus, our sample size of n = 78 exceeded the sample size necessary to achieve sufficient power.

Materials

Identity matching task

Participants viewed 200 unique images of 100 identities from a facial database (Weatherford et al., 2016) that approximated the same racial/ethnic and gender composition as the 2010 US Census. Each trial displayed a static target image (296 × 296 pixels) beside another static image embedded in one of six state ID templates (200 × 120 pixels) on a black background (see Fig. 1). All static images were taken in a controlled environment. The target image was taken in front of a blue background, and the state ID image was taken in front of a white background. The two images were taken with different cameras, in different settings, between six months to five years apart.

All image pairs were the same as described in Weatherford et al. (2020), Experiment 2. Among 100 identity pairs, we randomly chose 10 to serve as low-prevalence targets for the imbalanced conditions (i.e., presented as mismatches for the low mismatch prevalence condition or matches for the low match prevalence condition). For the balanced prevalence condition, we randomly split the 100 pairs across two different versions (i.e., the 50 identity pairs that served as mismatches in one counterbalance variation were displayed as matches in the other counterbalance variation). All identities were fully counterbalanced across all conditions, such that no participant viewed the same image more than once (for complete counterbalancing legends, see OSF website listed below).

Design and procedure

All participants accessed the ID matching task and provided informed consent online through Qualtrics. Participants were randomly assigned to one of three prevalence conditions (low mismatch/10% mismatches; balanced/50% mismatches; or low match/90% mismatches) and three feedback conditions (full feedback, error-only feedback, or no feedback). On each of the 100 trials, participants responded to the question “Are these photographs of the same person?” on a 1–6 scale, with 1 indicating Definitely No and 6 indicating Definitely Yes. After making the decision, participants in the full feedback condition saw one of four feedback statements. After a correct decision, they viewed “You rejected a fake ID” for mismatch trials or “You accepted an authentic ID” for match trials. After an incorrect decision, they viewed “You accepted a fake ID” for mismatch trials or “You rejected an authentic ID” for match trials. Participants in the error-only feedback condition only received the feedback statements after incorrect decisions. Participants in the no feedback condition did not see any statements about their performance. After 100 trials, participants provided demographic information, were debriefed, and thanked for their time.

Results

Our first experiment supports the conclusion that professionals exhibit the LPE, following the same patterns as non-professionals, regardless of what type of trial is infrequent. In other words, when fake IDs were infrequent—mismatch errors rose. When authentic IDs were infrequent—match errors rose. Further, the data support a criterion shift explanation, as evidenced in Fig. 2 (see also, Weatherford et al. 2020).

Analysis of discriminability and response criterion

We first binarized our data into match/mismatch decisions by collapsing across the 1–6 options (see Stanislaw & Todorov for comparisons of rating tasks with yes/no tasks^{Footnote 3}). We coded yes responses (i.e., Definitely Yes, Probably Yes, and Maybe Yes) to the question, “Are these photographs of the same person?” as match decisions. We coded no responses (i.e., Definitely no, Probably No, and Maybe No) as mismatch decisions. Therefore, correct responses to match and mismatch trials were coded as hits and correct rejections, respectively. Incorrect responses to match and mismatch trials were coded as misses and false alarms, respectively (although see Papesh, 2014, for an alternative coding approach that treats successful identification of mismatches as “hits”). To account for extreme performance levels (e.g., hit or false alarm rates of zero), extreme values were replaced by 1–2/N for rates of 1 or 2/N of 0, where N represents the number of trials of that type. For ease of comparison, the complementary error rates for misses and false alarms are depicted in Fig. 2.

We further went on to analyze signal detection measures of discriminability (d′) and response criterion (C). For d′, higher values are interpreted as a greater ability to distinguish between match and mismatch trials. For C, a value of zero represents no bias toward either match or mismatch decisions. However, lower or negative C values are interpreted as a greater likelihood of making a match decision. In contrast, higher C values are interpreted as a greater likelihood of making a mismatch decision.

We conducted a 3 (Prevalence: Low Mismatch, Balanced, Low Match) × 3 (Feedback: None, Error, Full) × 2 (Group: Professional, Non-Professional) factorial analysis of variance on discriminability measure d'. As can be seen in Fig. 2, the ANOVA revealed that only Prevalence significantly affected discriminability, F(2, 58) = 5.61, p = 0.006, n²p = 0.160. Tukey's Bonferroni-corrected HSD found that Balanced prevalence yielded greater discriminability than Low Match prevalence (p = 0.02) and Low Mismatch prevalence (p = 0.003), which were not significantly different from one another. No other factors yielded significant main effects or interactions.

We next conducted a 3 (Prevalence: Low Mismatch, Balanced, Low Match) × 3 (Feedback: None, Error, Full) × 2 (Group: Professional, Non-Professional) factorial analysis of variance on response criterion measure C. As predicted, Prevalence significantly affected criterion, F(2, 58) = 14.78, p < 0.001, n²p = 0.334. Tukey's Bonferroni-corrected HSD found that criterion under Low Match prevalence was significantly higher than Balanced (p < 0.001) and Low Mismatch prevalence (p < 0.001), which were not significantly different from one another. Criterion for professionals was marginally higher than non-professionals, F (1, 58) = 2.91, p = 0.09. These effects were accompanied by a significant Prevalence x Feedback interaction, F(4, 58) = 4.63, p = 0.003, n²p = 0.239. Planned follow-up comparisons found this interaction was driven by significant univariate effects of Feedback at Low Mismatch prevalence [F(2, 59) = 5.47, p = 0.007, n²p = 0.157], but not other levels of prevalence.

Discussion

Although these initial results seem promising, we sought to replicate and extend them to a modified paradigm with more externally valid materials. We also differentiated between different types of professional groups, as opposed to simply treating all professional experiences similarly.

Regarding the paradigm modifications, Experiments 2 and 3 were altered by replacing a) the static target image with a video of a person in a frontal pose rotating their head from side to side, and b) the state-based driver license templates with US passports. First, we justified the choice to incorporate video targets by relying upon previous research comparing still to moving images (e.g., Pike et al., 1997; Zhao & Bülthoff, 2017). Videos create a richer encoding experience that allows a screener to consider additional cues that are unavailable in a static image. These additional cues might then become diagnostic evidence to support a more informed match or mismatch decision (e.g., Pilz et al., 2006). At a theoretical level, the extant literature in facial identification and recognition supports both qualitative and quantitative differences that confer advantages in processing for moving versus still images (Lander et al., 1999; O’Toole et al., 2002; Thornton & Kourtzi, 2002; Yovel & O’Toole, 2016).

Secondly, we justified the switch to US passports to reduce additional attentional demands and perceived salience brought about by an inconsistent ID template. Using a standardized ID template with blurred typeface implicitly encourages participants to ignore any irrelevant details (e.g., expiration date, name, birthdate; see Goal 1 of document authentication as described in Introduction) and instead devote attention to facial comparison (e.g., comparisons of target video and ID image; see Goal 2 of identity authentication as described in Introduction).

Our choice to diversify our professional pool was similarly motivated by both theoretical and applied rationales. At a theoretical level, we anticipated that different types of professional experience would involve a different quantity of experiences with fake identification cards. In line with logic laid out in Papesh and Goldinger (2014), we anticipated that bar door security professionals would likely encounter a higher percentage of fake IDs (in terms of both fraudulent documents and/or fraudulent identities) than access security professionals. If fake IDs are more prevalent in their occupational experiences, bar door security may be less susceptible to mismatch LPE errors. At an applied level, we also anticipated that different types of professional experience would also involve a different quality of experiences with identification checking more generally. Although much more challenging to typify, we anticipated these two types of professionals complete their identification checking tasks in, perhaps, meaningfully different ways.

Experiment 2

Experiment 2 compared performance of non-professionals, bar door security professionals, and access security professionals on 100 identification trials between a target video and a US passport. We predicted that if the relative percentage of fraudulent ID cards in their professional experience conferred an advantage, then bar door security professionals (but not non-professionals or access security professionals) would commit fewer mismatch LPE errors. However, if these professional experiences did not transfer to this identification task, then we would predict no significant differences between professionals of either type and non-professionals.