You shall not pass: how facial variability and feedback affect the detection of low-prevalence fake IDs

Weatherford, Dawn R.; Erickson, William Blake; Thomas, Jasmyne; Walker, Mary E.; Schein, Barret

doi:10.1186/s41235-019-0204-1

Original article
Open access
Published: 28 January 2020

You shall not pass: how facial variability and feedback affect the detection of low-prevalence fake IDs

Dawn R. Weatherford ORCID: orcid.org/0000-0002-1145-1064^1,2,
William Blake Erickson¹,
Jasmyne Thomas¹,
Mary E. Walker¹ &
…
Barret Schein²

Cognitive Research: Principles and Implications volume 5, Article number: 3 (2020) Cite this article

2779 Accesses
10 Citations
12 Altmetric
Metrics details

Abstract

In many real-world settings, individuals rarely present another person’s ID, which increases the likelihood that a screener will fail to detect it. Three experiments examined how within-person variability (i.e., differences between two images of the same person) and feedback may have influenced criterion shifting, thought to be one of the sources of the low-prevalence effect (LPE). Participants made identity judgments of a target face and an ID under either high, medium, or low mismatch prevalence. Feedback appeared after every trial, only error trials, or no trials. Experiment 1 used two controlled images taken on the same day. Experiment 2 used two controlled images taken at least 6 months apart. Experiment 3 used one controlled and one ambient image taken at least 1 year apart. Importantly, receiver operating characteristic curves revealed that feedback and greater within-person variability exacerbated the LPE by affecting both criterion and discriminability. These results carry implications for many real-world settings, such as border crossings and airports, where identity screening plays a major role in securing public safety.

Significance statement

Determining an unfamiliar person’s identity is critically important to a wide variety of security-related occupations such as transportation-security screeners, border patrol agents, police officers, and other security personnel. These personnel typically compare a photo identification card (i.e., ID) to a live person before permitting access to restricted goods, services, and areas. Acceptable forms of ID are produced by a variety of agencies that embed features such as light-sensitive strips, ghost images, and material properties to help a screener distinguish a genuine from a fake ID. However, a genuine ID can still be presented by a person who is not pictured on the card. This ID is still considered fake; but, screeners need to detect the mismatched identities in order to reject it. Our research focuses on a screener’s ability to detect a fake ID under such circumstances when it is rare. We explore how response formats (e.g., yes/no decisions compared to confidence-based decisions), real-world concerns (e.g., the degree of control in manipulating within-person and between-person variability), and possible interventions (e.g., feedback) may alter the magnitude of the effect.

Unfortunately, research indicates that detecting the identity of an unfamiliar person is more difficult than it may seem (e.g., Kemp, Towell, & Pike, 1997; Robertson, Noyes, Dowsett, Jenkins, & Burton, 2016; White, Burton, Jenkins, & Kemp, 2014). Errors arise because two images of the same person can vary widely based on differences in age, hairstyles, weight, and a number of other factors. Similarly, two images of different people can look incredibly similar. Thus, determining a person’s identity requires visually searching the two different images (e.g., photo ID and live person) for two different types of cues. Observers must be able to distinguish between match cues that signal a single identity i.e., within-person variability; (e.g., Burton, 2013) and mismatch cues that signal two different identities i.e., between-person variability; e.g., (Jenkins, White, Van Montfort, & Burton, 2011).

Much like other complex visual search tasks, research shows that if one type of target—in this instance a genuine ID or fake ID—is infrequent, then an observer will often fail to identify it (Hout, Walenchok, Goldinger, & Wolfe, 2015; Rich et al., 2008; Wolfe et al., 2007). This low-prevalence effect (LPE) decreases the successful identification of weapons in real-world baggage-screening scenarios (e.g., Lau & Huang, 2010) and abnormalities during radiological screenings (e.g., Drew, Võ, & Wolfe, 2013) because both weapons and abnormalities appear less often during these searches than high-prevalence items such as aerosol cans or tumors.

Extending to work with faces, Papesh and colleagues (Papesh & Goldinger, 2014; Papesh, Heisick, & Warner, 2018) found that participants failed to detect identity mismatches when they were rare. In a series of studies, participants viewed image pairs of a target face displayed beside an ID card. For each pair, participants made untimed yes/no decisions about whether the two images represented the same person. Errors persisted on mismatch trials when mismatch prevalence was low, despite warning participants after incorrect decisions, directing participants to avoid errors through careful deliberation, and allowing participants to reconsider their initial decisions.

The low-prevalence effect (LPE)

Although a complete explanation of the LPE is still a matter of debate, the relatively robust literature in object-identification search tasks (e.g., weapons, tumors) provides an important theoretical foundation for its origins. Studies have primarily investigated whether the LPE is driven by early visual search termination (i.e., making an identification decision before exhaustively searching an entire visual array) or criterion shifting (i.e., visually fixating upon the correct cue, but determining that it does not sufficiently exceed the threshold to be identified as such).

If these same mechanisms are applied to facial-identification tasks, the low prevalence of fake IDs can be explained as a failure to identify mismatch cues due to the wide within-person variability between IDs and the individuals presenting them. In other words, when presented with a high frequency of genuine IDs, the evidence for a mismatch decision must be sufficiently high in order to identify the ID as fake. In the absence of strong and more unambiguous visual cues that signal a mismatch (i.e., the person presenting the ID is of a different race than the photo), observers decide that two facial images belong to the same person. Following the results of their facial-identification experiments, Papesh and colleagues’ (Papesh et al., 2018; Papesh & Goldinger, 2014) findings suggest that the LPE exerts its influence by creating a context that emphasizes cue search for identity matches. Therefore, participants fail to notice the diagnostic cues that signal between-person variability on mismatch trials because they terminated their search too quickly and/or attended only to match cues. These different search strategies resulted in shorter reaction times on inaccurate mismatch trials. However, these initial investigations into how the LPE affects facial identification are limited in ways that the current studies aim to explore.

The current research

The current studies contribute to this important area by more closely investigating factors that influence real-world security personnel and may affect criterion shifting in a serial decision-making task. First, within-person variability can affect the degree to which an individual resembles themselves over a lapse of time. Because ID photos can be valid for up to 10 years (e.g., United States passport documents), a wide variety of facial changes likely reduce the ability to adequately differentiate between an imposter presenting someone else’s ID and a legitimate person who has just changed substantially since their image was taken. In the current studies, we more strongly account for the degree to which image pairs look similar by representing different degrees of within-person variability. As a starting point, Experiment 1 used two controlled images that were taken on the same day with different cameras. To increase realism, Experiment 2 used two controlled images taken at least 6 months apart. Finally, Experiment 3 approximated the most realistic within-person variability by using one controlled image and one ambient image taken at least 1 year apart. Attention was also paid to ensuring sufficient between-person variability to approximate real-world settings, where an imposter presenting someone else’s ID must be at least adequately convincing to be believable. To represent convincing degrees of between-person variability, we created high-similarity mismatches (described in “Materials”) by pairing identities rated by an independent group of participants.

Second, feedback may influence the degree of the LPE in this face-matching task. Performance feedback in real-world security settings is delivered in a variety of ways. For instance, a screener might receive feedback by way of external validation (e.g., a screened individual is able to produce alternative forms of ID when prompted) or external information (e.g., a supervisor or confederate completes a random screening check for quality control). Although decision feedback is relatively rare compared to the vast majority of decisions that receive no additional scrutiny, it remains important to explore as a straightforward and plausible intervention strategy aimed at affecting criterion shifting. Further, professional identity screeners very typically receive feedback during their initial training period (Towler et al., 2019). Predictions about feedback are mixed, with some evidence suggesting its use as effective (Alenezi, Bindemann, Fysh, & Johnston, 2015; White, Kemp, Jenkins, & Burton, 2014) and others suggesting its use as ineffective or even detrimental (Papesh et al., 2018; Wolfe et al., 2007). Therefore, we again approached Experiment 1 as a means to replicate previous findings by providing feedback only in the case of errors (Papesh & Goldinger, 2014). Afterwards, Experiments 2 and 3 manipulated feedback more fully.

In order to consider the influences of these real-world factors, all three experiments adopted a variant of the traditional paradigm adapted from (Papesh & Goldinger, 2014) wherein participants made several decisions about whether a target face matched an ID. However, instead of yes/no judgments, participants made identity decisions on a 1–6 scale that allowed us to build receiver operating characteristic (ROC) curves (described in “Results”) and calculate discriminability and criterion. We predicted that, if the LPE exerts its influence, then both discrimination and criterion would be affected under low mismatch prevalence. However, the LPE may be reduced or nearly eliminated when image pairs represent low within-person variability (Experiment 1) compared to higher within-person variability (Experiment 2 and Experiment 3). In terms of feedback, we remained agnostic, as some evidence (Alenezi & Bindemann, 2013; White, Kemp, et al., 2014) would predict that feedback will increase discriminability and criterion (i.e., combat criterion shifts and decrease the likelihood of early search termination), whereas other evidence (e.g., Papesh et al., 2018) would predict that feedback will decrease discriminability and criterion (as a function of drawing attention to the low mismatch prevalence, thereby exacerbating the effect).

Experiment 1

Method

Participants

Undergraduate students (N = 91; M_age = 19 years; 68 female) participated in the experiment in exchange for partial course credit. Power analyses confirmed the sufficiency of this sample size for all omnibus tests (i.e., β − 1 > .95). Self-reported race reflected a diverse sample (15 Black/African American, 70 White/Caucasian, 1 Hispanic/Latino, 4 Asian/Asian-American/Pacific Islander, and 1). All participants reported normal or corrected-to-normal vision.

Materials

One-hundred and forty image pairs were selected for use in the experiment. Each match pair contained two different front-facing photographs of the same person, taken on the same day with two different cameras (Glasgow Unfamiliar Face Database, http://www.facevar.com/downloads). Adapting Papesh and Goldinger (2014), each trial presented a target face (approximately 5 in. by 5 in.) beside an ID card (approximately 2.25 in. by 1.5 in.; see Fig. 1). EPrime presented images to participants on a 22-inch monitor such that target identities the occupied the larger portion of the left side of the screen and ID card identities were embedded within one of several prototypical ID card images on the right side of the screen.

Mismatch identity pairs displayed two photographs of two different people (see Fig. 1, right column). Mismatch identities were paired using reported similarity ratings (all between 0.3 and 0.6 (M = 0.40, SD = .09), see Bruce et al., 1999; Burton, White, & McNeill, 2010)^{Footnote 1} Match and mismatch identities were fully counterbalanced and no images repeated across trials, such that each identity was equally likely to appear beside another photograph of themselves as they were a photograph of another person.

Design and procedure

After providing informed consent, participants made 140 untimed identity decisions under either high (80%), medium (50%), or low (20%) mismatch prevalence. Participants answered “Are these images of the same person?” by selecting a number on a 1–6 scale (1 = definitely no, 6 = definitely yes). To replicate the experimental conditions of Papesh and Goldinger (2014), participants viewed a 2-s penalty screen following incorrect decisions on match trials and a 4-s penalty screen following incorrect decisions to mismatch trials.^{Footnote 2} After completing all trials, participants provided demographic information and were debriefed.

Results

To allow more direct comparison with findings derived from yes/no judgments in previous studies, we first calculated accuracy by collapsing the response scale, with responses 1–3 coded as correct for mismatch trials and responses 4–6 coded as correct for match trials. These collapsed values were used to calculate accuracy and signal detection analyses. After satisfying that connection with the literature, we considered the full range of responses to construct ROC curves that more completely explore discriminability across all levels of confidence.

Accuracy

We analyzed accuracy using a 2 (Match Type-within: match, mismatch) × 3 (Mismatch prevalence-between: 80%, 50%, 20%) mixed-methods analysis of variance (ANOVA). Unless otherwise stated, we set alpha at .05 and corrected for Type 1 error inflation across all statistical tests using Bonferroni post-hoc analyses. We found a main effect of match type, F (1,88) = 17.768, p < .001, η² _p = .168 and no main effect of prevalence, F (2,88) = .99, p = .375, η² _p = .022. However, this main effect was qualified by an interaction between match type and mismatch prevalence, F (2,88) = 7.33, p = .001, η² _p = .143. As can be seen in Fig. 2a, planned follow-up analyses indicated that participants were more accurate for match trials (M = .89, SD = .09) than mismatch trials (M = .82, SD = .11) across all mismatch prevalence conditions (a trend we address when considering the influence of facial variability below); however, accuracy followed a mirror effect across different prevalence rates. Error rate analysis followed the same pattern and magnitude of results.

Signal detection measures

We also analyzed performance using signal detection measures, sensitivity (d’):

$$ {d}^{\prime }=z\left( False\ alarms\right)-z(Hits) $$

(1)

where higher values of d’ indicate superior recognition memory while accounting for response bias. To account for extreme performance levels (e.g., hit or false alarm rates of zero), extreme values are replaced by 1–2/N for rates of 1 or 2/N of 0, where N represents the number of trials of that type. We were also calculated response criterion (C);

$$ C=\frac{-1\left(\left( False\ alarms\right)+z(Hits)\right)}{2} $$

(2)

Figure 1 displays all data by prevalence and feedback conditions. For d’, a between-subjects ANOVA revealed no main effect of mismatch prevalence, F (2,88) = .19, p = .831, η² _p = .004. For C, a between-subjects ANOVA revealed a main effect of mismatch prevalence, F (2,74) = 9.89, p < .001, η² _p = .194. As predicted by the criterion-shift explanation of the LPE, mismatch prevalence affected criterion in linear fashion, with the low mismatch prevalence group demonstrating a more liberal criterion than the high- and medium-prevalence conditions.

Area under the curve

A further analysis that can illuminate the effect of mismatch prevalence rates on criterion shifting simultaneously considers discriminability across a range of criterion values. To this end, we calculated area under the curve (AUC). The cumulative proportions of “1”,“2”, “3”, “4”, “5”, and “6” responses made by each participant within each prevalence condition from 6 (the highest criterion level) to 2 (the lowest criterion level) were calculated for each pair type (match or mismatch) and plotted in ROC space for three curves. The space is arranged such that match proportional accuracy is plotted along the vertical axis from 0 at the origin to 1 at its maximum for match decisions, and along the horizontal axis from 0 at the origin to 1 at its maximum for mismatch decisions. Thus, a diagonal from coordinate 0,0 to 1,1 indicates chance performance. Coordinates above this diagonal indicate accuracy above chance; coordinates below this diagonal indicate accuracy below chance. As can be seen from Fig. 3, accuracy was generally high in each mismatch prevalence condition as indicated by each curve bowing toward the upper left of the ROC space.

Next, we computed the partial area under the curve (pAUC) scores for each ROC curve and conducted pairwise comparisons among each mismatch prevalence condition (see Table 1). Scores for full AUCs typically range from .50 (chance performance) to 1.00 (perfect performance). Although several methods of calculating these scores exist, most involve extrapolating the left-most and right-most data points of each ROC curve to the 0,0 and 1,1 coordinates on the plot. This approach puts in jeopardy interpretations made by comparing two ROCs on a plot which do not perfectly overlap along the x-axis, as different degrees of extrapolation are needed for each. Therefore, for each ROC curve comparison, we compared only those portions where the two curves do overlap. We used the pROC toolbox (Robin et al., 2011) in R to compute pAUC scores for the curves corresponding to each prevalence level within sub-portions of the aggregate ROCs that overlapped along the x-axis of the ROC plots. In addition, because there were three comparisons in total, the alpha level for significance decisions was adjusted to .017.

Table 1 Lower and upper receiver operating characteristic (ROC) curve overlap boundaries used for each partial area under the curve (pAUC) analysis, including D value for each comparison

Full size table

For the comparison between low mismatch prevalence and medium mismatch prevalence, the area spanned by the low condition (pAUC = .44) was less than the area spanned by the medium condition (AUC = .47), D = − 5.85, p < .001. For the medium and high conditions, the area spanned by medium mismatch prevalence (pAUC = .45) was greater than high mismatch prevalence (pAUC = .44), D = 3.47, p < .001. Comparisons between the low and high conditions yielded no differences in pAUCs.

Discussion

Experiment 1 confirmed our initial predictions and replicated the findings of Papesh and Goldinger (2014) that low mismatch prevalence decreased accuracy when we collapsed across the range of 1–6 judgments to replicate the yes/no paradigms that were previously adopted (e.g., Papesh et al., 2018; Papesh & Goldinger, 2014). Data replicated the classic mirror effect across the three mismatch prevalence rates, with match accuracy performance increasing from high to low mismatch prevalence and mismatch accuracy decreasing from high to low mismatch prevalence. Next, we explored criterion and discriminability using ROC curves. This measure, previously unavailable with the yes/no format of other work, confirmed that prevalence rates affected criterion. However, this more sensitive instrument revealed a much smaller difference on discriminability by prevalence — participants’ overall discriminability was very high, regardless of mismatch prevalence condition.

The results of the ROC curves are promising for translation to real-world identification screening tasks. However, some marked differences between our design and the conditions of real-world security scenarios are worthy of consideration before making any claims about generalizability. Therefore, Experiments 2 and 3 examined the magnitude of the LPE across three additional differences that should theoretically affect criterion.

Experiment 2

The influence of facial variability

Real-life identification tasks rarely involve comparing a photo ID to a person when that ID was taken on the same day. Further, identity screeners see people from all over the world who do not share such a high degree of visual similarity. Subsequently, many identity-matching studies underscore the need to control for both within-person and between-person variability in a way that more strongly maps onto these real-world conditions (Burton, 2013; Megreya, Sandford, & Burton, 2013). Although constraining experimental materials might increase internal validity, it also might produce outcomes that drastically underestimate externally valid facial variability and, therefore, mask potential generalizability.

The Glasgow Unfamiliar Face Database (GUFD) (Experiment 1) includes images taken on the same day, which limits within-person variability, of mostly (if not exclusively) young, light-skinned individuals, which limits between-person variability. This database’s limited facial variability likely affected participants’ criteria based upon information at the item-level (i.e., considering only information about the presented image pair on screen) and series-level (i.e., considering information across successive trials). At the item-level, low within-person variability may have contributed to the significantly higher accuracy for our match than mismatch trials overall. It stands to reason that images taken approximately 15 min apart would bear a striking resemblance to one another in terms of a variety of both noticeable (e.g., hairstyle) and subtle (e.g., skin luminance) visual cues. Therefore, the low degree of within-person variability increases matched cues across images. At the series-level, such unrealistically high similarity between match pairs likely made mismatch cues more distinctive in contrast i.e., more obvious difference in the context of similarity; (e.g., Hunt, 2006). Mismatches may have popped out more so than would have been expected in a more variable image set, and, therefore, reduced participants’ tolerance for perceptual differences as they calibrated their expectations for natural variations in a person’s appearance from day to day (see also, Menon, White, & Kemp, 2015a for a targeted approach to manipulating expected identity variation that produced findings that align with our rationale).

Previous identity matching studies in the low-prevalence literature confirm that stimulus variability can alter the interpretation of results. For example, Bindemann, Avetisyan, and Blackwell (2010) also used the GUFD to compare identity matching performance across five different experiments. Mismatch prevalence was varied in the first 49 of 50 trials, such that participants with saw either 24 mismatches (high) or 0 mismatches (low). When the authors compared performance on a final critical mismatch trial, several variations of the study confirmed that the high mismatch prevalence group committed more mismatch errors. On the face of it, these results suggest that low mismatch prevalence (2%) did not produce the classic LPE. However, we caution against that interpretation on the basis of how mismatch trials were selected. The authors strategically selected the critical mismatch image pairs with higher similar ratings (M = .56) than noncritical mismatch image pairs (M = .20). In other words, participants in the high-prevalence group saw more obvious mismatches that increased their likelihood of missing a less obvious mismatch.

More recent work has added to our understanding of the LPE by using face databases containing images taken multiple days (if not years in some instances) apart, (Papesh and Goldinger, 2014; Papesh et al., 2018) and another recent study (Susa, Michael, Dessenberger, & Meissner, 2019) found the LPE. This latter study used images specifically designed to test cross-race influences, and, therefore, portrayed a wider-degree of within- and between-person variability. Therefore, we considered it important to utilize an image set that more strongly represents real-world facial variability. Experiments 2 and 3 included a racially diverse database with multiple images of each identity taken with different cameras at different times.

The influence of differing prevalence rates

Although our low mismatch prevalence group did experience fewer mismatch trials overall, a 20% mismatch prevalence rate is still far greater than real-world settings. Although no exact figure exists, one could estimate that a very small percentage (i.e., < 1%) of passengers present a fake ID, making it quite a bit rarer than we have accounted for here. Nevertheless, participants in both the high and low mismatch prevalence groups were sensitive to the imbalance, which suggests that prevalence effects follow a continuous function. Put another way, participants made more errors on whichever type of trial was relatively rarer (either 20% mismatches or 20% matches). Therefore, if participants were sensitive enough to modify their decisions in response to differences between 80/20 prevalence rates, then our results likely underestimate the errors one could expect in a typical identity-screening scenario. Studies outside of the facial recognition literature support this interpretation by confirming that a greater degree of imbalance (Mitroff & Biggs, 2014; Wolfe & Van Wert, 2010) increases the magnitude of the LPE. Although an ultra-rare prevalence condition in this particular task would introduce its own set of problems, Experiments 2 and 3 adopted a greater degree of imbalance (i.e., 90/10 and 10/90) against which to compare to an equivalent prevalence group.

The influence of feedback

Although feedback in real-world settings is rare, it is naturally skewed toward combating mismatch errors for two primary reasons. First, most IDs that screeners check are genuine and presented by their rightful owner (a fact also responsible for the LPE). Second, screeners in many settings (e.g., airport security, liquor store cashiers) may never be made aware of that they accepted a fake ID because they are unlikely to encounter that individual again. However, they are made aware when they erroneously reject an authentic ID if the individual is able to provide alternative means of identification. Therefore, investigating the effect of feedback is crucial when considering empirically driven training regimens designed to reduce the LPE, particularly if it persists under these real-world situations. Because the LPE literature with facial-identification tasks is both fractured in its use and findings with feedback, some overview of the efficacy of feedback interventions on learning is in order before we proceed with our specific predictions.

The position that feedback improves performance dates in psychological science to the earliest days of behaviorism (e.g., Thorndike’s (1927) Law of Effect). Many behaviorists eschewed theory, so the benefits of feedback were often taken at face value. Kluger and DeNisi’s (1996) Feedback Intervention Theory provides a framework from which we can make predictions relevant to security screening. Feedback Intervention Theory presupposes that five components are required for feedback to modify performance at a given task. First, a gap exists between the performance upon which feedback is given and the “standard” (i.e., the desired level of performance). Second, the various goals related to task performance are organized hierarchically. Third, feedback can only regulate future behavior when the gap between current performance and the standard receives the individual’s attention. Fourth, attention moderates the ranking that a particular standard has in the goal hierarchy. Fifth and finally, feedback interventions affect behavioral outcomes by shifting attention within this hierarchy, thereby reordering the various goals.

The success or failure of a particular feedback intervention relies on the specific aspects of the task to which feedback draws attention. Given this, Kluger and DeNisi (1996) concluded that feedback exerts its greatest influence when tasks are sufficiently challenging, yet concrete (i.e., tasks that are too easy, difficult, or nebulous are unlikely to benefit), and when it focuses attention towards cues related to the task’s standard rather than to the individual (e.g., mere praise or admonishment may alter feelings of self-efficacy, but they do not necessarily affect performance).

Central to the current studies is the differential attention paid to match and mismatch cues observable in faces presented side-by-side, and how feedback affects where these cues lie within matching task’s goal hierarchy. Both cue types are shared within and between face-identity images dichotomously (i.e., facial features can only match or mismatch). The observer, then, must decide whether match cues outweigh mismatch cues when deciding whether two face images belong to the same identity. According to Feedback Intervention Theory, feedback would operate in a face-matching task by shifting attention within the goal hierarchy to cues that are most likely to match between face images belonging to the same identity. Therefore, it would make the visual system more sensitive to within-person variability.

Indeed, multiple studies demonstrate that feedback improves unfamiliar face-matching performance (Alenezi & Bindemann, 2013; White, Kemp, et al., 2014). However, the opposite influence of feedback has also been argued (Papesh et al., 2018). To date, the facial-identification paradigms that failed to find an effect of low mismatch prevalence (Bindemann et al., 2010; Stephens, Semmler, & Sauer, 2017) did not incorporate trial-by-trial feedback. Under such conditions, Feedback Intervention Theory would predict that the absence of feedback would not draw attention to the imbalanced trial types, thus not altering the cue hierarchy. Participants may not even be explicitly aware of the different mismatch prevalence rates and assume successful task performance. To more directly test this possibility, we more fully explored feedback with our modified paradigm using image sets with high facial variability.

Predictions

If more realistic facial variability (as would be the case with a greater lapse in time between images of a diverse group of people) also increases the difficulty of the task and exacerbates the LPE, then discriminability should decrease when mismatches are either infrequent or frequent (compared to when matches and mismatches are balanced). We may also see evidence of criterion shifting, as a greater degree of within-person variability might interact with prevalence to shift criterion even more liberally under low mismatch prevalence and conservatively under high mismatch prevalence than with more similar-looking match pairs. In contrast, if a wider degree of between-person and within-person variability does not interact with mismatch prevalence, then we expect to replicate the criterion shifting seen in Experiment 1, but not necessarily see differences in discriminability by mismatch prevalence.

With regards to our predictions about feedback, Feedback Intervention Theory would predict that varying mismatch prevalence will interact with the effect of feedback on mismatch accuracy in fairly straightforward ways: Low mismatch prevalence within a set of trials will yield fewer opportunities to make mismatch errors, and, therefore, fewer opportunities to modify the standard cue hierarchy toward attending to mismatch cues. This finding should be true, and reduce discriminability, in either of the imbalanced mismatch prevalence rates. When mismatch prevalence is low, feedback will increase the weight of match cues in the hierarchy, resulting in imbalanced performance favoring match trials (but overall reduced discriminability). When mismatch prevalence is higher, feedback will increase the weight of mismatch cues in the hierarchy, resulting in imbalanced performance favoring mismatch trials (but overall reduced discriminability).