In our everyday lives, we effortlessly process constantly changing facial information. The commonly accepted notion that it is performed in a highly proficient manner underlies a vast body of research and, for the most part, it coincides with our own personal experiences. However, an increasing number of studies is challenging the idea that face processing is generally highly proficient. Extremely high performance levels are observed in particular for the processing of faces of individuals encountered repeatedly in everyday life—those that are personally familiar to us (for a review see Ramon & Gobbini, 2018). For unfamiliar faces, on the other hand, our ability to process facial identity is comparably more prone to error (e.g., Bruce, Henderson, Newman, & Burton, 2001; Burton, Wilson, Cowan, & Bruce, 1999; Hancock, Bruce, & Burton, 2000; Jenkins & Burton, 2011; Jenkins, White, Van Montfort, & Burton, 2011; Megreya & Burton, 2006; Ramon & Gobbini, 2018; Ramon & Van Belle, 2016; Young & Burton, 2017) and varies considerably between individuals (e.g., Bate & Dudfield, 2019; Bate, Portch, Mestry, & Bennetts, 2019; Bruce, Bindemann, & Lander, 2018; Fysh, 2018; White, Kemp, Jenkins, Matheson, & Burton, 2014).
Decades of research has focused on studying normal face processing skills, as well as individuals exhibiting deficient face processing abilities (i.e., developmental prosopagnosia; for a recent review see Geskin & Behrmann, 2018). However, more recently increasing interest has been directed toward individuals with remarkable face processing abilities—so-called super-recognizers (SRs; e.g., Ramon, Bobak, & White, 2019a, 2019b). Understanding superior face processing skills is important from both a fundamental scientific as well as an applied perspective. Theoretically, this work has led to face processing being considered as a spectrum rather than supporting a dichotomous distinction between normal and dysfunctional abilities (Russell, Duchaine, & Nakayama, 2009). From a practical perspective, investigation of the abilities of SRs can provide valuable information for the optimization of automatic face processing or deployment of personnel in security-critical settings (Ramon et al., 2019a, 2019b), such as criminal investigation (Ramon, 2018a).
However, a fundamental challenge continues to exist: How do we reliably identify high performing individuals within the general population? Addressing this issue requires consideration of aspects that have been expressed by scientists and practitioners alike (cf. Bate et al., 2019; Moreton, Pike, & Havard, 2019; Ramon et al., 2019a, 2019b; Robertson & Bindemann, 2019). Two major factors are: 1) assessment of different subprocesses or processing levels in face cognition (Ramon et al., 2019a; Ramon & Gobbini, 2018); and 2) the degree to which this process-dependent behavior observed experimentally translates into extremely varied and constantly changing applied settings (cf. e.g., Bate et al., 2019; Ramon, 2018a). These two factors are discussed below.
Assessment of face perception versus recognition: procedural considerations
Concerning the first factor, distinct subprocesses involved in face cognition can be investigated through specific tasks that are (ideally) designed to capture them in a reliable manner. In the context of unfamiliar face processing, the most commonly assessed processes include face perception and face recognition. These are separate processes that require careful terminological distinction (cf. Ramon, 2018a; Ramon et al., 2019a; Ramon & Gobbini, 2018) and should not be used interchangeably or considered analogous to other processes (such as face identification; see Noyes, Hill, & O’Toole, 2018; Phillips et al., 2018).
Tests of face perception can involve simultaneous matching (or discrimination) among image pairs (e.g., Fysh, 2018; Ramon & Rossion, 2010), triplets (e.g., Barton, Press, Keenan, & O’Connor, 2002; Busigny, Joubert, Felician, Ceccaldi, & Rossion, 2010), 1-to-n matching scenarios (e.g., Bruce et al., 1999; Pachai, Sekuler, Bennett, Schyns, & Ramon, 2017; Rezlescu, Pitcher, & Duchaine, 2012), or can comprise delayed matching (e.g., Ramon, Busigny, & Rossion, 2010; Ramon & Van Belle, 2016). In the context of such laboratory-based face perception tests, consideration of both accuracy and reaction times (RTs) is necessary to accurately characterize individual face processing abilities. This is especially important for tests that 1) involve behavioral decisions recorded for individual trials with long or unlimited duration, and/or 2) are insufficiently calibrated to clear two standard deviations from the control mean (e.g., the Glasgow Face Matching Test (GFMT); Bate et al., 2018; Bobak et al, 2016a; Burton, White, & McNeill, 2010; Fysh, 2018; Fysh & Bindemann, 2018; Robertson et al., 2016; White, Phillips, Hahn, Hill, & O’Toole, 2015). To illustrate, time-consuming piecemeal matching strategies may enable normal performance levels in terms of accuracy even in highly impaired clinical populations, such individuals suffering from prosopagnosia (cf. Marotta, McKeeff, & Behrmann, 2002; White, Rivolta, Mike Burton, Al-Janabi, & Palermo, 2017).
Tests of face recognition involve intentional learning of previously unfamiliar identities, which are later recognized as “old” among novel ones (e.g., Bate et al., 2018; Bobak et al., 2016b; Russell et al., 2009). In the context of such face recognition tests, where encoding duration is classically predetermined, longer decision times cannot compensate for an inefficiently encoded face stimulus. Given the decreased likelihood of speed–accuracy trade-offs in face recognition tests, RTs are oftentimes not considered (e.g., Blais, Jack, Scheepers, Fiset, & Caldara, 2008).
To summarize, while face perception is usually assessed via matching of faces presented simultaneously, face recognition involves distinguishing experimentally learned identities from entirely novel ones. Importantly, observers can differ in their unique abilities exhibited across these distinguishable subprocesses (Fysh, 2018). Consequently, a comprehensive empirical understanding of the face processing abilities of individuals requires assessment across various levels of processing. In applied settings, on the other hand, this may not be required as, depending on the area of intended deployment, assessment of an ability confined to a specific type of task might be entirely sufficient (Bate et al., 2019; Moreton et al., 2019; Robertson & Bindemann, 2019).
Relationship between performance in the laboratory and the real world
The second factor concerns the relationship between experimentally observed process-dependent behavior and skills that are relevant in extremely varied and constantly changing real-life settings (Moreton et al., 2019; Ramon et al., 2019a, 2019b). For the majority, empirical studies aim to characterize different aspects of face processing in a highly controlled manner. For instance, psychophysical studies of identity matching have been conducted to better understand the specific contribution of certain controllable factors—for example, low spatial frequency information or orientation tuning (Pachai, Sekuler, & Bennett, 2013; Pachai et al., 2017; Watier & Collin, 2009; see also Papinutto, Lao, Ramon, Caldara, & Miellet, 2017). Notwithstanding the informative theoretical value of evidence provided through such rigorous experiments, the degree to which they characterize or reflect real-world proficiency in facial identity processing remains unknown (Ramon et al., 2019a, 2019b).
Indeed, other tests have been designed with the intention of assessing behavior that is pertinent to real life. One frequently used paradigm involves simultaneous identity matching in a two-alternative forced-choice (2AFC) scenario. This is thought to parallel identity verification at, for example, a passport control point where naturally occurring variation in ambient images is known to affect performance (Burton et al., 1999; Megreya & Burton, 2006). Unfortunately, some of the tests used to identify SRs (e.g., the GFMT) lack sensitivity, making them inappropriate to identify highly proficient face processing abilities (Bobak et al., 2016b; Ramon et al., 2019a, 2019b). This situation is further exacerbated by two additional aspects. First, 2AFC paradigms, including for example the Expertise in Facial Comparison Test (EFCT) and the Person Identification Challenge Test (PICT) (White et al., 2015), have been used to further probe individual differences in face matching (Phillips et al., 2018). Although developed to assess performance under challenging situations (White et al., 2015), the EFCT and PICT involve face matching under optimal viewing conditions, in other words using full-frontal face images—a necessary requirement to meet the original goal of comparing human and machine performance (Phillips & O'Toole, 2014). Additionally, as they require observers to make simple same/different decisions, they involve a constant probability of correct responses on a trial-to-trial basis. Most importantly, however, the “pedestrian notion” of speed–accuracy trade-offs (Heitz, 2014; see also Luce, 1986) is commonly not considered; observers can obtain high test scores at the expense of prolonged RTs, which are not reported when the EFCT and PICT are used (e.g., Phillips et al., 2018; for a similar approach adopted in the context of pairwise face-matching tests see also Bobak et al, 2016c; Robertson et al., 2016).
The solution: standardization of challenging and ecological valuable laboratory tests
In the context of identifying individuals that could provide a substantial contribution in applied settings, a growing body of literature has expressed the need for more ecological and challenging assessment of face processing abilities (Bate et al., 2018, 2019; Lander, Bruce, & Bindemann, 2018; Moreton et al., 2019; Ramon et al., 2019a; Robertson & Bindemann, 2019). We aim to contribute to filling this void concerning the assessment of facial identity processing under both realistic and challenging conditions. To this end, we tested a large group of individuals of all ages (N = 252) with previously reported tests of facial identity matching. These tests tap into invariance of facial representations by measuring facial identity matching ability across two dimensions that are pertinent to real life challenges. The YBT (for examples see Fig. 1b) captures identity matching across significant age-dependent changes in facial appearance (i.e., 25 years; Bruck et al., 1991). The FICST (see Fig. 1a) probes the ability of telling together and telling apart identities across superficial image variations (lighting, make-up, hairstyle, etc.; Andrews, Jenkins, Cursiter, & Burton, 2015; Jenkins et al., 2011). Because of the nature of the tasks and face stimuli, these two tests mimic real-world challenges in face processing. The YBT, for instance, resembles the situation of encountering acquaintances or friends after considerable time periods. In a policing context, it might translate into scenarios where comparison images of alleged criminals in a line-up are dated and experts or witnesses are required to disregard age-related changes. The task and face stimuli used in the FICST, on the other hand, could resemble a situation where police officers are required to determine whether footage from multiple crime scenes depicts the same individual, or whether image material viewed in the context of child abduction or abuse depicts the same victim(s). In this case, face processing would not only be challenging due to variation in image quality, but also due to potential disguises, or time periods between image acquisitions (Megreya, Sandford, & Burton, 2013). Additionally, image ambiance is inherently larger in the FICST due to the presence of 20 images per identity (versus two in the PICT and EFCT), and in both the FICST and YBT faces are depicted in greyscale and from different viewpoints (as opposed to full-frontal color images in the EFCT and PICT; see Fig. 1). Note that the EFCT and PICT stimuli were presented in color. Finally, instead of requiring simple same/different decisions between two faces as in the EFCT and PICT, the FICST and YBT involve a substantially greater number of possible responses. In the FICST, participants are blind to the number of identities portrayed in the 40 pictures they are supposed to sort. In the YBT, observers have to match five target pictures portraying young adults with the corresponding five probe images, which are presented along with five distractors (Bruck et al., 1991). Consequently, the FICST and YBT include a broader decisional space, which has been experimentally shown to increase task difficulty (Ramon, Sokhn, & Caldara, 2019; Ramon, Sokhn, Lao, & Caldara, 2018), and they also resemble more challenging real-world situations beyond identity verification.
In light of the aforementioned considerations, the goal of this study was twofold. We aimed to provide normative data for the YBT and FICST, which in our opinion are more ecologically meaningful and challenging tests of unfamiliar face matching (i.e., perception) than the EFCT and PICT. Beyond this, we sought to determine the relationship between performance measures provided by the YBT and FICST, as well as the EFCT and PICT, with the most commonly used tool to assess face cognition, the long form of the Cambridge Face Memory Test (CFMT+; Russell et al., 2009). Note that, although the CFMT+ represents a test of face recognition, it remains the most commonly used means to identify superior face processing skills (cf. Bobak et al, 2016b; Noyes, Phillips, & O'Toole, 2017; Ramon et al., 2019a).
To anticipate our findings, and in line with previous work, our results demonstrate that facial identity matching is considerably impacted by superficial image variations and age-related changes in appearance. Importantly, across all tests of face perception reported, which rely on simultaneous matching of facial identity displayed in natural images (i.e., involve no memory component), the EFCT and PICT were the least challenging. Based on these observations, we advocate for increased use of more challenging measures that involve manipulations that are pertinent to real-life settings, such as the YBT and FICST. Compared to 2AFC scenarios, in these tests time-consuming strategies are less effective for achieving high-performance accuracy. The conditions under which tools are developed (cf. EFCT and PICT; probing face matching under ideal visual conditions as required per dated automatic face processing solutions) have to be carefully considered in combination with the real-life roles in which they are deployed (for example, identity verification encountered by passport control officers, doormen, or cashiers).