Face masks and fake masks: the effect of real and superimposed masks on face matching with super-recognisers, typical observers, and algorithms

Ritchie, Kay L.; Carragher, Daniel J.; Davis, Josh P.; Read, Katie; Jenkins, Ryan E.; Noyes, Eilidh; Gray, Katie L. H.; Hancock, Peter J. B.

doi:10.1186/s41235-024-00532-2

Original article
Open access
Published: 02 February 2024

Face masks and fake masks: the effect of real and superimposed masks on face matching with super-recognisers, typical observers, and algorithms

Kay L. Ritchie ORCID: orcid.org/0000-0002-1348-760X¹,
Daniel J. Carragher^2,3,
Josh P. Davis⁴,
Katie Read⁴,
Ryan E. Jenkins⁴,
Eilidh Noyes⁵,
Katie L. H. Gray⁶ &
…
Peter J. B. Hancock²

Cognitive Research: Principles and Implications volume 9, Article number: 5 (2024) Cite this article

1168 Accesses
1 Citations
2 Altmetric
Metrics details

Abstract

Mask wearing has been required in various settings since the outbreak of COVID-19, and research has shown that identity judgements are difficult for faces wearing masks. To date, however, the majority of experiments on face identification with masked faces tested humans and computer algorithms using images with superimposed masks rather than images of people wearing real face coverings. In three experiments we test humans (control participants and super-recognisers) and algorithms with images showing different types of face coverings. In all experiments we tested matching concealed or unconcealed faces to an unconcealed reference image, and we found a consistent decrease in face matching accuracy with masked compared to unconcealed faces. In Experiment 1, typical human observers were most accurate at face matching with unconcealed images, and poorer for three different types of superimposed mask conditions. In Experiment 2, we tested both typical observers and super-recognisers with superimposed and real face masks, and found that performance was poorer for real compared to superimposed masks. The same pattern was observed in Experiment 3 with algorithms. Our results highlight the importance of testing both humans and algorithms with real face masks, as using only superimposed masks may underestimate their detrimental effect on face identification.

Introduction

Unfamiliar face matching

While humans are very good at recognising the faces of familiar people (e.g. Bruce, 1986; Bruce et al., 2001; Burton et al., 1999), we are far poorer at recognising unfamiliar people. In a typical face matching task, participants are shown two images and are asked to judge whether they depict the same person or two different people. Unfamiliar face matching performance has been shown to be poor both in the laboratory (Clutterbuck & Johnston, 2002, 2004; Megreya & Burton, 2008; Ritchie et al., 2015, 2021, 2023; Sandford & Ritchie, 2021), and in live tasks matching a physically present unfamiliar person to a photograph (Davis & Valentine, 2009; Kemp et al., 1997; Megreya & Burton, 2008; Ritchie et al., 2020). Unfamiliar face matching performance is poor even in people who are employed to make identity decisions from images, such as checkout assistants (Kemp et al., 1997), passport officers (White et al., 2014), and police officers (Burton et al., 1999).

The addition of everyday paraphernalia such as glasses and sunglasses to one image in the pair has been shown to reduce face matching accuracy (Graham & Ritchie, 2019; Kramer & Ritchie, 2016; Noyes et al., 2021). Face masks have also been shown to impair face identification (Fitousi et al., 2021; Freud et al., 2020, 2021) and face matching (Carragher & Hancock, 2020; Dhamecha et al., 2014; Estudillo et al., 2021; Noyes et al., 2021), with masks causing more of a reduction in accuracy than sunglasses (Noyes et al., 2021). It is not clear, however, precisely why face masks cause an impairment to face matching performance. The current study seeks to shed light on the mechanisms underlying this effect by testing face matching using different types of lower face occlusions.

Super-recognisers

Although unfamiliar face matching is generally poor, some people are able to perform with far higher accuracy than the general population. First described as having exceptional face memory (Russell et al., 2009), these people are referred to as super-recognisers (see Noyes et al., 2017 for a review). Although there are individual differences between super-recognisers, at the group level they perform with consistently higher accuracy than control participants (Bobak et al., 2016a, 2016b; Bobak et al., 2016a, 2016b; Davis et al, 2019; Noyes et al., 2018; Phillips et al., 2018). A recent study showed that super-recognisers are also more accurate than control participants at face matching with images wearing face masks (Noyes et al., 2021). The current study extends this work by testing both control participants and super-recognisers with different types of face coverings.

Algorithms

In recent years, there has been a rapid improvement in the performance of facial recognition algorithms through the use of ‘Deep Convolutional Neural Networks’ (DCNNs; e.g. Cao et al., 2018; Kemelmacher-Shlizerman et al., 2016; Taigman et al., 2014). One study tested algorithms made in 2015, 2016 and 2017 and showed a monotonic increase in performance from the oldest (68% accurate) to the newest (96% accurate; Phillips et al., 2018). Face masks present a new challenge for algorithm face identification. A recent competition receiving 18 submissions found that eight did not meet the baseline criterion for verification errors (Boutros et al., 2021). The National Institute of Standards and Technology (NIST) in the USA runs a regular Face Recognition Vendor Test (FRVT) which is a standard test of facial recognition algorithms. The FRVT has consistently reported improvements in algorithm face identification with algorithms achieving higher accuracy than humans (NIST, 2022a). NIST now also runs an ‘FRVT Face Mask Effects’ looking specifically at algorithm identification from masked faces. Algorithms are presented faces with superimposed masks and are tasked with identifying the person from a database of unmasked images (NIST, 2022b). Updates to the test show that some developers have adapted their algorithms to better cope with face masks, although the shape, colour, and coverage of the different masks used in the test affects some algorithms’ ability both to detect the face in the first place, and then to correctly identify the person pictured (Ngan et al., 2022).

Types of face coverings

While some previous studies of human face identification ability with face masks have used images of people wearing real masks (Dhamecha et al., 2014; Fitousi et al., 2021; Noyes et al., 2021), the majority have used pre-existing images with masks superimposed on to them (Carragher & Hancock, 2020; Estudillo et al., 2021; Freud et al., 2020, 2021). Some recent computer vision research has used real face masks (e.g. Jeevan et al., 2022; Lionnie et al. 2021), but the NIST FRVT Face Mask Effects test uses superimposed masks as the test images (Ngan et al., 2022).

It is not clear whether superimposed and real face masks produce different deficits in either human or computer face matching performance, and this difference is important for both theoretical understanding of face perception, and for understanding the impact of masks in applied face recognition practice. We have previously argued that one study using real face masks (Noyes et al., 2021) found a smaller reduction in face matching accuracy than a study using superimposed face masks (Carragher & Hancock, 2020) because it is possible that some elements of the person’s real face shape are still available to the viewer in real mask images but are covered in superimposed mask images. Although we predominantly use face texture to recognise other people (e.g. Burton et al., 2005), some element of face shape information may be useful (Rogers et al., 2022). Alternatively, it is possible that real face masks introduce extra texture information which may be more disruptive for face processing than superimposed masks, and the previously observed differences in findings (Carragher & Hancock, 2020; Noyes et al., 2021) were simply due to different task demands and methodologies.

The current studies

It is not clear exactly why face masks cause such a marked impairment in human face matching performance. One possibility is that masks cover facial features that are useful for identification (Towler et al., 2017). But previous research suggests that the upper half of the face, which remains visible when wearing a face covering, tends to be more useful for identification than the lower half (Fisher & Cox, 1975; McKelvie, 1976). Alternatively, covering the features of the lower face might interfere with the holistic processes that are used in face recognition (Maurer et al., 2002; Tanaka & Farah, 1993). In support of this possibility, Freud et al. (2020) report that holistic processing is impaired for faces wearing a face mask (see also Stajduhar et al., 2021). However, face matching can be aided by featural comparisons (Towler et al., 2017; White et al., 2015), which can occur without holistic processing (Towler et al., 2021). Recent research has shown that featural comparisons can lead to modest improvements in masked face matching performance (Carragher et al., 2022). The final possibility considered here is that the face mask serves as a source of distraction by attracting attention to the mask and away from the visible facial features.

In Experiment 1, we compare human unfamiliar face matching with different types of superimposed lower face occlusions. In Experiment 2, we compare unfamiliar face matching by control participants and super-recognisers with superimposed and real face masks, and in Experiment 3, we test algorithm performance with the real and superimposed masks.

Experiment 1: face matching with different types of superimposed lower face occlusions

This experiment was designed to investigate whether different types of superimposed face masks modulate the degree of impairment caused to unfamiliar face matching performance. In a within-participants design, observers completed a matching task in which one face in each pair was always presented unmasked, while the other face was selected from the following mask conditions: control (unmasked), fitted mask (the mask closely followed the shape of the face), loose mask (the mask occluded a large square shape, including the neck) and the top half only (the entire lower half of the image was removed). First, we expect that performance will be higher for the control condition than all others, replicating the basic finding that face masks impair matching performance (Carragher & Hancock, 2020; Noyes et al., 2021). Comparisons between the mask conditions could potentially reveal the mechanism by which masks impair face matching performance. Higher accuracy in the fitted mask condition compared to the loose mask condition would suggest that observers can extract information about facial shape from the mask. Alternatively, significantly better performance in the top half only condition compared to the two mask conditions (fitted, loose), would suggest that masks are a source of attentional distraction. Finally, no difference between the three manipulated conditions (fitted mask, loose mask, top only) would be consistent with two different explanations; either that face masks impair matching performance because they cover facial features that are important for identification, or because they impede holistic processing. These final possibilities are inextricably linked because covering facial features will, by definition, also interfere with holistic processing.

Method

Participants

From a convenience sample of volunteers recruited via email and social media, we received complete data from 79 participants (22 male, 57 female; mean age: 34 years; SD: 16 years; range: 18–67 years). All participants were naïve to the aims of the study. This research was approved by the General University Ethics Panel at the University of Stirling, and all participants gave informed consent.

Stimuli

The face masks in the current study were plain colour patches that were fitted to the faces automatically using custom written code (see Fig. 1). Automatically located landmark points were fine-tuned manually. The same landmark points below the eyes and over the bridge of the nose were used to establish the top of the mask in each mask condition (fitted, loose, top only). The fitted mask was created by filling the landmark points that follow the shape of the jaw with a plain pale blue patch (RGB 143, 205, 205), which is most similar to the FRVT Face Mask Effects’ ‘wide, medium coverage’ mask (Ngan et al., 2022). The loose masks were created by extending the occlusion 10 pixels down below the bottom of the jaw, square below the widest point at the ears. The top only condition was created by cropping the image below the top of the mask.

The faces for the current experiment came from two separate face matching tests. Half of the trials were the unfamiliar face pairs from the Stirling Famous Face Matching Task created by Carragher and Hancock (2020), making this the Stirling Unfamiliar Face Matching Task (SUFMT). These face pairs are images of amateur models that were downloaded from various online sources. The SUFMT consists of 40 image pairs, of which 20 are identity matches. The match and mismatch trials are evenly split for face sex. Each image only appears once within the SUFMT. The remaining trials came from the short version of the Kent Face Matching Test (KFMT; Fysh & Bindemann, 2018). The KFMT also consists of 40 trials, of which 20 are matches and 20 are mismatches. Each image pair consists of one smaller image that is typical of a student ID card, and one larger high-quality portrait image. The KFMT also consists of male and female face pairs. Thus, the experiment consisted of 80 trials in total, of which 40 were identity matches.

Trials from the SUFMT and KFMT were intermixed and randomised. Because all participants completed the same two tasks, we did not compare performance between the two tests. Allocation of trial pairs to mask conditions (control, fitted mask, loose mask, top only) was randomised between participants, such that all pairs were presented in each mask condition across participants. All participants completed 20 trials of each mask condition, of which 10 were match trials and 10 were mismatch trials. Face pairs in the fitted mask, loose mask and top only conditions consisted of one full-view face and one altered face. This image arrangement is consistent with the scenario in which a masked individual presents an official photo-ID document for inspection. In the KFMT, the smaller ID image was always unmasked, while the larger image was shown in each mask condition. All images were presented in colour. Images from the SUFMT were 420 × 595 px in size. Images from the KFMT were presented in their original sizes (Fysh & Bindemann, 2018); small (142 × 192 px), large (283 × 332 px).

Procedure

Participants completed the experiment on their personal computers via a web link. The experiment was run using Qualtrics survey software. Participants were informed that their task was to decide whether the two simultaneously presented images showed the same person or two different people. Responses were made using a 6-choice scale, which conveyed the identification decision (“Same”, “Different”) and confidence (“Certain”, “Think”, “Guess”). There was no time limit to give a response. All trial types were intermixed and presented in a random order in a single experimental block that consisted of all 80 trials. The experiment took approximately 15 min (M = 899 s, SD = 363 s) to complete.

Analysis

We analysed the data using signal detection measures of sensitivity (d′) and response bias (criterion). Sensitivity measures how well participants can discriminate match pairs from mismatches, with higher values indicating better performance (Macmillan & Creelman, 2004). Criterion is a measure of response bias, which shows whether participants had an overall tendency to report that pairs were a match (“same”) or mismatch (“different”). Positive criterion values indicate a bias to respond “different” across all trials (i.e. a conservative criterion), whereas negative values signal a “match” response bias (i.e. a liberal criterion). To calculate both measures, we collapsed across the confidence component of our scale, leaving only “same” and “different” responses (e.g. “Certainly Same”, “Think Same” and “Guess Same” were counted as “same”). These simplified responses correspond to hits (correctly responding “same” on a match trial) and false alarms (incorrectly responding “same” on a mismatch trial) which are used to calculate both d′ and criterion (Macmillan & Creelman, 2004; Stanislaw & Todorov, 1999). In both Experiment 1 and 2, we corrected for hits of 1 using the formula 1–1/(2N) and false alarms of 0 using the formula 1/(2N) where N is the number of trials in each condition. The number of trials was the same in each condition in each experiment, giving a maximum d′ value of 3.29. In addition to traditional frequentist hypothesis testing, we included Bayes factors calculated in JASP (JASP Team, 2020) with default prior width, which allowed us to quantify the extent to which the data support the alternative hypothesis (BF₁₀). We interpret BFs of less than 3.0 as anecdotal evidence of the alternative hypothesis (e.g. Jeffreys, 1961).

Results and discussion

All data for all experiments is available at https://osf.io/qgxhs/?view_only=6c6e8368c49d4d4fb634ada0671a7972

We present descriptive statistics here for ease of reading—full analysis of accuracy as defined by per cent correct can be found in the Additional file 1. In Experiment 1, face matching accuracy in each condition varied as follows: control (no concealment), 40% to 95% out of 20 (M = 69%, SD = 11%); fitted mask, 35% to 85% (M = 62%, SD = 10%); loose mask, 30% to 85% (M = 61%, SD = 12%); and top only, 30% to 90% (M = 60%, SD = 12%).

Sensitivity

Our main analysis uses signal detection theory as is common in the literature. A repeated measures ANOVA revealed a significant effect of mask condition on d′, F(3, 234) = 13.55, p < 0.001, η_p² = 0.15, BF₁₀ > 1000 (see Fig. 2). Bonferroni corrected post-hoc comparisons showed that sensitivity was significantly higher in the control condition compared to all other conditions (all ps < 0.001, all BF₁₀ > 400), which did not differ from each other (all ps > 0.999, all BF₁₀ < 1). The pattern of results is the same when the results are analysed using per cent correct, for both overall accuracy (collapsing across match and mismatch trials), and for match trials. However, there was no effect of mask condition on mismatch trials accuracy (see Additional file 1: Sect. 1).

Criterion

There was a non-significant effect of mask condition on response bias, F(3, 234) = 2.12, p = 0.098, η_p² = 0.03, BF₁₀ = 0.22.

Sensitivity was highest in the control condition and fell significantly for the three mask conditions, which did not differ from each other. These results suggest that the shape of the superimposed mask does not influence the degree of impairment to matching performance. Our findings suggest that masks impair performance either because they occlude facial features that carry identity information, or because they disrupt holistic processing. However, this experiment only examined the effect of superimposed masks. It is possible that real masks introduce extra information, either attracting attention to the mask, or adding additional spurious texture information to the face. Therefore, it is possible that images of faces wearing real face masks may lead to reduced face matching ability compared to superimposed masks. Alternatively, as we have previously suggested (Noyes et al., 2021), it is possible that real face masks might preserve some information about face shape, which could be useful for identification (see Rogers et al., 2022). Therefore, in the following experiment we tested unfamiliar face matching with real and superimposed face masks.

Experiment 2: face matching with real and superimposed masks

This experiment tested both typical participants and super-recognisers. Both sets of participants were recruited from a large database of participants used in previous research (e.g. Belanova et al., 2021; Noyes et al., 2021; Satchell et al., 2019). Importantly, none of the participants who took part in this study had taken part in our previous test of masked face matching (Noyes et al., 2021). Here we aimed to examine the effect of real and superimposed masks on typical participants’ and super-recognisers’ unfamiliar face matching performance.

Previous research using super-recognisers has tended to assess their ability using two tests: the Glasgow Face Matching Test: short version (GFMT, Burton et al., 2010) and the Cambridge Face Memory Test: Extended (CFMT + , Russell et al., 2009). The GFMT has recently been criticised for being a relatively easy test (e.g. Ramon, 2021), therefore here, we add a third test to the initial recruitment battery, the Kent Face Matching Test (KFMT, Fysh & Bindemann, 2018), which is a more difficult test of face matching than the GFMT.

Our super-recognisers are defined as individuals scoring 100% (40 out of 40) on the GFMT (Burton et al., 2010), 93% (95 or more out of 102) on the CFMT + (Russell et al., 2009) and 82.5% (33 or more out of 40) on the KFMT (Fysh & Bindemann, 2018). Less than 5% of people achieve perfect performance on the GFMT (Burton et al., 2010), while an estimated 2% score 95 or above on the CFMT + (Bobak et al., 2016a, 2016b; Russell et al., 2009), and average performance on the KFMT is 66.22%, taking the mean of performance reported in three studies (Fysh, 2018; Fysh & Bindemann, 2018; Gentry & Bindemann, 2019).

During the original database recruitment process, many participants did not meet the criteria to be classed as super-recognisers. Typical-ability participants were invited from this second group who had previously scored within approximately 1 standard deviation of the normal population mean on the GFMT (i.e. 28–36: Burton et al., 2010), CFMT + (i.e. 58–83: Bobak et al., 2016a, 2016b) and the KFMT (i.e. 24–29: Fysh, 2018; Fysh & Bindemann, 2018; Gentry & Bindemann, 2019).