To assess participants’ ability to distinguish hyper-realistic masks from real faces, we constructed a computer-based 2AFC task in which participants viewed pairs of on-screen images (one face and one mask), and indicated via key press which of the two images showed the mask. For comparison, we also included low-realism masks that were easy to detect. We expected that reaction times would be markedly slower in the high-realism condition than in the low-realism condition. Our main interest was whether the high-realism masks cleaved with the low-realism masks or with the real faces.
To test for other-race effects, we also presented equal numbers of own-race and other-race trials. The standard perceptual explanation of the other-race effect is that viewers become attuned to the variability that surrounds them, and remain relatively insensitive to variability outside of this range (e.g. O’Toole et al., 1994). These differences in perceptual experience lead to more efficient perceptual discrimination for own-race faces than for other-race faces. Although these effects are usually demonstrated using identification tasks, the same argument also applies to distinguishing hyper-realistic masks from real faces. We thus predicted shorter response latencies for own-race faces than for other-race faces in this task.
Method
Ethics statement
Ethical approval for the experiment in this study was obtained from the departmental ethics committee at the University of York (approval number Id215) and Kyoto University (approval number 28-N-3). Participants provided written informed consent to participate.
Participants
Volunteers (N = 120) took part in exchange for a small payment or course credit. These were 60 members of the volunteer panel at the University of York (39 female, 21 male; mean age 23 years, age range 18–39 years) and 60 members of the volunteer panel at Kyoto University (27 female, 33 male; mean age 22 years, age range 18–50 years). Testing took place on site at Kyoto University, Japan and the University of York, UK.
Materials and design
Three types of photographic image were used to construct the stimulus pairs - high-realism masks, low-realism masks, and real faces. To allow a fully crossed design, we collected an equal number of Asian and Caucasian images for each category. To ensure that we sampled real-world image variability, we used ambient images throughout (Jenkins et al., 2011). In the high-realism condition, a real face was paired with a hyper-realistic silicone mask. In the low-realism condition, a real face was paired with a non-realistic party mask.
High-realism mask images
To collect images of high-realism masks, we entered the search terms “realistic masks”, “hyper-realistic masks” and “realistic silicone masks” into Google Images. We selected images that (1) exceeded 150 pixels in height, (2) showed the mask in roughly frontal aspect, (3) showed the eye region without occlusions, and (4) included eyebrows made with real human hair. We used the same criteria to search the websites of mask manufacturers (e.g. RealFlesh Masks, SPFX, CFX) and topical forums on social media (e.g. Silicone Mask Sickos, Silicone Mask Addicts). For each of the Asian and Caucasian image sets, we gathered 37 hyper-realistic mask images that met the inclusion criteria (74 high-realism mask images in total).
Low-realism mask images
For comparison, we collected 74 images of low-realism masks by combining the search terms “Caucasian” and “Asian” with terms such as “Halloween”, “party”, “mask”, “masquerade”, “face-mask”, and “party mask” in Google Images, and selecting the first images that met the inclusion criteria 1–3 above. For low-realism mask images, race referred to the mask wearer, and was apparent from the parts of the face that were not occluded, and from the image source.
Real-face images
We also collected 148 real-face images to pair with the 74 high-realism and 74 low-realism mask images (148 mask images in total). To ensure that the demographic distribution among our real-face images was similar to that portrayed by the high-realism masks, we combined the search terms “Caucasian” and “Asian” with the terms “young male’”, “old male”, “young female”, and “old female” in Google Images. We then accepted images that met criteria 1–4 until the distribution of faces across these categories was the same as for the high-realism mask images. All photos were cropped to show the head region only and resized for presentation to 540 pixels high × 385 wide (see Fig. 2).
To create the stimulus displays, we paired each real-face image with a mask image from either the high-realism or the low-realism set. On each trial, the mask was equally likely to appear on the left or right side of the display. Stimuli always paired two images showing the same race (i.e. both Asian or both Caucasian). Within these constraints, image pairings were randomized separately for each participant, such that each participant saw each image exactly once, but judged different image combinations. In both the UK group and the Japan group, participants were randomly assigned to either the own-race or the other-race condition.
Procedure
Participants were instructed that each stimulus pair contained one real face and one mask, and that the task was to indicate via key press which image showed the mask. Each trial began with an image pair presented at the centre of the screen for 500 ms with the caption “Who is wearing the mask?” immediately below, and response options “Z” and “M” below the left and right images respectively (see Fig. 2). After 500 ms, the images were removed, and the question and response options remained onscreen until response. Participants pressed “Z” for the left image, or “M” for the right image as quickly and accurately as possible, and the response initiated the next trial. Each participant saw three practice trials followed by 74 recorded trials in a random order. The entire experiment took approximately 10 min to complete.
Results
Reaction time and error data are summarized in Fig. 3.
Reaction times
Participants’ mean correct reaction times (RTs) were analysed by 2 × 2 mixed analysis of variance (ANOVA) with the within-subjects factor of mask type (high-realism, low-realism), and the between-subjects factor of race (own-race, other-race).
As expected, there was a significant main effect of mask type, with slower responses for high-realism trials (mean (M) = 1258 ms, SE = 40.8, CI = 1178–1339) than for low-realism trials (M = 921 ms, SE = 29.3, CI 857–971) (F (1,118) = 204.6, p < .001, partial η2 = 0.63, Cohen’s d = 2.61).
There was also a significant main effect of race, with slower RTs in the other-race condition (M = 1197 ms, SE = 103.5, CI = 994–1399) than in the own-race condition (M = 976 ms, SE = 76.6, CI = 826–1125) (F (1,118) = 11.97, p < .001, partial η2 = 0.09, Cohen’s d = 0.63). The interaction between mask type and race was not significant (F (1,118) = 3.60, p = .06, partial η2 = 0.03, Cohen’s d = 0.35). For consistent reporting of effects across experiments, we also analysed simple main effects.
Simple main effects confirmed that there was a significant effect of mask type for both own-race (F (1,118 = 76.96, p < .001, partial η2 = 0.40, Cohen’s d = 1.63) and other-race faces (F (1,118 = 131.26, p < .001, partial η2 = 0.53, Cohen’s d = 2.12). The effect of race was also present in both the high-realism condition (F (1,118) = 11.62, p = .001; partial η2 = 0.09, Cohen’s d = 0.63) and the low-realism condition (F (1,118) = 9.61, p = .002; partial η2 = 0.08, Cohen’s d = 0.59).
Errors
Mean percentage correct scores were likewise analysed by 2 × 2 mixed ANOVA with the within-subjects factor of mask type (high-realism, low-realism), and the between-subjects factor of race (own-race, other-race).
This analysis revealed a significant main effect of mask type, with lower accuracy for high-realism trials (M = 66.2%, SE = 1.2, CI = 63.8–68.8) than for low-realism trials (M = 97.7%, SE = 0.4, CI = 96.9–98.6) (F (1,118) = 635.8, p < .001, partial η2 = 0.84, Cohen’s d = 4.58).
There was no main effect of race in errors (own-race: M = 83.0%, SE = 0.8, CI = 81.5–84.6; other-race: M = 80.9%, SE = 0.9, CI = 79.2–82.5) (F (1,118) = 2.69, p = .104, partial η2 = 0.02, Cohen’s d = 0.28), and no significant interaction between mask type and race (F (1,118) = 3.44, p = .066, partial η2 = 0.03, Cohen’s d = 0.35).
Simple main effects confirmed that there was a significant effect of mask type in both the own-race condition (F (1,118 = 272.85, p < .001, partial η2 = 0.70) and the other-race condition (F (1,118 = 366.33, p < .001, partial η2 = 0.76). Despite the numerical trend, there was no significant effect of race in the high-realism condition (F (1,118) = 3.45, p = .066, partial η2 = 0.03, Cohen’s d = 0.35), nor in the low-realism condition (F (1,118) = 0.02, p = .880, partial η2 < .001, Cohen’s d < .001).
Owing to the ceiling effect in the low-realism condition, we also compared own-race and other-race conditions with a separate Mann–Whitney test for each mask type. We found no significant effect of race for the high-realism condition (U = 1466, p = .079) or the low-realism condition (U = 1670, p = .437).
Given the high error rate in the high-realism condition, we next examined the distribution of errors across images. The purpose of this analysis was to establish whether errors were driven by a particular subset of images, or were instead distributed across the entire set. Figure 4 shows the results of this analysis. All of the high-realism mask images attracted some errors, and most attracted many errors. In other words, errors were not driven by a particular subset of images. Rather, they were distributed across the entire set.
Discussion
Analysis of RTs showed that 2AFC discrimination of masks from real faces was indeed slower for high-realism masks than for low-realism masks (~ 300 ms RT cost). As it turned out, the more interesting effect was in the error data. Participants performed almost perfectly in the low-realism condition (98% accuracy). That is perhaps not surprising, given the simplicity of the task. However, accuracy in the high-realism condition was just 66%, in the context of chance performance being 50%. An error in this 2AFC task is striking, as it requires the observer to choose the real face over the alternative, when the alternative is a mask. The implication is not merely that the hyper-realistic masks looked human. In some cases, they appeared more human than human in this task. That was the judgement in one-third of the high-realism trials.
We also observed an effect of race in reaction times (~ 200 ms cost), though not in the accuracy data. If reliable, this is an intriguing finding, as it potentially extends the classic other-race effect from identification tasks to the very different task of differentiating real faces from synthetic faces (masks).
One aspect of our experiment that complicates interpretation is the limited exposure duration for the stimuli (500 ms). Limiting stimulus duration is standard practice when the task would otherwise be too easy (Bogacz et al., 2006). As it turned out, the high-realism condition was not too easy. In the next experiment, we removed this time limit.