More human than human: a Turing test for photographed faces

Sanders, Jet Gabrielle; Ueda, Yoshiyuki; Yoshikawa, Sakiko; Jenkins, Rob

doi:10.1186/s41235-019-0197-9

Original article
Open access
Published: 21 November 2019

More human than human: a Turing test for photographed faces

Jet Gabrielle Sanders ORCID: orcid.org/0000-0002-9951-2799^1,2,
Yoshiyuki Ueda³,
Sakiko Yoshikawa³ &
…
Rob Jenkins¹

Cognitive Research: Principles and Implications volume 4, Article number: 43 (2019) Cite this article

9163 Accesses
3 Citations
972 Altmetric
Metrics details

Abstract

Background

Recent experimental work has shown that hyper-realistic face masks can pass for real faces during live viewing. However, live viewing embeds the perceptual task (mask detection) in a powerful social context that may influence respondents’ behaviour. To remove this social context, we assessed viewers’ ability to distinguish photos of hyper-realistic masks from photos of real faces in a computerised two-alternative forced choice (2AFC) procedure.

Results

In experiment 1 (N = 120), we observed an error rate of 33% when viewing time was restricted to 500 ms. In experiment 2 (N = 120), we observed an error rate of 20% when viewing time was unlimited. In both experiments we saw a significant performance cost for other-race comparisons relative to own-race comparisons.

Conclusions

We conclude that viewers could not reliably distinguish hyper-realistic face masks from real faces in photographic presentations. As well as its theoretical interest, failure to detect synthetic faces has important implications for security and crime prevention, which often rely on facial appearance and personal identity being related.

Significance

Forensic identification often relies on comparison of facial images (photographs or video stills) by human viewers. There are now dozens of criminal cases in which perpetrators have used hyper-realistic face masks to transform their appearance (e.g. change in apparent age, sex, or race). Facial disguise is not a new problem, but the level of realism that is achievable with these masks does raise new questions. With conventional disguises (e.g. balaclava or domino mask), it is generally clear that captured images do not show the person’s actual appearance. With hyper-realistic face masks, the situation is very different. Beyond a certain level of realism, viewers might think that captured images show the wearer’s real face. An error of that type can set an investigation down the wrong path, as numerous recent cases have shown (e.g. searching for a suspect of the wrong race). All of these implications hinge on whether or not the masks are truly realistic. Here we address this question by developing a Turing Test for photographed faces.

Background

Technologies often imitate natural objects, giving rise to artificial diamonds, artificial flowers, artificial fur, and countless other artefacts. How are we to judge the success of such imitations? In 1950, Alan Turing proposed an influential answer for the specific case of artificial intelligence: an imitation is successful when we cannot distinguish it from the real thing (Turing, 1950). In his original argument, Turing imagined a human evaluator engaged in natural language conversations with a real human and a computer designed to generate human-like responses. The evaluator would be informed that one of the two partners is a computer, and asked to determine which one. To focus the evaluation on quality of thought rather than quality of speech, the dialogue would be mediated by text only (e.g. keyboard and screen). If the evaluator cannot reliably distinguish the computer from the human, the computer is said to pass the test.

As a target of imitation, intelligent conversation is enormously complex. No current machine appears close to passing the Turing test. However, the logic of the test itself is straightforward, and provides a means for assessing the maturity of imitation technologies generally: given the imitation alongside the real thing, can an observer tell which is which?

Here we bring this logic to bear on a much more tightly circumscribed imitation technology - artificial faces (see Fig. 1). The past decade has seen increasing interest in the realism of computer-generated faces (Holmes, Banks, & Farid, 2016; Nightingale, Wade, & Watson, 2017). Our concern is artificial face images of a very different kind, specifically, unretouched photos of artificial faces in the real world. Images in this category differ from digital images in at least two important ways. First, digitally generated or manipulated images are not snapshots of the physical environment. They only exist in print and on screen, and that limits the ways in which viewers can encounter them. Our focus is physical artefacts that exist in the real world and are caught on camera. Second, digital image manipulation has been a part of mainstream media for a generation. As such, the level of public understanding that images may be “photoshopped” is high. One consequence of this development is that photorealistic images carry less evidential weight than they once did - all images are suspect in this sense (see Kasra, Shen, & O’Brien, 2018). Since the real world cannot be photoshopped in the same way, physical artefacts are more protected from this slide in credibility.

Artificial faces in the real world may not be intended to pass for genuine faces, even when they strive for realism in some respect. A marble bust might capture the proportions of a real face, but none of the movement; a robotic head might capture some facial movement, but remain disembodied. Hyper-realistic silicone masks differ from these examples in that they are worn by a real person, and so are seen in the context of a real body. Moreover, they are constructed from a flexible material, so they relay the wearer’s rigid and non-rigid head movements - at least at the gross scale (e.g. head turns; opening and closing of the mouth). These characteristics set hyper-realistic masks apart from other artificial faces, as they allow them to be fully embedded in natural social situations (see Fig. 2 for examples).

These natural social situations place unusual demands on imitation technologies, as humans tend to be especially attuned to social stimuli. Face perception offers abundant evidence of such tuning. For example, humans are predisposed to detect face-like patterns (Robertson, Jenkins, & Burton, 2017), and this tendency is present from early infancy (Morton & Johnson, 1991). Faces capture our attention (Langton, Law, Burton, & Schweinberger, 2008; Theeuwes & Van der Stigchel, 2006), and having captured attention, tend to retain it (Bindemann, Burton, Hooge, Jenkins, & De Haan, 2005). While viewing a face, we make inferences about the mind behind it, including emotional state from facial expression (Ekman & Friesen, 1971; Ueda & Yoshikawa, 2018; Young et al., 1997) and direction of attention from eye gaze (Baron-Cohen, Wheelwright, Hill, Raste, & Plumb, 2001; Friesen & Kingstone, 1998). We also use faces to identify individual people (Burton, Bruce, & Hancock, 1999; Burton, Jenkins, & Schweinberger, 2011), which can trigger retrieval of personal information from memory (Bruce & Young, 1986). All of these processes require high sensitivity to subtleties of facial appearance. There is even some evidence that these processes can become tuned to specific populations through social exposure. For example, children tend to be better at recognising young faces than old faces (and vice versa; Anastasi & Rhodes, 2005; Neil, Cappagli, Karaminis, Jenkins, & Pellicano, 2016); Japanese viewers tend to be better at recognising East Asian faces than Western faces (and vice versa; O’Toole, Deffenbacher, Valentin, & Abdi, 1994). Perhaps most relevant for the current study, discrimination between faces and non-face objects can be accomplished rapidly and accurately. Using saccadic reaction times, Crouzet, Kirchner, and Thorpe (2010) found that viewers could differentiate images of faces versus vehicles at 90% accuracy in under 150 milliseconds - significantly faster than discriminations that did not involve faces. The findings of Crouzet et al. (2010) were based on images from different categories. Nevertheless, they provide an interesting baseline against which to compare the more nuanced discriminations investigated here.

Taken together, these findings suggest that faces may be particularly difficult objects to imitate. Faces attract the glare of attention, and details of their appearance convey socially significant information. Even so, there is some evidence that hyper-realistic silicone masks can pass for real faces, at least in certain situations. In a previous study (Sanders et al., 2017), passers-by consistently failed to notice that a live confederate was wearing a hyper-realistic mask, and showed little evidence of having detected the mask covertly. Out of 160 participants in the critical condition, only two spontaneously reported the mask, and only a further three reported the mask following prompting. These low detection rates are consistent with the idea that hyper-realistic masks successfully imitate real faces. However, several aspects of the experimental procedure complicate this interpretation. For example, masks were not mentioned during the main phase of data collection, and participants had no reason to expect to see a mask. It is possible that participants might have detected the masks more often had they been expecting them. Moreover, responses were collected in a live social setting. It is possible that respondents were reluctant to inspect or to discuss the appearance of a person who was physically present (albeit out of earshot) - and especially reluctant to declare that person’s face to be artificial.

These matters of interpretation arise in part from our approach to testing, which prioritised ecological validity over experimental control. Here we adopt the complementary approach of two-alternative forced choice testing (2AFC), which strikes the opposite balance (see Bogacz, Brown, Moehlis, Holmes, & Cohen, 2006 for a review). The 2AFC method originated in psychophysical research (Fechner, 1860/1966), where it was developed to measure quantities such as perceptual acuity. Our application is closer in spirit to the Turing test, in that our main interest concerns the realism of artificial stimuli.

In 2AFC testing, the participant is presented with two stimuli, one of which is the target, and is forced to choose which is the correct stimulus. This contrasts with the tasks that we used previously (Sanders et al., 2017; Sanders & Jenkins, 2018), in which participants viewed individual stimuli, and made categorical judgements. There are several reasons why the proposed 2AFC testing should sharpen observers’ ability to distinguish hyper-realistic masks from real faces. First, the task instructions ensure that participants are aware in advance that masks will be presented. Second, social influence is minimised, as the task is computer based. Third, the task always involves two stimuli at a time: one is always a mask and the other is always a real face. Thus, even when participants are uncertain whether one of the images is the target, they can still solve the task indirectly if they are certain about the other image.

To test for other-race effects in this task, we collected data in both the UK and Japan. Although other-race effects are most strongly associated with identity-based tasks, such as face recognition (Meissner & Brigham, 2001) and face matching (Megreya, White, & Burton, 2011), our question here is whether they can also arise when distinguishing real faces from other face-like stimuli (Robertson et al., 2017) - a task more akin to face detection. The live viewing study by Sanders et al. (2017) could not address this point fully, as in naturalistic settings, the base-rate probabilities of encountering own-race and other-race faces are not well matched. Moreover, participants had no insight into the probability of a mask being present, even in the laboratory-based experiments. The 2AFC task gets around these limitations by allowing us to present own-race and other-race items equally often. We expect that equating background probabilities in this way will allow us to reach a more definitive answer.

Experiment 1

To assess participants’ ability to distinguish hyper-realistic masks from real faces, we constructed a computer-based 2AFC task in which participants viewed pairs of on-screen images (one face and one mask), and indicated via key press which of the two images showed the mask. For comparison, we also included low-realism masks that were easy to detect. We expected that reaction times would be markedly slower in the high-realism condition than in the low-realism condition. Our main interest was whether the high-realism masks cleaved with the low-realism masks or with the real faces.

To test for other-race effects, we also presented equal numbers of own-race and other-race trials. The standard perceptual explanation of the other-race effect is that viewers become attuned to the variability that surrounds them, and remain relatively insensitive to variability outside of this range (e.g. O’Toole et al., 1994). These differences in perceptual experience lead to more efficient perceptual discrimination for own-race faces than for other-race faces. Although these effects are usually demonstrated using identification tasks, the same argument also applies to distinguishing hyper-realistic masks from real faces. We thus predicted shorter response latencies for own-race faces than for other-race faces in this task.

Method

Ethics statement

Ethical approval for the experiment in this study was obtained from the departmental ethics committee at the University of York (approval number Id215) and Kyoto University (approval number 28-N-3). Participants provided written informed consent to participate.

Participants

Volunteers (N = 120) took part in exchange for a small payment or course credit. These were 60 members of the volunteer panel at the University of York (39 female, 21 male; mean age 23 years, age range 18–39 years) and 60 members of the volunteer panel at Kyoto University (27 female, 33 male; mean age 22 years, age range 18–50 years). Testing took place on site at Kyoto University, Japan and the University of York, UK.

Materials and design

Three types of photographic image were used to construct the stimulus pairs - high-realism masks, low-realism masks, and real faces. To allow a fully crossed design, we collected an equal number of Asian and Caucasian images for each category. To ensure that we sampled real-world image variability, we used ambient images throughout (Jenkins et al., 2011). In the high-realism condition, a real face was paired with a hyper-realistic silicone mask. In the low-realism condition, a real face was paired with a non-realistic party mask.

High-realism mask images

To collect images of high-realism masks, we entered the search terms “realistic masks”, “hyper-realistic masks” and “realistic silicone masks” into Google Images. We selected images that (1) exceeded 150 pixels in height, (2) showed the mask in roughly frontal aspect, (3) showed the eye region without occlusions, and (4) included eyebrows made with real human hair. We used the same criteria to search the websites of mask manufacturers (e.g. RealFlesh Masks, SPFX, CFX) and topical forums on social media (e.g. Silicone Mask Sickos, Silicone Mask Addicts). For each of the Asian and Caucasian image sets, we gathered 37 hyper-realistic mask images that met the inclusion criteria (74 high-realism mask images in total).

Low-realism mask images

For comparison, we collected 74 images of low-realism masks by combining the search terms “Caucasian” and “Asian” with terms such as “Halloween”, “party”, “mask”, “masquerade”, “face-mask”, and “party mask” in Google Images, and selecting the first images that met the inclusion criteria 1–3 above. For low-realism mask images, race referred to the mask wearer, and was apparent from the parts of the face that were not occluded, and from the image source.

Real-face images

We also collected 148 real-face images to pair with the 74 high-realism and 74 low-realism mask images (148 mask images in total). To ensure that the demographic distribution among our real-face images was similar to that portrayed by the high-realism masks, we combined the search terms “Caucasian” and “Asian” with the terms “young male’”, “old male”, “young female”, and “old female” in Google Images. We then accepted images that met criteria 1–4 until the distribution of faces across these categories was the same as for the high-realism mask images. All photos were cropped to show the head region only and resized for presentation to 540 pixels high × 385 wide (see Fig. 2).

To create the stimulus displays, we paired each real-face image with a mask image from either the high-realism or the low-realism set. On each trial, the mask was equally likely to appear on the left or right side of the display. Stimuli always paired two images showing the same race (i.e. both Asian or both Caucasian). Within these constraints, image pairings were randomized separately for each participant, such that each participant saw each image exactly once, but judged different image combinations. In both the UK group and the Japan group, participants were randomly assigned to either the own-race or the other-race condition.

Procedure

Participants were instructed that each stimulus pair contained one real face and one mask, and that the task was to indicate via key press which image showed the mask. Each trial began with an image pair presented at the centre of the screen for 500 ms with the caption “Who is wearing the mask?” immediately below, and response options “Z” and “M” below the left and right images respectively (see Fig. 2). After 500 ms, the images were removed, and the question and response options remained onscreen until response. Participants pressed “Z” for the left image, or “M” for the right image as quickly and accurately as possible, and the response initiated the next trial. Each participant saw three practice trials followed by 74 recorded trials in a random order. The entire experiment took approximately 10 min to complete.