Face search in CCTV surveillance

Mileva, Mila; Burton, A. Mike

doi:10.1186/s41235-019-0193-0

Original article
Open access
Published: 23 September 2019

Face search in CCTV surveillance

Cognitive Research: Principles and Implications volume 4, Article number: 37 (2019) Cite this article

14k Accesses
11 Citations
7 Altmetric
Metrics details

Abstract

Background

We present a series of experiments on visual search in a highly complex environment, security closed-circuit television (CCTV). Using real surveillance footage from a large city transport hub, we ask viewers to search for target individuals. Search targets are presented in a number of ways, using naturally occurring images including their passports and photo ID, social media and custody images/videos. Our aim is to establish general principles for search efficiency within this realistic context.

Results

Across four studies we find that providing multiple photos of the search target consistently improves performance. Three different photos of the target, taken at different times, give substantial performance improvements by comparison to a single target. By contrast, providing targets in moving videos or with biographical context does not lead to improvements in search accuracy.

Conclusions

We discuss the multiple-image advantage in relation to a growing understanding of the importance of within-person variability in face recognition.

Significance

In many countries, closed-circuit television (CCTV) surveillance is common in public spaces. The availability of CCTV footage has brought about significant changes in policing and across judicial systems. While finding a person of interest can be vital for public safety, it is also a task of great visual complexity that requires sustained attention, good identity detection and recognition skills and other cognitive resources. Here, we aimed to establish whether there are any general psychological principles for understanding the accuracy of search in this noisy, real-world setting. We asked participants to look for target individuals in real surveillance footage from a city rail station. The search target photos were also real, being passport photos, custody images or social media images. This way we bridged the gap between laboratory-based experiments and real-life CCTV search. We focused on the role of within-person variability (i.e. how different images of the same person can often look very different, and how this is incorporated into visual representations) and demonstrated its benefits for finding target identities in CCTV footage, a task that is conducted by security officers around the world every day.

Background

Visual search is typically studied in highly artificial, but tightly-controlled visual environments, for example asking viewers to find a particular letter among a set of distractors (Duncan & Humphreys, 1989; Treisman & Gelade, 1980). This fundamental approach can elicit general principles, such as the importance of target salience and the effects of multiple distractors. However, it is difficult to apply the results directly to everyday visual search such as finding one’s bag at an airport or looking for a friend at a station (Clark, Cain, & Mitroff, 2015).

A number of search experiments have been performed with real scenes, and some with specialist displays such as airport baggage or medical radiology. From these it is possible to make general observations demonstrating the effects of scene context (e.g. Seidl-Rathkopf, Turk-Browne, & Kastner, 2015; Wolfe, Alvarez, Rosenholtz, Kuzmova, & Sherman, 2011); searcher vigilance (e.g. Warm, Finomore, Vidulich, & Funke, 2015); target prevalence (e.g. Menneer, Donnelly, Godwin, & Cave, 2010; Wolfe et al., 2007); target-distractor similarity (Alexander & Zelinsky, 2011; Duncan & Humphreys, 1989; Pashler, 1987) and individual differences (e.g. Muhl-Richardson et al., 2018). Furthermore, while most experiments are conducted with static stimuli, it has also been established that attention can follow moving objects within a scene (for example as measured by inhibition of return to projected future object locations, Tipper, Driver, & Weaver, 1991; Tipper, Jordan, & Weaver, 1999).

Despite this wealth of research, rather little is known about the mechanics of an everyday search task that is not only commonplace, but often security critical. In the present study, we examined the problem of trying to find a target person in real CCTV recordings of a busy rail station. The search targets were previously unknown to those watching the CCTV, and searchers also had access to the types of images available to police and security agencies, e.g. passports, driving licences and custody images. CCTV quality was not always high, ambient lighting conditions were changeable and the level of crowding was highly variable. All these factors combine to make this a very difficult search task. Nevertheless, we aimed to establish whether it is possible to discern some general principles about search in this noisy, visual environment. In the experiments subsequently described we showed photos of a target person alongside video clips from CCTV. We ask whether particular display manipulations lead to more efficient search: is it beneficial to show multiple images of the target or perhaps moving images of the target?

Historically, the appeal of CCTV surveillance stems from its comparison to eyewitness testimony, which has been the focus of a substantial amount of forensic and applied research (Ellis, Shepherd, & Davies, 1975; Wells, 1993; Wells & Olson, 2003). Eyewitness accuracy is known to be highly error-prone, and methods used to enhance memory of faces, while sometimes resulting in small improvements, have not delivered a means of overcoming this problem. CCTV footage, however, can eliminate some of these problems as it provides a permanent record of events and all those involved in them. This apparent benefit has therefore motivated the widespread installation of CCTV cameras and has enhanced their use and impact in court in many jurisdictions (Farrington, Gill, Waples, & Argomaniz, 2007; Welsh & Farrington, 2009). Nevertheless, there is now substantial evidence that unfamiliar face matching (i.e. deciding whether two, simultaneously presented, different images belong to the same identity or not) is a surprisingly difficult process (Megreya & Burton, 2006, 2008). This is likely to impact on the type of visual search examined here, as it is now clear that face matching is difficult even in optimal conditions (e.g. images taken only minutes apart in good lighting and similar pose, with unlimited time for viewers to examine the images and make their response; Bruce et al., 1999; Burton, White, & McNeill, 2010).

Similar findings have been reported in studies of pair-wise face matching using poorer-quality stimuli such as CCTV images (Bruce, Henderson, Newman, & Burton, 2001; Henderson, Bruce, & Burton, 2001), CCTV footage (Burton, Wilson, Cowan, & Bruce, 1999; Keval & Sasse, 2008) and even live recognition (Davis & Valentine, 2009; Kemp, Towell, & Pike, 1997). Henderson et al. (2001), for example, used CCTV (of comparable quality to the footage available in most high-street banks) and broadcast-quality footage of a mock bank raid. They explored the recognition rates of unfamiliar participants who were asked to compare stills from the footage with high-quality targets in an eight-image line up or in a one-to-one matching task. The error rate was high regardless of number of distractors, and accuracy ranged from 29% with CCTV images to 64% with stills from broadcast-quality footage.

Taking this a step further, Burton et al. (1999) presented three separate groups of participants (students familiar with and students unfamiliar with the individuals in the images shown, and police officers) with short (2–3 s) CCTV video clips and then asked them to match these people to high-quality images. Results showed generally very poor performance by police officers and unfamiliar students, but near-ceiling performance by students who were familiar with the people shown. The findings highlight the importance of familiarity and raise many concerns about the use of such video footage by unfamiliar viewers. Indeed, there is now evidence that matching a live person to short CCTV footage, a situation simulating real-life juror decisions, is also associated with very high error rates (Davis & Valentine, 2009).

Overall, these studies raise concerns about the use of CCTV footage to judge identity. However, it is possible that such studies are, in fact, overestimating participants’ performance. While the CCTV footage in most published experiments captures only one person walking or performing some choreographed actions, most CCTV cameras are installed in busy locations such as train stations or airports with many different people passing by at any time. This could have important implications for recognition accuracy, especially for the number of potential misidentifications.

Within-person variability

Our daily experience of person recognition is very different for familiar and unfamiliar faces. Unfamiliar recognition typically relies on a single exposure, often a single image (e.g. matching a traveller to their passport), whereas familiar recognition (e.g. recognising a friend) benefits from the experience of a person’s appearance across a range of situations and circumstances. It has been argued that the accumulation of idiosyncratic within-person variability underlies the process of familiarisation and is responsible for our expertise in familiar face recognition (Burton, Jenkins, Hancock, & White, 2005; Burton, Jenkins, & Schweinberger, 2011; Jenkins, White, Van Montfort, & Mike Burton, 2011; Young & Burton, 2017b). A number of studies have already demonstrated that providing participants with multiple images leads to better learning and discrimination. Bindemann and Sandford (2011), for example, showed participants either 1 or 3 identity cards and asked them to find their target in an array of 30 other images. They showed a surprisingly large range of performance (46–67%) depending on which ID card was used in the single-image condition. More importantly, being able to see all three ID cards at the same time led to significantly better identification (85%). Similar results have been reported in matching tasks using single or multiple images of the target individuals (Dowsett, Sandford, & Burton, 2016; White, Burton, Jenkins, & Kemp, 2014).

There are two mechanisms that could be responsible for the benefit of using multiple images in unfamiliar face recognition: exposure to many different images of the same person could help us construct a more complete and accurate representation of the target identity (as argued by Burton et al., 2005 and Jenkins et al., 2011) or allow us to select a closest-match image, which is then used to make the matching decision. In an attempt to distinguish between these two processes, Menon, White, and Kemp (2015) compared matching performance with a single image, multiple similar-looking images (low variability) or multiple varied images (high variability) of the same person. Recognition accuracy was significantly higher in the multiple-image conditions and, critically, there was a significant benefit for images with high rather than low variability. They also showed that no single image in the multiple condition was solely responsible for the increase in accuracy, suggesting that the observed benefit relied on the combination of images rather than on the single closest-match image.

Dynamic versus static presentation

Another key component of everyday identity recognition is movement. Comparing the experience of seeing someone’s face move and simply looking at their photograph triggers the intuition that we can extract a greater amount and range of identifying information in the former case. Despite this intuitive advantage for dynamic faces, the current literature is inconsistent and inconclusive, with some studies showing clear benefits for recognising dynamically learned faces (Butcher, Lander, Fang, & Costen, 2011; Lander & Bruce, 2003; Lander & Chuang, 2005; Schiff, Banka, & de Bordes Galdi, 1986) and some showing no improvement at all (Bruce et al., 2001; Darling, Valentine, & Memon, 2008; Knight & Johnston, 1997; Shepherd, Ellis, & Davies, 1982), while others report that using moving-face stimuli could even lead to a significant detriment in performance (Christie & Bruce, 1998; Lander, Humphreys, & Bruce, 2004).

The most stable and replicated benefit of movement involves familiar, rather than unfamiliar, face recognition. A number of studies have shown higher rates of recognition and confidence when presented with dynamic rather than static images of known identities, particularly in low-quality visual displays that would otherwise make recognition difficult (Bennetts, Butcher, Lander, Udale, & Bate, 2015; Butcher & Lander, 2017; Lander & Bruce, 2000; Lander & Chuang, 2005). Pike, Kemp, Towell, and Phillips (1997) report similar findings in a recognition task where identities were initially learned through dynamic videos, multiple stills or a single still, and the memory for these identities was then tested in an old/new procedure. Results indicated better performance for dynamically learned faces compared to both multiple and single stills. Similar findings have been reported by Lander and Davies (2007); however, they only find a motion advantage when both the learning and test stimuli are moving. There is also some evidence that using a video of a moving face as a prime produces faster recognition time than a still; however, this advantage of motion has not been seen to improve accuracy (Pilz, Thornton, & Bülthoff, 2006; Thornton & Kourtzi, 2002).

In contrast to work on familiar faces, a large number of studies on unfamiliar face recognition fail to find an advantage of dynamically presented faces using a variety of tasks, including matching (Bruce et al., 1999), recognition memory (Christie & Bruce, 1998), familiarity decision (Knight & Johnston, 1997) and forensically relevant recall measures based on eyewitness testimony (Havard, Memon, Clifford, & Gabbert, 2010; Shepherd et al., 1982). Christie and Bruce (1998) further explored different types of movement (rigid, head nods and shakes versus non-rigid, speaking and emotional expressions) as well as test stimulus modality (still versus dynamic sequences). They report no advantage of motion regardless of movement type and of whether memory was tested through a still or a video. In fact, they found a benefit of learning faces from a still image compared to a subtle rigid movement when still images were also used at test. Such a detriment in recognition performance was also reported by Lander et al. (2004) who compared the accuracy of a patient with prosopagnosia (patient HJA) and two groups of controls (age-matched and undergraduate students) in a no-delay recognition task. While HJA showed a consistent improvement in accuracy when faces displayed a rigid or non-rigid movement, both control groups performed significantly better with still rather than moving faces.

Overview of experiments

In the following series of studies, we examine visual search for a target in real CCTV taken from a large city transport hub. Viewers are asked to find target individuals in these complex changeable scenes, and their search is based on photos gathered from a range of sources including passports, driving licences, custody images and social media. In each of the experiments, viewers have unlimited time to make their decisions (target present or absent), and can pause, rewind or slow the CCTV, just as in operational contexts. We aimed to establish general principles for estimating and improving the efficiency of search in these contexts. To do so, we manipulated the information presented alongside CCTV clips. Across the experiments this comprised a single photo, multiple photos or videos of the target person. Multiple photos and video seem to provide the searcher with more information about the target, but does this extra information help, and if so how? If multiple photos allow a searcher to extract key information about the idiosyncratic variability of that person’s face, does a video support even greater generalisation? Finally, we ask whether providing biographical information about the target person supports more efficient search, perhaps via motivational or depth of processing effects.

The experiments make use of a comprehensive multimedia database, which includes 17 h of CCTV footage from a busy rail station in two formats: standard definition (SD, 720 × 576 pixels, interlaced, 25 frames/second (fps)) and high definition (HD, 1920 × 1080 pixels, non-interlaced, 5–10 fps). Both of these formats are in routine use; for example, both are admissible as evidence in UK courts. Volunteers travelled through the rail station, and had their images captured as part of the routine CCTV surveillance. They also donated images in a number of forms, including personal ID (e.g. passports, driving licences, membership cards); social media images; high-quality custody images (compliant with both UK and Interpol arrest standards) and high-quality (1080p) video recordings of the volunteer moving their heads from side to side, up and down and reading from a prepared script. The number and type of images available for each target individual varied considerably - a constraint that contributes to the design of specific experiments described subsequently.

In each of our studies, participants were presented with images of these target identities together with short, 2-min CCTV clips. Their task was simply to identify those targets in the CCTV videos. In study 1, participants were either shown one or three different images of the target person, alongside the CCTV. Based on findings from face learning and matching studies (Dowsett et al., 2016; Jenkins et al., 2011), we expected a boost in performance with exposure to additional target images. In study 2 we extended the number of images available to 16 for each search target, allowing viewers access to a large range of variability for each target. In study 3 we directly compared performance across the two levels of CCTV format (resolution) available. We also provided viewers with the option to use moving images of the search target alongside the CCTV. Finally, in study 4 we presented participants with additional semantic information by embedding target images in wanted or missing person posters.

Study 1: search with one or three images of the target person

Overview

Our first study explored the role of within-person variability in CCTV identification. In each trial, participants were presented with either one or three images of a target person and searched for that person in a 2-min CCTV clip. Previous research on matching static images suggests that performance is improved when viewers are able to base their judgements on multiple images of the same person (e.g. Bindemann & Sandford, 2011; Dowsett et al., 2016). However, performance on visual search tasks is known to be severely impaired when viewers have multiple targets (e.g. Menneer, Cave, & Donnelly, 2009; Stroud, Menneer, Cave, & Donnelly, 2012). In the CCTV search task, viewers may attempt to integrate multiple photos of the target, leading to improved performance, or they may try to match each of the individual target photos, perhaps leading to reduced performance. In fact, results showed high error rates, both when targets were present and absent. More importantly, being exposed to multiple images of the same person brought about a significant improvement in accuracy.

Method

Participants

A total of 50 participants (7 men, mean age = 21.2, range = 18–43 years) completed the face search task. All were students who received either course credit or payment. All participants had normal or corrected-to-normal vision and provided informed consent prior to participation. A sensitivity power analysis in GPower (Erdfelder, Faul, & Buchner, 1996) indicated that with the present sample, alpha of .05 and 80% power, the minimum detectable effect is 0.17 (η_p² = 0.027). The experiment was approved by the ethics committee of the Psychology Department at the University of York.

Design

The study used a 2 (number of search images, 1 vs 3) × 2 (trial type, present vs absent) within-subjects design. Participants completed 14 trials, each with a different target identity. Half the trials had one search image and half had three. For each participant, the target was present on half the trials. Stimuli were counterbalanced across the experiment, such that each target person appeared equally often in present and absent trials. Trial order presentation was randomised individually for each participant.

Materials

We used images and CCTV footage videos capturing 14 target identities (8 male) encompassing a range of ages (20–49 years) and ethnicities. All search images were taken from official identity documents (passport, driving licence or national identity card) and membership cards (e.g. library or travel cards). Some target images were presented in colour and others in greyscale, as per the original document from which they had been taken. Many of the images included watermarks. We collected three images per identity for multiple-image trials and used one of them (either a passport or driving licence photograph) in single-image trials.

CCTV footage was taken at a busy city rail station. Each 2-min clip was presented in greyscale, original HD quality (1920 × 1080 pixels, no interlacing, and a frame rate of 5–10 fps). Figure 1 shows a mock-up of a trial.

Procedure

Participants completed the face search task while seated at a computer screen. Each trial showed a target face (one or three images) and a CCTV clip (see Fig. 1). Their task was to find the target person in the CCTV video. Participants were informed that the person they were looking for would be present in some and absent in other trials, but they were not aware of the prevalence (which was 50%). Participants had control of the CCTV video, and could choose to pause, rewind or jump forward as they wished. There was no time limit, and participants terminated a trial by completing a response sheet, recording “not present” or a frame number in which the target appeared. For “present” responses, participants also used a mouse click to indicate the person chosen.

Each participant completed two practice trials in order to familiarise themselves with the procedure. They then completed 14 experimental trials, in an independently randomised order. Screen recordings were taken to establish accuracy (e.g. identification of the correct person in a “present” trial) and to allow subsequent analysis of participants’ strategies.

Results and discussion

Recognition accuracy

Mean identification accuracy across conditions is presented in Fig. 2. Within-subjects analysis of variance (ANOVA) (2 (image number, 1 vs 3) × 2 (trial type, present vs absent)) revealed significant main effects of image number (F (1, 49) = 4.40, p < .05, η_p² = 0.08) and trial type (F (1, 49) = 13.03, p < .001, η_p² = 0.21). There was no significant interaction (F < 1). Further analysis is presented in Additional file 1, which gives an analysis of response time data (Additional file 1: Figure S1), a detailed breakdown of error-types (Additional file 1: Figure S5) and a by-item analysis, suggesting that these effects are not driven by specific targets (Additional file 1: Figure S9).

Our results show that searching for a target in CCTV footage is a highly error-prone task. Note that for target-absent trials, there is very poor accuracy, with participants’ performance at 57% when using a single search photo (see Fig. 2). Nevertheless, it is interesting to observe that presenting multiple photos of the search target improves performance in both target-present and target-absent trials. The extra information available allows viewers to make more accurate identifications, and more accurate rejections when searching for people in complex moving scenes. This is consistent with earlier work on face matching from static photos (Bindemann & Sandford, 2011; Dowsett et al., 2016), but it is particularly interesting to observe in this difficult visual search task. The result contrasts starkly with evidence showing that visual search for multiple objects is much more difficult than search for an individual target (Menneer et al., 2009; Stroud et al., 2012). This large cost also occurs when trying to match multiple faces rather than individuals (Megreya & Burton, 2006). However, in the present experiment, participants are not searching for multiple targets, but for one target represented by multiple photos. They appear to be able to exploit this redundancy to improve performance, in a way that is consistent with extraction of within-person variability, known to help in face familiarisation (Andrews, Jenkins, Cursiter, & Burton, 2015; Jenkins et al., 2011).

Search strategies

As well as overall accuracy, we were able to observe some aspects of participant behaviour from screen recordings of each trial. In fact, 11.7% of trials were not recorded due to technical failure, and so the following summary statistics are based on 618 recordings (317 target-present and 301 target-absent trials). We observed 5 different strategies: (1) watching the whole video once before making a target-absent decision (18.1% of all trials, target present and target absent); (2) watching the whole video more than once before making a target-absent decision (20.1% of all trials); (3) watching the whole video first, then going back to suspected targets and making an identification (30.1% of all trials); (4) making an identification during the video but continuing to watch until the end (11.2% of all trials) and (5) making an identification during the video and then terminating the trial without watching the remainder of the clip (20.5% of all trials). No participants made a target-absent decision without watching the CCTV video through at least once.

In trials where participants made a target-absent decision, watching the CCTV footage more than once led to better performance (80% accuracy) than watching the video only once (70.5% accuracy). In trials where participants made a target-present decision, highest performance was achieved when participants identified a target during the clip and did not continue to watch the whole video (66.9% accuracy), possibly reflecting participants’ confidence in their identification. This was closely followed by making an identification during the clip but watching the whole video until the end (65.2% accuracy). Worst performance was associated with watching the whole video first and then going back to suspected targets (51.6% accuracy). The number of unique misidentifications varied greatly across the target identities.