When security personnel watch surveillance videos or monitor the crowds walking on the street, they need to split their attention between multiple things such as pedestrians, vehicles or bikers. In this sort of a task, they are not simply tracking the positions of a set of items, they are looking for classes of events: for example, a suspicious action like a person leaving his bag behind. Little is known about how people perform in this sustained-monitoring task where they have to detect an event in time while monitoring a dynamic scene.
Clearly, how well observers can detect an event in a dynamic scene depends strongly on how many items those observers are able to monitor, unless the event itself summons attention. The ability to divide attention between multiple moving objects has been extensively studied using the MOT task (Pylyshyn & Storm, 1988), where observers are asked to track a set of identical targets moving among identical distractors. Observers are typically asked to track the relevant subset of targets for several seconds. At the end of that time, they might be asked to indicate the position of tracked objects or to declare if a marked item was or was not part of the tracked set. Studies have shown that people are able to accurately track about four items (Cavanagh & Alvarez, 2005; Pylyshyn & Storm, 1988) with variation between different observers (Oksama & Hyönä, 2004) and with the limit changing somewhat with different target variables (Bettencourt & Somers, 2009).
The performance in these experiments, however, mainly reveals a limit of selective attention to otherwise identical items. In the type of event-monitoring task described here, each item in the display could be unique. Therefore, the questions are different. Did a unique item change? Did two different items interact? There is a limited body of research on tracking unique items. Early studies showed that the featural properties of tracked targets are not encoded during MOT (Pylyshyn, 2004; Scholl, Pylyshyn, & Franconeri, 1999). Oksama and Hyönä (2004) asked observers to track visually different line drawing targets (multiple identity tracking, MIT). At the end of each trial, one of the tracked targets was probed and observers were asked to identify the probed target from the presented targets. They found that the targets’ content could be addressed during the position tracking. That is, observers did know, at least to some extent, which target moved where. Similar results were also reported in the tracking of different faces (Ren, Chen, Liu, & Fu, 2009), identities (Horowitz et al., 2007) and color features (Makovski & Jiang, 2009a, 2009b). It has been shown that, during identity tracking, the capacity for localizing the individualized targets was around two (Botterill, Allen, & McGeorge, 2011; Horowitz et al., 2007), which is much smaller than the capacity in position tracking. However, it is still unclear whether the reduced capacity in MIT arises because identity tracking needs to compete for common attentional resources with position tracking (Cohen, Pinto, Howe, & Horowitz, 2011), or whether identity and location tracking are simply governed by two different systems with their own limits (Botterill et al., 2011; Oksama & Hyönä, 2016).
Thus, there are clear limits on the capacity to track objects whether or not they are unique. What about a change in an object or between objects? Even in a static scene, the evidence suggests that multiple event tracking is powerfully limited. Wolfe, Reinecke, and Brawn (2006) asked observers to indicate if any specific dot changed its color from red to green or vice versa. The task was trivial if the color switch was the only visual transient in an otherwise static display. However, if a luminance change also occurred simultaneously with the color change, observers were close to chance performance in deciding if the luminance change was or was not accompanied by a color change. This result does not bode well for the ability to monitor a dynamic scene for the occurrence of an event.
Wolfe et al. (2006) estimated the capacity to monitor a static set of dots to be between 0 and 4, covering the same range as found in MOT and MIT tracking and as found in measures of visual working memory (VWM) capacity (Irwin, 1992; Luck & Vogel, 1997; Wolfe et al., 2006). Indeed, the VWM limitation could be a common limit in all sustained-monitoring tasks. Under many circumstances, detection of change is severely capacity limited (Simons & Rensink, 2005). In the classic version of change blindness, large changes in a scene can be missed if an event, like a blank screen between the original and changed scene, masks the transients produced by the change (Rensink, O’Regan, & Clark, 1997). Under those circumstances, the location of the change is unknown. In the experiments discussed here, observers look for changes in a small, designated subset of the simple stimuli on the screen.
There are only few studies of change detection during MOT. Bahrami (2003) asked observers to track a set of targets among distractors while reporting if there was any color/shape change among them. Observers were able to track all targets and then detect the critical change if the change occurred openly, in the absence of a mud splash to mask the change transient. However, the detection was impaired when the change transition was occluded by mud splashes even if the change occurred in a tracked target. Others have reported that the features of objects are often not encoded during MOT (Pylyshyn, Haladjian, King, & Reilly, 2008; Scholl & Pylyshyn, 1999). It has been suggested that two different systems might be at work during tracking: one would encode the positions of the tracked objects, while the other encodes features and object identity (Horowitz et al., 2007; Oksama & Hyönä, 2016). These systems might still compete for the same attentional resource (Cohen et al., 2011). Thus, if the ability to detect an event among tracked objects shares resources with tracking, performance in event detection might be better when the need for tracking is low.
On the other hand, other phenomena suggest that event monitoring could have a much higher capacity than tracking. Suppose that event detection is similar to a recognition memory task where observers’ task is to distinguish things that have been seen before from novel items. Observers can memorize thousands of specific images and distinguish old from new with good accuracy (Brady, Konkle, & Alvarez, 2011; Brady, Konkle, Alvarez, & Oliva, 2008; Shepard, 1967; Standing, 1973; Standing, Conezio, & Haber, 1970). In a visual search setting, Cunningham and Wolfe asked observers to identify the new object in the visual display. The new item on one trial became an old item for all subsequent trials. Observers could monitor search displays for the new item even when holding a set of hundreds of old items in memory (Cunningham & Wolfe, 2014). Thus, it is possible that the limit on event detection in a sustained-monitoring task might not be limited in the same way that tracking of identical circles is limited.
The goal of the current study is to measure the capacity for detecting events in a sustained-monitoring task. That is, how many items can be monitored at the same time to successfully detect an event when it happens to one of those items? If observers are monitoring a set of otherwise identical objects, waiting for an event to occur, it seems likely that that task will be limited by MOT capacity. However, if items, like individuals in a crowd, are unique, it might be possible in principle to scan through a large number of memorized, unique items, looking for the new event.
To investigate these questions, we used two types of events: in one case, the event was an isolated change occurring to a single item (e.g., the letter T becomes the letter L – e.g., Experiment 1). In the second case, two items interacted with each other, analogous to two people swapping bags (e.g., Experiment 4). To anticipate our results, in all of the variants reported here, observers showed a very limited capacity to monitor for events (capacity K = 2–3 items).