Multiple event monitoring

Wu, Chia-Chien; Wolfe, Jeremy M.

doi:10.1186/s41235-016-0022-7

Original article
Open access
Published: 12 December 2016

Multiple event monitoring

Chia-Chien Wu^1,2 &
Jeremy M. Wolfe^1,2

Cognitive Research: Principles and Implications volume 1, Article number: 21 (2016) Cite this article

2439 Accesses
3 Citations
1 Altmetric
Metrics details

Abstract

Suppose you were monitoring a group of people in order to determine if anyone of them did something suspicious (e.g., putting down a bag) or if any two interacted in a suspicious manner (e.g., trading bags). How large a group could you monitor successfully? This paper reports on six experiments in which observers monitor a group of entities, watching for an event. Whether the event was performed by a single entity or was an interaction between a pair, the capacity for event monitoring was two to three items. This was lower than the multiple object tracking capacity for the same stimuli (approximately six items). Capacity was essentially the same whether entities were identical circles or unique cartoon animals; nor was capacity changed by an added requirement to identify the entities involved in an event. Event monitoring appears to be related to, but not identical to, multiple object tracking.

Significance

In a surveillance situation, an individual might be required to monitor a crowd and look for a suspicious event among them (e.g., Did anyone abandon a bag?). Similarly, a lifeguard might monitor a pool for swimmers in danger. What is our capacity for “event monitoring” of this sort? That is, how many items can people monitor simultaneously in order to detect an event when it happens? In a series of experiments, we show that people can monitor only two to three items at the same time. This event-monitoring capacity is more limited than the capacity of position tracking (multiple object tracking (MOT)). In the real world, salient cues (e.g., a shout from the crowd) might orient attention to an event. In the absence of such a cue, our results reveal a significant capacity limitation on anyone keeping watch.

Background

When security personnel watch surveillance videos or monitor the crowds walking on the street, they need to split their attention between multiple things such as pedestrians, vehicles or bikers. In this sort of a task, they are not simply tracking the positions of a set of items, they are looking for classes of events: for example, a suspicious action like a person leaving his bag behind. Little is known about how people perform in this sustained-monitoring task where they have to detect an event in time while monitoring a dynamic scene.

Clearly, how well observers can detect an event in a dynamic scene depends strongly on how many items those observers are able to monitor, unless the event itself summons attention. The ability to divide attention between multiple moving objects has been extensively studied using the MOT task (Pylyshyn & Storm, 1988), where observers are asked to track a set of identical targets moving among identical distractors. Observers are typically asked to track the relevant subset of targets for several seconds. At the end of that time, they might be asked to indicate the position of tracked objects or to declare if a marked item was or was not part of the tracked set. Studies have shown that people are able to accurately track about four items (Cavanagh & Alvarez, 2005; Pylyshyn & Storm, 1988) with variation between different observers (Oksama & Hyönä, 2004) and with the limit changing somewhat with different target variables (Bettencourt & Somers, 2009).

The performance in these experiments, however, mainly reveals a limit of selective attention to otherwise identical items. In the type of event-monitoring task described here, each item in the display could be unique. Therefore, the questions are different. Did a unique item change? Did two different items interact? There is a limited body of research on tracking unique items. Early studies showed that the featural properties of tracked targets are not encoded during MOT (Pylyshyn, 2004; Scholl, Pylyshyn, & Franconeri, 1999). Oksama and Hyönä (2004) asked observers to track visually different line drawing targets (multiple identity tracking, MIT). At the end of each trial, one of the tracked targets was probed and observers were asked to identify the probed target from the presented targets. They found that the targets’ content could be addressed during the position tracking. That is, observers did know, at least to some extent, which target moved where. Similar results were also reported in the tracking of different faces (Ren, Chen, Liu, & Fu, 2009), identities (Horowitz et al., 2007) and color features (Makovski & Jiang, 2009a, 2009b). It has been shown that, during identity tracking, the capacity for localizing the individualized targets was around two (Botterill, Allen, & McGeorge, 2011; Horowitz et al., 2007), which is much smaller than the capacity in position tracking. However, it is still unclear whether the reduced capacity in MIT arises because identity tracking needs to compete for common attentional resources with position tracking (Cohen, Pinto, Howe, & Horowitz, 2011), or whether identity and location tracking are simply governed by two different systems with their own limits (Botterill et al., 2011; Oksama & Hyönä, 2016).

Thus, there are clear limits on the capacity to track objects whether or not they are unique. What about a change in an object or between objects? Even in a static scene, the evidence suggests that multiple event tracking is powerfully limited. Wolfe, Reinecke, and Brawn (2006) asked observers to indicate if any specific dot changed its color from red to green or vice versa. The task was trivial if the color switch was the only visual transient in an otherwise static display. However, if a luminance change also occurred simultaneously with the color change, observers were close to chance performance in deciding if the luminance change was or was not accompanied by a color change. This result does not bode well for the ability to monitor a dynamic scene for the occurrence of an event.

Wolfe et al. (2006) estimated the capacity to monitor a static set of dots to be between 0 and 4, covering the same range as found in MOT and MIT tracking and as found in measures of visual working memory (VWM) capacity (Irwin, 1992; Luck & Vogel, 1997; Wolfe et al., 2006). Indeed, the VWM limitation could be a common limit in all sustained-monitoring tasks. Under many circumstances, detection of change is severely capacity limited (Simons & Rensink, 2005). In the classic version of change blindness, large changes in a scene can be missed if an event, like a blank screen between the original and changed scene, masks the transients produced by the change (Rensink, O’Regan, & Clark, 1997). Under those circumstances, the location of the change is unknown. In the experiments discussed here, observers look for changes in a small, designated subset of the simple stimuli on the screen.

There are only few studies of change detection during MOT. Bahrami (2003) asked observers to track a set of targets among distractors while reporting if there was any color/shape change among them. Observers were able to track all targets and then detect the critical change if the change occurred openly, in the absence of a mud splash to mask the change transient. However, the detection was impaired when the change transition was occluded by mud splashes even if the change occurred in a tracked target. Others have reported that the features of objects are often not encoded during MOT (Pylyshyn, Haladjian, King, & Reilly, 2008; Scholl & Pylyshyn, 1999). It has been suggested that two different systems might be at work during tracking: one would encode the positions of the tracked objects, while the other encodes features and object identity (Horowitz et al., 2007; Oksama & Hyönä, 2016). These systems might still compete for the same attentional resource (Cohen et al., 2011). Thus, if the ability to detect an event among tracked objects shares resources with tracking, performance in event detection might be better when the need for tracking is low.

On the other hand, other phenomena suggest that event monitoring could have a much higher capacity than tracking. Suppose that event detection is similar to a recognition memory task where observers’ task is to distinguish things that have been seen before from novel items. Observers can memorize thousands of specific images and distinguish old from new with good accuracy (Brady, Konkle, & Alvarez, 2011; Brady, Konkle, Alvarez, & Oliva, 2008; Shepard, 1967; Standing, 1973; Standing, Conezio, & Haber, 1970). In a visual search setting, Cunningham and Wolfe asked observers to identify the new object in the visual display. The new item on one trial became an old item for all subsequent trials. Observers could monitor search displays for the new item even when holding a set of hundreds of old items in memory (Cunningham & Wolfe, 2014). Thus, it is possible that the limit on event detection in a sustained-monitoring task might not be limited in the same way that tracking of identical circles is limited.

The goal of the current study is to measure the capacity for detecting events in a sustained-monitoring task. That is, how many items can be monitored at the same time to successfully detect an event when it happens to one of those items? If observers are monitoring a set of otherwise identical objects, waiting for an event to occur, it seems likely that that task will be limited by MOT capacity. However, if items, like individuals in a crowd, are unique, it might be possible in principle to scan through a large number of memorized, unique items, looking for the new event.

To investigate these questions, we used two types of events: in one case, the event was an isolated change occurring to a single item (e.g., the letter T becomes the letter L – e.g., Experiment 1). In the second case, two items interacted with each other, analogous to two people swapping bags (e.g., Experiment 4). To anticipate our results, in all of the variants reported here, observers showed a very limited capacity to monitor for events (capacity K = 2–3 items).

Experiment 1

Method

Participants

Twelve participants (eight women) were recruited from the volunteer pool used by the Brigham and Women’s Hospital’s Visual Attention Laboratory. All had normal or corrected-to-normal vision and passed the Ishihara color screen (or Ishihara color blindness test). Participants gave informed consent approved by the Brigham and Women’s Hospital Institutional Review Board, and were paid US$10/h. Participants ranged in age from 18 to 37 years (M = 24.4, SD = 6.13).

Apparatus and stimuli

Stimuli were presented on a 24″ screen on an iMac model A1225 (EMC2211) with resolution = 1920 × 1200. All items would move within an imaginary display of 20° × 20° at a viewing distance of approximately 60 cm. At this viewing distance, 1 cm is nearly equivalent to a visual angle of 1°. The experiments were written in MATLAB 8.3 with Psychtoolbox version 3.0.12 (Brainard, 1997; Kleiner et al., 2007; Pelli, 1997). All items were either black letter Ts or Ls on a gray background (Fig. 1) and the size of each letter was about 0.83° × 0.83°.

Procedure

The experiment consisted of four blocks of 50 trials, each with a different set size (N) 2, 4, 6 or 8 items. Unlike the conventional MOT task, where selected targets were tracked among distractors, observers were asked to monitor all items until they found the target. At the start of the trial, all items were stationary for 0.5 × N s. All items then started to moved with a velocity of 4°/s. When any two items crossed paths, they would overlap. At a time point randomly selected from 0.3 to 6.0 s after the start of motion, the target letter would change its identity from T → L, or vice versa. Observers were told to respond by key press as soon as they found the target. Once observers pressed a key, all items would stop moving and turn into empty circles. Observers would then use the mouse to indicate the location of the target. Accuracy feedback was given after the response was made.

Pilot data made it clear that various guessing strategies could be quite successful. For instance, especially with the smaller set sizes, it is not hard to encode the number of Ts and Ls. If there were more Ts at the end of the trial than at the beginning, the change must have been L → T, so randomly picking a T would produce above chance performance, even though the change was not detected, only inferred. To avoid observers using such a strategy, the trial was terminated and counted as a miss error if observers did not respond within 2 s after the change happened. In addition, for set sizes 6 and 8, there were always at least two Ts and two Ls. To avoid the detection of an abrupt letter change by a transient that could summon attention to otherwise untracked and unattended items, we imposed two different noises: (1) movement noise: all items moved with a small jitter orthogonal to the direction of motion and (2) added transients: every 750 ms all items would change their identity to the letter O for 250 ms, then change back again. These manipulations masked the possible pop-out effect, requiring observers to maintain their focus on as many as items as possible in order to find the target.

Data analysis

Our goal was to measure the capacity for event monitoring. How many items can be monitored if a change to one of those items is to be successfully detected within 2 s of the change? Assume that the target event can be detected immediately if it occurs to an item that is being successfully monitored. In this case, the tracking performance (P, proportion correct) is given by the number of items actually monitored (K, for capacity) divided by the total number of items (set size, N). There is no guessing term because of the 2-s response deadline, so:

$$ P=\frac{K}{N} $$

(1)

Since we know P and N, we can derive an estimate of K.

Results and discussion

The event-monitoring capacity was estimated using Eq. 1 for each set size. The resulting estimate of capacity, averaged over set sizes 4, 6 and 8 was 3.4 (set size 2 was excluded since its maximum capacity would be 2 and this would underestimate the overall capacity). However, a further analysis shows that, for the larger set sizes, performance was strongly dependent on the mix of Ts and Ls. The tracking performance was worse when the numbers of Ts and Ls were more evenly distributed. For instance, for set size 8, P was 38% when there were four Ts and four Ls. It was 57% when the number of Ts and Ls was unequal. This suggests that observers used a grouping or counting strategy, perhaps choosing to track only the smaller set. To minimize this effect, we only consider the trials where the numbers of Ts and Ls were equal (which was about 43% of the total 2400 trials). The accuracy for these balanced displays is plotted as a function of visual set size in Fig. 2. As expected, the tracking accuracy decreased with set size (a one-way repeated measure analysis of variance (ANOVA), F(3,33) = 42.65, p < 0.001, $ {\eta}_p^2 $ = 0.80). The average capacity was about 3.

Even though observers did not need to actually track individual objects in this task, their capacity was, if anything, lower than the approximately four objects that can be tracked in standard MOT paradigms (Pylyshyn & Storm, 1988; Yantis, 1992). In principle, observers could have attended from item to item or group to group checking for change, but whatever their strategy, we found that observers’ capacity for detecting the identity change was relatively small, even at a slow-moving velocity (4°/s).

Experiment 2

The result of Experiment 1 shows that the capacity for monitoring items for an identity change seems to be lower than the magic number 4 (Cowan, 2001). Perhaps a T becoming an L, or vice versa, is too unnatural. An improbable change might impair performance in change detection (Beck, Angelone, & Levin, 2004). In Experiment 2, we replicated the basic experiment using photorealistic objects that could change their state (e.g., an open bag becomes a closed bag).