What is the role of the film viewer? The effects of narrative comprehension and viewing task on gaze control in film

Hutson, John P.; Smith, Tim J.; Magliano, Joseph P.; Loschky, Lester C.

doi:10.1186/s41235-017-0080-5

Original article
Open access
Published: 22 November 2017

What is the role of the film viewer? The effects of narrative comprehension and viewing task on gaze control in film

John P. Hutson¹,
Tim J. Smith²,
Joseph P. Magliano³ &
…
Lester C. Loschky¹

Cognitive Research: Principles and Implications volume 2, Article number: 46 (2017) Cite this article

15k Accesses
28 Citations
12 Altmetric
Metrics details

Abstract

Film is ubiquitous, but the processes that guide viewers’ attention while viewing film narratives are poorly understood. In fact, many film theorists and practitioners disagree on whether the film stimulus (bottom-up) or the viewer (top-down) is more important in determining how we watch movies. Reading research has shown a strong connection between eye movements and comprehension, and scene perception studies have shown strong effects of viewing tasks on eye movements, but such idiosyncratic top-down control of gaze in film would be anathema to the universal control mainstream filmmakers typically aim for. Thus, in two experiments we tested whether the eye movements and comprehension relationship similarly held in a classic film example, the famous opening scene of Orson Welles’ Touch of Evil (Welles & Zugsmith, Touch of Evil, 1958). Comprehension differences were compared with more volitionally controlled task-based effects on eye movements. To investigate the effects of comprehension on eye movements during film viewing, we manipulated viewers’ comprehension by starting participants at different points in a film, and then tracked their eyes. Overall, the manipulation created large differences in comprehension, but only produced modest differences in eye movements. To amplify top-down effects on eye movements, a task manipulation was designed to prioritize peripheral scene features: a map task. This task manipulation created large differences in eye movements when compared to participants freely viewing the clip for comprehension. Thus, to allow for strong, volitional top-down control of eye movements in film, task manipulations need to make features that are important to narrative comprehension irrelevant to the viewing task. The evidence provided by this experimental case study suggests that filmmakers’ belief in their ability to create systematic gaze behavior across viewers is confirmed, but that this does not indicate universally similar comprehension of the film narrative.

Significance

Film, television, and video are ubiquitous, and viewers of these media generally have similar narrative experiences despite the complexity of the audiovisual stimuli and large individual differences across viewers. One potential reason for this is the filmmaking techniques for creating highly systematic viewing experiences that filmmakers have intuitively developed and believe to be highly effective. However, these intuitions have rarely been empirically validated. Does film work the way filmmakers think it does? Highly produced mainstream films have been empirically shown to guide viewers to look at the same places at the same time and the association between gaze location and bottom-up visual salience has been reliably computationally modeled. But, the contribution of online top-down cognitive factors, such as comprehension and viewing task, that are known to have large effects on eye movements during reading and static scene viewing are poorly understood for films. This is of critical importance, because although where a person looks and their understanding are highly correlated, if a film viewer has little control over where they look this relationship may be weaker. Our study shows that when viewers watch mainstream movies their visual attention is only modestly affected by differences in narrative comprehension. However, conscious control of one’s attention by a task at odds with comprehending the narrative more strongly guides attention. These results further our understanding of how filmmakers’ and viewers’ goals both shape viewer experience.

Background

“…and all of us go into a kind of lock step where, if we were watching a tennis match, you’d see that perfect synchronicity of heads going left-right, left-right. The same thing in a movie theatre, when the movie is working and the audience is galvanised, almost hypnotised, all watching the same things, all knowing where to look at the exact same time…it’s a wonderful thing. There is nothing greater than that.” (Spielberg, 2013).

“If a million people see my movie, I hope they see a million different movies.” (Tarantino, 1995)

Watching movies and videos is a ubiquitous activity around the world. Such highly produced videos are very complex stimuli. Yet, they are produced by professional filmmakers for broad entertainment audiences using popular techniques believed to ease viewing and comprehension processes (Smith, 2012), and people seem to comprehend them with little difficulty. Nevertheless, to date little research has been done to explain why this may be the case (Smith, Levin, & Cutting, 2012). One stage of the film comprehension process that might be critical to this apparently effortlessness is how the eyes move across the screen. As demonstrated by the Steven Spielberg quote above, filmmakers believe they have the power to make their audience look exactly where they want them to irrespective of who the audience is. This belief in the primacy of the audiovisual stimulus for guiding attention and, assumedly subsequent comprehension, reflects the tone of practical filmmaking guides (Katz, 1991), accounts of the “rules” used in film construction (Bordwell & Thompson, 2001), and filmmakers’ reflections on their craft (Murch, 2001; Reisz & Millar, 1953). The strong assumption that viewers passively receive a film’s meaning, and that individual differences such as age, gender, race, and sexuality do not impact this reception, was also shared by classic film theories including the Structuralist, Auteurist, Formalist, Marxist, and Psychoanalytic theories (see Stam and Miller (1999) for review).

However, these theories and the view of mainstream filmmakers like Spielberg are out of line with currently dominant film theories. Since the 1960s, advances in film theory have mirrored movements in other areas of the arts to increase the prominence of the individual viewer in theorizing about the act of meaning making (i.e., comprehension). Theoretical movements including Cultural Studies (and its specialist foci such as Feminist and Queer Film Theory), Reception Studies, and Cognitive Film Studies (Bordwell & Carroll, 1996) have acknowledged that films do not have a single meaning but can be read differently by different people, depending on their desires, ideology, and social differences (Hall, 1980). According to these newer film theories, the film experience is similar to a discourse in which, while the viewer is usually passively seated, they are cognitively active in selecting, encoding, and constructing their part of the filmic discourse (Tseng & Bateman, 2013). This view is sometimes also shared by filmmakers who push the boundaries of film, such as Quentin Tarantino (see his earlier quote made when discussing his multi-threaded narrative film Pulp Fiction, 1994).

The mental activity of viewers is typically inaccessible to film theorists (except through introspection; Brown, 2015), but a key piece of physical evidence that does exist is how viewers move their eyes on the screen. Furthermore, we can assess their film understanding through their verbal responses to questions. To what extent do film viewers’ eye movements and measured film comprehension support either Spielberg’s belief that he has absolute control over viewers’ gaze and comprehension of a film, or contemporary film theories’ and Tarantino’s assertion that each viewer sees and understands a different film? Empirical disciplines, such as cognitive science, can play a role in exploring the insights of filmmakers and in helping to resolve this debate (Bortolussi & Dixon, 2003; Sanford & Emmott, 2012). In doing so, we can develop and refine theories of how we make sense of media and the role of eye movements in that process.

Eye movements in scenes

While watching a film, viewers typically move their eyes two to five times per second in order to extract information from it, and those eye movements are likely related to viewers’ understanding of the film they are watching (Eisenstein, 1948; Jesionowski, 1982; Murch, 2001; Smith, 2012). Causally, this relationship can go in two directions: attention guiding comprehension, or what is being comprehended guiding attention. There is a long literature on “bottom-up” features that guide attention in scenes (e.g., color, edges, and motion; Itti, 2005; Mital, Smith, Hill, & Henderson, 2010). In film, these bottom-up features have been shown to have such a strong effect on attention that they lead people to look at the same places at the same times, which has been termed “attentional synchrony” in film (Dorr, Martinetz, Gegenfurtner, & Barth, 2010; Smith, 2013; Smith et al., 2012). Similar synchrony in brain activity has been shown using functional magnetic resonance imaginh (fMRI; Hasson et al., 2008; Shepherd, Steckenfinger, Hasson, & Ghazanfar, 2010) and electroencephalography (EEG; Dmochowski, Sajda, Dias, & Parra, 2012). Such bottom-up influences contrast with “top-down” factors such as the viewer’s task, individual differences, preferences, and the viewer’s active mental model of the scene. The current study asks what role top-down comprehension processes play during film viewing. Although hardly any previous research has addressed this question directly (but see Lahnakoski et al., 2014; Loschky, Larson, Magliano, & Smith, 2015), two well-developed lines of research are highly relevant: research on eye movements and reading comprehension and research on eye movements in static and dynamic scenes.

Top-down and bottom-up effects on eye movements

At a broad level, comprehension processes for narrative content have been studied in the realm of text (McNamara & Magliano, 2009 for review), and readers’ eye movements have been shown to differ based on their comprehension (reviewed by Rayner, 1998). Importantly, such comprehension effects on eye movements occur at both the local and global levels (Rayner & Morris, 1990; Rayner, Raney, & Pollatsek, 1995). Examples of this relationship at the local level are eye movements associated with the processing of anaphoric references (e.g., identifying the character that is referred to with the pronoun “he”), which typically involve sentences that closely occur in a text (Ehrlich & Rayner, 1983), and generating elaborative inferences that are closely associated with the semantic content of specific sentences (O’Brien, Shank, Myers, & Rayner, 1988). Examples at the global level include the fact that, as the overall difficulty of a text increases, readers tend to make more eye movements (Rayner, Chace, Slattery, & Ashby, 2006), that information presented ironically produces more regressive eye movements (Kaakinen, Olkoniemi, Kinnari, & Hyönä, 2014), and that reading times get faster as one progresses through a text (in part due to repetitions of concepts; Rayner et al., 1995). Findings such as these are the basis of the “eye-mind hypothesis” (Just & Carpenter, 1980; Reichle, Pollatsek, Fisher, & Rayner, 1998; Reilly & Radach, 2006) that eye movements are driven by online cognitive processes (e.g., fixation is maintained longer for words that need more processing).

At one level, it is reasonable to assume that comprehension processes are similar for text and film (e.g., Magliano, Loschky, Clinton, & Larson, 2013), and as such one would expect a connection between each movie viewer’s comprehension and their eye movements. More specifically, when movie viewers have different information incorporated into their event models for a narrative (Zwaan & Radvansky, 1998), it seems reasonable that they would attend to different aspects of the film stimulus in order to update their event model, as has been shown for reading text (Anderson & Pichert, 1978).

Similar top-down effects on attention are found during scene viewing. When viewing static scenes, eye movements can be affected by volitional and mandatory top-down processes (Baluch & Itti, 2011). Volitional top-down processes are things like the goal or task of the viewer (Henderson, 2007; reviewed in Henderson & Hollingworth, 1998). Mandatory top-down processes are learned biases that guide attention without any intention to do so (Baluch & Itti, 2011). In the lab, mandatory top-down effects are often implicitly trained during complex search tasks (Baluch & Itti, 2010; Chun & Jiang, 1998). Similar processes could also be argued to occur naturally in scene searches in which context and cognitive relevance have been shown to guide visual search (Eckstein, Drescher, & Shimozaki, 2006; Henderson, Malcolm, & Schandl, 2009; Torralba, Oliva, Castelhano, & Henderson, 2006), and generally in the tendency to fixate faces in scenes (Birmingham, Bischof, & Kingstone, 2008) and the speaker in a scene (Coutrot & Guyader, 2014; Ho, Foulsham, & Kingstone, 2015; Vo, Smith, Mital, & Henderson, 2012). The same is true when watching video clips, in which the where and when of viewer attention on a screen is influenced by both volitional processes such as the goals of the viewer (Henderson et al., 2009) and more mandatory processes such as who is speaking (Coutrot & Guyader, 2014; Ho et al., 2015; Vo et al., 2012). Alternatively, bottom-up features of scenes are also known to have strong effects on visual attention (Itti, 2005; Mital et al., 2010), but may not affect the interpretation of the scene (Latif, Gehmacher, Castelhano, & Munhall, 2014). When watching films, the role of bottom-up features appears to be very strong, such that when viewing highly produced Hollywood film trailers, people tend to look at the same places at the same times, known as “attentional synchrony” (Dorr et al., 2010; Hasson et al., 2008; Itti, 2005; Mital et al., 2010). This is very different from static scenes and “natural videos” (i.e., those lacking a narrative or any filmmaking techniques) in which viewers show lower attentional synchrony (i.e., they may look at similar points of interest, but not at the same time) (Dorr et al., 2010; Mannan, Ruddock, & Wooding, 1997).

However, there may be differences in how information is extracted across media (Magliano et al., 2013; Magliano, Higgs, & Clinton, in press). For example, Loughlin, Grossnickle, Dinsmore, and Alexander (2015) showed that visual search is prominent in processing art, but that these processes are not central to making sense of text-based narratives. It may be that narrative film has properties that affect eye movements during comprehension in such a way that the nature of the eye–mind connection is different than how it is manifested in text comprehension.

Film narrative is unique

Differences between the linguistic and visual modalities of narrative representation need to be accounted for when researching comprehension in visual narratives (Magliano et al., 2013). For example, written text is composed of distinct words arranged in lines and paragraphs on a page, and readers typically fixate every content word (noun, verb, adjective, and adverb) in a line, progressing from left to right (in English). In contrast, films are composed of moving images within a frame, but there are no stated rules for how film viewers should watch them, though filmmakers follow numerous conventions in creating them (Smith, 2012). Also, film shots are typically viewed serially from beginning to end, unless a solitary film viewer uses a remote control with pause and rewind functions. This is in contrast with reading in which the reader controls their pace of reading and can vary the amount of time they allocate to processing a piece of information (i.e., fixation/dwell duration) and make regressive eye movements back to previously read words.

Similarly, the highly produced nature of film contains several features that exert strong bottom-up control and increase attentional synchrony (Dorr et al., 2010; Smith, 2013). Importantly, these features are used based on the practical film theory that they guide viewer attention (Eisenstein, 1948; Murch, 2001; Spielberg, 2013). The bottom-up features include motion (Mital et al., 2010), editing (Wang, Freeman, Merriam, Hasson, & Heeger, 2012), and lighting (Cutting, Brunick, DeLong, Iricinschi, & Candan, 2011; Murch, 2001). Additionally, filmmakers often compose highly produced dynamic scenes to include few points of interest, or construct them such that the bottom-up features guide attention to a single point of interest (Cutting, 2015). Compared to highly produced film, the visual features of both static text and static scenes have relatively weak bottom-up features. Potentially due to the weak bottom-up visual features, many studies have shown strong top-down effects on eye movements in text reading (Hyönä & Lorch, 2004; Rayner et al., 1995; Wiley & Rayner, 2000) and static scene viewing (DeAngelus & Pelz, 2009; Yarbus, 1967). All the above differences between films, reading, and other types of scene viewing suggest that a simple analogy between how viewers process each is likely to be wrong.

Comprehension and eye movements in film

The few studies that have tested top-down effects on eye movements in film have what may appear to be contradictory effects. Lahnakoski et al. (2014) found that giving viewers an explicit task to take a certain perspective (interior decorator or detective) can have a top-down effect on eye movements. Alternatively, to test the same research question as the current study, how comprehension processes affect eye movements, Loschky, Larson, Magliano, and Smith (2015) presented participants with a scene from the James Bond film Moonraker (Broccoli & Gilbert, 1979) and had them start viewing the clip earlier (Context condition) or later (No-context condition). They found that participants had large differences in comprehension due to their context condition, but there were relatively weak effects of comprehension on eye movements. The lack of a top-down effect on eye movements despite large comprehension differences was termed the “Tyranny of Film”. Put differently, the Tyranny of Film is the presence of gaze similarity between groups regardless of comprehension differences between viewers, where gaze similarity refers to groups having the same amount of gaze clustering on the same location(s) in the scene (specific details of the gaze similarity analysis are below).

The few eye-movement differences in Loschky et al. (2015) occurred during a single shot of the clip that was essentially a static image that allowed participant gaze to explore the image. In other words, the static nature of the scene may have allowed for eye-movement differences similar to those found in previous experiments using static scenes (DeAngelus & Pelz, 2009; Smith & Mital, 2013; Yarbus, 1967). Nonetheless, the lack of eye-movement differences throughout the rest of the film clip were striking given the large effects typically found during static scene viewing, and the effects found for perspective taking (Lahnakoski et al., 2014) and location-based viewing tasks (Smith & Mital, 2013). Taya, Windridge, and Osman (2012) give evidence for when there is a lack of top-down effects during free-viewing. Similarly, Wang et al. (2012) used a scrambling manipulation with narrative film sequences, which is known to reduce narrative comprehension and memory for texts and picture stories (Gernsbacher, Varner, & Faust, 1990; Larson, Wallace, McQuade, Badke, & Loschky, 2011; Thorndyke, 1977), yet Wang et al. (2012) found very few effects on eye movements except looking at the “most important object” immediately after each cut. This raises the critically important question addressed in the current study, namely, why and when is there a general dissociation between eye movements and film comprehension, which fails to support the eye–mind hypothesis in film viewing?

Overview of the present study

Filmmakers and theorists have long debated the degree to which viewers are active in their consumption of film (see Stam & Miller, 1999 for review). Previous empirical work has found that, in highly composed and rapidly edited films, there seem to be minimal opportunities for top-down impact on narrative processing and gaze (Carmi & Itti, 2006; Hasson et al., 2008; Loschky et al., 2015; Mital et al., 2010; Smith & Mital, 2013). Consistent with these previous works, the present study strategically used a “found film” clip that best illustrated the phenomenon of interest and built an experimental paradigm around it. In this case, based on the prior research described above, we wanted a found film that did not conform to the features of a typical, highly produced narrative film (Bordwell, 2002) so that we could create a strong test of top-down comprehension effects on eye movements in film.

We developed selection criteria for a clip based both on its bottom-up features and what it afforded in terms of top-down manipulations. First, the clip needed to lack specific bottom-up features that create attentional synchrony, which should therefore enhance the opportunity for top-down processes to differentially guide viewers’ eye movements while watching the clip. Many film sequences show only a single primary object of interest in each shot, which limits the opportunities for attention to be shifted to different screen locations. A film segment with many different things to look at could reduce the degree of attentional synchrony as different people may look at different things in the film frame. Second, each time there is a film cut (i.e., a switch between camera shots) there is a sudden decrease and then increase in attentional synchrony as viewers search for and then find the point of central interest in the new shot (Carmi & Itti, 2006). A film sequence lacking any cuts for long periods of time (i.e., a “long-take”) would remove the “resetting” after each cut (Mital et al., 2010; Wang et al., 2012). We chose one of the most famous long-takes in film history, the opening scene of Orson Welles’ Touch of Evil (Welles & Zugsmith, 1958). This long (3 minutes and 12 seconds) single shot depicts events at a Mexico–USA border crossing in the 1950s. Using a combination of deep-focus, wide framing, and a continuous camera movement that takes in much background action, this shot is a much-discussed example of the type of filmic composition film theorist Andre Bazin viewed as the “ideal” way for cinema to capture reality (Bazin, 1967). Bazin stated that by not cutting but instead choosing to depict the action in a single long-take, directors like Orson Welles (here and in Citizen Kane, 1941) evoke:

a more active mental attitude on the part of the spectator and a more positive contribution on his part to the action in progress. While analytical montage [i.e., including many cuts and shots] only calls for him to follow his guide, to let his attention follow along smoothly with that of the director who will choose what he should see, here he is called upon to exercise at least a minimum personal choice. It is from his attention and his will that the meaning of the image in part derives. (Bazin, 1967, p. 35–36)

This Bazin quote (from the perspective of film theory) directly supports our prediction that the opening shot of Touch of Evil should provide an ideal opportunity to find top-down influences of viewer’s comprehension on their gaze.

To create top-down comprehension effects on gaze, it may also be necessary to require the viewer to acquire information from different regions within the scene. For example, Taya et al. (2012) found that both experts and novices tend to have high gaze similarity while watching a tennis match. One likely reason for this is that regardless of expertise, there was only one primary thing to watch—namely the ball (and, to a lesser extent, the player whose court the ball was in). Similarly, in the Moonraker clip used in Loschky et al. (2015), there was usually only a single primary object of interest in each shot, thus theoretically increasing the degree of attentional synchrony. For the current study, choosing the Touch of Evil clip that has multiple objects in the frame that could be relevant to the narrative at any given moment allows for a top-down manipulation that would require viewers to look in different places (Bazin, 1967).

The narrative content of the opening shot of Touch of Evil (https://youtu.be/vIUBoj8CqF8) allows for just such a manipulation of comprehension at the event model level (Kintsch, 1998; van Dijk & Kintsch, 1983). The clip opens on a close-up of someone setting a time bomb (Fig. 1). The time bomb is then placed into the trunk of a car, after which a couple unknowingly gets into the car and drives off, as the camera follows them. About halfway through the clip a second couple walking down the street is introduced (played by Charlton Heston and Janet Leigh, who are mentioned in the quote below), and the camera begins to follow them with the car always lurking around them. Importantly, after the bomb is put into the car, the bomb is never seen again for the remainder of the clip. Many film critics have argued that this creates a very suspenseful experience for viewers as they wait for the time bomb to explode (Comito, 1986; D’Angelo, 2012; Stubbs, 1985), and has been theorized to specifically guide attention to the car:

Our knowledge of the impending explosion makes us hyper-aware of the car’s location, especially in relation to Heston and Leigh (even though we don’t yet know anything about their characters), and Welles expertly teases this instinctive anxiety by allowing it to occasionally leave the frame, getting a few feet ahead of our heroes before being stopped by traffic or passing goats. (D’Angelo, 2012)

The camera does not … move about in order to concentrate and guide our attention. Rather, it seems teasingly to withhold from us what we want to see, what we know--from what we’ve been permitted to see and also from other movies we’ve seen--must be coming. (Comito, 1986, p. 8)

Thus, from a filmmaker’s perspective, it is the knowledge of the bomb that makes the clip so suspenseful—without the bomb, it is just a mundane shot of people and cars on a street. There is theoretical support for these filmmaker intuitions. The presence of the bomb should create a token (i.e., “[bomb]”) in the viewer’s event model that is associated with the car (e.g., Radvansky, 2005), which should be reactivated every time the car is in the frame, including its causal implications for subsequent narrative events (i.e., it creates imminent danger for anyone near it) (e.g., Myers & O’Brien, 1998). Conversely, if a viewer did not see the bomb put in the car, the car would have no particularly salient causal connections in the unfolding narrative, other than as a means of transportation for characters that may or may not be of relevance to the narrative, and would simply be a part of the backgrounded events weakly represented in the event model. Thus, theories of comprehension would say viewers should pay greater attention to the car when the bomb is part of their event model than when it is not. Our comprehension manipulation plays on the power of the bomb to create suspense, and make the car with the bomb an integral component of the scene. Overall, knowledge of the bomb and an impending explosion is at a fairly global level of comprehension, but which gives different levels of importance to local features of the scene (e.g., the car).

To manipulate knowledge of the bomb, we used the “jumped-in-the-middle” paradigm developed by Loschky et al. (2015). This manipulation creates the common experience of coming into a television program or film part way through and then trying to comprehend what is happening. Context group participants saw the bomb being placed in the car at the beginning of the scene, while the No-context group do not see that. At the end of the clip, all participants were asked to predict what would happen next, which provided a basis for demonstrating that the manipulation of context affected comprehension. Eye movements were the primary data of interest.

We assessed both gaze similarity (i.e., the similarity in participant fixation locations on a frame by frame basis) and the extent to which participants fixated on the car with the bomb in the two context conditions. Specifically, the “Event Model” hypothesis (Loschky et al., 2015) predicts there will be greater gaze similarity within conditions (Context and No-context) than between them. This could be the result of, for example, a greater likelihood of viewers fixating on the car in the Context condition than in the No-context condition, as predicted by D’Angelo (2012) above and other film theorists (Comito, 1986; Stubbs, 1985). Overall, Context condition participants looking at the car more would likely result in less exploratory behavior, which would be seen in longer fixation durations and shorter saccades. Alternatively, according to the Tyranny of Film hypothesis (Loschky et al., 2015), the attentional synchrony created by the bottom-up features of this masterfully produced film (created by virtuoso filmmaker Orson Welles) would limit the impact of top-down comprehension on eye movements. Specifically, if everyone is looking at the same places at the same times, there would be no room for differences in eye movements due to differences in viewers’ comprehension. As such, this hypothesis predicts comparable levels of gaze similarity and number of fixations on the car with the bomb in the two context conditions. Importantly, in comparison to the James Bond Moonraker clip used in Loschky et al. (2015), the comparatively weaker bottom-up features of the Touch of Evil clip should theoretically reduce the attentional synchrony. Nevertheless, it is conceivable that Welles used all the other filmmaking techniques at his disposal (mis-en-scene [i.e., staging], camera framing [i.e., what is shown on camera], camera movement, etc.) to masterfully guide viewers’ attention.

This found film approach is valuable in situations where it is difficult to equate stimuli on the features of interest, which is the case when using naturalistic films that vary dramatically in visual cinematic features that can affect attention and eye movements. Additionally, it is intended to demonstrate a phenomenon that already exists in the world that should subsequently be studied in a more controlled manner, likely with multiple experimenter-generated video clips. Importantly, the present study was carried out as a replication and extension to Loschky et al. (2015), thus adding greater generalizability to it, and providing new and deeper insights.

Experiment 1: context and eye movements

Methods

Experiment 1 tested the effect of the comprehension differences on viewers’ eye movements while watching the film clip.

Participants

Eighty-four participants (61 females; mean age = 18.6 years; standard deviation (SD) = 1.4) were pseudo-randomly assigned to one of two viewing conditions for the opening scene of Touch of Evil (Context, n = 42; No-context, n = 42). The Kansas State University Institutional Review Board approved all experiments in the study. The study was determined to pose minimal risk to the participants and informed consent was deemed unnecessary (i.e., exempt under the criteria set forth in the Federal Policy for the Protection of Human Subjects.) All participants received course credit for their participation, and all analyses were performed on de-identified data.

Stimuli

Two clips from the opening scene of Orson Welles Touch of Evil were used (Welles & Zugsmith, 1958). The Context version shows a bomb being placed in a car trunk at the beginning and runs for 3:12. The No-context version omits the first 18 seconds when the bomb is placed in the car and runs for 2:54. Both clips end with a close-up of the walking couple kissing. An initial experiment (Hutson, Magliano, Smith, & Loschky, Working memory span and film comprehension: Effects on high-level inference generation, in preparation, not presented here, found that presenting the clip with audio created the largest effect between the Context and No-context conditions in inference generation. Thus, audio was presented with the film clip.

Both clips were presented at a frame rate of 30 frames per second (fps) and a resolution of 1080 × 720 pixels. The video clips were shown on a 17” ViewSonic Graphics Series CRT monitor (Model G90fb). A chin and forehead rest set a fixed viewing distance of 60.96 cm. The screen subtended 21.42° × 16.10° of visual angle.

Eye tracking was done using an EyeLink1000 eye tracker (SR Research), which samples eye position 1000 times per second (1000 Hz). Based on the SR Research guidelines, an average spatial accuracy of 0.5° of visual angle and a maximum error of 1° or better were obtained for all calibrations.^{Footnote 1}

Procedure

All participants were told that they would be shown a video clip while their eyes were tracked. Participants went through a nine-point calibration routine, after which the experiment began. An eye-movement trigger was used to ensure that the video started at the beginning of a fixation. To start a trial, while the participant was looking at the central fixation point, they pressed a button which moved the fixation point 13.65° to right of center. Once the participant fixated the new point, it moved back to the center. During the saccade (velocity > 30°/s) back to the center, the video began to play. In this way, any saccadic inhibition (which increases the current fixation duration), caused by the motion transient due to the sudden onset of the video clip, was masked by the viewer’s own eye movement (Reingold & Stampe, 2000, 2002). Participants then watched the video, uninterrupted, until the moment when the couple kisses (3:12 into the Context condition and 2:54 into the No-context condition). At the end of the video all participants were asked, “What will happen next?” and responses were collected using the computer keyboard. The next question asked was, “Have you seen this movie before?” The keyboard was used to indicate “Yes” or “No.” If a participant responded “Yes” they were asked the follow-up question, “What was the name of the movie?” No participants indicated having seen the movie before.

Data analysis

To identify whether participants’ predictive inferences at the end of the clip were influenced by having the bomb in their event model, we had two research assistants code each inference, with coders blind to the condition from which each response was taken. The coding of the inference was dichotomous from (1 = participant mentioned something related to the bomb, 0 = the participant did not). The coders had a high level of inter-rater reliability (Cohen’s Kappa = 0.954, p < 0.001). Any remaining discrepancies between the two coders were resolved through discussion. After coding, the four participant groups were Context + Inference (n = 33), Context + No-inference (n = 9), No-context + Inference (n = 1), and No-context + No-inference (n = 41).

In this and all following experiments, Bayes factors (BF₀₁ reported) (Rouder, Morey, Speckman, & Province, 2012; Rouder, Speckman, Sun, Morey, & Iverson, 2009; Wetzels & Wagenmakers, 2012) were calculated for tests that did not reject the null to identify the level of evidence for the null, which would support the Tyranny of Film hypothesis. Values over 1 offer some evidence for the null, over 3 is substantial evidence, and over 10 is strong evidence for the null.

An important consideration when analyzing eye-movement data in videos is that there may be smooth pursuit eye movements, which are low-velocity eye movements during which visual information still reaches the visual cortex. Unfortunately, there are still no reliable methods for parsing eye-movement data to differentiate between fixations and smooth pursuits (Larsson, Nyström, Ardö, Åström, & Stridh, 2016). Nevertheless, this issue was addressed by rerunning analyses that didn’t already account for smooth pursuit with a cleaning procedure to remove potential smooth pursuits. The cleaning procedure quantified the maximum linear displacements during intersaccadic intervals (the period when the eye-tracker eye-movement parser estimated that the eyes were in a fixation due to their velocity being lower than the saccade threshold). The change in eye location during the intersaccadic interval was identified first by calculating the Euclidean distance in pixels between the x,y location of where the saccade before the intersaccadic interval ended and the location of where the next saccade began. This pixel value was then converted to the degrees of visual angle that the eyes moved during each intersaccadic interval. The majority of these intersaccadic intervals are likely to be fixations and exhibit low displacement. However, potential smooth pursuits, by definition, require displacement along with a moving target and can therefore be excluded by removing all intersaccadic intervals with displacements greater than 1° of visual angle. Surprisingly, this resulted in about 30% of all previously identified fixations being cleaned from the data set, regardless of experiment or condition. This is a noticeably higher proportion of potential pursuits than has previously been reported for video viewing (e.g., 2.8% of all data; Smith & Mital, 2013). However, despite the large number of previously identified fixations being removed from the data, the effects found for the remaining fixations (with low displacement; unlikely to be pursuit periods) were unchanged. Below, we report results both with and without the intrasaccadic cleaning to remove potential smooth pursuit eye movements from fixations.

Results

Overview

As will be described in detail below, the results of experiment 1 showed that although there were large differences in participant comprehension based on the context manipulation, the only eye-movement effect showed Context participants who made the inference had longer saccade lengths than No-context participants. Bayes factors indicated that all other effects (fixation durations, gaze similarity, and region of interest) supported the null hypothesis. Thus, experiment 1 mostly supported the Tyranny of Film hypothesis.

Predictive inference

A chi-square test was used to identify whether there was a comprehension difference between the Context and No-context conditions. The expected difference between context conditions was found, with 80% of participants in the Context condition making a bomb-relevant inference compared to only one participant from the No-context condition doing so (X ² (1, N = 85) = 51.59, p < 0.001). There were also qualitative differences in the predictive inferences generated. Specifically, instead of predicting that the bomb would explode, killing the couple in the car, other innocent bystanders, and possibly the walking couple, a common predictive inference among those in the No-context condition was that the couple in the car would have dinner with the walking couple. Thus, the results indicate viewers in the two conditions had radically different event models (i.e., comprehension) of the narrative in the film clip based on the context they were given.^{Footnote 2}

Eye movements

Fixation durations and saccade lengths

Fixation durations and saccade lengths can be very sensitive to manipulations of comprehension in reading at both local and global levels (Rayner, 1998), what is currently being fixated in scenes (Henderson & Pierce, 2008; Henderson & Smith, 2009), and manipulations of task in dynamic scenes (Smith & Mital, 2013). The event model hypothesis thus predicts there should be effects of our comprehension manipulation on these basic eye-movement metrics in the current study. Specifically, based on the logic that Context condition participants will hold knowledge of the bomb in their event model, the Event Model hypothesis would predict that when the car with the bomb is on the screen they should have tighter gaze on the car. This should result in shorter saccades and longer fixations. The inclusion of these measures should give a fuller picture of the eye-movement results to help interpret the effects for gaze similarity and region of interest below.

All eye-movement data were first cleaned by removing the longest and shortest 1% of fixation durations and saccade lengths for each participant. We then compared the mean fixation durations and saccade lengths between the Context and No-context groups for the shared viewing period. There were no significant differences in fixation duration between the two conditions. In the Context condition, the average fixation duration was slightly descriptively longer than the No-context condition (Table 1), but not significantly different (t (82) = 0.438, p = 0.662; intersaccadic interval 1° cleaning, t (82) = 0.888, p = 0.377). There was substantial evidence for the null hypothesis (BF₀₁ = 5.48). The effect was the same when only participants in the Context condition who made the inference were compared to the No-context condition (t (73) = 0.318, p = 0.751; intersaccadic cleaning, t (73) = 0.772, p = 0.443). There was again substantial evidence for the null hypothesis (BF₀₁ = 5.33). The average saccade length for the Context group was descriptively longer than for the No-context group, and marginally significant (t (82) = 1.848, p = 0.068, d = 0.41; intersaccadic cleaning, t (82) = 1.892, p = 0.062). The Bayes factor only showed anecdotal evidence for the null hypothesis (BF₀₁ = 1.25). When only Context condition participants who made the inference were included the effect of condition and inference on saccade lengths was significant (t (73) = 2.089, p = 0.040; intersaccadic cleaning, t (73) = 0.2.168, p = 0.033). Thus, Table 1 shows participants in the context condition who made the inference had longer saccade lengths compared to the No-context condition, though this was a small-to-medium effect (d = 0.489).

Table 1 Experiment 1 and 2 fixation duration and saccade length descriptive statistics

Full size table

Longer saccade lengths usually show greater exploration of a scene (Pannasch, Helmert, Roth, Herbold, & Walter, 2008; Smith & Mital, 2013), which makes it surprising the Context group that made the inference would explore more. They have the best understanding of the narrative presented, and many of them maintained the bomb in their event model throughout the clip. This should create suspense that would guide their eye movements towards the car with the bomb, which should theoretically result in shorter saccade lengths. Bezdek et al. (2015) showed in an fMRI study that suspense in film narrows attentional focus. A potential alternative explanation is that the Context participants who made the inference did explore the scene more to look for potential effects of a bomb explosion. Also, there is the possibility that, due to their relatively good comprehension for the narrative, Context participants that made the inference were under less cognitive load to maintain the narrative, which gave them the opportunity to explore the screen more. This may be similar to a person watching a film for the second or third time, and noticing things they hadn’t in previous viewings because they don’t have to follow the narrative as closely. Nevertheless, this difference in saccade lengths was a relatively small effect, so the above interpretations must be made cautiously.

Gaze similarity

Data pre-processing

Comparing the spatiotemporal distribution of gaze between viewing conditions in dynamic media is more difficult than in static scenes as traditional scanpath comparison methods assume that the stimulus does not change during viewing (e.g., Scanmatch; Cristino, Mathot, Theeuwes & Gilchrist, 2010). Instead, a method based on comparison of gaze heatmaps between conditions on a frame-by-frame basis can be used. (Note that, for this reason, it is not necessary to clean out potential smooth pursuit eye movements, since the analysis simply calculates each viewer’s mean gaze position during that 1/30th of a second [i.e., 33 ms].) Gaze heatmaps represent the probabilistic spatial distribution of raw gaze points within a viewing condition and can be compared across conditions to statistically confirm qualitatively observable changes in heatmaps over time (e.g., tightening of gaze clusters = moments of attentional synchrony) and differences between heatmaps (i.e., when gaze similarity is high or low). Such heatmap comparison methods have become the standard in dynamic scene viewing research (Peters, Iyer, Itti, & Koch, 2005; Dorr et al., 2010; Caldara & Miellet, 2011; Loschky et al., 2015). The method used here is an adaptation of the Normalized Scanpath Saliency (NSS) first proposed by Peters et al. (2005) and extended to video by Dorr et al. (2010) (for full details of the method and equations, see Dorr et al. (2010) and Loschky et al. (2015)). Our gaze similarity derivation of NSS is preferable over alternative methods of comparing the distribution of gaze between two groups. For example, the alternative of averaging the separate Pearson correlations of gaze X and Y coordinates would ignore the 2D nature of the data at a particular moment, and also produce similarly high correlations whether each distribution is tightly clustered or not. The gaze similarity measure also allows us to use inferential statistics to identify moments when the gaze distributions between two groups differ in time, allowing for direct tests of the Event Model hypothesis.

The NSS method was modified for the analysis here in two critical ways. First, to calculate inter-observer similarity within the reference condition (in this case, the Context condition since it is the originally intended viewing condition by the filmmaker), a probability map is created by down-sampling the raw eye-tracking data to 33 Hz (from 1000 Hz) to express raw eye fixation X/Y coordinates per video frame, and exclude saccades and blinks (periods when visual encoding is absent). A 2D circular Gaussian (1.2° SD; roughly equivalent to the fovea) is then plotted around each raw gaze location and temporally averaged over a 225-ms moving time window (to roughly approximate the duration of an average fixation) for all but one participant within the Context condition. These Gaussians are summed and normalized relative to the mean and SD of these values across the entire Context condition, to see how the similarity fluctuates over time (z-score similarity = (Raw values − Mean)/SD). The gaze location of the remaining participant is then sampled from this distribution (i.e., a z-score is calculated for this participant) to identify how their gaze fits within the distribution at that moment. This “leave-one-out” procedure is repeated for all participants within the Context condition until each participant has a z-scored value (referred to as “gaze similarity” here). These values express both 1) how each individual gaze location fits within the group at that moment, and 2) how the average gaze similarity across all participants at that moment differs from other times in the video: a z-score close to zero indicates average synchrony, negative values indicate less synchrony than the mean (i.e., more variance), and positive values indicate more synchrony.

Second, the method is extended to allow gaze from different viewing conditions (e.g., No-context) to be sampled from a reference distribution (Context). For each gaze point in the No-context condition, the probability that it belongs to the Context condition’s distribution is identified by sampling the value at that location from the Context’s probability distribution (this time leave-one-out is not used as the gaze does not belong to the same distribution so cannot be sampled twice). The resulting raw NSS values for No-context are then normalized to the reference condition. Importantly, if the two distributions are identical, the average z-scored similarity for both distributions will fluctuate together, expressing more (positive z-score) or less (negative z-score) gaze similarity together over time (Fig. 2). However, as the similarity score is derived from the reference distribution, if the two distributions differ significantly, we cannot know if this is because the comparison distribution has more versus less gaze clustering than the reference distribution. We can only say that the comparison distribution differs more from the reference distribution than the reference distribution differs from itself. For example, both gaze distributions could be tightly clustered but in different, non-overlapping parts of the screen or the comparison distribution could be more spread out and only partly overlap with the tight reference distribution. Both situations would result in significant differences between the distributions.

To address this ambiguity of the gaze similarity measure, we also include three other gaze measures in our analyses. Specifically, we include two common measures of attentional synchrony, sum-weighted gaze covariance and number of gaze clusters (Mital et al., 2010), to determine the degree to which gaze in each group was tightly clustered (We report these later for all experiments and groups in Fig. 8.) We also report region of interest analyses to determine the degree to which gaze goes to specific regions in each condition. By including all four gaze measures (gaze similarity, sum-weighted gaze covariance, number of gaze clusters, and the car region of interest analysis), we can disambiguate the sources of differences in gaze similarity.

Additionally, a shuffled baseline was created to demonstrate the gaze similarity that would be observed if eye movements were randomly distributed during the clip (given the constraints of normal eye movements within that clip). The shuffled baseline started with the Context group. Because the gaze similarity values are calculated based on each participant’s gaze location on each film frame, the order of frames for each participant (and thus the order of their eye movements across frames) was shuffled. In other words, for the first frame of the film, instead of having each participant’s first fixation the shuffled baseline may have one participant’s first fixation, a second participant’s 356th fixation, and another’s 22nd, etc. This new gaze distribution was then compared to the reference distribution (i.e., Context) in the same method described above. This created a chance baseline of gaze similarity which represents a floor value of similarity for each participant at every frame against which all other conditions can be compared. Importantly, since the shuffled baseline was created from eye movements from the same participants in the same clip, any biases within the clip (e.g., center bias or certain characters staying on a specific side of the scene) would be in the baseline as well. These frame-by-frame gaze similarity z-scores for each participant and condition (including the shuffled baseline) were then analyzed to identify differences in gaze similarity across conditions.

Gaze similarity results

As shown in Fig. 2, the first gaze similarity analysis compared the Context and No-context conditions across the entirety of the shared viewing period of the film clip that overlapped across conditions (i.e., the 2 minutes and 54 seconds of film, starting just after the bomb was placed in the car).

Qualitatively, looking at Fig. 2 one can see that gaze similarity scores between the Context and No-context groups generally overlap throughout the film clip, indicating that, regardless of context condition, viewers likely did not differ in their overall gaze similarity. A t-test of mean gaze similarity between groups averaged across all frames supported this qualitative assessment (t (80) = 1.081, p = 0.283; d = 0.241), indicating that knowledge of the bomb did not have an effect on overall viewer gaze similarity. Additionally, the Bayes factor showed substantial evidence for the null (BF₀₁ = 3.45), namely support for the Tyranny of Film hypothesis. The results are similar when participants who did not make the inference in the Context condition are removed from the analysis (t (71) = 0.592, p = 0.556). Next, we included the shuffled baseline for comparison. As shown in Fig. 2, when the experimental groups’ gaze similarity is above the shuffled baseline, it indicates that gaze is more clustered than would be predicted by chance, possibly due to either the bottom-up features of the film or all viewers’ mental models systematically guiding their eye movements. This qualitative assessment was confirmed by adding the shuffled baseline to the ANOVA for condition, which produced a significant effect (F (2, 122) = 73.727, p < 0.001, ηp2 = 0.551). Bonferroni corrected pairwise comparisons indicated that both the Context (mean (M) = −0.001, SD = 0.267) and No-context (M = −0.067, SD = 0.290) conditions had significantly greater gaze similarity than the shuffled baseline (M = −0.561, SD = 0.034). Importantly, Fig. 2 shows that this quantitative difference in gaze similarity from chance was highly systematic. For example, Fig. 2b shows the time point with the lowest gaze similarity, which shows a busy street scene with many people, goats, cars, and building signs to look at. Conversely, Fig. 2c shows the moment of highest gaze similarity, which is when the walking couple kiss, at the center of the screen, and there is nothing else of interest to look at (i.e., only a non-descript architectural background). Therefore, one cannot attribute the null effect of context on gaze similarity to the gaze similarity measure being insensitive to variations in attentional synchrony. On the contrary, Fig. 2, and comparisons with the shuffled baseline, shows that the gaze similarity measure was very sensitive to moments when one would predict to find lesser (Fig. 2b) or greater (Fig. 2c) attentional synchrony.

Region of interest

Data pre-processing

Dynamic regions of interest were created for the clip to test whether either condition looked more at the car with the bomb in it. To create the dynamic region of interest for the car, we used Gazeatron (Vo et al., 2012) to identify the rectangular x,y pixel coordinates for the car on the screen for each frame (at 30 frames/s). These pixel coordinates were then exported and combined with the raw fixation report from EyeLink DataViewer (SR Research). This was used to calculate the cumulative dwell time and mean number of fixations for each participant in the car region of interest. One-second time bins were used, and fixations were counted for the time bin they ended in (i.e., if a fixation went across time bins, it was only counted for the time bin it was in when the next saccade was generated).

Region of interest results

While gaze similarity is a metric that indicates the co-occurrence of eye movements in space and time, it does not indicate the features of a scene that are being attended to. The region of interest analysis remedies this by indicating how much a specific object in a scene, here the car with the bomb in it, is attended to. The Event Model hypothesis predicted that the car with the bomb would be of greater importance to participants in the Context condition, because they are aware of the potential destructive causal effects the car could have on nearby persons, places, and things.

As illustrated in Fig. 3, as with the gaze similarity analysis, fixations on the car by viewers in the two context conditions were compared for the shared viewing time from the start time of the No-context condition when both conditions were seeing the exact same information. The region of interest was used to calculate the mean number of fixations when the car was present on the screen within 30-frame (1 s) time bins.

As with the gaze similarity analyses, Fig. 3 shows the lines for the two context conditions are mostly overlapping, indicating that, regardless of the context condition, participants fixated the car at the same time points throughout the clip. A t-test comparing the proportion of fixations on the car between each condition was not statistically significant (t (82) = 1.73, p = 0.087; d = 0.382), and the Bayes factor showed anecdotal evidence for the null, namely the Tyranny of Film hypothesis (BF₀₁ = 1.52). The non-significant trend was for the No-context condition to have a higher proportion of fixations (M = 0.098, SD = 0.045) on the car than the Context group (M = 0.082, SD = 0.036), which is in the opposite direction of what was predicted. The result was the same when only those members of the Context condition who made the inference were included (t (73) = 1.434, p = 0.156; d = 0.316), which also showed anecdotal evidence for the null (BF₀₁ = 2.21). Overall, this indicates that the viewers without knowledge of the bomb fixated the car at a similar rate to those with knowledge of it.

Discussion

The context manipulation had an impact on comprehension in that participants in the Context condition were more likely to predict that the car would explode than those in the No-context condition. However, context had only one modest effect on eye movements. Saccade lengths were slightly longer for participants in the Context condition that made the inference about the bomb. This could potentially be argued to support the hypothesis that Context participants explored the scene more. However, all other eye-movement measures showed no effect. This included gaze similarity, which should pick up on one condition exploring the scene more than the other. These results therefore mostly support the Tyranny of Film hypothesis (Loschky et al., 2015). This is despite the fact that, as shown in Fig. 8, the Touch of Evil film clip produced less gaze clustering than the Moonraker clip (Loschky et al., 2015), which we had predicted due to the Touch of Evil clip lacking cuts but including numerous objects in the film frame to look at. This stands in contrast to the lack of effect of strong comprehension differences on viewers’ eye movements. Thus, our predicted reduction of gaze similarity in the Touch of Evil film clip in comparison to the Moonraker film clip was nevertheless not enough to allow a strong difference in comprehension between context conditions to produce meaningful differences in eye movements.

One potential problem with experiment 1 was that the No-context condition and the Context condition both show the car at the beginning of the scene, and in particular a couple getting into the car. First mentioned entities have a special status in event models for text narratives (e.g., Gernsbacher, 1990). As such, the car was likely prominent in the event models for both the Context and No-context conditions, which may have led to similar eye movements in both conditions, regardless of whether they had knowledge of the bomb.

Experiments 2a and 2b: eye-tracking with new No-context condition and map task

In order to address the potential first mention (Gernsbacher, 1990) issue that viewers in the No-context condition may have treated the characters in the car as protagonists, and therefore paid close attention to the car even though they did not know it contained a bomb, a new No-context condition was used. In the new No-context condition, viewers began watching the clip only after the walking couple entered the street and the car was off-screen (Fig. 1; image 5 marked “No-context: Exp. 2”). Thus, viewers in the new No-context condition should treat the walking couple as the protagonists. Viewers in the Context condition started watching the clip from the beginning as before.

Experiment 2b was conducted to demonstrate that the Tyranny of Film effect can be broken when there are strong endogenous factors affecting attention. Previous research has shown higher level cognition has large online effects on eye movements during scene viewing (Foulsham & Underwood, 2007; Henderson, Brockmole, Castelhano, & Mack, 2007; Henderson, Shinkareva, Wang, Luke, & Olejarczyk, 2013). These effects have also been shown more recently in “natural film” (i.e., unedited real-world video), such as trying to determine the location depicted in a video (Smith & Mital, 2013), and in edited narrative film, such as taking different film viewing perspectives (Lahnakoski et al., 2014). We therefore predicted viewer eye movements in the Touch of Evil clip would similarly be affected by a cognitive task that was designed to be specifically at odds with understanding the narrative, specifically a map drawing task that involved creating a map of the narrative spatial environment. That is, eye movements for viewers under instruction to draw a map would be different than those of participants processing the film clip in order to comprehend it. Moreover, those eye movements under the map task would be at regions of the screen important to the task, but not important to the narrative content.