Exploring website gist through rapid serial visual presentation

Owens, Justin W.; Chaparro, Barbara S.; Palmer, Evan M.

doi:10.1186/s41235-019-0192-1

Original article
Open access
Published: 20 November 2019

Exploring website gist through rapid serial visual presentation

Justin W. Owens^1,2,
Barbara S. Chaparro^1,3 &
Evan M. Palmer ORCID: orcid.org/0000-0003-2791-4390^1,4

Cognitive Research: Principles and Implications volume 4, Article number: 44 (2019) Cite this article

2623 Accesses
3 Citations
1 Altmetric
Metrics details

Abstract

Background

Users can make judgments about web pages in a glance. Little research has explored what semantic information can be extracted from a web page within a single fixation or what mental representations users have of web pages, but the scene perception literature provides a framework for understanding how viewers can extract and represent diverse semantic information from scenes in a glance. The purpose of this research was (1) to explore whether semantic information about a web page could be extracted within a single fixation and (2) to explore the effects of size and resolution on extracting this information. Using a rapid serial visual presentation (RSVP) paradigm, Experiment 1 explored whether certain semantic categories of websites (i.e., news, search, shopping, and social networks/blogs) could be detected within a RSVP stream of web page stimuli. Natural scenes, which have been shown to be detectable within a single fixation in the literature, served as a baseline for comparison. Experiment 2 examined the effects of stimulus size and resolution on observers’ ability to detect the presence of website categories using similar methods.

Results

Findings from this research demonstrate that users have conceptual models of websites that allow detection of web pages from a fixation’s worth of stimulus exposure, when provided additional time for processing. For website categories other than search, detection performance decreased significantly when web elements were no longer discernible due to decreases in size and/or resolution. The implications of this research are that website conceptual models rely more on page elements and less on the spatial relationship between these elements.

Conclusions

Participants can detect websites accurately when they were displayed for less than a fixation and when the participants were allowed additional processing time. Subjective comments and stimulus onset asynchrony data suggested that participants likely relied on local features for the detection of website targets for several website categories. This notion was supported when the size and/or resolution of stimuli were decreased to the extent that web elements were indistinguishable. This demonstrates that schemas or conceptualizations of websites provided information sufficient to detect websites from approximately 140 ms of stimulus exposure.

Significance

In the 25+ years since the advent of the MOSAIC web browser, the web has become an integral part of modern society. Research has demonstrated that people have expectations about website layouts and can judge a web page’s usability, trustworthiness, and visual appeal in a glance. While we understand that users have expectations about websites, and that they can rapidly categorize websites, little is known about the underlying cognitive mechanisms for visually processing websites or forming layout expectations.

Literature on how websites are perceived, attended to, and classified mirror various aspects of the scene perception literature, at least on the surface. Extensive research on scene perception covers topics including scene gist, scene classification, scene processing, and visual search. Due to these shared aspects, scene perception theories and methodologies, specifically those related to scene gist, could be used to explore the cognitive and perceptual processing of websites.

The results of this study demonstrate that website schemas or conceptualizations provided sufficient information to distinguish between different types of websites under rapid serial visual presentation. This suggests quick and efficient website perception may utilize a combination of gist-like and diagnostic feature processing.

Background

In 1993, the MOSAIC web browser ushered in the internet age, exposing modern culture to web pages, a new form of stimuli. In the 25+ years since, web pages have become integral to society. In 2016, 87% of US adults were online, up from 52% in 2000 (Pew Research, 2013). Given the tremendous contact with this relatively new class of stimuli, we wondered whether people could accurately detect website categories (e.g., news, shopping, search, social media) with exposure durations equivalent to a single glance.

Jahanian, Keshvari, and Rosenholtz (2018) established that participants could accurately categorize web pages with only 120 ms of exposure of the stimulus. These short presentations were sufficient for accurate detection of ads on the web pages and localization of the navigation menu. While participants used web page text as an information source during the task, they still had above chance performance in a classification task when the text was inverted and reflected. Thus, Jahanian et al. (2018) demonstrated that participants can rapidly extract important featural information from a web page within a single glance and accurately categorize it. The present research expands the Jahanian et al. (2018) work by testing participants’ ability to categorize web pages using an RSVP procedure, using different web page categories, more web page stimuli per category, and different stimulus sizes and resolutions.

Web users have well-established expectations for website layout and formatting (Bernard, 2001, 2003; Bernard & Sheshadri, 2004; Di Nocera, Capponi, & Ferlazzo, 2004; Granka, Hembrooke, & Gay, 2006; Owens, Chaparro, & Palmer, 2011; Owens, Palmer, & Chaparro, 2014; Roth, Schmutz, Pauwels, Bargas-Avila, & Opwis, 2010; Shaikh, Chaparro, & Joshi, 2006; Shaikh & Lenz, 2006). For instance, users expect navigation on the left or top of a website, and advertising on the top or right (Bernard, 2001; Bernard & Sheshadri, 2004; Shaikh et al., 2006; Shaikh & Lenz, 2006). Such layout expectations are cross-cultural (Bernard & Sheshadri, 2004; Shaikh et al., 2006), exist for specific types of websites (Roth et al., 2010), and are affected by users’ experience and expertise (Di Nocera et al., 2004; Roth et al., 2010). While users rely on website layout conventions, they can adapt to violations of these conventions, despite the decreased usability (McCarthy, Sasse, & Riegelsberger, 2004; Owens et al., 2014; Santa-Maria & Dyson, 2008; Tzanidou, Petre, Minocha, & Grayson, 2005).

Few studies have investigated what information can be derived from web pages in a single glance. With the exception of Jahanian et al. (2018), previous researchers mainly focused on subjective user impressions, such as visual aesthetics, trustworthiness, and perceived usability (Albert, Gribbons, & Almadas, 2009; Jiang, Wang, Tan, & Yu, 2016; Lindgaard, Dudek, Sen, Sumegi, & Noonan, 2011; Lindgaard, Fernandes, Dudek, & Brown, 2006; Thielsch & Hirschfeld, 2012; Tuch, Presslaber, Stocklin, Opwis, & Bargas-Aliva, 2012). For instance, judgments about web page aesthetics are almost as consistent between 50 ms, 500 ms, and unlimited exposure durations (Lindgaard et al., 2006, 2011). With exposures as low as 17 ms, aesthetic judgements have been shown to correlate with web page prototypicality and visual complexity (Tuch et al., 2012). With only slightly longer display durations (i.e., 50 ms), trust and perceived usability could be reliably rated (Albert et al., 2009; Lindgaard et al., 2011). Such 50-ms exposure durations are substantially shorter than the average fixation duration of 200–250 ms (Rayner, 2009).

The studies reviewed above raise the question: how can these types of judgments occur reliably within a single glance? Additionally, since users seem to have well-defined conventions for websites, how are website layouts perceived and represented cognitively? With little work exploring perceptual and cognitive representations of web pages and how quickly they can be accessed, we explored these questions with well-established methodologies from the scene perception literature.

Scene perception and gist

A theme common to scene perception literature has been how easily participants recognize visual scenes from very brief exposure durations. In general, global information of a scene is processed first, followed by local information (Navon, 1977). In the case of an outdoor scene, this is analogous to processing the forest before the trees. It seems that observers rely on global information to classify scenes, and work has provided additional detail about the sorts of holistic, global scene information that might be important for recognition.

Several scene perception theories incorporate processing of global information for the recognition of objects within a scene, including the perceptual schema model, the priming model, and contextual guidance model (Friedman, 1979; Henderson & Hollingworth, 1999; Oliva & Torralba, 2007; Torralba, Oliva, Castelhano, & Henderson, 2006). Such holistic representations have also been integrated into other vision theories, such as newer versions of Wolfe’s Guided Search model (Wolfe, Võ, Evans, & Greene, 2011) and the spatial envelope theory (Oliva & Torralba, 2001).

Changes to global statistics can affect scene perception (Joubert, Rousselet, Fabre-Thorpe, & Fize, 2009). Some research has suggested that superordinate categories (Rosch, 1978) have less bias than basic categories (Loschky & Larson, 2010). When scenes share global properties or lack distinct global properties, correctly distinguishing between scenes becomes more difficult (Greene & Oliva, 2009b; Loschky & Larson, 2010).

Observers’ ability to rapidly extract the “gist” of a scene has also been researched extensively. Seminal gist research found that participants detected rapidly presented target scenes above chance after being prompted with just a verbal description or image (Potter, 1975, 1976). Research describes scene gist as the extracted meaning of a scene occurring within a single fixation, possibly with little-to-no attention, based on global processing of visual information (Fei-Fei, Iyer, Koch, & Perona, 2007; Fei-Fei, VanRullen, Koch, & Perona, 2002; Greene & Oliva, 2009a, 2009b; Intraub, 1980, 1981; Larson & Loschky, 2009; Oliva, 2005; Potter, 1975, 1976). Information contained within scene gist may consist of a semantic label, a limited number of objects, and the spatial layout of objects (Oliva & Torralba, 2006).

Participants performed better when they were prompted with pictures than with text descriptors (Potter, 1976). Moreover, when prompts have more information, (i.e., butterfly vs animal), performance typically increases (Intraub, 1981). Longer displays of scenes also result in richer descriptions of the scenes or features detected (Fei-Fei et al., 2007; Intraub, 1981; Loftus, Nelson, & Kallman, 1983). Fei-Fei et al. (2007) found a rich variety of information, including object identities and scene classifications, could be derived from a scene in 107 ms. Other research has shown that objects can be recognized as quickly as 100 ms (Liu, Agam, Madsen, & Kreiman, 2009). In fact, gist can be extracted in the absence of fine detail, from degraded scenes, when objects are difficult to process (Larson & Loschky, 2009; Meng & Potter, 2008; Oliva & Torralba, 2007; Potter, 1975, 1976; Rousselet, Joubert, & Fabre-Thorpe, 2005; Torralba, 2009), or even if multiple scenes are presented simultaneously (Potter & Fox, 2009).

Oliva (2005) proposed that gist occurs in conceptual and perceptual forms. Perceptual gist represents the depiction of the scene defined by its global features, while conceptual gist is the scene’s semantic meaning extracted during the cognitive processing that occurs after viewing the scene. Perceptual gist influences conceptual gist.

Extraction of semantic information from scenes is robust, even with visually degraded scenes. For instance, scenes can be detected and recognized even when they are partially occluded (Meng & Potter, 2008), inverted (Diamond & Carey, 1986; Epstein, Higgins, Parker, Aguirre, & Cooperman, 2006; Evans & Treisman, 2005; Harding & Bloj, 2010; Kelley, Chun, & Chua, 2003; Meng & Potter, 2008; Shore & Klein, 2000), have had color removed (Meng & Potter, 2008; Rousselet et al., 2005), contain object inconsistencies (Biederman, Mezzanotte, & Rabinowitz, 1982; Davenport, 2007; Davenport & Potter, 2004), and even when the scenes are low resolution or poor quality (Loschky, Hansen, Sethi, & Pydimarri, 2010; Oliva & Schyns, 1997; Torralba, 2009). The recognition and detection of scenes in such scenarios has been attributed to the semantic information derived from the scene.

Scene gist methods

During studies exploring scene gist, viewers typically see a prompt, followed by a scene stimulus for up to a few hundred milliseconds, and then a masking stimulus. Masks stop perceptual processing of a stimulus, allowing for more accurate estimates of requisite scene processing time (Potter, 1976). Another approach has been to display a prompt followed by an RSVP stream of scenes instead of single scene and mask. The rapid presentation of stimuli, one after the next, effectively halts perceptual and conceptual processing of the previously presented stimuli (Intraub, 1984; Potter, 1976). Following the display sequence, viewers are asked whether any stimulus in the stream matched the prompt shown at the beginning of the task. With either approach, stimuli are often displayed from 10 to several hundred milliseconds and participants typically achieve above chance performance detecting targets provided by the prompt.

Detecting gist requires exposure durations shorter than a single fixation (Potter, 1975, 1976). This can be accomplished using sufficiently short display durations with appropriate inter-stimulus intervals (ISIs), which remove stimuli from the screen without masking and allow for continued processing. Loftus, Shimamura, and Johnson (1985) noted that performance using unmasked stimuli was equivalent to approximately 100 ms of additional exposure time, due to the prolonged sensory presence of the stimuli in the visual icon (Neisser, 1967; Sperling, 1960). Potter and Fox (2009) found that participants readily detected targets regardless of whether RSVP streams incorporated ISIs, but demonstrated that when ISIs were present, performance was relatively worse. During recognition tasks, participants performed similarly regardless of whether RSVP streams incorporated ISIs.

Website categories

Web pages are complex documents, consisting of a variety of elements arranged spatially within a single page. Some previous classification attempts have relied on groups of elements or the type of elements found within a web page, sometimes in combination with previous personal experiences (Crowston & Williams, 2000; Dillon & Gushrowski, 2000), while other attempts have focused on automation and examining the hierarchy and the occurrences of types of text within a web page (Rehm, 2002; Santini, 2006). These methods have typically resulted in researchers or automation creating genres, but not users. Jahanian et al. (2018) developed ten web page categories for their study: art place, blog, company, computer game, helpline, news, online tutorials, shopping, society, and tourism. The authors derived categories based on considerations of web page use, which were validated in a pilot study.

The web evolves over time, which has interesting implications for classification schemes. Santini (2007) noted that some types of websites may emerge or may just be unknown. For instance, before blogs were a mainstay on the Internet, they were considered an emerging genre. Similarly, Crowston and Williams (2000) found a large portion of their genres as being previously unknown. Both sets of authors argued that web pages may be classified into multiple genres. For example, market research from NM Incite found that of the largest social networking websites, three were actually blogs (Nielsen, 2012), which included Blogger, WordPress, and Tumblr. In one study, participants examined United States’ individual state website home pages by placing them into groups, and then examining them in terms of form, function, and content over time (Ryan, Field, & Olfman, 2002). The importance of these dimensions shifted over time. However, none of these classification methods address whether websites can be classified into a similar taxonomy as scenes.

Web page gist

For gist processing of websites to occur, they would need to have characteristic spatial structure that the human visual system could learn and harness for rapid categorization, as in scenes. As reviewed above, web pages can be categorized into genres, and may evolve over time (Crowston & Williams, 2000; Santini, 2007). Web designs tend to follow a certain structure, often with navigation regions on the left side and top of the page, content in the middle, and advertising regions on the right (Bernard, 2001, 2003; Bernard & Sheshadri, 2004; Di Nocera et al., 2004; Granka et al., 2006; Owens et al., 2011, 2014; Roth et al., 2010; Shaikh et al., 2006; Shaikh & Lenz, 2006). People expect web pages to follow these layout conventions and may react negatively when the conventions are violated (Owens et al., 2014). People exhibit a phenomenon called “banner blindness” where they will ignore areas of websites where ads are most expected, even if they know that relevant information may be located there (Benway, 1998; Owens et al., 2011, 2014). This suggests that habitual ignoring of web page regions may be based on the spatial structure of the website, rather than the visual characteristics of the elements.

Thus, it seems that there is evidence that people develop gist-like representations of web pages (Jahanian et al., 2018). We believe the human visual system is tuned to statistical regularities in the world and exploits those regularities to guide behavior whenever possible (e.g., Turk-Browne, Jungé, & Scholl, 2005). To further determine whether there is indeed gist processing of web pages at a glance, we employed an RSVP paradigm, as described in the present study’s experiments.

The current work

Given that humans can recognize scenes in a glance, we wondered whether they could recognize different types of websites in a glance. To investigate these issues, we first had participants classify website screenshots into multiple categories as part of a pilot study. A sample of 271 participants recruited from Wichita State University and Mechanical Turk classified 132 de-branded websites into one of nine categories: news, search, shopping, social networks, blogs, maps, corporate, general knowledge websites, or none. Social networking web pages were classified as both blogs and social networks, so they were combined into a single blogs/social networks category. Web pages with over 80% agreement for a single category, but no more than 20% agreement for a second category, were selected for the study (see Table 1 for category agreement results). These categories had websites that participants would likely have experience using, but also represent some of the oldest or largest website categories found on the Internet.

Table 1 Participant Agreement For Website Classification

Full size table

After determining website categories, we tested users’ ability to detect a specific type of website within an RSVP stream of other websites in Experiment 1. In Experiment 2, we explored the effects of size and blur on observers’ ability to rapidly detect websites in an RSVP stream.

Experiment 1

Experiment 1 was conducted to determine whether specific web pages could be detected with above chance accuracy during an RSVP task, when displayed for less than a fixation (≤ 140 ms). Comparisons were provided by having participants detect upright and inverted natural scenes in separate conditions. Upright natural scenes provided a best case comparison for gist perception and scene-related performance, while inverted natural scenes provided a degraded performance comparison by interrupting the extraction of a scene’s semantic meaning.

Inverted scenes have several advantages. Features such as spatial structure, color, and luminance remain consistent regardless of orientation, but the change in orientation tends to interfere with perception of semantic scene categories. Using inversion (180-degree rotation) as a method of degrading scenes has had mixed results. Inversion reduces detection of scene targets during RSVP tasks (Evans & Treisman, 2005), detection of changes in a scene (Kelley et al., 2003; Shore & Klein, 2000), but not detection of animals and humans (Rousselet et al., 2003). Inversion has been shown to reduce performance when used in combination with occlusion (Meng & Potter, 2008) and changes in luminance (Harding & Bloj, 2010), but not significantly when combined with gray scaling (Nandakumar & Malik, 2009) and jumbling scenes (Zimmermann, Schnier, & Lappe, 2010).

In this study, a staircase procedure was used to estimate stimulus onset asynchrony (SOA) durations necessary for several levels of performance for upright scenes, inverted scenes, and web page targets in an RSVP task. SOAs represent the amount of time elapsed between the onset of two stimuli. We developed three hypotheses:

H₁: Participants will be able to discriminate between categories for both scenes and web pages based on stimulus exposures lasting less than a fixation, but participants will have worse performance for web pages than for scenes. We felt that web page perception would be worse because participants have seen more scenes than web pages during their lifetime.
H₂: The necessary performance to reach desired accuracy thresholds will be lower for categories with higher agreement.
H₃: Performance will be better for upright scenes than inverted scenes, but inverted scenes will be similar or better than web page-related SOAs. Similarly to H₁, we felt that participants would have more experience with scenes as a whole during their lifetime.

Methods

Participants

Twenty-two college students from Wichita State University participated for course credit. All participants provided informed consent and the study was approved by the Wichita State University Institutional Review Board. Two participants did not complete the study and another was omitted from analysis due to poor overall performance (z = − 2.41, M = 865 ms). Of the remaining 19 participants (M = 21.16 years, SD = 3.67 years; 7 males, 12 females), four self-reported they use the Internet 1–10 h per week, while 15 self-reported using the Internet at least 11 h or more per week. The most common self-reported reasons for using the Internet included e-mail, entertainment, education, and social networking.

Apparatus

Participants viewed stimuli on a 22-in. CRT monitor with an 85 Hz refresh rate and 1400 × 1050 pixel (px) resolution, paired with a 2GHz Apple Mac Pro computer running Matlab and RSVP software using PsychToolbox (Brainard, 1997; Kleiner, Brainard, & Pelli, 2007; Pelli, 1997) and QUEST (Pelli, 1987; Watson & Pelli, 1983).

Materials

Websites and visual scenes were selected as stimuli for the study. All visual stimuli were presented at 512 × 386 px, which subtended 13.69° by 10.34° at a 60 cm distance. As described above, website stimuli were selected from a pilot study in which participants classified screenshots of web pages (in the same resolution as presented in the current study) into one or more categories, yielding 276 screenshots per category (the entire website stimulus set can be downloaded from https://scholarworks.sjsu.edu/psych_pub/28/). Each of the website categories selected for this study had classification agreement scores of above 94%. Due to a configuration error, only 172 screenshots for the social networks/blogs category were used in Experiment 1. See Fig. 1 for examples of the website stimuli.

For the natural scene stimuli, four categories were selected: beaches, mountains, deserts, and forests. For target and distractor categories, 268 and 284 scenes were selected, respectively. Stimuli were downloaded from the SUN database (Xiao, Hays, Ehinger, Oliva, & Torralba, 2010) and Google Image Search. The scene stimuli were validated through pilot testing. Inverted versions of the natural scenes were also created. See Fig. 2 for examples of outdoor scenes.

Procedure

Participants were screened for normal color vision and normal or corrected-to-normal visual acuity. The researcher described the experiment procedure and provided descriptions of the stimuli. Participants then were seated at a chinrest where the RSVP program provided instructions on the task and how to respond using the keyboard.

The experiment consisted of multiple RSVP trials. Each trial consisted of a brief written description of the target, followed by a fixation point, blank screen, multiple stimuli presented in succession, and a prompt inquiring whether the target category was present in the RSVP stream. See Fig. 3 for a schematic of the trial.

For each website or natural scene category, QUEST statistical-based adaptive staircases (Watson & Pelli, 1983) were initialized for three accuracy thresholds indicating above chance performance (e.g., 60%, 75%, and 90%). Participants completed two practice and 40 experimental trials for each category/threshold condition, equating to 24 practice and 480 experimental trials in total. Half of these trials were target present, while the other half were target absent. The trial order was randomized.

Each trial consisted of 15 randomly selected stimuli from three nontarget categories. Nontarget stimuli had an equal chance of being seen multiple times during the study, but targets were only seen once. In each trial, nontargets were selected without replacement, but between trials, nontargets were selected with replacement. During a target-present trial, a single stimulus from the target category was selected without replacement and placed randomly in the RSVP stream between, but not including, the first and last positions.

For each RSVP trial, the SOA was calculated using the QUEST algorithm. Each QUEST staircase was initialized with several parameters, including the minimum SOA, the maximum SOA, mean, and standard deviation. The minimum SOA was set to one screen refresh, the maximum SOA to one second, the mean set as the median of the range (505.9 ms), and the standard deviation was set as one second.

Each stimulus was displayed for 140 ms or less during the RSVP stream. If the requested SOA exceeded this presentation duration, an ISI followed the stimulus presentation to make up the rest of the time (e.g., a 600-ms SOA would be 140 ms exposure plus 460 ms ISI). A participant’s progression through a staircase was halted when the display duration requested by the QUEST algorithm exceeded 976.5 ms (one second minus two screen refreshes) on ten consecutive trials. The number of trials remaining in the halted staircase were considered a measure of poor performance for that condition.