A fundamental constraint on performance in any visual task is the information available at a glance. Searching for a person in a natural scene will be more difficult if we cannot detect them at a glance. Being able to at least extract target-relevant features at a glance will speed up search; otherwise, we may have to execute a number of eye movements to find the target. Our ability to quickly get the gist of a scene can also speed up search if it provides layout information, identifying likely locations to find a person.
In the context of navigating a web page (e.g., see Fig. 1), clearly much of the pertinent information comes in the form of text that may not be readable at a glance. However, if a user can quickly determine that a web page is a blog, and thus, say, an unreliable source of information about drug interactions, she can quickly navigate to another page. (Of course, in viewing a page of search results, a savvy user might also realize that a given link points to a blog based on its URL. However this cue might not always be so readily available.) To the extent that a user can determine the layout of the page at a glance, she can easily direct her attention to a paragraph of interest, click on a button relevant to her goals, or look through the menu for more choices. At the other extreme, if the user cannot get much information out of a page at a glance, she may be forced to read a significant amount of the text, or otherwise scan the page, a slow and perhaps frustrating process. The user may decide simply to navigate to a new page that is more easily comprehensible – an undesirable outcome from the point of view of the page designer, the owner of the page, and any companies with ads on that page. Understanding the information available at a glance constrains models of perception, visual representation, and attention, and informs our understanding of usability and design.
A considerable amount of information about a stimulus, such as a scene or display, is available in a single fixation. This summary information has been termed the “gist.” Colloquially, the gist is defined as “the sentence one uses to describe a stimulus.” Often this is operationalized as “the perceived contents of a scene given a certain amount of viewing time” (Fei-Fei, Iyer, Koch, & Perona, 2007), often in a single fixation (Fei-Fei et al., 2007; Oliva, 2005). We take the gist to mean the information available at a glance, i.e., in a single fixation. Such a fixation can last between 100 and 300 milliseconds (ms) (Harris, Hainline, Abramov, Lemerise, & Camenzuli, 1988; Pieters & Wedel, 2012; Wedel & Pieters, 2000), while typical fixations fall in the range of 200–250 ms (Rayner & Castelhano, 2007).
At a glance, participants can identify the category of a natural scene (e.g., beach vs. forest, indoor vs. outdoor, parking lot vs. downtown) (Ehinger & Rosenholtz, 2016; Greene & Oliva, 2009b; Joubert, Rousselet, Fize, & Fabre-Thorpe, 2007; Rousselet, Joubert, & Fabre-Thorpe, 2005), and how much room there is to navigate (Greene & Oliva, 2009a). They can determine whether a given object is present, such as an animal (Kirchner & Thorpe, 2006; Li, VanRullen, Koch, & Perona, 2002; Thorpe, Fize, & Marlot, 1996), vehicle (VanRullen & Thorpe, 2001), or a human face (Crouzet, Kirchner, & Thorpe, 2010). They can reliably distinguish between cities (e.g., Paris vs. Los Angeles) and tell what kind of intersection lies ahead (Ehinger & Rosenholtz, 2016). Furthermore, experiments in which participants freely report what they perceived in the scene, as opposed to merely carrying out a pre-defined task, have revealed the richness of the perception of lower and mid-level properties, such as the colors and textures present (Fei-Fei et al., 2007).
In addition to the extensive research on natural scenes, much of vision research has (effectively) studied vision at a glance using artificial, psychophysics-style stimuli (e.g., Gabors, simple 2D/3D shapes, synthetic textures, etc.). Many experiments studying basic visual abilities use short display times, typically only long enough for a single fixation. This includes studies of texture segmentation (Julesz, 1981; Rosenholtz & Wagemans, 2014; Treisman, 1985), popout search (Treisman & Sato, 1990), crowding (Levi, 2008), ensemble/set perception (Whitney, Haberman, & Sweeny, 2014), numerosity judgments (Feigenson, Dehaene, & Spelke, 2004), dual-task performance (VanRullen, Reddy, & Koch, 2004), iconic memory (Sperling, 1960), and perceptual organization in general (Wagemans, 2015). Experimenters use short display times not only to explicitly study at-a-glance perception; but also to study preattentive processing, or avoid complicating factors, such as fixation location. Human vision research, however, rarely extends this work to information visualizations, computer displays, and user interfaces; all of which have scene-like qualities and are practically relevant, despite being artificially designed. The goal of our research is to bridge this gap between natural and artificial stimuli by studying at a glance perception of web pages.
Research on human vision arguably suggests that perception of artificial stimuli is poorer than that of natural scenes. Synthetic stimuli and tasks appear to be more affected by attentional limitations than natural stimuli and tasks (Li et al., 2002). Researchers have suggested several explanations for this apparent difference. Our visual systems developed to process natural stimuli (Geisler, 2008). There appear to be brain areas devoted to processing stimuli like natural scenes (Epstein & Kanwisher, 1998); this specificity of neural organization possibly provides an advantage in processing those natural stimuli (VanRullen et al., 2004). In addition, web pages are often quite text-heavy; much of this text is unlikely to be readable at a glance, perhaps further impairing ability to classify a web page at a glance. One obviously cannot generalize from extracting the gist of a natural scene or of psychophysics-style stimuli to the gist of diversely designed artifacts such as web pages. Given their novelty and pervasiveness, web pages are “real-world” stimuli that require rigorous psychophysical investigation.
Beyond contributing to theories of human vision, understanding web page gist is relevant for design and usability. We can learn from web pages that are easy to comprehend at a glance in order to improve easy access to relevant information. For this reason, researchers in the HCI (Human-Computer Interaction) community have begun to study perception of web pages at a glance. However, to our knowledge all of these studies involved subjective judgments, e.g., “is this web page aesthetically pleasing,” or “does this web page appear to have high or low usability?” Researchers have found that participants form subjective impressions of the appeal of a web page in the first 50 ms of viewing, and respond consistently when shown the same stimulus later (Lindgaard, Fernandes, Dudek, & Brown, 2006). Furthermore, first impressions of visual appeal based on short (50 ms) exposures correlate well with judgments based upon longer viewing times (500 ms and further up to 10 s) (Tractinsky, Cokhavi, Kirschenbaum, & Sharfi, 2006). Users also make consistent subjective ratings about the trustworthiness and perceived usability of web pages after only 50 ms of viewing (Lindgaard, Dudek, Sen, Sumegi, & Noonan, 2011). Inspired by human vision research (see Oliva & Torralba, 2006) that suggests that low spatial frequencies are sufficient to communicate the layout of a natural scene, Thielsch & Hirschfeld (2010, 2012) found high correlation between judgments of aesthetics made on low-pass filtered web page screenshots and the original web page screenshots. Perceived usability, on the other hand, correlated better with judgments made based on high-pass filtered stimuli. Of course, just because observers can consistently make certain subjective judgments at a glance does not imply that they will be able to perform the tasks of interest in this paper. Instead of studying subjective judgments, we ask observers to perform objective tasks with web pages at a glance.
We perform several experiments to investigate what can be seen in a single fixation, 120 ms, on a web page. Display times of this magnitude are typical for similar studies with natural scenes (Fei-Fei et al., 2007). In Experiment 1, we ask whether observers can rapidly ascertain the category of a web page. This is a new question in the human vision literature. Common wisdom in HCI suggests that a user cannot acquire much semantic information, such as the category of a web page or meaning of any text, in a presentation time of less than 500 ms (e.g., Lindgaard et al., 2006). However, researchers have not actually tested this hypothesis.
In Experiment 2, we ask whether ads are detectable at a glance. This is an object detection task like the animal/no-animal task in scene perception studies (Kirchner & Thorpe, 2006; Li et al., 2002; Thorpe et al., 1996). However, since detection depends greatly on both the signal to be detected and on the background against which it appears, one cannot infer from easy animal detection that ads will be easy to detect. In particular, designers may use multiple different strategies for ad design. Some designs aim for ads to have a salient, visually distinct appearance from the rest of the web page, while other designs might disguise the ad on purpose, effectively creating camouflaged objects. In previous work, Pieters & Wedel, (2012) showed that observers can distinguish between ads and editorial articles in magazines with high accuracy (up to 85% on average) in only 100 ms. Furthermore, observers could discriminate between types of ads (i.e., for cars, financial services, food, or skincare products), at rates of 95% correct for “typical” ads, and 53% correct for “atypical” ads. This study differs from the present in several ways. We study ads embedded in web pages, as opposed to isolated full-page magazine ads. The task is to detect these embedded ads, rather than to categorize them. In addition, the ad style in magazines tends to be quite different from that in web pages.
Finally, in Experiment 3, we ask how well a user can locate the menu bar. A menu bar is essentially defined by the horizontal or vertical alignment of its elements; menu items form either a row or a column, respectively (Fig. 10). In addition, many menu bars contain menu items that have similar colors and other features, and/or those items may be contained within a rectangular box. As a result, one can think of menu localization as an initial question of what perceptual organization (alignment, similarity, and/or containment) one can perceive at a glance at a web page. Considerable work has demonstrated that observers can perform perceptual organization tasks in brief presentations (van der Helm, 2014). However, much of this work uses fairly simple and homogeneous displays, leaving open the question of what observers can perceive in web pages at a glance. Perhaps more relevant is work suggesting that observers can estimate the 3D layout of a natural scene at a glance (Greene & Oliva, 2009a, 2009b), though clearly both the task and stimuli differ greatly from detecting a web page menu.
Our particular set of tasks can be thought of as a parallel to tasks in the scene gist literature. We have a semantic task (categorization), similar to scene categorization (Biederman, 1981); an object detection task (ad detection), similar to object detection with scenes (Thorpe et al., 1996), and a layout-related task (menu localization), similar to 3D layout estimation (Oliva & Torralba, 2001). Can observers assess these mid-to-high level properties in web pages, as they can in natural scenes?