Experiments 1 and 2 show an advantage for hat graphs over bar graphs in that people were faster to identify the advertisement that produced the biggest increase in performance. Speed can be indicative of the ease with which a graph can be processed. However, there are other relevant components of graph comprehension including accuracy and biases.
One of the motivators for hat graphs as a replacement for bar graphs is that the hat graphs are not restricted to having the y-axis range start at zero. For bar graphs, the top of the bar and the length of the bar both signify the value for that condition. Consequently, bar graphs must start at zero to avoid having a conflict between the two indicators that could produce misleading impressions (Pandey et al., 2015; Pennington & Tuttle, 2009). This restriction does not apply to hat graphs because there is only the one indicator for the condition’s value. Thus, one of the benefits of hat graphs is to be able to control the range of the y-axis to better convey the magnitude of the effect. Small effects should look small and big effects should look big. One way to achieve this is to set the range of the y-axis to 1.5 SDs (Witt, in press). In experiment 3, hat graphs were plotted with this 1.5 SD range, whereas bar graphs were plotted such that the y-axis started at zero to empirically test whether this theoretical advantage of hat graphs leads to empirical benefits.
Method
Participants
Twenty-three participants completed the experiment in exchange for course credit. The first 13 completed the experiment with the two graph types presented in separate blocks, and the last 10 completed the experiment with the two graph types intermixed having decided the latter is more akin to what people are likely to experience.
Stimuli and apparatus
The experimental set-up was the same as in experiments 1 and 2. The stimuli were 80 unique graphs, 40 of which were hat graphs and 40 were bar graphs. Each graph depicted two means based on simulated data. The data were simulated to mimic scores on a memory test from 0 to 100 after engaging in one of two study styles. Massed refers to studying everything at once, as in cramming just before the exam. Spaced refers to dispersing studying across time. The simulated mean for the massed study style was 60. The simulated mean for the spaced study style was set to one of four values (60, 63, 65, or 68). The simulated SD was 10 for both groups so the differences between the two groups correspond to four different effect sizes as measured with Cohen’s d (0, 0.3, 0.5, 0.8). These four values coincide with the naming conventions of a null, small, medium, and big effect. For each effect size, means for both groups were simulated based on a sample size of 100 per group, and for each effect size, simulations were conducted 10 times for a total of 40 unique data sets. The final effect size for each data set was compared to the intended effect size, and discarded and replaced if not within 0.05 SDs of each other. One bar graph and one hat graph was created for each data set, with error bars that corresponded to 95% confidence intervals (see Fig. 11). For the bar graph, the y-axis started at 0 and went to 4% beyond the top of the range necessary to see both error bars (as is the default in R). For the hat graph, the same restriction of having a baseline of zero does not apply. Therefore, the range of the y-axis was set to 1.5 SDs based on the recommendations of Witt (in press). The y-axis range was the grand mean of both groups minus 7.5 to the grand mean plus 7.5 for a total range of 15, which is 1.5 times the simulated SD of 10.
Procedure
Initial instructions explained the two study styles and that participants would make a judgment about whether study style affected final test performance. They were to judge whether study style had no effect, a small effect, a medium effect, or a big effect, and press the corresponding number (1–4, respectively). They were then given an overview of each type of graph showing what corresponded to the mean for each group and what the error bars signified. The two graph types were presented in different blocks for the first group of participants (n = 14), so instructions for each type of graph preceded that block. For the second group, the graph types were intermixed within block, so instructions for both graph types were presented at the beginning.
Each trial began with a blank screen with a fixation cross at the middle for 500 ms. Then the graph appeared and remained until participants estimated the magnitude of the effect depicted in the graph. There was no time limit and no feedback given. Responses were followed by a blank screen presented for 500 ms. For participants in the blocked condition, each block consisted of all 40 unique graphs for one type. Order within block was randomized. Participants completed two blocks with one graph type, then two blocks with the other graph type for a total of 160 trials. For participants in the intermixed condition, each block consisted of all 80 unique images (40 with each graph type, order was randomized), and participants completed three blocks for a total of 240 trials once it was determined that there would be enough time to complete all of these trials within the 30-min session.
Results and discussion
The data were analyzed using a linear mixed model with Satterthwaite’s estimation for degrees of freedom. The dependent variable was estimated effect size, centered from the original scale of 1–4 by subtracting 3. One independent factor was depicted effect size. Although four effect sizes were used in the experiment, following Witt (in press), only the non-null effects were included in the analysis because there are differences in sensitivity when estimating between a null and a non-null effect compared with estimating across non-null effects, and it is the latter that is typically of greater interest. The three depicted effect sizes were converted to the same scale as the response and centered by subtracting 3 (− 1, 0, 1 for 0.3, 0.5, and 0.8, respectively). The other independent factors were graph type (bar graph and hat graph) and block type (blocked or intermixed). Depicted effect size and graph type were within-subject factors, and block type was a between-subject factor. Random effects for participant, including intercepts and main effects for each within-subject factors and their interaction, were included. These random coefficients were initially examined for outliers. One participant showed no sensitivity to effect size in either condition, and was excluded. The model was re-run but did not converge so the interaction term for the random effects was excluded to achieve convergence.
With this experimental design, sensitivity is measured as the slope, and a slope of 1 indicates perfect sensitivity while a slope of 0 indicates no sensitivity. Sensitivity was 0.52 for the bar graphs (SE = 0.06) and was 0.70 (SE = 0.04) for the hat graphs. This shows a difference in sensitivity of 0.18, which corresponds to a 35% improvement in sensitivity for hat graphs with the standardized axes compared with bar graphs with the baseline at zero, d = 0.28, t = 6.28, p < .001, estimate = 0.18, SE = 0.03 (see Fig. 12). Separate linear models were run for each participant for each graph type. From these the slopes were extracted as the measure of sensitivity. Out of the 22 participants, 21 showed higher sensitivity with the hat graph than with the bar graph (see Fig. 13).
There was a difference in sensitivity of 0.22 when the graphs were blocked than when they were intermixed, d = 0.33, t = 2.46, p = .022, estimate = 0.22, SE = 0.09. This makes sense given that participants did not have to rapidly switch between graph types to interpret graph size. The effect of block type on differences in sensitivity across graph types was negligible, d = 0, t = 0.01, p = .99 (see Fig. 14). In both cases, sensitivity was greater with the hat graphs than with the bar graphs (blocked, d = 0.30, t = 6.20, p < .001, estimate = 0.18, SE = 0.03; intermixed, d = 0.26, t = 6.49, p < .001, estimate = 0.19, SE = 0.03).
In addition to sensitivity, the data also reveal biases associated with the various graph types. Bias is measured by the intercepts. Graph type had a medium effect on bias, d = 0.61, t = 4.33, p < .001, estimate = 0.40, SE = 0.09. Separate linear mixed models were run for each graph type to assess bias for each, and the intercepts were transformed into percent overestimation scores. When reading bar graphs, participants underestimated depicted effect size by 16%, d = 0.68, t = − 4.73, p < .001, estimate = − 0. 74, SE = 0.10. When reading hat graphs, the bias was 2% underestimation and was negligible, d = 0.11, t = − 0.79, p = .44, estimate = −0.07, SE = 0.09. Thus, not only do hat graphs improve sensitivity to effect size, they also reduce bias in estimating effect size relative to bar graphs because of their flexibility to allow the y-axis to start at a value other than zero. This leads to a bias to make effects appear smaller because bar graphs must start at zero. Figure 15 shows differences in bias for each participant.
It is also worth noting that error bars were included in both the hat graph and bar graph stimuli in this experiment. That the hat graphs still improved sensitivity and decreased bias despite the presence of the error bars further supports the claim of their advantage, at least when bar graphs are forced to a have a baseline at zero. One of the advantages of hat graphs is that they do not require the baseline to start at zero, whereas the rule for bar graphs is a baseline starting at zero. When following their respective rules, the hat graphs lead to increased sensitivity to the size of the effect depicted in the graph and reduced bias.