To summarize, when our participants were offered two options, either restudy or taking a test, they chose restudying on the majority of trials. When they were allowed to request hints during test trials, however, they preferred testing over restudy by a sizeable margin. In other words, difficult test trials were like black coffee (revolting) while test trials with hints were like coffee with milk and sweetener (heavenly). In terms of learning, we found that participants learned more from any kind of test trials than they did from restudy trials, but learning was not affected by hints.
These results suggest that making retrieval easier by giving hints might be an effective way to increase learning. The point was not to change the learning efficiency of retrieval or restudy trials, it was to make people prefer to study in a more efficient way. Retrieval is considered a desirable difficulty because it makes learning more difficult in the short term but enhances learning in the long term (e.g., Bjork, 1994; Bjork & Bjork, 2011). Unfortunately, desirable difficulties are not always desirable to the learner because learners typically, but incorrectly, assume that poor short-term performance is equivalent to poor learning (for reviews, see Bjork, Dunlosky, & Kornell, 2013; Soderstrom & Bjork, 2015). Retrieval with hints appears to be a rare case of ‘desirable easiness’; it was similar to a desirable difficulty in terms of the long-term benefits of retrieval being desirable for learning, but it was different in the short term because the easiness brought on by the hints made learners find retrieval desirable as well.
Practical recommendations
Based on these findings, we recommend giving students the option to get hints when they are testing themselves. It will make them choose testing more often, which should increase their learning, and it will also make learning more fun, which might increase their motivation to study. We envision instructors making more use of hints in worksheets, questions at the end of textbook chapters, flashcards, and a variety of digital study aides that resemble Quizlet. The students themselves might also benefit by finding ways to give themselves hints as they test themselves.
These recommendations need to be qualified in more than one way, however. First, hints are probably especially important when the participants would otherwise fail to answer most of the test questions without hints. Thus, the hints might be most useful when the learners are just starting to learn the material, or the material is very difficult. When a learner can get the answers right without hints, they will probably choose test trials with or without hints, so the hints might not hurt, but they might not help either.
On a related note, making hints too easy in the wrong way is another danger. In Experiment 4, it was possible to guess the answer based on the hint, even if one did not remember having studied the word pairs previously. In this case, participants tended to learn less from hint trials than test trials. In short, hints that make the answer guessable (e.g., king-q__en) could potentially impair learning (compared to test trials) and may need to be avoided. For this reason, further research is needed to verify that hints have the same effects with authentic educational materials as they do with word pairs.
Another important issue is that of the retention interval. In the present experiments, final test performance was assessed after a brief delay (2 min). Research has shown that testing is particularly effective across longer retention intervals, at least when feedback is not given (see Toppino & Cohen, 2009). However, this finding is not relevant here because feedback was given in the present study. When feedback is given, the benefits of tests (compared to restudy) do not increase as the retention interval increases (Kornell, Bjork, & Garcia, 2011). Thus, it is likely that the effects shown in this paper would have been similar with longer retention intervals, but only further research would confirm this.
Finally, we gave our participants the option to use hints, but we did not force this option on them. We do not know whether this affected the study since we never tried forcing the use of hints, but we suspect that participants will enjoy learning more when they get to choose how they want to study. Furthermore, Tullis, Fiechter, and Benjamin (2018) showed that tests are more effective than presentation trials overall, but forcing them to take a test does not enhance learning when participants do not want to be tested.
A tale of two approaches to behavior modification
Here, we will highlight differences between two ways of getting students to test themselves, that is telling students that testing is a valuable way to learn, which we chose not to do, versus making students want to test themselves, as we did here. A great deal of ink has been spilled in the campaign to educate students and teachers about the benefits of tests. There are high-profile research articles (e.g., Dunlosky, Rawson, Marsh, Nathan, & Willingham, 2013; Pashler et al., 2007), newspaper articles (e.g., Carey, 2010), and books (Boser, 2017; Brown, Roediger, & McDaniel, 2014; Carey, 2015; Rhodes, Cleary, & Delosh, 2020). These are all undoubtedly good things. One big advantage of this approach—that is, spreading the word about the benefits of retrieval—is reach. Books and articles can reach a lot of people. Students can then, hopefully, use their newly found knowledge to make better use of their own study time, by testing themselves and so forth.
However, making students want to test themselves, rather than telling them that they should test themselves, has two advantages. First, a student’s beliefs do not always match their decisions; sometimes students choose to study in ways that they do not think are the most efficient (Kornell & Son, 2009). In other words, changing a student’s beliefs about the benefits of testing might not change how they chose to study.
There is a related problem with telling people that testing is good for them. We argue that students already think testing is good for them. We hypothesize that the reason they chose restudy, instead of testing, is not because they think testing is bad. Rather, it is because they are trying to avoid failure.
Although we did not directly measure whether avoidance of failure is what caused our participants to use hints, our data are consistent with this interpretation. First, self-testing was only the popular choice when hints were available (i.e., when the likelihood of getting the answer correct increased). Second, we analyzed initial test performance across experiments. If participants wanted to avoid failure, and chose to test themselves on items on which they expected to get the answer right, then they should show higher initial test performance when they were allowed to choose which items to test themselves on compared to when they were forced to self-test (as they were in Experiments 3a, 3b, and 4). The data support this hypothesis. For instance, in Experiment 2 (self-test by choice), initial test performance in the zero-letter condition was around 24% but it was only 12% and 10% in Experiments 3a and 3b, respectively. Additionally, two-letter performance in Experiment 1 was approximately 48% compared to 23% and 16% in Experiments 3a and 3b, respectively. These numbers are consistent with the notion that learners choose to test themselves when the odds of being correct are higher, although this should be taken lightly since they rely on cross-experimental comparisons, which are not ideal.
Future research can investigate situational factors that affect test avoidance. For example, people might be affected by the stakes of the situation. Someone who avoids using self-testing when the stakes are high (e.g., a medical student treating a patient) might be comfortable failing during a low-stakes quiz (e.g., a medical student at a pub quiz). Researchers have argued that medical students should employ retrieval practice as a study strategy (Larsen, Butler, & Roediger III, 2008), but these students may be reluctant to do so due to fear of failure. It has been shown that low-stakes quizzing can reduce test anxiety (Agarwal, D’Antonio, Roediger III, McDermott, & McDaniel, 2014), but that is irrelevant if students avoid self-testing altogether. Interestingly, testing with hints may reduce test anxiety by alleviating the fear of failure associated with testing and could be a useful option for those suffering from test anxiety. In summary, avoidance of failure does seem to be able to explain why students have been shown to prefer restudy in some situations.
If failure is what students are trying to avoid, then the best way to make them do more self-testing might be to convince them that they should embrace failure. Telling students something they already know—that they should test themselves sometimes—will not have much impact on learning.
Second, and more important, telling people what is best for them does not necessarily change their behavior for very long. Saying you should test yourself without making it fun is like saying you should eat your spinach without making it taste good. For a student, it will probably mean self-testing requires willpower and self-control. Because it is difficult to maintain self-control in the long term (e.g., Hoch & Loewenstein, 1991; Mischel, Shoda, & Rodriguez, 1989), we tried to remove self-control from the equation. We hoped our learners would test themselves not because it was the right thing to do, but because they wanted to. In other words, we tried to use hints to make the spinach taste good.
Theoretical implications
We think these results have three sets of theoretical implications: people like to be tested; any test trial that meets two simple criteria will be equally effective; and retrieval effort might not affect learning. We will discuss each of these in turn.
First, it is often claimed that, when people study, they prefer restudy over self-testing (Geller, et al., 2018; Hartwig & Dunlosky, 2012; Karpicke, et al., 2009; Karpicke & Roediger, 2008; Kornell & Bjork, 2007; Kornell & Son, 2009; Morehead et al., 2016; Wissman et al., 2012). We think this idea has been painted with too broad a brush. General statements about whether people appreciate the value of testing, or want to be tested, or choose testing, are bound to be inaccurate because people’s preferences for restudy versus testing depend on the circumstances. For example, one must surely consider the kind of material being studied (e.g., people probably like to test themselves more when studying vocabulary than when reading a novel), but that is beyond the scope of this article. The circumstance we focused on was the person’s chance of getting the answer correct.
When we looked more specifically, our data suggested that, at least with simple word pairs, testing was more popular than restudy. That is, most people’s favorite option was to take a test that they could get right; this option was more popular than a test they could not get right or a restudy trial. Consistent with previous research, restudy trials were more popular than tests that the participants could not get right, but this comparison leaves out the participants’ favorite option. Therefore, we disagree with the idea that people underestimate the value of testing, or avoid testing themselves when they study. They do dislike and avoid something, but it is being wrong, not taking tests.
The second main theoretical implication of our results has to do with the finding from Experiments 3a and 3b that hints did not diminish the benefits of retrieval. This finding fits with Kornell and Vaughn’s (2016) two-stage model of learning from retrieval. The stages in this model are a legitimate retrieval attempt (stage 1) followed by exposure to the correct answer (stage 2). The model predicts that the full benefit of retrieval will be obtained any time these two conditions are met. In other words, if retrieval would provide a 10 percentage point boost in learning for a given word pair at a given time, then those 10 points will be obtained if there is a legitimate retrieval attempt followed by a chance to fully process the correct answer, regardless of other factors that might be at play. Previous research has supported this claim. One study showed that whether the retrieval attempt was successful or not did not affect learning (Kornell, Klein, & Rawson, 2015). Another showed that the amount of time one spends trying to retrieve an answer (i.e., the duration of stage 1) did not affect learning (Vaughn, Hausman, & Kornell, 2017).
Experiments 3 and 4 provided crucial additional support for this model by showing that hints, as long as they are not guessable, also did not affect learning. In Experiment 3, both stage 1 and stage 2 occurred in all of the retrieval conditions and, as predicted, the full benefit of retrieval was obtained regardless of whether there was no hint, a two-letter hint, or a four-letter hint. In Experiment 4, a legitimate retrieval attempt was not required, because participants could guess the answer based on semantic knowledge, so stage 1 did not always occur under the hint conditions. As predicted, the full benefit of retrieval was not obtained under the hint conditions. In short, Experiments 3 and 4 support the two-stage model by adding to the list of factors (retrieval success, retrieval duration, and now retrieval difficulty) that do not affect how much one learns from retrieval.
The third theoretical implication of these results has to do with the retrieval effort hypothesis (Pyc & Rawson, 2009; also see Bjork & Allen, 1970). According to this hypothesis, retrieval effort leads to learning, such that a difficult, high-effort retrieval produces more learning than does a relatively easy, low-effort retrieval (assuming the retrieval attempt is successful). A study by Pyc and Rawson (2009) supported this hypothesis. Their participants were to learn Swahili-English word pairs with either short practice lags (e.g., six intervening items) or long practice lags (e.g., 34 intervening items). Longer practice lags increased the amount of effort required during the retrieval attempts in the study phase. As predicted by the retrieval effort hypothesis, on the final test participants did better under the condition with longer practice lags.
Although Pyc and Rawson’s (2009) data are consistent with the retrieval effort hypothesis, there is an alternative explanation of their data. They showed that participants learned more from longer lags than shorter lags. They explain this finding based on retrieval effort, but it can also be seen as a spacing (or “lag”) effect, and a great deal of research has shown that longer lags lead to more learning than shorter lags (for a review, see Cepeda, Pashler, Vul, Wixted, & Rohrer, 2006). Moreover, lag effects can be explained by factors other than retrieval effort. For example, perhaps the benefit of the longer lags in Pyc and Rawson’s data came from a difference in accessibility (which was lower in the longer lag condition), not from a difference in retrieval effort (Bjork & Bjork, 1992). In other words, retrieval effort was correlated with learning, but it might not have caused learning. There is a third variable in this correlation, memory accessibility, that is known to influence learning. Accessibility and retrieval effort were confounded. Therefore, it is possible that retrieval effort did not affect learning directly, but only appeared as though it did, while what actually happened was that low accessibility under the long lag condition caused both retrieval effort to be high and learning to be high.
To overcome this third variable problem, and truly test whether retrieval effort has a causal impact on learning, accessibility needs to be held constant while retrieval effort is manipulated. The methodology of Experiments 3 and 4 achieved this; hints made retrieval less difficult and less effortful, but they did not affect accessibility. If retrieval effort has a causal effect on learning, then hints should have affected the amount participants learned. No such effect materialized in Experiments 3a or 3b. (This effect did occur in Experiment 4 but, as we have already explained, the effects in Experiment 4 can be explained based on guessability.)
In short, Experiments 3a and 3b may be the strongest test yet of the retrieval effort hypothesis, but this hypothesis was not supported. Thus, perhaps an amended version of the retrieval effort hypothesis is in order—retrieval effort can be positively correlated with learning, but retrieval effort per se might not cause learning. Further research is needed to look at these possibilities.