The range of confidence scales does not affect the relationship between confidence and accuracy in recognition memory

Tekin, Eylul; Roediger, Henry L.

doi:10.1186/s41235-017-0086-z

Original article
Open access
Published: 20 December 2017

The range of confidence scales does not affect the relationship between confidence and accuracy in recognition memory

Eylul Tekin¹ &
Henry L. Roediger III²

Cognitive Research: Principles and Implications volume 2, Article number: 49 (2017) Cite this article

7875 Accesses
25 Citations
6 Altmetric
Metrics details

Abstract

Researchers use a wide range of confidence scales when measuring the relationship between confidence and accuracy in reports from memory, with the highest number usually representing the greatest confidence (e.g., 4-point, 20-point, and 100-point scales). The assumption seems to be that the range of the scale has little bearing on the confidence-accuracy relationship. In two old/new recognition experiments, we directly investigated this assumption using word lists (Experiment 1) and faces (Experiment 2) by employing 4-, 5-, 20-, and 100-point scales. Using confidence-accuracy characteristic (CAC) plots, we asked whether confidence ratings would yield similar CAC plots, indicating comparability in use of the scales. For the comparisons, we divided 100-point and 20-point scales into bins of either four or five and asked, for example, whether confidence ratings of 4, 16–20, and 76–100 would yield similar values. The results show that, for both types of material, the different scales yield similar CAC plots. Notably, when subjects express high confidence, regardless of which scale they use, they are likely to be very accurate (even though they studied 100 words and 50 faces in each list in 2 experiments). The scales seem convertible from one to the other, and choice of scale range probably does not affect research into the relationship between confidence and accuracy. High confidence indicates high accuracy in recognition in the present experiments.

Significance

Confidence ratings are collected routinely in many types of research, including psychophysics and perception, decision making, recognition memory, eyewitness memory, and many metacognition experiments. Outside the laboratory, confidence is measured in settings such as eyewitness identification and surveys for consumer products, among others. A wide variety of confidence scales are used, ranging from simple 2-point scales (sure-unsure) to increasingly fine-grained scales ranging up to 100-point scales (where 100 is the highest confidence and 1 is guessing). Very little evidence exists to answer the question whether certain types of confidence scale are better than other types of scales. We report two recognition memory experiments using words and faces as the study materials, and we show that four scales that varied over a wide range of values (1–4, 1–5, 1–20, and 1–100) are generally comparable in their sensitivity in recognition decisions. This outcome will be reassuring to anyone who uses confidence scales. In addition, we obtained a very strong relationship between confidence and accuracy in our experiments—about as high as in eyewitness experiments—even though we had subjects study many words or faces. As in eyewitness experiments with a single tested face, our studies show that high confidence indicates high accuracy, even in experiments with many events to be remembered.

Background

Psychologists have long wrestled with the issue of how confidence and accuracy of memories are related. The first experiment we can find asking (and answering) this question was published over 100 years ago. Dallenbach (1913) showed “observers” (using the terminology of the day) complex pictures for 1 minute each with instructions to remember them. He later tested them 5, 15, or 45 days later. One test involved asking his subjects questions about the pictures; if they provided an answer, he asked them to rate their confidence on a 3-point scale defined verbally as “slightly sure, fairly certain, or absolutely certain.” Dallenbach showed that forgetting occurred over time, which is no surprise, and he also found that confidence of responses was related to their accuracy. He concluded, “The degree of certainty of the observer’s replies bears a direct relation to the fidelity of the answer” (p. 335).

The question posed by Dallenbach in 1913 has been addressed in hundreds of experiments in the intervening century, and the relationship can be examined in many different ways, such as across subjects (Are subjects who are highly confident also highly accurate?), across events or items (Are events that are accurately remembered also confidently remembered?), within individuals (the relationship between confidence and accuracy for different events for the same person), among others (see Roediger, Wixted, & DeSoto, 2012, for a review). Depending on the way the question is posed and the type of analysis used, researchers have obtained every imaginable answer: strong positive correlations between confidence and accuracy, null relationships, and even negative correlations (e.g., DeSoto & Roediger, 2014; Koriat, 2008; Sampaio & Brewer, 2009). Despite the array of findings in the literature, the field is making good progress in understanding confidence-accuracy relationships in memory. Several reviews provide emerging principles that help resolve the confidence-accuracy puzzle (Koriat, 2012; Roediger & DeSoto, 2015; Wixted, Mickes, Clark, Gronlund, & Roediger, 2015; Wixted & Wells, 2017).

The aim of the present experiments was to examine a neglected factor in considering confidence-accuracy relationships: the range of the confidence scale. In reviewing the various literatures on confidence and accuracy, we found that the type of confidence scale used varies tremendously, and rarely does a researcher defend the confidence scale used (and then the defense amounts mostly to a personal preference). Most experiments on confidence-accuracy relationships use some form of recognition test, although, of course, analyses can be applied to recall, as in Dallenbach’s study (1913), in which he used cued recall. In recognition procedures, typically subjects view one or more events and then take a recognition test in which the studied event is mixed with unstudied events. Subjects are asked to pick the previously studied (“old”) item and then rate their confidence. In some procedures, they are also asked to rate their confidence in items they call unstudied (“new”). Confidence scales can range anywhere from 2 points (subjects using yes/no or old/new represents a 2-point scale), or, after making a yes/old judgment, researchers have used 3-point scales (Dallenbach, 1913), 5-point scales (Read, Yuille, & Tollestrup, 1992), 6-point scales (Perfect, 2004), 7-point scales (Brewer & Sampaio, 2012), 9-point scales (Robinson & Johnson, 1996), 20-point scales (Mickes, Hwe, Wais, & Wixted, 2011), and 100-point scales (DeSoto & Roediger, 2014). As noted, the general assumption seems to be that various scales are used in much the same way, because few researchers bother to tell why they used a particular scale or include two or more scales in their research to examine whether their findings are generalizable across scale types. We examined the issue directly in two experiments, and we review the evidence that is already available on the issue of how the type of scale may affect the relationship between confidence and accuracy.

Previous research on decision making in recognition memory addressed whether more decision options led to greater decision noise. Malmberg and Xu (2006) used a 4-point recognition scale (4 points from “sure yes” at 4 to “sure no” at 1) and Benjamin, Diaz, and Wee (2009) manipulated the set size of options in the recognition test by presenting one, two, or four words in each set. They defined accuracy as discriminability and calculated discrimination of targets from lures using ROCs. The researchers in both of these studies concluded that the ROC functions were influenced not just by stimulus noise (as they should be) but also by decision noise; as the number of decision options increased, the recognition measures became less trustworthy.

To directly test this claim, Benjamin, Tullis, and Lee (2013) conducted a recognition experiment with words and manipulated the range of the scale for the recognition decision between subjects. Subjects provided recognition judgments using only two-value (i.e., binary yes/no) or four- or eight-value scales. On the four- and eight-value scales, the lowest value was labeled “sure no,” whereas the highest value was labeled “sure yes.” Benjamin et al. concluded that the more alternatives given, the poorer the performance: “Rating scales with more options led to lower estimates of recognition than did scales with fewer options” (p. 1601) (but see Kellen, Klauer, & Singmann, 2012). However, one important difference between the procedure in this experiment and that in most confidence-accuracy research is that, in the latter research, experimenters first asked subjects to make a binary yes/no recognition decision and then rated their confidence on a scale for that decision. Thus, in Benjamin et al.’s (2013) terms, the initial judgment is always on a binary scale. Still, this research does provide a reason to expect that in other settings subjects will not use widely varying confidence scales in the same way.

Other results suggest that scale differences in recognition memory experiments may not matter. In two recognition memory experiments, Mickes, Wixted, and Wais (2007) used 20-point or 99-point rating scales to assess confidence for all items. The idea behind switching from a 20-point scale to a 99-point scale was to see if subjects would use more fine-grained readings at the high end of the 99-point scale. However, for the 99-point scale, the results revealed that “subjects often supplied ratings at intervals of 5 on the scale, which means that, for them, this was effectively a 20-point scale” (p. 863). Even though the comparison of 99-point and 20-point scales was not the main purpose of their study, Mickes et al. (2007) showed that 20- and 99-point scales yielded similar confidence-accuracy distributions. Of course, both these scales are relatively large, and many researchers use narrower scales (e.g., 1–4), so one can wonder if the conclusion would hold over a wider variety of scales.

More directly relevant to our present project, Dodson and Dobolyi (2015) compared nine confidence scales using lineup identifications as recognition tests. They employed verbal and numeric scales (e.g., ranged from 0 to 100 or from “not at all confident” to “completely confident”) and different numbers of points identified on a 100-point scale (e.g., numeric 6 points, 0–100: 0, 20, 40, 60, 80, or 100). They also manipulated whether the 100-point scale started at 0 or 50 (e.g., numeric 6 points, 50–100: 50, 60, 70, 80, 90, or 100) and whether they gave labels only for end points on verbal scales (e.g., using 6 points but only with “not at all confident” and “completely confident” labels on the end points). Thus, for verbal scales, they had 6 points with each point labeled, 11 points with each one labeled, 6 points with only the end points labeled, and 11 points with only the end points labeled. For numeric scales, they had 6 points with 0–100, 6 points with 50–100, 11 points with 0–100, and 11 points with 50–100. They also used a continuous numeric scale ranging from 0 to 100 with a slider, and thus overall they used nine different confidence scales. They analyzed the results derived from these scales in using confidence-accuracy calibration measures as well as correlational measures. They showed that the confidence-accuracy relationship was generally the same with all types of scales. Of course, in some sense, all their measures were variations on using a 100-point confidence scale.

The prior research is a bit mixed on the question whether various confidence scales provide the same estimate of the relationship between confidence and accuracy. Our experiments address this same issue, but in a different manner from past research. We compared subjects’ use of 4-, 5-, 20-, and 100-point scales in recognizing words (Experiment 1) and faces (Experiment 2) using confidence-accuracy characteristic (CAC) plots (Mickes, 2015). These plots permit us to ask questions such as, “Is 5 on a 5-point scale equivalent to 17–20 on a 20-point scale and to 81–100 on a 100-point scale in terms of accuracy?” Of course, we can ask this question for all points on the confidence scale (“Is a 2 on a 4-point scale equivalent to 6–10 ratings on a 20-point scale and 26–50 on a 100-point scale?”). One essential difference between the present study and that of Dodson and Dobolyi (2015) is that we used confidence scales over a wide range (4-, 5-, 20-, and 100-point scales) rather than carving up a 100-point scale in different ways. At issue is whether subjects will use these widely different confidence scales in the same way or in different ways. This issue is of practical significance because both researchers and police departments want to use the most sensitive type of scale.

The present experiments addressed three primary questions: First, do different ranges of confidence scales yield similar confidence-accuracy relationships? Second, do the highest points of each scale yield similar accuracy rates? The reasoning behind the second question was that the highest point on confidence scales with more points (i.e., 20- and 100-point scales) may provide higher accuracy than confidence scales with fewer points (i.e., 4- and 5-point scales). Third, what do CAC plots reveal for experiments in which many items are used (100 words in our first experiment and 50 faces in our second experiment)? CAC plots have thus far been employed only in eyewitness identification experiments, which are almost always one-item (one crime and lineup) experiments. CAC plots in these eyewitness experiments show that, on an initial identification from a lineup, high confidence always indicates high accuracy (Wixted et al., 2015; Wixted & Wells, 2017). However, this outcome may break down when large numbers of targets are used, owing to interference among items. However, the nature of CAC plots in experiments with many words or faces is an empirical issue that the present experiments help to resolve.

Experiment 1

In Experiment 1, subjects sequentially studied two different sets of 100 words and were tested on 200 words (100 targets, 100 lures) after each study phase. The lures were primary associates of the targets to make the tests difficult. After each old/new decision, different groups of subjects gave confidence judgments using a 4-, 5-, 20-, or 100-point scale.

Methods

Subjects

Subjects were 96 Washington University undergraduate students who participated for payment or course credit in groups of one to five. Data from two subjects were lost because of a programming error, and these subjects were replaced. Subjects were randomly assigned to one of the four confidence scales, with 24 subjects in each condition. The study was approved by the Washington University Institutional Review Board.

Design and materials

The experiment used a between-subjects design that manipulated only one variable: the type of confidence scale used on the yes/no recognition test. Four different confidence scales were used, and the recognition tests differed only in terms of the range of the confidence scale. After subjects judged a test item to be old or new, they rated their confidence on a scale of 1–4, 1–5, 1–20, or 1–100, with labels at each end of the scale ranging from “not confident at all” on the low end to “totally confident” on the high end. Thus, four groups of subjects were tested.

Word sets were used as materials for the present experiment. Two hundred associated word pairs (thus 400 words) were selected from among the Nelson, McEvoy, and Schreiber (2004) norms, with all associated items being one of three strongest associations of the target word (e.g., table–chair). The words had concreteness levels greater than 3.5 of 7 according to Nelson et al. (2004). The logarithm of HAL frequency in the English Lexicon Project (Balota et al., 2007) was used as a measure to check for word frequencies, which ranged from 5.98 to 13.67. The two items were counterbalanced across study and test phases. For example, for half of the subjects, when table was presented during the study phase, chair served as the lure during the test phase; for the other half, chair served as the target and table as the lure. Thus, all 400 words appeared as both targets and lures across subjects. Each study list consisted of 100 words presented in random order (different for each subject), and the recognition test consisted of 200 words (targets and their primary associates), also presented in random order.

Two filler tasks, a president recognition test (Roediger & DeSoto, 2016) and a survey about the events in Ferguson, Missouri, in 2015, were used in the experiment between study of each list and its test. The filler tasks were counterbalanced across the first and second lists. The tasks are tests used in other research in our laboratory and permit an assessment of undergraduate knowledge of presidents and the events surrounding Michael Brown’s death in Ferguson. These tasks were selected because they should provide general, not specific, interference in remembering lists of words.

Procedure

After subjects were given a consent form that included general information about the experiment, they were told they would be presented with words one at a time and be asked to remember them for a later memory test. The experiment consisted of two halves, and each half had three phases: study of the list, a distractor task, and a recognition test. During the study phase, 100 words were presented in the middle of the computer screen for 2 seconds each, with a 500-millisecond blank screen between words, for an effective study duration of 2.5 seconds. After the study phase, subjects completed one of the 10-minute filler tasks described above. During the recognition phase, 200 words (100 previously studied words and 100 related lures) were presented one at a time to the subjects. For each word, subjects responded whether they had seen the word during the study phase by clicking “old” or “new” on the screen. After making this decision, they were asked to make a confidence judgment about their answer on the given scale. They were informed that the highest point on the scale indicated “totally confident” and the lowest point indicated “not confident at all.”

Subjects rated confidence on 4-point, 5-point, 20-point, and 100-point scales (ranging from 1 to the highest point of the given scale). We selected these scales so that they would be easily converted to one another for comparison. That is, both 20-point and 100-point scales can be divided into four and five bins to be compared with 4-point and 5-point scales. The recognition test was self-paced, and subjects typed in a number (1–4, 1–5, 1–20, or 1–100) to indicate confidence. They were required to make a confidence judgment before moving to the next test item. After completing this procedure for 200 words, subjects took a 5-minute break and then started the second study phase with a different set of 100 words. Other than the new set of material and the alternative filler task, other aspects of the procedure were the same as in the first half of the experiment. After the subjects completed the second round, they were debriefed. The experiment lasted for 60–90 minutes, depending on subjects’ pace of responding.

Results

The top section of Table 1 provides the hit rates, false alarm rates, and d′ for the four different rating scale conditions. To examine whether hit and false alarm rates differed between the first and second phases of the experiment, we conducted two separate 2 (phase 1 vs. phase 2) × 4 (scales) analyses of variance (ANOVAs) for hit and false alarm rates. For both hit rates and false alarm rates, the results revealed that phases and the type of scale did not differ on these dimensions; for hits, F(1,92) = .82, BF ₀₁ = 6.35, F(3,92) = 1.06, BF ₀₁ > 100, and for false alarms, F(1,92) = 1.53, BF ₀₁ = 4.41, F(3,92) = 1.70, BF ₀₁ = 70.60, respectively (p _s > .05). Hence, Table 1 presents the data collapsed across the two phases, and we used these combined data for all analyses. For d′ scores, one-way between-subjects ANOVA revealed no main effect of the type of scale: F(3,92) = .60, p = .619, η² _p = .02, BF ₀₁ > 100.

Table 1 Hit rates, false alarm rates and sensitivity scores for Experiments 1 and 2

Full size table

Comparison of hits across confidence scales

For each bin, accuracy is computed by using the following formula: Proportion correct = number of hits/(number of hits + number of false alarms). To investigate the relationship of accuracy across the groups using the four scales, we analyzed the data by converting the 20- and 100-point scales into bins that permitted comparison. We used four bins for the 4-point scale and five bins for the 5-point scale. That is, for comparison with the 4-point scale, we binned data from subjects using the 20-point scale into bins that contained the number of responses made from 1–5, 6–10, 11–15, and 16–20 on the scale. Similarly, for the 100-point scale, we binned the data into bins of 1–25, 26–50, 51–75, and 76–100. We used the same analytic approach for the 20- and 100-point data for comparison with the 5-point scale. With this analysis, for example, we compared accuracy at the 5-point confidence level on a 5-point scale with 81–100 and 17–20 ranges on 100- and 20-point scales, respectively. Subjects used ratings in the lower confidence bins relatively rarely, so fewer observations were obtained in these bins. Therefore, the lowest two confidence bins were combined for further analyses. The number of observations per confidence bin is provided in Appendix 1.

Figs. 1 and 2 show these comparisons for four confidence bins and five confidence bins, respectively, for hits. As shown in both figures, accuracy increased steadily as a function of confidence, and the scale type did not lead to any difference in the increased accuracy with confidence. In Fig. 1 (left panel), mean accuracy ratios for the bins from 1–2 to 4 were .46, .61, and .83. For the 5-point scale, the corresponding values were .44, .53, .64, and .88 (Fig. 2, left panel). Obviously, if subjects are more confident, they are also more accurate. This outcome occurred despite our making the recognition test difficult by using primary associates as lures.

Two two-way repeated-measures ANOVAs were conducted, with confidence bins serving as the within-subjects factor and type of rating scale as the between-subjects factor. First, for the comparison of 100-, 20-, and 4-point scales, a 3 (confidence bins) × 3 (scales) ANOVA was conducted, which revealed a main effect of confidence bins, F(1.77,122.18) = 147.00, p < .001, η² _p = .68, and a main effect of the type of scale, F(2,69) = 3.41, p = .039, η² _p = .09, but no interaction F(3.54,122.18) = .41, p = .778, η² _p = .01. The pairwise comparisons with the Šidák correction revealed that, overall, the 4-point group (mean .68, SE.02) showed higher accuracy than the 100-point group (mean .60, SE.02), p = .033. Second, a 4 (confidence bins) × 3 (scales) ANOVA was conducted for comparison of the 100-, 20-, and 5-point scales, revealing a main effect of confidence bins, F(2.27,156.83) = 167.29, p < .001, η² _p = .71, but no main effect of type of scale, F(2,69) = 1.87, p = .162, η² _p = .05, BF ₀₁ = 10.72. The interaction was not reliable, F(4.55, 156.83) = .71, p = .601, η² _p = .02. The results of Experiment 1 revealed that higher confidence led to higher accuracy. In addition, subjects using the 100-point scale were less accurate than subjects using the 4-point scale. This is interesting because the two groups did not differ in their overall hit and false alarm rates. Moreover, this pattern did not emerge for 5-point comparisons. We examined this issue again in Experiment 2.

Comparison of hits at the most confident point of each scale

The previous analysis showed no consistent pattern for points at the highest range of confidence (i.e., bin 4 or 5, depending on the range of the scales). However, perhaps differences would appear if we had considered only the highest possible point in each scale type, such as the proportion correct for ratings of 4, of 5, of 20, and of 100 for the four different scale types. We hypothesized that accuracy would be highest when subjects could give 100 on a 100-point scale relative to, say, 4 on a 4-point scale, owing to the finer grain of the 100-point scale. Hence, we compared proportion correct for the last points of each scale; thus, for the 100- and 20-point scales, hits arising from only the 100- and 20-point ratings were included in the comparison. The logic behind the comparison was that in wide-range scales, the highest point at the highest end of confidence (e.g., 100 at the 81–100 bin) might yield higher accuracy than the highest point in narrow-range scales (e.g., 4 or 5 points). The number of ratings for the most confident response (4, 5, 20, or 100) sharply decreased from 4-point scales to 100-point scales (see Appendix 2). Still, we can ask if accuracy increased across scales at the most confident point, and the logic above leads to the prediction that accuracy should be higher for subjects using 20- and 100-point scales.

The mean proportions correct for the highest confidence rating were as follows: for ratings of 4 (mean .87, SD .16), of 5 (mean .93, SD.10), of 20 (mean .92, SD .13), and of 100 (mean .94, SD .11). A one-way between-subjects ANOVA was conducted across the four scale conditions and revealed no effect of scale type, F(3,90) = 1.46, p = .230, η² _p = .05, BF ₀₁ = 97.24, which surprised us, given the much larger numbers of observations in the four and five bins for the more coarse grain scales (4 and 5; see Appendix 2).

Comparison of correct rejections across confidence scales

When subjects correctly rejected an unstudied item by picking “new,” they also made a confidence judgment on this correct response. Thus, we can also assess the relationship between confidence and accuracy for correct rejections using CAC plots. We first examined whether the groups differed from one another in correct rejection rates through one-way between-subjects ANOVA, and no difference was found among groups, F(3,92) = 1.68, p = .173, η² _p = .05. Correct rejection rates for the 4-, 5-, 20-, and 100-point confidence scales were .67, .73, .64, and .63, respectively. Then, for each bin, accuracy was computed by using the following formula: proportion correct = number of correct rejections/(number of correct rejections + number of misses). As with analyses of hits, we combined the lowest two confidence bins because of the low number of observations. The number of observations per bin is provided in Appendix 3.

We investigated the relationship between correct rejections and confidence in the same way we investigated the relationship between confidence and hits, dividing 100-point and 20-point scales into bins of five or four. Figs. 3 and 4 (left panels) show that probability of correct rejections increased with increasing confidence, and the scale type did not create much difference in terms of correct rejections. For the comparison of 100-, 20-, and 4-point scales for the data in Fig. 3 (left panel), a 3 (confidence bins) × 3 (scales) ANOVA again revealed a main effect of confidence bins, F(1.51,102.39) = 28.13, p < .001, η² _p = .29, but no effect of scale type, F(2,68) = 1.11, p = .337, η² _p = .03, BF ₀₁ = 22.72, and no interaction F(3.01,102.39) = 1.50, p = .220, η² _p = .04.

For the data in Fig. 4 (left panel), a 4 (confidence bins) × 3 (scales) ANOVA for the comparison of 100-, 20-, and 5-point scales indicated a main effect of confidence bins, F(2.05,129.40) = 37.42, p < .001, η² _p = .37, but no main effect of the type of scale, F(2,63) = 2.18, p = .122, η² _p = .07, BF ₀₁ = 7.25, with no reliable interaction, F(4.11,129.40) = .88, p = .481, η² _p = .03.

Comparison of correct rejections at last point of each scale

We compared the accuracy of correct rejections for the last point of each scale as we did with hits. A one-way between-subjects ANOVA was conducted between 100 (mean .82, SD .29), 20 (mean .84, SD .25), 5 (mean .85, SD .19), and 4 (mean .77, SD .25) points, which revealed no main effect of scale type, indicating that accuracy for correct rejections at the highest point did not differ from one another as a function of scale type, F(3,70) = .44, p = .727, η² _p = .02, BF ₀₁ > 100.

Comparison of confidence-accuracy relationship between hits and correct rejections

A comparison of the data in Figs. 1 and 2 (hits) with data in Figs. 3 and 4 (correct rejections) indicates that the confidence-accuracy relationship appears steeper for hits than for correct rejections. The data are shown in Fig. 5 (left panel) for the 5-point scale, thus collapsing across the data in the left panels of Figs. 2 (hits) and 4 (correct rejections). We conducted a 2 (hits, correct rejections) × 4 (confidence bins) ANOVA and obtained a main effect of level of confidence, F(2.17,136.49) = 153.37, p < .001, η² _p = .71, and a reliable interaction, F(2.51,158.18) = 39.51, p < .001, η² _p = .39. Overall, the proportion correct for correct rejections (mean .70, SE .02) was higher than the proportion correct for hits (mean .62, SE .01), F(1,63) = 29.44, p < .001, η² _p = .32. The post hoc pairwise comparisons, though, revealed that the interaction was driven by a crossover between hits and correct rejections at the highest end of the confidence scales. The proportions of correct rejections were higher than proportions of hits at the first (mean .61, SE .02, mean .44, SE .02), second (mean .69, SE .02, mean .53, SE .02), and third bins (mean .73, SE .02, mean .64, SE .02), p _s < .001. Yet, at the fifth bin, the proportion of hits (mean .88, SE .02) was higher than the proportion of correct rejections (mean .80, SE .03, p < .001). Hence, the confidence-accuracy relationship for hits is indeed steeper than that for correct rejections. The same pattern occurred for the 4-point confidence scale.

Before discussing the results, we attempted to replicate them using faces as the study material with the same basic design as in Experiment 1.

Experiment 2

In this experiment, we switched to faces as the to-be-remembered material, because previous literature suggested that confidence and accuracy might change according to the type of material (Roediger et al., 2012). This might be one reason for the differences observed in the confidence-accuracy relationship between list-learning and eyewitness situations. Thus, we aimed to replicate (or not) the findings from Experiment 1 with faces. Would the various confidence scales be used similarly with faces as they are with words?