Using simulated data, Lampinen (2016) argued that ROCs derived from lineup and showup identification procedures produce different estimates of witnesses’ ability to discriminate guilty from innocent suspects even if their true level of discrimination accuracy does not vary. This point, if correct, would raise questions about ROC-based comparisons of the accuracy of lineup and showup identifications (Gronlund et al., 2012; Mickes, 2015; Wetmore et al., 2015). However, we will show that Lampinen reached an erroneous conclusion because he simulated data using relatively liberal criterion locations and inappropriately applied signal detection equations from a different decision task. Lampinen’s second criticism of ROCs is that they encourage researchers to compare discrimination accuracy at different levels of witness confidence, which “is not a reasonable or scientifically valid way to compare two conditions” (Lampinen, 2016, p. 28). As we will argue, this claim fails to recognize the most valuable contribution of ROCs, which is exactly that they eliminate the need to worry about witness confidence because the same decision accuracy (d′) is reflected at every confidence level.1. In contrast to Lampinen’s (2016) claim, ROCs provide the best measures of underlying discrimination performance and may be compared across lineups of different lengths (including showups).
Lampinen (2016) argued that witnesses’ ability to distinguish guilty from innocent suspects appeared to be different for showup and lineup procedures that involve ROCs, so “pAUC analyses do not provide a valid way of comparing identification procedures” (Lampinen, 2016, p. 26). As evidence, he offered a series of simulations in which true discrimination accuracy (d′) was equated for two different identification tasks: showups and six-photograph lineup identifications. In both cases, Lampinen assumed an underlying representation based on signal detection theory (Macmillan & Creelman, 2005). Specifically, memory strength values were sampled from Gaussian strength distributions with a mean of 0 and a standard deviation of 1 for fillers, and a mean of d′ (set at 0.5, 1, 1.5, or 2)Footnote 3 and a standard deviation of 1 (for equal variance simulations) or 1.2 (for unequal variance) for guilty suspects.
To simulate the showup procedure, Lampinen randomly selected a single value from either the guilty suspect distribution, to represent a TP showup, or the foil distribution, to represent a TA showup. The sampled strength was compared with a fixed set of criterion values (0.18, 0.23, 0.27, 0.39, 0.67, 1.15, and 1.53) to assign a confidence level for the response. For example, a sampled strength of 1.0 would be assigned a confidence level of 6 because it falls in the interval between the fifth and sixth criterion locations. Any sampled strength greater than 0.18, the lowest criterion, was assumed to result in a positive identification.
Simulation of the six-photograph lineup procedure was similar, except that each simulated lineup involved six sampled strengths, either one from the guilty suspect distribution and five from the filler distribution (for a TP lineup) or all six from the filler distribution (for a TA lineup). In either case, the highest sampled strength determined the confidence rating. If that strength was from the guilty suspect distribution, then a “hit” resulted. If the lineup included only fillers, then any strength above the lowest criterion (0.18) was treated as a false alarm. Because there are six opportunities for a filler to exceed the criterion in a TA lineup, the resulting response rates were divided by 6.
Lampinen (2016) showed that the simulated ROC for the showup procedure fell above the simulated ROC for the lineup procedure for each of the true d′ values he considered. Because the AUC is a measure of subjects’ ability to discriminate between two types of stimuli (i.e., guilty and innocent suspects; Green & Swets, 1966; Macmillan & Creelman, 2005), he concluded that the estimated accuracy was higher for the showup method. As a consequence, his simulations seem to suggest that the AUC fails to provide a good measure of eyewitnesses’ ability to discriminate guilty from innocent suspects. We will show, using our own simulations, that ROCs based on a choice from among a larger set of options (i.e., six-person lineup rather than three-person one) do have lower hit and false alarm rates at every criterion location, though for reasons quite different from those offered by Lampinen. However, the difference in the AUCs (which Lampinen did not report) is inconsequential compared with the difference in AUCs when true discrimination accuracy (d′) actually varies.
We performed simulations similar to Lampinen’s so that the AUCs could be calculated. To extend and generalize his analyses, we varied the criterion locations continuously for 10,000 different simulated trials of each type, thus mapping out the full theoretical ROC. By comparison, Lampinen assumed a particular set of fixed criterion locations, which necessarily provides only a snapshot of the ROC. In addition, we assumed several different lineup sizes (2, 3, or 6). Although lineups with only two or three photographs are not used in standard police procedures, their inclusion in our simulations does provide some insight into the relationship between lineup size and form of the ROC.
Our simulated showup and lineup ROCs are shown as the set of ROCs in Fig. 1 that are represented with dashed functions. These curves were generated using one of the combinations of parameter values from Lampinen’s study, namely a true d′ of 1.5 and equal variance distributions. (Other values of true d′ yielded similar results.) Notice that these four ROCs are visually indistinguishable at the lowest false alarm rates (up to about 0.10) that reflect the highest confidence levels. This part of the ROC was not shown in any of Lampinen’s simulations, because the most conservative criterion he selected was relatively liberally placed and thus produced high hit and false alarm rates overall. The operating points in Lampinen’s simulations are marked with red open circles on the showup ROC in Fig. 1.
The highest-confidence identification decisions are those that are most valuable to the legal system because they carry the greatest probative value (Wixted, Mickes, Dunn, Clark, & Wells, 2016). High-confidence identifications are much more likely to be correct than low-confidence identifications (e.g., Carlson, Dias, Weatherford, & Carlson, in press; Juslin, Olsson, & Winman, 1996; Mickes, 2015; Palmer, Brewer, Weber, & Nagesh, 2013; Wixted et al., 2016), and jurors are more likely to believe confident witnesses (e.g., Cutler, Penrod, & Stuve, 1988). Thus, the highest-confidence and most important region of the showup and six-person lineup ROCs are visually indistinguishable when true accuracy (d′) is equated. Lampinen’s selection of particular, relatively liberal decision criteria obscured this similarity.
At lower levels of confidence, or more liberal response biases, it becomes apparent that the number of response options affects both the hit and false alarm rates. The false alarm rate is limited for the obvious reason that selection of any of the fillers in a TA lineup is an error; chance response rates are properly limited to 1/N, where N is the lineup size. The basis for the decrease in the hit rate with increasing lineup size is perhaps less obvious. Because witnesses select the photograph that is most familiar (assuming it exceeds some criterion), the greater the number of fillers in a TP lineup, the more likely it is that one of them will have greater familiarity than the guilty suspect just by chance. In that case, the witness would choose a filler instead of the guilty suspect, thus reducing the hit rate. A consequence of these two effects is that the very same evidence values in memory, used in conjunction with identical criterion locations, can nonetheless result in hit and false alarm rates that are reduced as the number of photographs in the lineup increases. For example, the blue circles in Fig. 1 show the hit and false alarm rates that result when witnesses have a true d′ value of 1.5 and use a decision criterion of 1.00 for both showup and six-person lineup decisions: Their decisions appear more conservative in the lineup task simply as a consequence of the number of photographs presented (both memory and the decision process are identical).
To estimate the area under the curve for each of the ROCs shown in Fig. 1, we used the R package pROC (Xavier et al., 2011). Because the maximum false alarm rate is limited by the lineup size, we actually estimated the pAUC for the showup and six-person lineup ROCs (those compared visually by Lampinen) using a false alarm rate range of 0 to either 0.10 (only the highest-confidence responses) or 0–0.16 (essentially the full ROC for the six-person lineup). To obtain confidence intervals on these area estimates, we repeated this process using 2000 bootstrapped samples for each identification procedure and true accuracy level (d′), selecting the pAUCs at the 2.5 and 97.5 percentiles as the lower and upper bounds of the 95 % confidence intervals.
Figure 2 shows that the difference in pAUCs for showups and six-person lineups is trivial when true accuracy (d′) is equated.Footnote 4 The differences typically appear in the fourth decimal place (i.e., less than the size of the smaller gray square in Fig. 1). In contrast, when true accuracy (d′) varies, the pAUC for both showup and six-person lineup identifications capture those changes quite readily. Figure 2 shows that with a true accuracy difference of 1.0 d′ units, pAUC changes are obvious for both showup and lineup decisions (i.e., approximately twice the size of the larger gray square in Fig. 1). Thus, contrary to Lampinen’s conclusion, we observe that the AUCs obtained from showup and six-person lineup identifications may be safely compared empirically: They yield indistinguishable estimates of discrimination accuracy when true accuracy (d′) is held constant and change appropriately (and quite similarly) when true accuracy varies.Footnote 5
Of course, these simulations assume witnesses are equally good at discriminating guilty from innocent suspects in showup and lineup tasks. On one hand, they show that, theoretically, if discrimination accuracy (d′) is the same, the ROCs are highly unlikely to suggest that either of the procedures results in higher-accuracy decisions than the other. On the other hand, the simulations also show that real differences in discrimination accuracy (d′) can be detected with ROCs, and to the same degree for both showups and lineups. Whether accuracy is actually the same in these two identification procedures is an empirical question, not a theoretical one. Empirical comparisons of ROCs for showup and lineup identifications have consistently revealed that showups result in significantly lower decision accuracy than lineup procedures, as measured with pAUC (Gronlund et al., 2012; Mickes, 2015; Wetmore et al., 2015). Our simulations suggest that the most appropriate conclusion to draw from these studies is that showup identification accuracy is inferior to lineup identification accuracy. (See Wixted and Mickes [2014] for a possible theoretical explanation of that difference.)
The other published ROC comparisons of eyewitness identifications are safe from Lampinen’s criticism as well. Lampinen’s basic claim was that varying the length of the lineup (from six to one) affected estimated but not true witness accuracy, but the primary application of ROCs to eyewitness identification decisions has been to compare sequential and simultaneous presentation of the same lineup photographs. The consistent finding, that pAUC for simultaneous lineups is equal to or greater than for sequential procedures (Carlson & Carlson, 2014; Dobolyi & Dodson, 2013; Gronlund et al., 2012; Mickes et al., 2012), is not in any way challenged by Lampinen’s arguments, because the same lineup length (indeed, the same lineup) was used for both procedures. Thus, the ROC data indicate that, if anything, there is a simultaneous superiority effect; we will revisit this issue in the final section of this paper, where we discuss the implications of ROC data for theoretical developments in eyewitness identifications.
Finally, note that the “proof” offered by Lampinen (2016, Appendix) of the relationship between estimated discrimination accuracy from lineups and showups is irrelevant because that math applies for a different task, namely a two-alternative forced choice (2AFC) task. In a 2AFC task, participants are shown a target and a lure and must choose the target. The decision is usually modeled as one of taking a difference between the two (independent) memory strengths, and thus the distribution of interest becomes N(d′, √[12 + s
2]), where s is the standard deviation of the guilty suspect distribution.Footnote 6 The resulting ROC, again for a true d′ of 1.5 and s = 1, is shown as the solid curve in Fig. 1. This is not the ROC simulated by Lampinen, nor is it the same ROC that occurs for lineups of size 2 (see Fig. 1). Participants in an eyewitness identification task have the option of rejecting the lineup entirely, which may change the underlying task from one of comparison (as in 2AFC) to one in which the subject must simply identify the strongest item that exceeds a minimum criterion (DeCarlo, 2013). Comparisons of ROCs from different tasks should involve careful consideration of the decision processes and memory evidence involved, as we will show in the final section of this paper.
To summarize, Lampinen’s first criticism misses the mark in several important ways. It is irrelevant to the comparison of sequential and simultaneous lineups that has dominated the eyewitness ROC literature, and it reaches the unfounded conclusion that estimated accuracy (pAUC) differs systematically for showups and lineups. Our analyses demonstrate that the two paradigms yield essentially identical estimates of performance when the true accuracy level (d′) is equated. Thus, Lampinen’s argument does not change either the empirical conclusion that simultaneous lineups yield equal or greater AUCs than sequential lineups (Carlson & Carlson, 2014; Dobolyi & Dodson, 2013; Gronlund et al., 2012; Mickes et al., 2012), or that showups yield lower-accuracy identifications than lineups of either type (Gronlund et al., 2012; Mickes, 2015; Wetmore et al., 2015). Our comparison of showup and six-person lineup ROCs confirms that these ROC analyses provide accurate information about relative performance.2. Lampinen (2016) claims ROCs invite inappropriate comparison of accuracy at different levels of response bias. In truth, ROCs separate bias from discrimination accuracy.
Lampinen’s (2016) second major claim is that ROCs invite comparison of memory accuracy across different levels of confidence. As an example, he selects a particular false alarm rate, say, 0.167 (see Lampinen, 2016, Fig. 6), and then observes that the hit rates vary for different simulated identification procedures, suggesting that different estimates of accuracy (say, d′) would be observed for lineup and showup identifications. Finally, Lampinen notes that if response bias differs across identification procedures, then the operating points being compared may reflect different degrees of witness confidence. Conversely, Levi (2016) worries that witnesses given different identification procedures may respond with the same confidence level despite having different discrimination accuracy (d′).
These criticisms completely miss the value of ROCs, namely that each ROC reflects the same accuracy (d′) at every point. Because of this property of ROCs, one can readily see both accuracy and response bias effects simultaneously: Curves higher in the space reflect higher (d′) decision accuracy (though possibly not meaningfully so, as our simulations demonstrate), and points toward the lower-left end of the curve reflect more conservative responses. One does not need to compute a single-point measure of accuracy, such as d′, at a given false alarm rate to compare accuracy across two conditions. Indeed, one should not do so, because d′ is confounded with response bias whenever the underlying evidence distributions have unequal variance, as is consistently observed in recognition memory judgments (e.g., Ratcliff, Gronlund, & Sheu, 1992). This mistake has led to substantial interpretation errors in a variety of memory experiments (e.g., Dougal & Rotello, 2007; Evans, Rotello, Li, & Rayner, 2009; Verde & Rotello, 2003).
It is important to understand that every “single-point” measure of decision accuracy, be it d′, percent correct, diagnosticity, or something else, has an associated theoretical ROC. The theoretical ROC for a given measure simply connects all combinations of hit and false alarm rates that yield the same accuracy value according to that measure, regardless of differences in response bias. However, the various single-point measures of accuracy each predict a different ROC form, which means that researchers who ignore differences in response bias across conditions may easily (and erroneously) conclude that accuracy differs if they select an inappropriate accuracy measure (see Dube, Rotello, & Heit, 2010, and Rotello, Masson, & Verde, 2008, for detailed explanations). For example, for a constant value of diagnosticity, D = H / F, it is easy to see that the theoretical ROC, H = D × F, is a line with an intercept of 0 and a slope equal to the diagnosticity value itself. Rotello, Heit, and Dubé (2015, Fig. 1) plotted several such theoretical ROCs for diagnosticity and compared them with the empirical ROCs reported in a study of eyewitness identifications (Mickes et al., 2012, Experiment 1b). The empirical ROCs were curved, not linear, meaning that different empirical response biases would yield different estimated diagnosticity despite actually representing the same underlying discrimination accuracy (d′). For this reason, diagnosticity is not an appropriate measure of eyewitness identification accuracy (nor for any other task of which we are aware; see Swets, 1986a).
When the empirical and theoretical ROCs do not match, as for the empirical identification ROCs and the diagnosticity measure, a different accuracy measure must be selected; otherwise, estimated accuracy and response bias are confounded (Rotello et al., 2008; Swets, 1986b; Wixted & Mickes, 2012). For this reason, Lampinen’s claim that ROCs and probative value measures such as diagnosticity provide redundant information (p. 32; see also Levi, 2016, p. 45) is simply incorrect unless response bias is the measure of interest. As Fig. 3 shows, diagnosticity varies systematically with response bias but is uncorrelated with any particular AUC. Importantly, the greatest variability in diagnosticity estimates occurs for the most conservative response biases (i.e., those with larger positive values of the signal detection measure c) that yield the high-confidence responses that are most useful to the legal system (e.g., Wixted et al., 2016).
In summary, ROCs do not “invite” inappropriate comparisons of performance across different response biases. Instead, they make explicit if and how response bias differs across empirical conditions, and they yield a measure of response accuracy (AUC) that is independent of response bias. In contrast, single-point measures, such as d′, percent correct, and diagnosticity, both obscure and are almost invariably confounded with differences in response bias.