Details of the experimental data, SDT model and model-fitting procedure, and the simulations are provided in the “Methods” section. Here we focus on the results and only briefly describe the methods. We do note, however, that, rather than relying on simulations, closed-form solutions were used for the SDT model (e.g., Cohen et al (n.d.) (in press)).
Experimental data
Full data set
The SDT model described previously was fit simultaneously to the data from 13 eyewitness lineup experiments, involving a total of 10,137 identification decisions. As discussed previously, the rationale for mixing data across a large number of studies was to approximate the huge variability across real witnesses, who view different crimes under different viewing conditions and who vary substantially in their individual characteristics. We do not claim that our combined data set is a close analog to real witness identification data, but it is certainly a much closer analog than data from any single experiment in which these estimator variables are all tightly controlled (or manipulated) across subject-witnesses. All of the 13 experiments collected confidence ratings from witnesses and used a simultaneous lineup procedure, i.e., all photos were shown at the same time. Critically, the model was applied to collapsed data, i.e., data in which the guilt or innocence of the suspect is not known. Thus, the main question is whether, without this information, the model can recover the base rate of guilty suspects present in the data by estimating a value for the pg parameter that is near the true P(guilty).
The results are shown in Fig. 3. First, consider the left panel. The number in the lower right shows the sample size in the original data set. The green circle plots the estimated base rate for the experimental data on the y-axis against the actual experimental base rate on the x-axis. The model does an excellent recovering the actual experimental base rate.
Next, we bootstrapped samples from the original data set to generate data sets with a known range of base rates. That is, samples of trials in which the suspect was known to be guilty were combined with samples of trials in which the suspect was known to be innocent in different proportions, i.e., .20, .35, .50, .65, and .80 guilty suspects. Each sampled data set comprised 1000 identification attempts, so, for example, a .20 base-rate data set would have responses from 200 guilty-suspect lineups and 800 innocent-suspect lineups. We refer to data generated in this way as sampled data and the base rates of sampled data as sampled base rates. The SDT model was then used to estimate the sampled base rate. This process was repeated 1000 times. The gray circles represent the estimated sampled base rate for each of the 1000 sampled data sets, with the sampled base rates jittered on the x-axis for visibility. For each sampled base rate, the median estimated base rate is marked with a red x and the 10th and 90th quantiles are marked with red lines.
Consistent with Wixted et al. (2016), the estimated base rate tracks the actual sampled base rate well. There are some caveats, however. First, there is considerable variability in the estimated base rate for each actual sampled base rate. The middle 80% of the estimated base rates, i.e., the distance between the 10th and 90th quantiles, span a range of approximately 0.15 to 0.20. Although this result indicates that the model’s estimated base rate provides a fair approximation to the true value, it is an open question as to whether the observed level of variability in those estimates is acceptable. This variability will also change with sample size, and readers should note that the displayed estimates are based on a large sample (1000 witnesses). Second, there is a slight but clear bias such that low sampled base rates tend to be overestimated. To a lesser extent, high sampled base rates tend to be underestimated. Base-rate estimation is best for sampled values near 0.5. While we might be tempted to take comfort from the fact that most experimental base rates are also near 0.5, there is typically no need to estimate experimental base rates; rather, our goal is to accurately estimate the currently unknown base rate of police line ups that include the guilty suspect, which may vary considerably.
The middle panel of Fig. 3 shows the fit of the model to the original data. The green o’s are the experimental data and red x’s are the model predictions. The labels L, M, and H show the proportion of trials resulting in low-, medium-, and high-confidence suspect identifications. Similarly, the labels l, m, and h reflect the proportion of trials yielding low-, medium-, and high-confidence filler identifications, and the point labeled R shows the proportion of lineups that were rejected by the witness (i.e., a “no ID” decision was made). The fit is excellent, with the model accurately predicting performance in every response category.
The bootstrapped samples shown in the left panel of Fig. 3 provide one way of evaluating how well the SDT model estimates the sampled base rate. A different way of assessing the model is to consider how well it fits the data when a particular base rate is assumed. Concretely, suppose that the true base rate in a particular data set is 0.50. If the SDT model provides a good description of the data, it should provide a good fit (indicated by a small value of G2) when the base-rate parameter pg is fixed at a value close to 0.50, and a relatively poor fit (i.e., large G2) when pg is set to a value that is far from 0.50. Observing such a pattern would provide support for the conclusion that the model’s estimated base rate is constrained by the data. On the other hand, if changes in the model’s base-rate parameter do not result in substantial changes in the fit statistic, then that would imply that the model’s estimated base rate is not sufficiently informed by the data and should not be trusted.
The right panel of Fig. 3 shows the results of this sensitivity analysis. The upper section of the right panel shows the fit measure, G2, that is observed when the SDT model is fit to the experimental data under a wide range of assumptions about the value of pg. Specifically, we fit the experimental data many times, using a different fixed value of pg (ranging from 0.01 to 0.99) for each fit. The x-axis shows the fixed value of the model base rate pg and the black curve shows the corresponding values of G2 (where lower values are better). For this large data set, with a true base rate of approximately 0.52, the model’s fit profile is quite good. The parabolic shape of the black curve shows that the model only fits these data well when the model’s base-rate parameter is in a relatively narrow range around the true base rate in the data (shown with the green vertical line). This parabolic shape shows that it is not a coincidence that the model estimates the base rate in the data quite accurately when the base-rate parameter is free to vary (as in the results shown in the left panel of Fig. 3). This estimated value is shown with the red vertical line.
Another advantage of this approach is that we can compare the fit of the model when the true base rate is used as the value of pg to the fit that results when pg is unconstrained. That is, we can compute the relative likelihood, lr, of the model with the best-fitting estimated base rate and the actual base rate. This value, shown in green in the lower-left corner of the right, upper panel, means that the best-fitting base rate is 1.9 times more likely than the actual base rate. In this context, that means that fixing the model base rate to the actual base rate does not meaningfully change the fit of the model.
Finally, the lower section of the right panel of Fig. 3 assesses how well the SDT model can estimate base rates when the data are known to be generated by the model, assuring that the model assumptions are met. This section of the figure was generated as follows. First, the model was fit to the full data set with pg fixed at the actual data base rate. Second, data were simulated from the model using the parameters estimated from the first step and pg was again fixed at the actual base rate. The sample size and lineup sizes for the simulation were the same as in the data. Third, the model was fit to this simulated data. Steps 2 and 3 were repeated 1000 times. The lower section of the right panel of Fig. 3 provides the frequency with which a given base rate, pg, was recovered. The gray curve shows a kernel density estimation of the distribution of estimated base rates. The peak of the distribution is at the generating base rate, provided by the green vertical line. Critically, the estimated base rate from the full data set, the red line, is still relatively likely.
In summary, all of these different ways of evaluating the ability of the SDT model to fit these data suggest that it does a good job, although there is quite a bit of variability in the estimated base rate even with a sample that includes many identification attempts (N = 1000 trials, left panel of Fig. 3).
A more familiar way of looking at eyewitness identification decisions is to use a calibration curve (Juslin, Olsson, & Winman, 1996) or confidence-accuracy characteristic (Mickes, 2015), which plots the accuracy of the identifications as a function of witness confidence. These curves can be generated for both empirical data and model predictions, as shown in Fig. 4. Each column shows the calibration curves for a sampled base rate. The model predictions are averaged across the results from all simulations at that base rate. We begin by discussing the top row, which shows the proportion of correct responses given either a rejection (rej) or a low (sL), medium (sM), or high (sH) confidence suspect identification. For example, a value of 0.60 for sL means that 60% of the low-confidence suspect choices were to guilty suspects and 40% were to innocent suspects. A value of 0.70 for rej means that 70% of the rejected lineups included an innocent suspect, and so were correctly rejected; the other 30% included a guilty suspect that was missed. These results illustrate an effect of the over-prediction of estimated base rate for low-sampled base rates (left panel of Fig. 3). Specifically, for low-sampled base rates, the model predicts that performance for suspect identifications will be more accurate than is actually observed. As sampled base rate increases, this difference between the data and the model’s predictions is reduced; however, the model slightly over-predicts the accuracy of rejected lineups.
These graphs also reflect the ability of the SDT model to recover PPG. Recall that PPG is defined as the probability of the suspect being guilty, if the suspect is identified. That is exactly the information the sL, sM, and sH points in the top row of Fig. 4 provide, for individual confidence levels. For comparison to previous work, those graphs were generated by averaging across predictions. To get a sense of the variability inherent in the estimated PPG we repeated this analysis, but for each individual sampled data set. The results are provided in Fig. 5. There are three main results. First, the model captures the relative PPG values across both base rates (panels) and low (black circles), medium (red triangles), and high (green crosses) confidence levels. Second, however, and consistent with Fig. 4, the model over-predicts PPG, especially for low base rates. Third, there is considerable variability across sampled data sets, again, especially for low base rates, suggesting that estimated PPG is a relatively imprecise measure of actual PPG. This result is important as it suggests that, although estimation of base rates is fairly good, the PPG estimated by the model is significantly higher than the PPG in the data, especially for low base rates similar to that estimated by Wixted et al. (2016) for the Houston data. That is, the base rate estimated from the model provides an inflated sense of the guilt of an identified suspect.
Returning to Fig. 4, the bottom row plots the probability of a suspect identification given a low- (L), medium- (M), or high- (H) confidence response (for either a suspect or filler identification). For example, a value of 0.30 for low-confidence responses means that 30% of the low-confidence responses were suspect identifications and 70% were not. Here, the model does an excellent job throughout.
Individual experiments
The previous results show that, when applied to a data set drawn from a range of experimental procedures, sample sizes, lineup sizes, and manipulations that simulate different estimator variables, the model does a fairly good job of estimating the actual base rate, but with systematic biases in some cases and high variability in estimates. To get a sense of how robust the results were, we next applied the SDT model to the data from each of the 13 individual experiments, repeating the same analyses on each experiment as we reported for the overall data set. The only difference is that the analyses were applied to each of the 13 experiments individually, rather than the combined data set.
Overall, these individual experiment model fits were excellent. That is, the SDT model did an excellent job in accounting for the response proportions for each identification category. These data, along with the calibration curves and sensitivity analyses for each experiment, are provided on Open Science Framework (OSF; https://osf.io/3qz5n).
The key results of these model fits are the base-rate estimates for the bootstrapped samples drawn for each experiment; these are shown in Fig. 6. The results are quite variable. For some experiments, the model did a very good job in estimating the base rate. For example, the data from Brewer and Wells (2006), Carlson et al. (2016), Mickes (2015) Experiments 1 and 2, and Rotello et al. (2015) Experiment 3, produced very good to excellent mean estimated base rates (though often with high variability). Of particular note is the superb base-rate estimation for the Palmer et al. (2013) study which is the same data set used by Wixed et al. (2016). The base-rate estimates for other studies were good, albeit biased to varying degrees, including Mickes et al. (2017) Experiment 1 and Rotello et al. (2015) Experiment 2. Yet other studies showed a good correlation between actual and estimated base rates, but with an extreme bias. These include Carlson et al. (2017), Kneller and Harvey (2016), and Wetmore et al. (2015). The estimated base rates for Rotello et al. (2015) Experiment 1 and Gronlund et al. (2009) were very poor.
Mediocre to poor base-rate estimation in some of these studies can be easily explained. In Rotello et al. Experiment 1, the suspect identification rates were quite low and did not vary much as a function of confidence. The Kneller and Harvey (2016) study involved only 120 participants, and in two of their three conditions there were no suspect identifications made with high confidence. These results suggest that it is important to get a good sample of suspect identifications and responses at all confidence levels. The Wetmore et al. (2015) study, like Gronlund et al. (2009), included biased lineups that strongly encouraged selection of the suspect, which results in an inflated estimate of the base rate of guilty suspects. We suspect that this is one main reason the that model overestimates the base rate for this experiment and we explore this explanation further in the next section.
Experimental conditions
Estimation of base rates for the individual experiment data shows highly variable performance. For the majority of studies, base-rate estimation was good to excellent. There were studies, however, for which base-rate estimation was very poor. In some cases, as discussed previously, an explanation is readily available. In others, it is not readily apparent why the model performs so poorly. To generate a more fine-grained view of where the model performs well and where it performs poorly, we now turn to a condition-by-condition analysis of the data.
A summary of the conditions used is provided in Table 2 of the “Methods” section. The full set of results, the model fits (which were all very good to excellent), sensitivity analysis, and calibration curves for each condition are provided on OSF. Here, we discuss the conditions from two experiments that proved to be especially informative regarding the conditions under which the model performed poorly.
Some of the experimental manipulations are expected to strongly influence identification rates and the distribution of confidence. For example, biased or unfair lineups in which the fillers are dissimilar to the suspects, as in Gronlund et al. (2009), tend to inflate suspect identification rates and confidence levels. The presence of a visible weapon, as in Carlson et al. (2017), is expected to draw attention away from the perpetrator, thus reducing guilty-suspect identification rates and lowering confidence. Indeed, estimation of base rate was especially poor in exactly these conditions.
The Gronlund et al. (2009) study manipulated a number of factors, including the fairness of the lineup and the memory strength of the suspect. There were three levels of fairness, fair (F), intermediate (I), or biased (B), which were generated by manipulating the similarity of the fillers and suspect. In addition, guilty suspects could be represented by a photo that was a better or worse match to the way that they appeared in the witnessed crime, resulting in a strong (GS) or weak (GW) memory strength. Naturally, the GS conditions resulted in more suspect identifications than the GW conditions. Similarly, innocent suspects could be strong (IS) by virtue of offering a good match to the perpetrator, or else weak (IW); there were more IS than IW suspect identifications. Interestingly, the Gronlund et al. data reveal that the GW suspects were identified less often than either the IS or IW suspects; in that case, the perpetrator elicited lower memory strength than the innocent suspect. Because the innocent suspects had never been seen before, these conditions could be viewed as manipulations of selection bias rather than memory strength. The condition-by-condition Gronlund et al. (2009) results are provided in Fig. 7.
The Gronlund et al. (2009) results are nuanced. Estimation was good, albeit noisy, for the fair (no-bias) condition with a weak innocent suspect (FIW). This condition corresponds to a standard lineup in which the weak innocent suspect was essentially just another filler. All of the other conditions, however, show strong deviations from accurate base-rate estimation. Base-rate estimation is at or near ceiling for all of the biased (B) conditions because in those conditions the suspect was identified with high probability and high confidence, regardless of guilt or innocence; the model interprets this response pattern as reflecting a high base rate of lineups containing guilty suspects. One way of understanding this outcome is to contrast the effective size of the lineup (E’; Tredoux, 1998) and the lineup size assumed by the model, which always reflects the actual number of photos shown. To the extent that these two values differ, the model is misspecified for the data. Whereas the model assumed a lineup size of six, for the biased lineups from Gronlund et al. (2009), the effective lineup size E’ was always less than two.
Another striking misestimation of the sampled base rate is evident in instances of the Gronlund et al. (2009) data in which the guilty suspect is a poor match to the perpetrator (GW). In that case, the model’s estimates of base rate are actually negatively correlated with the sampled base rate. This failure of the model occurs because the GW suspect is selected less often than either the IW or IS suspects, which means that there are fewer suspect identifications in the data as the sampled base rate increases; the model interprets this low suspect identification rate as reflective of the base rate. The combined effect of (intermediate) biased lineups and guilty suspects that are a poor match to the perpetrator is visible in the IGW condition, which shows overestimation of the base rate overall due to the relative dissimilarity of the filler photos to the perpetrator, as well as the negative relationship between estimated and sampled base rates that stems from inclusion of the GW suspect.
In Carlson et al. (2017) a weapon was either shown (S), present but concealed (C), or not shown (N). The condition-by-condition Carlson et al. (2017) results are provided in Fig. 8. In all conditions, the model’s estimate of the base rate is too low, and the degree of underestimation varies systematically with the participants’ awareness of the weapon. When there is no weapon (N), the bias is smallest. When the weapon is visible (S), the estimates are very strongly biased, reflecting the relatively low probability that the suspect is identified. Presence of a concealed weapon results in moderate bias.
So that the resulting misestimation of the base rates can be understood more easily, up to this point, we have highlighted the way in which the conditions of two specific studies were particularly challenging to the model. In turns out, however, that there is a systematic pattern in the data across all experimental conditions that predicts poor base-rate estimation. Figure 9 shows the distribution of identification responses for all 47 experimental conditions from Table 2. The blue points show the responses from all conditions in which PPG was underestimated. The red points show the results from all of the conditions in which PPG was overestimated. Although, for exposition, the data in the top panel of Fig. 9 are separated by innocent-suspect and guilty-suspect trials, recall that the model was fit to the collapsed data, which are shown in the bottom panel. The difference between medium-confidence suspect identifications (top panel, right-hand M) and the low-confidence filler identifications (top row panel, right-hand l) in innocent-suspect lineups accounts for 52% of the variability in the model PPG misspecifications (r = .72, p = .04). The reason that these particular response rates are challenging for the model is that fillers and innocent suspects are assumed to be sampled from the same distribution (see Fig. 2). This means that the model is forced to predict that the conditional distribution of confidence levels is the same for both of these lineup members; when the data show more identifications of innocent suspects than fillers, which tends to happen with moderate confidence, it resolves this conflict by incorrectly concluding that those suspect identifications are to guilty suspects and thus overestimates the base rate of lineups that include guilty suspects. In contrast, when the data show more filler identifications than innocent-suspect selections, the model resolves this conflict by concluding that there are fewer lineups containing guilty suspects and thus underestimates the base rate. A shift of criterion cannot completely account for this pattern because, for example, an increase in the medium-confidence response region will simultaneously increase the probably of medium-confidence suspect and filler responses, a pattern which does not occur. A similar pattern emerges for both the guilty-suspect trials and in the collapsed data.
Best practices
One of the strengths of the previous analyses is that they show how well the SDT model performs under a wide range of situations. Many of the experimental conditions, however, were specifically designed to deviate from best practices for lineups; for example, with the inclusion of biased instructions or fillers that are dissimilar from the suspect. Thus, they may not provide a good indication of how well the model would perform under ideal conditions (e.g., fair lineups, good witness instructions, double-blinded administration). Indeed, the previous section illustrates that the model fails to capture performance in exactly these problematic conditions. It is, therefore, informative to examine model performance when these less-than-ideal conditions are removed from the data set. We call the remaining data sets the best-practices data set. See the “Methods” section for details. We repeated our previous analyses on these data; the base-rate estimation, model fit, and sensitivity results are provided in Fig. 10, and the calibration curves are shown in Fig. 11.
The model fit is excellent. The bias to overestimate the proportion of guilty suspects for low base rates is now gone. Indeed, the model is very well calibrated at the lower end of the scale. There is, however, still a tendency to underestimate the proportion of guilty suspects at the higher end of the scale, and the estimates are still quite variable considering the large sample size (1000 witnesses). Furthermore, the deviations from the calibration curves are greatly reduced. The only exception is a tendency to overestimate the proportion of correctly rejected lineups, especially at high base rates. A comparison of the actual sampled PPG and model-predicted PPG is provided in Fig. 12. When restricted to these conditions, although there is still considerable variability across sampled data sets, the model does a very good to excellent job of estimating PPG at all base rates and all levels of confidence.
Simulations
We have shown that the SDT model does a good job overall in estimating the true base rate of lineups that include the guilty suspect from collapsed data. Next, we consider the theoretical limits of this performance level using parameter recovery simulations. We generated a large number of data sets with the SDT model and then fit each simulated data set with the same model, allowing all five parameter values to vary freely (again, note that we fit an equal-variance model, σg = 1). At issue in this analysis is the ability of the SDT model to accurately recover the parameters that were used to generate the data. The generating parameters for each simulated data set were randomly sampled from highly variable distributions, so the model was challenged to accurately estimate the base-rate parameter, pg, against a background of random variation in all of the other model parameters. If the model succeeds, then we can conclude that there are data patterns that specify base rate in a way that cannot be mimicked by any combination of the other parameters. We used simulated data sets comprising 100, 500, or 1000 identification attempts to assess the effect of sample size. Full simulation details can be found in the “Methods” section. The code is provided on OSF.
Figure 13 shows the ability of the model to recover three key parameters: the distance between target- and lure-strength distributions (μg), the identification criterion (c1), and the base rate of lineups with guilty suspects (pg). Each plot shows results for 250 fits to simulated data with the parameter value that generated the simulated data on the x-axis and the parameter value estimated in fits of those simulated data on the y-axis. In each case, the points tend to be concentrated along the positive diagonal, indicating accurate parameter recovery. Recovery for all parameters improves with larger samples, as expected. Notably, base-rate estimation is quite accurate with 1000 identification attempts per data set, with a strong majority of estimates within 5 percentage points of the true generating value. Our empirical results with 1000 identification attempts were noticeably more variable than the simulation results, suggesting that estimates based on real data are subject to additional uncertainty introduced by violations of the model’s assumptions, as discussed previously. Estimation sometimes failed for the μg parameter such that μg went to the maximum value allowed in the estimation program (which was arbitrarily set to 5). This failed estimation was most common at the smaller sample sizes.
The results in Fig. 13 show that the model is capable of distinguishing changes in base rate from changes in the other parameters, but it is not immediately obvious how the model does so. The latter question is especially mysterious given that collapsing the data seems to hide all information about base rates. We found that two aspects of the data are critical for this seemingly magical feat: the relative proportion of filler and suspect identifications and the distribution of responses across different confidence levels in each of these response categories. Although suspect identifications cannot be directly classified as correct or incorrect without knowing the guilt status of each suspect, filler identifications are known to be incorrect because the fillers are chosen from a pool of people known to be innocent. Thus, the model can gauge the extent to which things are “going well” – that is, a high proportion of suspects are guilty and witnesses are often successful in recognizing the true culprit – by evaluating the proportion of filler identifications (the lower the better).
How, then, does the model distinguish the two processes that might lead to troublingly high rates of filler selections, low base rate and poor witness memory? That is where the distribution across confidence comes in. Figure 14 demonstrates the role of confidence in estimating base rates. The circles are model predictions with a baseline parameter set, and the pluses and triangles are predictions generated by changing either the memory-strength parameter (μg) or the base-rate parameter (pg), respectively, to increase the proportion of collapsed suspect identifications by the same amount. Boosting suspect identifications by improving memory strength also strongly increases the confidence for collapsed suspect identifications. The boost in collapsed suspect picks is produced because witnesses with a better memory are more likely to recognize guilty suspects in the subset of lineups that have them, and stronger memory also increases confidence in these guilty-suspect picks. Thus, increasing memory-strength results in fewer simulated suspect identifications being made with low and medium confidence. This effect occurs because the target-strength distribution shifts to the right with a higher μg value (i.e., increased memory strength), which also means that a greater proportion of this distribution falls in the highest confidence region. In contrast, increasing collapsed suspect identifications by increasing the base rate of guilty suspects has a flatter response profile across the confidence levels, showing a less dramatic increase in high-confidence suspect identifications compared to a memory change, and no decreases in responses made with medium or low confidence. In this scenario, the change in collapsed suspect picks is not driven by a change in responses to guilty or innocent suspects, but by a higher-level change in how these two trial types are proportionally mixed. Showing witnesses more lineups with guilty suspects means that there are more trials likely to produce a suspect identification, but does not mean that witnesses will be more confident when they do identify a guilty suspect.
Interestingly, the difference in response distributions due to changes in base rate and memory strength cannot be eliminated by allowing the confidence criteria to vary. The critical reason is that changing the confidence criteria also changes the confidence distribution for filler identifications. Figure 14 shows that the distinct confidence profiles for base rate and memory changes occur only for suspect selections and not filler identifications. Thus, changing the confidence criteria cannot make a base-rate change look like a shift in memory strength because the filler identification decisions would be distorted in detectable ways.
We also used the simulations to assess the theoretical ability of the model to estimate PPG. Figure 15 shows a scatterplot of the actual PPG for each data set of 1000 simulated witnesses and the estimated PPG generated by fitting the model to each data set. Similar to the parameter recovery results, the points closely follow the positive diagonal, indicating excellent recovery of PPG. Again, these results are cleaner than the analyses that bootstrapped data from real data sets, suggesting that the analyses of real data are subject to additional uncertainty based on violations of model assumptions.