A major goal of eyewitness research is to develop procedures that maximize correct identifications and minimize incorrect identifications by eyewitnesses. The sequential lineup has been proposed as one such procedure (Lindsay & Wells, 1985). In contrast to the more traditional simultaneous lineup, in which all items are presented to the eyewitness at the same time, items in the sequential lineup are presented one at a time. Past research had suggested that the sequential lineup is superior to the simultaneous lineup because it leads to a reduced number of incorrect identifications without affecting the number of correct identifications (e.g. Wells, Memon, & Penrod, 2006), suggesting that memory for the perpetrator is expressed more efficiently in the sequential lineup. However, recent studies have drawn the opposite conclusion, finding that simultaneous presentation is superior (e.g., Clark, 2012; Mickes, Flowe, & Wixted, 2012). This raises the question of whether memory for the perpetrator is greater in the sequential lineup compared to the simultaneous lineup or vice versa. In order to answer this question, we argue that it is necessary to apply formal models specific to each procedure in order to measure underlying memory strength and response bias. Our aim in this paper is to develop such models and to apply them to both existing and new data to answer the question of whether memory is the same or different between simultaneous and sequential lineups.

### The sequential lineup

Lineups are typically presented simultaneously, with all lineup items shown at the same time in a single array. A witness may either identify an item as the target (i.e., corresponding to their memory of the perpetrator) or reject the lineup, indicating that no item is a suitable match. In a sequential lineup, as originally proposed by Lindsay and Wells (1985), each lineup item is presented one at a time and, for each item, the witness is asked to judge if it matches their memory of the target by making a “yes/no” judgement. If the witness responds “yes”, the procedure terminates and the remaining lineup items (if any) are not shown. If they respond “no”, they are shown the next lineup item if there is one. The lineup is implicitly rejected if the witness responds “no” to all available lineup members. Variations of this procedure have also been proposed, which do not enforce the immediate stopping rule. These alternatives may permit witnesses to see remaining lineup members after an identification is made (Wilson, Donnelly, Christenfeld, & Wixted, 2019), require witnesses to view all lineup members before making an identification, or allow (or require) witnesses to lap through the procedure a second time (Horry, Brewer, Weber, & Palmer, 2015; Seale-Carlisle, Wetmore, Flowe, & Mickes, 2019).

Lindsay and Wells (1985) originally proposed the sequential lineup based on a theoretical distinction between absolute and relative judgement strategies (Wells, 1984). A relative judgement is said to occur when a witness selects the lineup item most similar to their memory of the target *relative* to the other items. Such a strategy would tend to lead to a high false positive rate because there is a basis for identification even when memory of the perpetrator is poor or the target is not a member of the lineup. An absolute judgement is said to occur when an identification judgement does not depend on the similarity of other lineup items to the witness’ memory of the target. Such a strategy would tend to lead to lower false positive rates because witnesses have a basis to reject the lineup when memory of the target is poor or if the target is not present. Lindsay and Wells (1985) suggested that the sequential lineup would encourage an absolute decision strategy by removing the opportunity to compare lineup items. Consistent with this, Lindsay and Wells (1985) found that sequential presentation led to significantly fewer innocent suspect identifications than simultaneous presentation, accompanied by a relatively small reduction in target identifications. This pattern of results, termed the sequential superiority effect, has been identified in many subsequent studies and in two meta-analyses (Steblay, Dysart, Fulero, & Lindsay, 2001; Steblay, Dysart, & Wells, 2011). Based on this evidence, researchers have successfully advocated a policy shift toward sequential presentation, which has led to its adoption in various forms in 30% of US jurisdictions and in Canada and the United Kingdom (Police Executive Research Forum, 2013; Seale-Carlisle & Mickes, 2016).

### Diagnostic feature detection theory

The interpretation of the sequential superiority effect has recently been challenged by Wixted and Mickes (2014). They have proposed the diagnostic feature detection theory (DFDT) of lineup identification, which predicts a memory advantage for simultaneous lineups compared to sequential lineups. According to this theory, correct identification (and rejection) of a lineup is based on identifying diagnostic features of the different lineup items. A diagnostic feature is one that is uniquely shared by a lineup item and the witness’ memory of the target which, if identified, would support a correct identification. A non-diagnostic feature is one that is shared by all lineup items (e.g. hair colour) which, even if it matches the witness’ memory of the target, cannot support a correct identification. Wixted and Mickes (2014) argued that because a witness is better able to compare the features of different lineup items in a simultaneous lineup, they are better able to identify features that are diagnostic and to discount those that are not.

The distinction between absolute and relative identification strategies proposed by Lindsay and Wells (1985) and DFDT make opposite predictions on the relative merits of simultaneous and sequential lineups - both cannot be correct. This has led to a re-evaluation of the sequential superiority effect and a re-examination of how eyewitness performance is measured. Specifically, researchers have argued that much of the early research on the sequential lineup has obscured potential shortcomings of the sequential procedure by treating the accompanying small reduction in perpetrator identifications as inconsequential (Clark, 2012; Moreland & Clark, 2016). In addition, recent research, employing receiver operating characteristic (ROC) analysis derived from signal detection theory, has found evidence that simultaneous presentation may, in fact, outperform sequential presentation (e.g. Carlson & Carlson, 2014; Dobolyi & Dodson, 2013). We discuss each of these issues in turn.

### Measuring identification performance

In many earlier studies of the sequential superiority effect, eyewitness performance was measured using the diagnosticity ratio statistic, defined as the ratio of the proportion of correct target identifications (TIDs) (the TID rate) to the proportion of incorrect innocent suspect identifications (SIDs) (or the false positive rate). A TID is made when the witness correctly identifies the target in the lineup. An SID is made when the target is not a member of the lineup and the witness incorrectly identifies the innocent suspect. On this measure of performance, an identification made from a lineup procedure that reliably generates a higher diagnosticity ratio is to be preferred to one that does not.

An alternative performance measure is based on signal detection theory (Wixted & Mickes, 2012, 2015a, 2015b) and proposes that performance should be judged in terms of the level of correct identifications that can be obtained for a given level of incorrect suspect identifications. This is termed empirical discriminability and it minimizes the two kinds of identification error discussed previously (Wixted & Mickes, 2018). Empirical discriminability can be measured by constructing an ROC curve. In the context of lineup tasks, this is a plot of TID rates against SID rates at different levels of response bias - the general willingness of a decision-maker to make an identification. In perceptual research, different levels of response bias are achieved by varying payoffs that differentially weight correct and false positive responses, leading decision-makers to be biased towards one kind of response over another. Post-decision confidence estimates are used as a proxy for different levels of response bias in many recognition memory experiments. These may be recorded on a Likert scale or a 0–100% scale with the number of bins set by the researcher.

Figure 1 displays ROCs for two hypothetical show-up procedures. A show up is a lineup consisting of only one item. These ROC curves have the same form as found in laboratory-based yes-no recognition memory tasks, extending from the extreme lower left to the extreme upper right. The two curves in Fig. 1 differ in empirical discriminability, which is greater for the curve that is closer to the top-left corner. This curve, corresponding to Procedure B in this example, always has a higher correct identification rate for any given incorrect identification rate. If empirical discriminability is zero, the ROC curve falls on the main diagonal indicating chance performance. Following this logic, empirical discriminability can be measured by calculating the area under the ROC curve (AUC). The greater the AUC, the greater the empirical discriminability. The AUC measure is independent of response bias because any combination of correct and incorrect identification rates on the same ROC curve is associated with the same AUC. Accordingly, because Procedure B has greater AUC than Procedure A, it has greater empirical discriminability.

Each point on the ROC curve corresponds to a different response bias and is associated with a given diagnosticity ratio. It is here that the contrast between empirical discriminability and the diagnosticity ratio becomes apparent - the same ratio can be found on different ROC curves corresponding to different levels of discriminability (Gronlund, Wixted, & Mickes, 2014; Rotello, Heit, & Dubé, 2015). This feature is shown in Fig. 1 by the set of dashed lines each of which corresponds to a different diagnosticity ratio (1.0, 1.5, 2.5, 5.0, or 10.0). As can be seen, these lines intersect each of the two ROC curves at different points showing that, all else being equal, the more conservative the response bias (associated with lower false positive rates), the larger the diagnosticity ratio. It is clear from this that the diagnosticity ratio is simply a measure of response bias, independent of empirical discriminability.

### Task dependence of ROC curves

Empirical discriminability provides an objective criterion against which different lineup procedures may be compared. On this view, any procedure that leads to a higher correct identification rate for any given false positive rate is to be preferred (Wixted & Mickes, 2012). However, DFDT is concerned with *underlying* discriminability, i.e. memory strength (Wixted & Mickes, 2018). It proposes that the feature detection mechanism facilitated by simultaneous presentation leads to greater underlying discriminability compared to sequential presentation, and that this explains the superior empirical discriminability of simultaneous presentation observed in some studies using ROC analysis (e.g. Carlson & Carlson, 2014)*.* ROC analysis may be uninformative with respect to underlying discriminability when the procedures being compared have different structural characteristics. In this case, the shapes of the ROC curves and the resulting empirical discriminability associated with each procedure may differ substantially even when underlying discriminability is the same (Rotello & Chen, 2016; Stephens, Dunn, & Hayes, 2019).

A dissociation between empirical and underlying discriminability due to structural features of a task is illustrated in Fig. 2a. This shows a family of hypothetical ROC curves derived from lineups of different sizes. These curves were generated using the simultaneous lineup model signal detection theory (SDT)-MAX, which we define later (the relevant formulas are given in Additional file 1). This model is based on a signal detection framework in which there is a normal distribution of familiarity values for the target item and another normal distribution for foil items, including the innocent suspect. For each lineup size, although underlying discriminability (i.e. the difference between the familiarity distributions of the target and foils) is the same, the shape and termination point of each ROC curve is different. Each curve terminates at a different point because, under the most lenient response bias (i.e. always select a lineup member) there is a 1/*n* chance of choosing the innocent suspect, where *n* is the lineup size. Thus, because *n* differs between the curves, each must terminate at a different point corresponding to a false positive rate of 1/*n*.

Because the ROC curves in Fig. 2a were all generated from the same underlying signal detection model, the differences are due to a structural characteristic of the lineup task - specifically the lineup size. This means that differences in empirical discriminability between these tasks do not indicate differences in underlying discriminability (which is the same for each curve).

From the foregoing, it should come as no surprise that structural characteristics of the sequential lineup also change the shape of the ROC curve. In this case, it is not the size of the lineup that is critical, but the minimum level of evidence required to make an identification. Figure 2b shows a set of ROC curves for a sequential lineup of size 6, each constructed with a different minimum level of evidence. The ROC curves shown by thin solid lines in Fig. 2b illustrate different choices for the minimum level of evidence expressed in terms of a decision criterion on the familiarity axis. The value of this criterion is indicated at the end of each corresponding ROC curve. A large value indicates a conservative response bias for which a relatively high level of familiarity is required for a lineup item to trigger identification. A small value indicates a lenient response bias for which a relatively low level of familiarity is sufficient to trigger identification. Each of these ROC curves terminates at a different point. In the limit, when the minimum evidence is very low, the ROC curve terminates on the main diagonal (indicated by the dotted line in Fig. 2b). The ROC curve shown by the thick solid line corresponds to the situation in which each witness has a different level of minimum evidence. It encloses the set of confidence-based ROC curves and is clearly non-monotonic. Rotello and Chen (2016) observed a similar shaped curve in their simulations of the sequential lineup, as did Wilson et al. (2019) in empirical sequential lineup data.

Figure 2b also shows the ROC curve generated from a simultaneous lineup of size 6 as shown in Fig. 2a (by the curve labelled 6). Altogether, these curves show that even when underlying discriminability is held constant, the shapes of ROCs and the corresponding empirical discriminability values differ to a considerable degree. It is therefore important to distinguish two research questions. One question is about empirical discriminability - for any given false identification rate, which procedure leads to higher correct identification rates? The ROC curves shown in Fig. 2a and b suggest that simultaneous lineups are preferred to sequential lineups and, within the class of simultaneous lineups, smaller lineup sizes are preferred to larger lineup sizes. Empirical research also supports this conclusion, at least with respect to simultaneous, as compared to sequential lineups (Carlson & Carlson, 2014; Dobolyi & Dodson, 2013; Experiment 1a Mickes et al., 2012; Neuschatz et al., 2016), although this has not always been found (Flowe, Smith, Karoglu, Onwuegbusi, & Rai, 2016; Gronlund et al., 2012; Experiment 1b and 2 Mickes et al., 2012; Sučić, Tokić, & Ivešić, 2015).

The second question bears on DFDT and concerns underlying discriminability - which eyewitness test procedure reveals higher levels of memory strength? ROC curves and the AUC cannot be used to answer this question. As shown above, they may not reflect underlying discriminability across different lineup procedures. In order to measure underlying discriminability, it is necessary to use a formal model to measure the parameter of interest. In this section we outline two models of the simultaneous lineup task based on signal detection theory (SDT-MAX and SDT-INT) and develop a comparable model of the simple stopping rule version of the sequential lineup task (SDT-SEQ). We then apply these models to extant and new data to estimate memory strength across the two procedures.

### Unequal variance signal detection model

The starting point for all the lineup models we consider is the unequal variance signal detection (UVSD) model. The UVSD model accounts well for data in laboratory-based recognition memory tests (Jang, Wixted, & Huber, 2009; Mickes, Wixted, & Wais, 2007) and can be extended to account for lineup tasks. In a typical eyewitness experiment, a participant views a simulated crime conducted by a perpetrator and is subsequently shown an *n*-item lineup. In a target present (TP) lineup, one item is the *target* (a picture of the perpetrator) and the remaining items are foils or fillers (pictures of other people). In a target absent (TA) lineup, one item may be designated as the innocent *suspect* with the remaining items being foils. The participant is required to judge whether the lineup contains the target and, if they believe it does, to identify the corresponding item. We assume that each lineup item is associated with a familiarity value that reflects its similarity to the participant’s memory of the perpetrator. Each familiarity value is considered a random draw from one of several distributions - a target distribution if the item is a target, an innocent suspect distribution if it is an innocent suspect^{Footnote 1}, and a foil distribution if it is a foil. In order for the models to be testable we assume that each distribution is Gaussian. Consistent with most signal detection models, the foil distribution is assigned a mean of zero and a standard deviation of one. The target distribution has mean *d*_{t} and standard deviation *s*_{t}, both of which can be estimated from the data. Because *s*_{t} may not equal one the model is called the *unequal variance* signal detection model. In addition, because the innocent suspect may be distinct from the remaining foils, the suspect distribution has mean *d*_{s} and standard deviation *s*_{s}.

A lineup can be considered as a combination of a detection question, “Is the target present?”, and an identification, “If so, which item is the target?” (Duncan, 2006). While the answer to the identification question is relatively straightforward - always choose the lineup member associated with the greatest familiarity - the answer to the detection question is less clear-cut. This leads to different models based on different decision rules. Although there is a wide range of possible decision rules, we consider two in particular, which we call SDT-MAX and SDT-INT. In the SDT-MAX model, the decision rule is to compare the familiarity value of the most familiar lineup item (the maximum) to a response criterion. In the SDT-INT model, the decision rule is to compare the *sum* of the familiarity values of the lineup items to a response criterion. For both of these models, if the relevant value exceeds the criterion, the most familiar item is identified as the target. We also developed a model of the sequential lineup. In this case, because the witness does not see all the lineup items until the end, and may not see all items if they choose before reaching the end, it not possible before that point to identify either the maximum or the sum, or any other function of the familiarity values of the entire lineup. For this reason, we developed a model of the sequential lineup, here called SDT-SEQ.

### SDT-MAX

SDT-MAX, also known as the independent observations model (Duncan, 2006; Wixted, Vul, Mickes, & Wilson, 2018), is perhaps the simplest model of the simultaneous lineup. In this model, identification decisions are made with respect to a set of *k* decision criteria, *C* = {*c*_{1}, …, *c*_{k}} such that *c*_{1} < *c*_{2} < … < *c*_{k}, that define a set of *k* + 1 confidence levels. Let *X* = {*x*_{1}, …, *x*_{n}} be the set of familiarity values associated with each of *n* lineup items. Let *x*_{m} = max(*X*) be the maximum familiarity value associated with item *m*. The decision rule is this: if *x*_{m} < *c*_{1} then reject the lineup, otherwise choose lineup item *m* with confidence level *l* where *c*_{l} is the largest element of the set, {*c*_{i} ∈ *C* : *x*_{m} ≥ *c*_{i}}.

As detailed in Additional file 1, we derive general formulas for the probability of a correct identification and the probability of a false identification under the SDT-MAX model. We summarize these below under the assumption that all the underlying distributions are Gaussian. Let *ϕ(x;μ,σ)* be the normal probability density function and let Φ(x;μ,σ) be the normal cumulative distribution function evaluated at *x* ∈ ℝ. Recall that the foil distribution takes the form of the standard normal distribution with *µ* = 0 and *σ* = 1. In this case, we write *ϕ*(*x*;0,1) = *ϕ*(*x*) and ϕ(*x*;0,1) = ϕ(*x*). Let *c* ∈ *C* be a decision criterion and let *P*_{TID}(*c*) be the probability of a correct target identification with confidence greater than or equal to *c*. Then

$$ {P}_{TID}(c)={\int}_c^{\infty}\phi \left(x;{d}_t,{s}_t\right)\kern0.1em \Phi {(x)}^{n-1} dx. $$

Similarly, let *P*_{SID}(*c*) be the probability of an incorrect suspect identification with confidence greater than or equal to *c*. Then, if there is a designated innocent suspect,

$$ {P}_{SID}(c)={\int}_c^{\infty}\phi \left(x;{d}_s,{s}_s\right)\kern0.1em \Phi {(x)}^{n-1} dx, $$

otherwise,

$$ {P}_{SID}(c)=\frac{1}{n}\left(1-\Phi {(c)}^n\right). $$

### SDT-INT

Let \( \mathrm{sum}(X) \) be the sum of familiarity values of all the lineup items. The decision rule is this: If sum(*X*) < *c*_{1} then reject the lineup, otherwise choose lineup member *m* with confidence level *l* where *c*^{l} is the largest element of the set, {*c *∈ *C* : sum(*X*) ≥ *c*}

The equations for the probability of a correct identification and probability of a false identification under the SDT-INT model are summarized below (see Additional file 1 for details).

$$ {\displaystyle \begin{array}{rl}{P}_{TID}(c)& =\Pr \left(\mathrm{sum}(X)\ge c\mid m=t\right)\cdot \Pr \left(m=t\right)\\ {}& \approx {\int}_{-\infty}^{\infty}\left(1-\Phi \left(c-x;\left(n-1\right){\mu}_x,\sqrt{\left(n-1\right)}{\sigma}_x\right)\right)\phi \left(x;{d}_t,{s}_t\right)\Phi {(x)}^{n-1} dx\end{array}} $$

where *t* is the position of the target item and *μ*_{x} and *σ*_{x} are the mean and standard deviation, respectively, of the standard normal distribution truncated at the upper limit of *x*. The equation is not exact because it assumes that the sum of truncated distributions is approximately normal (by the Central Limit Theorem). Similarly, if there is a designated innocent suspect, then

$$ {P}_{SID}(c)\approx {\int}_{-\infty}^{\infty}\left(1-\Phi \left(c-x;\left(n-1\right){\mu}_x,\sqrt{\left(n-1\right)}{\sigma}_x\right)\right)\phi \left(x;{d}_s,{s}_s\right)\Phi {(x)}^{n-1} dx, $$

otherwise,

$$ {P}_{SID}(c)=\frac{1}{n}\left(1-\Phi \left(c;0,\sqrt{n}\right)\right). $$

### SDT-SEQ

Our model for sequential presentation is also based on the UVSD framework and incorporates the “first-above-criterion” decision rule where presentation of the lineup items is terminated as soon as an identification is made. As detailed in Additional file 1, we derive the following equations for the probability of a correct identification and probability of a false identification under the SDT-SEQ model. Let *p*_{i} be the probability that the item in lineup position *i* is a target. Then

$$ {P}_{TID}(c)=\left(1-\Phi \left(c;{d}_t,{s}_t\right)\right)\sum \limits_{i=1}^n{p}_i\Phi {\left({c}_1\right)}^{i-1}. $$

If there is a designated innocent suspect, let *q*^{i} be the probability that the lineup item at position *i* is the suspect. Then,

$$ {P}_{SID}(c)=\left(1-\Phi \left(c;{d}_s,{s}_s\right)\right)\sum \limits_{i=1}^n{q}_i\Phi {\left({c}_1\right)}^{i-1}. $$

otherwise,

$$ {P}_{SID}(c)=\frac{1}{n}\left(1-\Phi (c)\right)\sum \limits_{i=1}^n\Phi {\left({c}_1\right)}^{i-1}. $$

### Palmer and Brewer (2012) database

Palmer and Brewer (2012) conducted an extensive analysis of previously published studies that compared simultaneous and stopping-rule sequential lineups under the same conditions. They fit a signal detection model equivalent to the SDT-INT model described previously, to data from 22 previous studies. Their aim was to determine if either underlying discriminability and/or response bias differs between sequential and simultaneous lineups. Their analysis revealed that, across the datasets, the two presentation methods did not differ in terms of underlying discriminability but that the sequential procedure was associated with more conservative responding.

While the finding of equal underlying discriminability is not consistent with DFDT, the difference in response criteria was consistent with the view that a sequential lineup produces a higher diagnosticity ratio. It is now widely accepted that sequential presentation leads to more conservative responding than simultaneous presentation (Clark, 2012; Clark, Moreland, & Gronlund, 2014; Wells, 2014; Wixted & Mickes, 2014). The apparent success of the modelling approach employed by Palmer and Brewer (2012) has also led researchers to use SDT-INT to examine other aspects of the sequential lineup (Carlson, Carlson, Weatherford, Tucker, & Bednarz, 2016; Horry et al., 2015; Horry, Palmer, & Brewer, 2012).

However, there are aspects of the Palmer and Brewer (2012) approach that challenge the validity of their conclusions. First, and most critically, the SDT-INT model was fit to data from both simultaneous and sequential lineups. No attempt was made to model the unique task demands of sequential presentation. It is therefore unknown whether the same results would be found if a more appropriate model were used, such as SDT-SEQ as described previously. Second, the SDT-INT model does not exhaust the set of decision rules for simultaneous lineups (Wixted et al., 2018). A different decision rule, such as SDT-MAX, may lead to different results. Third, Palmer and Brewer (2012) fit the SDT-INT model using an inefficient and potentially inaccurate manual grid search of parameter space. Finally, because confidence judgements were not available, it was only possible to fit an equal variance signal detection model in which *s*_{t} = *s*_{s} = 1. If this is not an appropriate model of their data, the results may be distorted.

### Summary and aims

The aim of the present paper was to compare simultaneous and sequential lineups in order to test the central prediction of DFDT that simultaneous presentation is associated with greater underlying discriminability than sequential presentation. To do this, we first re-analysed the corpus of simultaneous and sequential data from Palmer and Brewer (2012), addressing the previously described problems in their analysis. Principally, we fit a model of the sequential lineup, SDT-SEQ, specifically developed for this task, and two models of the simultaneous lineup - the SDT-INT model as used by Palmer and Brewer (2012) and the alternative SDT-MAX model. Third, we fit each model using an efficient optimisation procedure that leads to more accurate solutions. Second, we conducted a new experiment from which we obtained confidence judgements enabling us to fit models based on the assumption of unequal variances.

### Predictions

Predictions were preregistered on the Open Science Framework, available at https://osf.io/xwp9d/. DFDT predicts that simultaneous presentation should lead to greater underlying discriminability than sequential presentation. Specifically, this means that the estimate of *d*_{t} (or the difference *d*_{t} – *d*_{s} if there is a designated suspect) should be greater for simultaneous lineups. Based on the conclusions reached by Palmer and Brewer (2012), sequential presentation is predicted to lead to more conservative responding than simultaneous presentation. This means that the estimate of *c*_{1} (and possibly other criteria) should be greater for sequential lineups.

### Model cross fit

We have described three models that we propose to fit to data. This is motivated in part by the idea that there are differences between the models that determine how well they fit different kinds of data. This means that if data are simulated from a model, while this model should fit the data well, other models should fit relatively poorly. In order to investigate this question, we conducted a cross fitting and parameter recovery analysis. First, we randomly generated 100 sets of parameter values for a 6-item lineup and then used each of these to generate 100 simulated datasets from each model. To avoid issues with low cell counts, we set the number of TP and TA lineups to 10,000, giving 20,000 simulated observations for each dataset. We then fit each model to its own sets of data and to those generated by the other models, recording the *χ*^{2} value, *p* value and parameter estimates from each fit. Further detail on the simulation process and expanded results are available in Additional file 2.

Figure 3 shows the proportion of datasets where the model could be rejected at *p* < .05. It shows that when a model is fit to data generated by any other model, it is highly likely to be rejected. In other words, the models are, in principle, distinct - given sufficient statistical power, if the data are consistent with one model then they should be poorly fit by any of the remaining models.

### Parameter recovery

We measured parameter recovery by examining the correlation between generating and recovered parameter values for each model fit. Scatterplots and tables of correlations are available in Additional file 2. We were interested in two aspects of this analysis. First, it is desirable for the correlation value to be close to 1 when the models are fit to their own data. Second, it is also important to understand how well the models recover the correct parameter values when fit to data they did not generate as, in some cases, they may fit well but recover incorrect parameter estimates.

When fit to their own data, the models generally recover their own parameters well, with *r* > = .90 for generating versus recovered parameter values. SDT-MAX recovers the generating parameters perfectly when fit to its own data, but both SDT-SEQ and, to a lesser extent, SDT-INT, recover a small number of outliers, affecting the correlation coefficients. These are most likely due to the presence of local minima, which can be avoided by starting parameter search from different initial values. It is evident from the scatterplots in Additional file 2 that recovery is close to perfect once these outliers are excluded.

When SDT-MAX and SDT-INT are fit to data generated by SDT-SEQ, recovery of *d*_{t} is poor. This suggests that if SDT-SEQ is a good representation of the sequential lineup task, then fitting SDT-MAX or SDT-INT to sequential lineup data may lead to inaccurate estimates of *d*_{t}. Recovery of *s*_{t} was poor for all models when fit to data they did not generate, while recovery of the decision criteria (*c*_{1}, …, *c*_{5}) was generally good for all fits, with *r > =* .80.