The impact of AI errors in a human-in-the-loop process

Agudo, Ujué; Liberal, Karlos G.; Arrese, Miren; Matute, Helena

doi:10.1186/s41235-023-00529-3

Original article
Open access
Published: 07 January 2024

The impact of AI errors in a human-in-the-loop process

Cognitive Research: Principles and Implications volume 9, Article number: 1 (2024) Cite this article

3670 Accesses
2 Citations
19 Altmetric
Metrics details

Abstract

Automated decision-making is becoming increasingly common in the public sector. As a result, political institutions recommend the presence of humans in these decision-making processes as a safeguard against potentially erroneous or biased algorithmic decisions. However, the scientific literature on human-in-the-loop performance is not conclusive about the benefits and risks of such human presence, nor does it clarify which aspects of this human–computer interaction may influence the final decision. In two experiments, we simulate an automated decision-making process in which participants judge multiple defendants in relation to various crimes, and we manipulate the time in which participants receive support from a supposed automated system with Artificial Intelligence (before or after they make their judgments). Our results show that human judgment is affected when participants receive incorrect algorithmic support, particularly when they receive it before providing their own judgment, resulting in reduced accuracy. The data and materials for these experiments are freely available at the Open Science Framework: https://osf.io/b6p4z/ Experiment 2 was preregistered.

Background

The presence of artificial intelligence algorithms and automated systems in public sector decisions (Araujo et al., 2020; Eubanks, 2018; O’Neil, 2016), such as social assistance (Civio, 2022; De-Arteaga et al., 2020; López-Ossorio et al., 2016), justice (Casacuberta & Guersenzvaig, 2018; Larson et al., 2016; Martínez-Garay, 2016; Niiler, 2019), health (Obermeyer et al., 2019; Raghu et al., 2019), and education (Alon-Barkat & Busuioc, 2022; Duncan et al., 2020), is becoming increasingly common.

Thus, many countries already use automated decision support systems which are often based on artificial intelligence (Solans et al., 2022). Examples are the United States (Berkman Klein Center, 2022), the United Kingdom (Ministry of Justice, 2013), China (Wei, 2019), Estonia (Niiler, 2019), Argentina (Ministerio Público Fiscal de la Ciudad Autónoma de Buenos Aires, 2020), Poland (Ministerstwo Sprawiedliwości, 2021), and Spain (Capdevila et al., 2015; Valdivia et al., 2022), in the judicial context, which is the focus of this article. In these cases, the system usually does not make the decision fully autonomously, but rather supports the human decision process (Araujo et al., 2020) in different ways, such as gathering and summarizing the information needed for the decision, or recommending a particular decision (Binns & Veale, 2021). This initiative of introducing humans into an automated decision process is known in the literature as a human-in-the-loop process. The idea is that this human presence should guarantee a better final decision due to human supervision of the system and appropriate intervention to prevent and/or mitigate errors that could be made by the automated system (Ponce, 2022; Portela & Álvarez, 2022).

Existing legislation and policy recommendations on automated decision systems (often designated by a variety of terms such as algorithm, artificial intelligence system, AI technology, or robot; European Commission, 2019) emphasize the right of citizens not to be subject to a fully automated decision, and point to the importance of the human presence in the process as a safeguard and protection against a possible erroneous or biased algorithmic decision (Green, 2022; Portela & Álvarez, 2022).

However, this approach is not without difficulties. Achieving appropriate interaction between humans and automated systems is complex because it requires, among other things, that the humans involved in the process have the skills, experience, motivation, and time to interpret and critically manage the information provided by the system (Ponce, 2022; Portela & Álvarez, 2022), can understand how these systems work, and are able to disagree with the automated system’s decision (Green, 2022) in the event of a conflict between human judgment and algorithmic recommendation (Valdivia et al., 2022).

There is a large body of empirical evidence questioning the human ability to disagree with or override an automated decision. In fact, for more than two decades, the scientific literature has pointed to a human tendency to use the information provided by support systems as a shortcut to avoid searching for or processing other relevant information. In this way, people demonstrate compliance with the system’s decision or delegate their decision to the system. Excessive human compliance when the system’s assessment is erroneous is often referred to in the engineering and artificial intelligence fields as automation bias (Cummings, 2004; Lyell & Coiera, 2017; Mosier & Manzey, 2019; Parasuraman & Mustapha, 1996), and this effect has been documented in domains as diverse as aviation, healthcare, military, and process control (see meta-analysis by Mosier & Manzey, 2019). An example of this automation bias is found in Lyell et al. (2017). In this study on drug prescribing using an automated decision support system, the researchers found that when the system erroneously indicated that a drug was not appropriate for a patient, prescribing errors increased by 56.9%.

In the judicial system, however, recent work reports a less consistent and less robust effect of this automation bias. On the one hand, there are cases of algorithm implementation that suggest excessive human compliance with system decisions, as in the case of RisCanvi, the system used for assessing the risk of recidivism of inmates in Catalonia, Spain. According to Saura and Aragó (2021), government officials using RisCanvi disagree with the algorithm only 3.2% of the time. This is so, even though, as shown in the most recent general report published on the performance of this system, RisCanvi has a positive predictive capacity of 18%, that is, only two inmates out of ten end up confirming the system’s prediction and reoffending after being classified as high risk (Capdevila et al., 2015; Martínez-Garay, 2016). However, this data about the poor predictive capacity of RisCanvi is not made visible when the system is used and is therefore likely to be unknown to the government officials who use it.

On the other hand, several empirical studies using similar forensic AI models, seem to suggest the opposite results (Green & Chen, 2019a, 2019b, 2021; Grgic-Hlaca et al., 2019; Portela et al., 2022; Skeem et al., 2020). For example, Grgic-Hlaca et al. (2019) conducted an experiment in which participants first had to predict, without AI support, whether some defendants would reoffend within two years. The researchers then showed the participants the recidivism prediction estimated by a computer program, and the participants were asked to indicate their prediction again. The researchers also showed the participants the accuracy rate of the computer program (68%). Only in a minority of cases did the participants adjust their prediction after seeing the computer’s estimate, showing a low level of automation bias. According to the authors, the 32% reported error of this system probably influenced the low bias observed. In addition, and as we will discuss below, the fact that the algorithmic prediction was presented after, rather than before, the participants provided their judgments, may have also been a critical factor in the low compliance observed in this study.

Moreover, in a related experiment, Green and Chen (2019a) manipulated the race of the defendant to assess how this data affected compliance with algorithmic support. They found that automation bias increased when the algorithm predicted a high risk of recidivism in cases where the defendant was black, and a low risk of recidivism in cases where the defendant was white. That is, participants agreed with the algorithm when it confirmed their own prejudices.

Quite possibly, these contradictions about the impact of automated system support on human-in-the-loop processes are due in part to the wide disparity in the methodological procedures used. Existing work on this topic evaluates the role of different human decision-makers performing different tasks, in different countries, in different domains, and with very different decision processes. Studies also vary in terms of whether or not participants are informed about the predictive accuracy of the algorithm; what the algorithm is called (decision support system, computer program, algorithm, or artificial intelligence among others); whether or not the system provides erroneous support; whether or not an explanation of the criteria followed by the system is provided; whether or not participants receive feedback on how accurate their decision was; and at what point participants receive support from the system, whether before or after making their own judgment.

Therefore, as we have already pointed out, we believe that the differences in the results obtained in the studies of automation bias in human-in-the-loop processes may be due in part to the variety of methodological procedures employed in these studies and in those models. Moreover, not all of those studies are a true reflection of the actual human decision-making processes used in the public sector, so they may not all have the same ecological value from an applied perspective. For example, in cases of actual implementation of automated decision systems in the public sector, system support is typically provided at the beginning of the decision process. Specifically, this process usually follows the following sequence (Chong et al., 2022; Solans et al., 2022): First, the system evaluates the available information and shows its assessment; then, the human is given just a few options: to validate, or to modify the system assessment. This sequence implies that human decision makers never explicitly emit their own judgments, but merely validate or modify the system’s assessments; and that the system support is received at the beginning of the process, establishing an order in the presentation of information that is likely to influence the processing of decision-relevant information by the human decision-maker (Marquardson & Grimes, 2018) and affect compliance and accuracy.

One example of this possible influence would be the anchoring bias (Rastogi et al., 2022). Anchoring bias is the tendency to over-rely on a piece of information we initially receive (the anchor) so that then we tend to adjust our final judgment based on that starting point or anchor (Epley & Gilovich, 2006; Tversky & Kahneman, 1974). In the case of human-in-the-loop processes, system support (e.g., suggesting a high, medium, or low risk of recidivism for an inmate) is presented before humans make their decisions. This human decision, as noted above, is usually limited to government officials confirming or modifying the assessment previously made by the system. Thus, this AI support could act as an anchor that conditions the human decision maker, whose final decision would be merely an adjustment to the system’s assessment.

Therefore, we conducted two experiments designed to test whether manipulating the time at which system support is presented in a human-in-the-loop process can help to increase the accuracy of the final decision, and reduce excessive compliance (i.e., automation bias) when the system makes errors. In order to recreate a decision process that is as close as possible to the real processes implemented in the public sector, our two experiments in the field of justice, simulated the RisCanvi system (Soler, 2013). As mentioned above, this system predicts the recidivism risk of inmates in Cataluña, Spain, and we use it merely as an example, because it includes the features common to the other systems described above: a specific sequence in the decision process (first system support, then validation or modification by the human decision-maker) and a very simple interactive interface consisting on just two buttons, one to validate and another one to modify the assessment of the system. Thus, in this type of human-in-the-loop process, human decision makers do not explicitly emit their own judgment. Instead, they only confirm or modify the assessment previously received from the system. We believe that this could probably favor human compliance with the AI assessment, which would act as an anchor for the human decision. As previously mentioned, people using RisCanvi agree with the algorithm 96.8% of the time (Saura & Aragó, 2021), even though the algorithm has only 18% positive predictive power (Capdevila et al., 2015; Martínez-Garay, 2016).

There are few studies of human-in-the-loop processes that have manipulated the time at which the system support is received, or that have presented algorithmic support at different points in the decision process (Buçinca et al., 2021; Echterhoff et al., 2022; Green & Chen, 2019b; Rastogi et al., 2022; Vicente & Matute, 2023) to study whether this support can cause an anchoring effect on the human decision or in any way affect the final decision. For example, Green and Chen (2019b) conducted an experiment in which their participants had to indicate, on a scale of 0 to 100, the probability that several inmates would fail to appear in court or would be arrested before trial. In one of the experimental conditions, the participants gave their judgment before the algorithmic assessment was shown. This assessment was sometimes incorrect, simulating the performance of real-world systems such as the COMPAS algorithm (Angwin et al., 2016). In this condition, in which the participants emitted their judgment before and after receiving the algorithmic support, the highest accuracy was obtained, as compared to other conditions in which the algorithmic assessment was provided before the participants’ judgment, or not displayed at all.

In another context, Buçinca et al. (2021) conducted an experiment in which participants had to identify the highest carbohydrate ingredient in a food dish in order to replace it with another dish with less carbohydrate but similar taste. They found that participants’ accuracy and compliance were affected by the moment at which they received erroneous support from an AI. The performance of participants who emitted their judgment before seeing the incorrect AI assessment was better than that of participants who saw the AI assessment first. In addition, the former group was also less compliant than the group who received the erroneous AI support before making the decision. Although none of the groups that received the incorrect AI support completely avoided the automation bias, the authors suggest that asking participants to emit their judgments before seeing the incorrect AI assessment may act as a cognitive forcing function. This cognitive forcing function would force users of decision support systems to think more analytically and disrupt the fast and heuristic reasoning that may lead them to show compliance (Lambe et al., 2016).

We believe that understanding the impact of human-AI interaction on automated decision processes can lead to more accurate decisions and less automation bias, because the current lack of conclusive evidence in this area is not slowing down the implementation of these automated decision support systems in the public sector, which is a concern. Therefore, as mentioned above, we conducted two experiments, inspired by real-world AI decision support systems such as RisCanvi. In these experiments, we manipulated the time at which algorithmic assessments are received in a human-in-the-loop process. Our purpose was to test whether this manipulation could contribute to improving collaborative human-AI decision-making by helping to reduce compliance when the system errs, and increase decision accuracy. That is, our purpose was not to test whether humans are more or less accurate than automated systems in making their assessments. Our aim was to evaluate the standard sequence of decision-making in human-in-the-loop processes, which, as noted above, consists of the system first showing its assessment, and then humans merely confirming or modifying that assessment, without at any time explicitly emitting their own judgment. We believe that such a sequence may favor an anchor bias that could affect the accuracy of decisions, even to the point of leading to excessive compliance when the system errs. If that were the case, we believe that changing the time at which the AI support is provided and, as suggested by Lambe et al. (2016), forcing the human to explicitly emit a judgment before receiving the system’s assessment, should be a good strategy to reduce the bias, and thus increase accuracy and reduce compliance.

Experiment 1

This experiment simulates a human-in-the-loop process in which participants receive erroneous support from an AI system to decide the guilt of several defendants. Our purpose was to test whether forcing participants to explicitly make their judgment when they have not yet received the system’s erroneous assessment could improve the accuracy of the decision, as compared to when the biased AI support is the first step. Thus, we hypothesized that asking human decision-makers to emit their judgment before receiving the algorithmic assessment would improve the accuracy of their judgment and reduce their compliance with the AI incorrect support, that is, this should reduce their automation bias.

Method

Participants

We recruited a sample of 150 participants (36.6% women, 62.7% men, 0.7% non-binary), aged 18 years or older (M = 33.2, SD = 11.4), through the Prolific Academic platform. Since our experiment, which was conducted online, was inspired by the automated decision system used in Spain, RisCanvi, we recruited a sample from this country. To do so, we used the “Nationality: Spain” and “First language: Spanish” filters on Prolific. Although to conduct an experiment as similar as possible to the real-world decision process, it would have been appropriate to use a sample of government officials linked to the penitentiary and judicial field, we opted for a sample of laypeople. This decision, in addition to facilitating recruitment, was supported by previous work on automated decision systems in justice, which states that the behavior of laypeople and professionals does not differ (Green & Chen, 2021).

The sensitivity analysis for the sample size showed that we had a power of 80% to detect small to medium-sized effects (w = 0.22). The online program randomly assigned each participant to one of two experimental groups: AIsupport→Judgment (n = 76), or Judgment→AIsupport (n = 74).

Design and procedure

After providing some basic demographic information (age and gender), all participants read the same instructions. The instructions told them that their task was to assess the probability that several defendants were guilty, based on witness testimonies. We also told them that they would count on the support of an Artificial Intelligence system. Next, we asked participants about their degree of confidence, both in their own ability and in the AI system, given that these two factors could affect the acceptance or rejection of algorithmic support, as noted by some researchers (Chong et al., 2022; Green & Chen, 2019a). Thus, participants had to indicate how confident they were that they would perform the task properly, and that the artificial intelligence system would adequately assess the defendants’ guilt. Then, the experiment proper begun.

The experiment consisted of three trials for each participant, and each trial consisted of three steps. Table 1 shows a summary of the three steps in each trial. In Step 0, the computer presented the participants with a criminal case to be judged and the testimonies associated to it. In order to use standardized materials, we used the criminal cases of the ForenPsy 1.0 normative bank of testimonies developed by Álvarez et al. (2023). This bank includes the description of three criminal cases (homicide, threats, and trespassing) with 15 testimonies each. In the study by Álvarez et al., the 45 testimonies were ranked by a sample of anonymous participants, who estimated the degree of innocence or guilt that each testimony suggested about each of three defendants.

Table 1 Design Summary of Experiments 1 and 2, Showing the Steps in each Trial

Full size table

Thus, Step 0 in each trial consisted of a description of one of the three criminal cases of ForenPsy 1.0 (Álvarez et al., 2023), along with seven testimonies that suggested either innocence or guilt (see Fig. 1). Five of the seven testimonies clearly pointed to one of the verdicts (innocence or guilt), and the other two pointed in the same direction but were somewhat ambiguous, according to the ForenPsy calibration. The introduction of these two more ambiguous testimonies was intended to add realism to the trials. Some participants viewed the seven testimonies that suggested innocence, while some viewed the seven testimonies that indicated guilt. The type of testimonies that each participant received (i.e., innocence or guilt) was randomized. In addition, the order of presentation of each case and each testimony was randomized for each trial.

Our main experimental manipulation took place in Steps 1 and 2 in each trial. The order in which these two steps occurred was reversed for each of two different groups (see Table 1). In the AIsupport→Judgment group, during Step 1, the participants were shown the probability of guilt estimated by a (fictitious) artificial intelligence system.^{Footnote 1} The AI assessment of the defendant’s guilt that was shown to the participants could only take two values: high probability of guilt or low probability of guilt (see Fig. 2), so that it could be either congruent or contradictory with the verdict suggested by the testimonies presented during Step 0. In the first two trials, the system assessment was always correct, that is, it was congruent with the previously presented testimonies, which suggested either innocence or guilt according to the ForenPsy calibration. It was only in the last trial (the incorrect trial hereafter) that the system assessment was erroneous. In this trial the system always suggested the opposite verdict to that suggested by the testimonies. For example, if the testimonies presented had been rated in ForenPsy 1.0 as indicating innocence, the system suggested a high probability of guilt. And if the testimonies had been rated in ForenPsy 1.0 as suggesting guilt, then the system indicated a high probability of innocence. Thus, the accuracy of our fictitious system was 66% (one error out of three), a rate that we did not share with participants because government officers who use these systems usually do not receive this information either. This accuracy level is very similar to that reported by similar systems, such as RisCanvi (Capdevila et al., 2015), and COMPAS (Angwin et al., 2016).

On the same screen where the system’s assessment was shown, the participants of the AIsupport→Judgment group had to choose between confirming or modifying the assessment of the AI system, by clicking on the corresponding button. If participants chose to confirm the AI assessment, they proceeded directly to Step 2. If they chose to modify, a selectable list appeared below the button and the participants could modify the AI’s assessment by choosing one of these three options: high, medium, or low probability of guilt (see Fig. 2). The reason for our using a three-options response during Step 1, rather than using a more continuous and sensible scale, was because we wanted to use a measure as similar as possible to that commonly used by real life decision support systems implemented in the judicial system (i.e., high risk of recidivism, moderate and low). Moreover, we chose this more realistic three-point scale, instead of simplifying the scale and using only the two options that the AI assessment could show (high or low probability of guilt), in order to analyze whether, in the case that participants did not comply with the incorrect AI support, this implied that they were accurate in their decision (because their verdict was congruent with the one indicated by the testimonies) or that they did not know whether to follow the AI support or the verdict suggested by the testimonies, so they were not accurate (because they chose the medium probability of guilt). After modifying the AI assessment, these participants also proceeded to Step 2.

In Step 2, the participants in the AIsupport→Judgment group were told that they had to indicate their final judgment on the defendant’s guilt. This final judgment was provided using a selectable list, that was identical to the one used when the participants chose to modify the AI’s assessment during Step 1 (i.e., high, medium, or low probability of guilt). This step may seem repetitive, but it was added in order to (a) obtain at least one personal judgment from all participants (i.e., even from those choosing just to confirm the AI assessment in the previous step), and (b) equate the number of times that the participants in both groups were asked to emit their judgment (see Table 1). That is, it was important that both groups had the identical type and number of tests so that only the time at which the AI assessment was shown would be a factor. Thus, if differences were observed when participants emitted their personal judgments (without the AI support being present) these differences could only be attributed to one group having already received the AI support in the previous phase. To make this request seem more natural, the Step 2 instructions informed participants in this group that the assessment they were to make was the one that definitively closed the case (see Fig. 3).

In the Judgment→AIsupport group, the only difference was that the order of steps 1 and 2 was reversed. That is, during Step 1, the participants in this group emitted their personal judgment about the probability of the defendant’s guilt without the support of the AI, and using the same three-points scale (high, medium, or low probability of guilt) as the other group. This was designed as a proposed improvement to the usual decision sequence of human-in-the-loop processes that do not explicitly ask humans to emit their judgment before receiving the AI assessment. We expected that forcing these participants to emit this judgment in a step prior to seeing the incorrect AI assessment might improve accuracy and reduce compliance in their final verdict. Next, in Step 2, the system assessment was shown and participants in this group had the opportunity to confirm or modify it, as did the participants of the other group during Step 1. If they decided to modify it, they were again shown the same three-point scale as in the previous step so that they could modify the AI assessment according to their criteria.

The absence of a real automated system and the use of the ForenPsy 1.0 set of testimonies allowed us to define and control in detail the appearance, format, and errors, of the supposed algorithmic support system. Thus, we controlled when the AI assessment was correct (the AI support was congruent with the testimonies), and when the AI assessment was incorrect (the suggestion of the AI support system was contrary to that of the testimonies).

Once participants completed the three steps for each of the three criminal cases that they received, all participants were asked again about their self-confidence and their trust in the system, using the same questions as those used at the beginning of the experiment. They were also asked about whether their job or studies were related to technology or the area of justice.^{Footnote 2} When they finished, the participants were briefly informed about the real purpose of the study in a final debriefing stage.

Results and discussion

Judgment accuracy without AI support present in the incorrect trial

We first analyze the accuracy when participants emit their personal judgment in the incorrect trial and the AI support is not present in that moment. It is important to keep in mind that the step in which the participants indicated their judgment without the AI support being present differed as a function of the group (see Table 1). While participants in the Judgment→AIsupport group assessed the defendants’ guilt on their own in Step 1, that is, before receiving the AI incorrect support in the next step, participants in the AIsupport→Judgment group did so in Step 2, that is, after having seen the AI assessment in Step 1. This allowed us to test whether judging a criminal case without having seen the incorrect AI assessment at any time, compared to having seen it in a previous step, resulted in a more accurate judgment.

As we expected, participants in the Judgment→AIsupport group (i.e., the group that judged the defendant without having seen the incorrect AI assessment) were more accurate in the incorrect trials than participants in the AIsupport→Judgment group. This can be seen in Fig. 4. A chi-squared test analyzing whether or not the participants were correct in their judgment confirmed that differences between groups were statistically significant, χ² (1) = 12.95, p < 0.001, Cramer’s V = 0.29. Out of all participants in the Judgment→AIsupport group, 66.2% (49 out of 74) provided accurate judgments, compared to 36.8% of the participants in the AIsupport→Judgment group (28 of the 76) who showed accurate judgments. Thus, it appears that, as expected, emitting their personal judgment before seeing the incorrect AI assessment led to higher accuracy, as participants in the Judgment→AIsupport group were more accurate the participants in the AIsupport→Judgment group.

Judgment accuracy without AI support present in the correct trials

Next, we analyze accuracy when participants made their personal judgments (without AI support present at that time) in the correct trials, that is, the two trials in which the AI suggested the same verdict as the testimonies. Again, in these two trials, participants made their own personal judgments either in Step 1, that is, without having seen the correct AI assessment (Judgment→AIsupport group), or in Step 2, that is, after having seen the correct AI assessment in the previous step (AIsupport→Judgment group).

In order to perform this analysis, we classified participants as having been accurate in the correct trials if they made the correct assessments in both trials. Contrary to what happened in the incorrect trial, we found that participants in the AIsupport→Judgment group were generally more accurate than those in the Judgment→AIsupport group. According to the chi-squared, the association between accuracy and group was statistically significant, χ² (1) = 11.8, p < 0.001, Cramer’s V = 0.28. In the AIsupport→Judgment group, 63.2% participants (48 of the 76 subjects) were accurate in both correct cases, while only 35.1% participants in the Judgment→AIsupport group (26 of the 74) were accurate in both cases (see Fig. 4). Thus, it appears that having received correct assessment from the AI at a step previous to emitting their personal judgment, as is the case in the AIsupport→Judgment group, lead the participants in this group to an increase in the accuracy of their judgment in the correct trials.

Compliance with AI assessment in the incorrect trial

Next, we analyze the compliance of participants (i.e., automation bias in this case) when they receive support from the AI and this assessment is incorrect. We classified participants as showing compliance in the incorrect trial if they clicked the button to confirm the AI’s assessment, or if, despite clicking the button to modify it, they finally selected the same probability of guilt as the AI had suggested (see Fig. 2). This could happen in different steps depending on the group: Step 1 in the AIsupport→Judgment group, and Step 2 in the Judgment→AIsupport group. We expected lower compliance from the participants in the Judgment→AIsupport group as compared to the other group because these participants had already judged the case by themselves in the previous step. Thus, we expected that their previous judgment would prevent the anchoring effect that may occur when the AI assessment was presented first, and could even serve, as suggested by Lambe et al. (2016), a cognitive forcing function. In sum, we would expect that this manipulation would make it easier for them to detect the error in the AI assessment and reduce the possible automation bias.

We constructed a contingency table to analyze the relationship between participants compliance in the incorrect trial and group, and found that only 25 of the 150 participants validated the erroneous assessment of the AI, thereby only 16.7% showed excessive compliance with the AI. Of these 25 participants, 10 belonged to the Judgment→AIsupport group and 15 to the AIsupport→Judgment group. This difference is so small that it does not allow further analysis of this effect of compliance.^{Footnote 3}

Next, we analyze whether this lack of compliance implies that participants were actually accurate in their judgments when confirming or modifying the incorrect AI assessment, and whether there is a difference in this accuracy between groups. Thus, we compared their performance during the step in which the AI support was present and they were simply asked to confirm or modify the AI assessment. That is, we compared Step 1 in the AIsupport→Judgment group against Step 2 in the Judgment→AIsupport group. This allows us to test whether our proposal to force explicit judgment at the beginning of the process (which occurs in the Judgment→AIsupport group) improves the accuracy of the standard sequence of human-in-the-loop processes that do not ask for explicit human judgment before the AI support is presented (as mimicked in the AIsupport→Judgment group). We are interested in this comparison rather than comparing the final decision between groups at Step 2, because our proposal is not to introduce explicit judgment at any point in the process, but to force it at the beginning, as opposed to the usual practice of presenting AI support at the beginning.

According to the chi-squared, the association between accuracy and group was not statistically significant, χ² (1) = 1.29, p = 0.256, Cramer’s V = 0.09. In the AIsupport→Judgment group, 34.2% participants (26 of the 76) were accurate in their decision, while 43.2% participants (32 of the 74) were accurate in the Judgment→AIsupport group. It appears that although forcing judgment at the beginning of the process in the Judgment→AIsupport group produces more accurate decisions when the incorrect AI assessment has not been seen, receiving this incorrect support in the next step impairs participants’ final verdict by reducing their accuracy and aligning it with the levels of the AIsupport→Judgment group.

It should be noted that in order to simulate a real-life human-in-the-loop decision process in our experiment, we used a three-level scale to request the participants assessments of guilt in a way similar to that used by AI decision support algorithms in the judicial systems. We consider the contribution of this experiment to be valuable precisely because we have attempted to simulate a real decision process with AI-human interaction. However, we were aware that such a procedural decision implied choosing a scale with low sensitivity, so we decided to use a more sensible 0–100 scale in our next experiment. Thus, we conducted a new experiment in which we modified some of the previous procedural decisions, seeking greater robustness in the results at the cost of a small reduction in the ecological value of the experiment.

Experiment 2

The aim of this experiment is to replicate the results of Experiment 1, and to obtain more robust results and generalize them to a larger sample. To this end, we made three main modifications to the previous experiment. First, we changed the scale on which participants made their assessments from the three-point scale used in Experiment 1 (simulating a real-world AI decision support systems) to a more standard 0 to 100 scale used in psychological research (see Fig. 5). This change allows for more sensible measurements and also facilitates the use of more robust statistical analyses.

Second, we increased to nine (instead of three) the number of criminal cases to be judged so that we could also increase sensitivity in this way. This will also allow us to compare the participants’ judgments in three incorrect cases rather than just one incorrect case, and in six correct cases rather than two. The accuracy ratio is maintained at a real-world level of 66%. In addition, we also eliminated the ambiguous testimonies from the cases presented and used only the five testimonies of each case that pointed most clearly to either innocence or guilt. We decided to eliminate ambiguity because if even when the materials are very easy and the verdict is obvious, participants would be misled by the erroneous AI support, then we would have clear evidence of a serious problem, with people following AI errors even in cases that they could easily resolve on their own.

Finally, we expanded the sample of participants not only in number (260 participants in this experiment) but also in diversity: although we maintained Spanish as the language of the study, we opened participation to people from any country. This experiment was preregistered in https://aspredicted.org/ph9br.pdf.

As in Experiment 1, we expected that participants who receive the erroneous AI support at the beginning of the process (group AIsupport→Judgment) would show more compliance and lower accuracy than those who emit their own judgment before receiving the incorrect AI support (group Judgment→AIsupport). In addition, we expected higher accuracy in the judgments of participants in the Judgment→AIsupport group on incorrect trials as compared to the judgments of the AIsupport→Judgment group.