Adapting to the algorithm: how accuracy comparisons promote the use of a decision aid

Liang, Garston; Sloane, Jennifer F.; Donkin, Christopher; Newell, Ben R.

doi:10.1186/s41235-022-00364-y

Original article
Open access
Published: 08 February 2022

Adapting to the algorithm: how accuracy comparisons promote the use of a decision aid

Garston Liang¹,
Jennifer F. Sloane¹,
Christopher Donkin¹ &
…
Ben R. Newell¹

Cognitive Research: Principles and Implications volume 7, Article number: 14 (2022) Cite this article

3465 Accesses
5 Citations
Metrics details

Abstract

In three experiments, we sought to understand when and why people use an algorithm decision aid. Distinct from recent approaches, we explicitly enumerate the algorithm’s accuracy while also providing summary feedback and training that allowed participants to assess their own skills. Our results highlight that such direct performance comparisons between the algorithm and the individual encourages a strategy of selective reliance on the decision aid; individuals ignored the algorithm when the task was easier and relied on the algorithm when the task was harder. Our systematic investigation of summary feedback, training experience, and strategy hint manipulations shows that further opportunities to learn about the algorithm encourage not only increased reliance on the algorithm but also engagement in experimentation and verification of its recommendations. Together, our findings emphasize the decision-maker’s capacity to learn about the algorithm providing insights for how we can improve the use of decision aids.

Introduction

Decision aids are increasingly in demand. Often implemented as computer algorithms, developments in data availability and computational capabilities have expanded the reach of these tools into much of everyday life. From the mundane, such as deciding which TV show to binge next, to the momentous, such as recommending surgery to a patient, algorithms synthesize vast amounts of information to provide users with on-demand recommendations.^{Footnote 1}

The focus of this paper is to understand what guides individuals to rely upon a recommendation rather than making their own decision. Decision aids, while powerful, might not be the panacea to every problem. The uncertain surgeon who seeks out a medical decision aid by day might later ignore the algorithm behind Netflix’s show recommendations by night.

In this paper, three experiments show that individuals exhibit an acute selectivity in when they rely upon a recommendation. Across our experiments, we instantiate an imperfect but helpful algorithm into a perceptual decision-making task. We show that the information individuals learn about the accuracy of an algorithm is crucial to when individuals rely on a recommendation. Understanding one’s relative performance compared to the algorithm’s accuracy equips the decision-maker with the knowledge of who (or what) is better suited to solving the problem at hand. Taken together, we undertake a systematic comparison of feedback, training, and strategic hints to understand how learning about the algorithm affects the way people use recommendations.

Comparing algorithm performance

There is a notable distinction between seeking advice from another person compared to seeking an algorithm’s recommendation. For a person, the decision-maker can put themselves in another’s shoes. The advisor may share the same reasoning process and step the person through the complexities of a situation (Prahl & Van Swol, 2017). By contrast, the steps an algorithm takes to produce a recommendation may be opaque or at the least unfamiliar to the ordinary user (Yeomans et al., 2019). To ameliorate this gap, algorithms are typically accompanied by descriptions that help convey why its recommendations can be trusted, for instance, by describing the mechanics of its statistical underpinnings. Such information can help decision-makers calibrate their expectations about how useful a recommendation might be.

A simple way to communicate a recommendation’s usefulness is to provide information about the algorithm’s accuracy. Accuracy highlights any performance benefits of relying on the recommendation and offers a benchmark against which individuals can judge their own performance (Parasuraman et al., 2000). Typically, accuracy is conveyed through (a) verbal descriptions that summarize performance, such as describing the algorithm as an 87% accurate medical diagnostician (e.g. in Longoni et al., 2019), or (b) feedback accumulated over multiple recommendations, such as providing information about what the algorithm recommended compared to the correct response (e.g. Dietvorst et al., 2015). In either format, accuracy information establishes a simple explanation for why a recommendation is or is not used; namely, that the preferred system (algorithm or personal judgement) is superior in performance.

Perhaps most interesting are instances where superior recommenders are shunned even in the presence of accuracy information promoting their virtues (e.g. Dietvorst et al, 2015; Mohoney & Houpt, 2019; Barlett & McCarley, 2017, 2019). A good example comes from a set of experiments involving feedback and a helpful decision rule (Arkes et al., 1986). Participants examined student report cards and based upon three grades were asked to indicate the honours-roll status of each student (i.e. responding honours/not honours after each report card). Additionally, they were provided with a simple decision rule to aid them. The rule was 70% accurate: indicate honours for report cards with two or more A’s, and no honours for one or fewer.

Various instruction manipulations made clear the difficulty of surpassing this performance benchmark. For example, the debias condition was explicitly instructed that “most people can’t judge at a rate better than 70% correct … [those] who try actually perform a lot worse” (Arkes et al., 1986, p. 97). However, despite the heavy-handed instructions and ongoing feedback throughout the task, most individuals deviated from exclusively using the decision rule and scored lower than had they strictly complied. Surprisingly, this rule deviation was more prominent when feedback was present than when it was absent.

These rather curious results suggest that many individuals believed they could outperform the rule. Such behaviour may have been driven by scepticism about the validity of the rule, participants’ belief that their prior knowledge of college grades was superior to a simple rule, insufficient training in the task, or perhaps simply the desire to take on the challenge implied by the experimenter (e.g. “I am superior to most people so I will be able to do better”). Whatever the precise motivation, these kinds of results highlight the importance of being able to accurately assess one’s own level of (unaided) performance on a task when deciding whether to seek and follow an external recommendation (Arkes et al., 1986; Sieck & Arkes, 2005).

A matter of skill

Algorithmic decision aids hold a great deal of promise for highly skilled professions (e.g. sentencing decisions by judges; Kleinberg et al., 2018). Particularly in time-poor environments, algorithms can be helpful in outsourcing the peripheral features of a task and allowing the expert to focus on the more demanding details. Radiography is one such profession where visual search algorithms assist expert judgement in the detection of screening anomalies. Radiologists can outsource ambiguous cases to visual search algorithms that in turn recommend which anomalies require additional expert scrutiny.

Expertise is precisely what equips individuals to judge the utility of any decision aid tool. Expertise can also, however, be an impediment to using decision aids. Relative to lay populations, more knowledgeable experts typically reject recommendations from both algorithmic and human advisors (Logg et al., 2019; Yaniv, 2004; Arkes et al., 1986). Within the medical field, high levels of expertise typically beget overconfidence, where overestimating one’s own capabilities can lead to grave judgement errors (Berner & Graber, 2008; Croskerry & Norman, 2008; Sieck & Arkes, 2005).

Examining how people evaluate their skills relative to the algorithm can help determine when one should consult a decision aid. Our experiments incorporate manipulations that vary the complexity of training and veridical feedback to give people multiple opportunities to reassess their performance. Having multiple opportunities to re-evaluate their performance may lead individuals to adapt their reliance on a decision aid over time. For example, an individual may decrease their reliance if their skills gradually improve beyond the accuracy of the algorithm. However, if the algorithm consistently outperforms the individual, that individual may learn to be increasingly reliant on the algorithm’s suggestion.

Single-shot choice experiments have found that individuals adjust their preference for a decision aid based on information about its accuracy. For instance, while participants initially preferred a human physician to an equivalent-performing algorithm in a medical scenario, Bigman and Gray (2018) found a preference switch when participants were subsequently told the algorithm would outperform the physician. Individuals are also capable of disregarding unhelpful decision aids such as when they are told the recommendations are generated by a coin-flip (i.e. chance-level performance in a binary choice task; Douneva et al., 2019). Our experiments sought to combine and extend these findings in a within-subject investigation of how performance information alongside assessments of one’s own skill shapes when people consult an algorithm.

Overview of experiments

Across three experiments, we investigated how people relied on a decision aid that was situationally helpful. Figure 1 displays and describes the way in which we implemented the decision aid (see figure caption for details). In the main task, individuals made binary choice judgements that could be aided by an algorithm. If participants were uncertain, they could consult an algorithm that was set to a known accuracy level of 70%. This meant that on most, but crucially not all, occasions the algorithm would provide a correct recommendation (e.g. an arrow pointed in the recommended direction for the dot motion task, see Fig. 1). Importantly, participants were explicitly told of the algorithm’s accuracy level and the potential for an incorrect recommendation (i.e. 30% of the time the arrow points in the opposite direction to the motion of the dots).

We specifically chose the algorithm’s accuracy level to bisect the expected performance across two levels of task difficulty (explained further in Experiment 1a and 1b). For the easier stimuli, most individuals learnt the task to near perfection, and the vast majority surpassed the accuracy of the algorithm (i.e. median participant accuracy ~ 95% correct). By contrast, the harder versions of the stimuli continually proved to be difficult, even with increasing levels of training and feedback introduced in later experiments. The algorithm systematically outperformed all but a single individual for the harder version of the task (median participants’ accuracy ~ 52% correct).

Our primary aim was to examine how people subsequently adjusted their use of the 70%-accurate algorithm to these difficulty levels. A noteworthy implication of fixing the algorithm’s accuracy across stimulus difficulty is that the task includes situations where what is difficult for an algorithm may not be difficult for a human observer (i.e. easier trials where the algorithm is 70% correct). While we acknowledge this is not always the case, such situations can arise if the algorithm uses a different process compared to a human observer. For example, in a task distinguishing huskie dogs from wolves, a human may recognize the facial subtleties of each animal while an image classifier might learn to recognize snow in the background of images of wolves (Ribeiro et al., 2016). Indeed, online CAPTCHA tests exist because classifier algorithms have difficulty recognizing simple objects that humans can easily identify. Our intent in including such situations is that we can directly examine whether individuals understand such limitations of the algorithm.

An additional benefit to this experimental setup is that it discouraged the exclusive reliance on either source of responses. Should an individual display an inherent aversion to the algorithm, their performance for the harder images would be at chance levels. Similarly, an individual that outsourced the entirety of the task to the decision aid would make a substantial number of simple and avoidable errors on the easier images. The best overall approach was to selectively seek the algorithm’s recommendation for the harder stimuli but disregard its recommendation for the easier stimuli.

Examining a strategy of selectively using the algorithm distinguishes our experimental settings from many past studies where the best response is always to use the algorithm instead of one’s own judgement (Arkes et al., 1986; Dietvorst et al., 2015; Logg et al., 2019). While it is possible, via sufficient experience and feedback, that participants can learn that the best response strategy is always to rely on the algorithm (e.g. Sieck & Arkes, 2005), there is no guarantee that such a policy will be implemented. Repeated experience may instead inspire a variety of hypotheses regarding what behaviour is appropriate, such as wondering “does the experimenter always expect the same response or should I intervene across different stimuli?” (Brehmer, 1980), and, in turn, lead to maladaptive experimentation and suboptimal responding (Szollosi et al., 2019).

Our intent was to remove this experimental layering by including situations in which the best response was to avoid the decision aid (e.g. on an easier trial, participants may judge their own performance to be superior to a 70%-correct algorithm). These avoid trials provide the additional space for participants to exhibit their understanding of the task. By adjusting one’s reliance on an algorithm, our data allow for richer characterizations of people’s decision-aid behaviours beyond a dichotomy of algorithm users and avoiders.

Experiment 1a and 1b: automatic recommendations

We begin with situations where recommendations are provided automatically and without cost to the decision-maker. Such automatic recommendations resemble alert systems that monitor data and only interrupt the decision-maker when a criterion is met (e.g. emergency ward alerts when patient vitals fall below critical thresholds). In Experiment 1a participants learnt to categorize mammogram images as cancerous or non-cancerous and in Experiment 1b, a separate group of participants performed the dot motion judgement task outlined in Fig. 1. In both experiments, participants were provided with recommendations from an algorithm described as being 70% accurate. Our key question was whether adherence to this recommendation differed as a function of the difficulty of the to-be-classified stimulus. We hypothesized that individuals would avoid relying on the decision aid for easier images and reserve its use for the harder images.

Method

Participants

Experiment 1a and Experiment 1b were identical in design with only stimuli differences (see below). Experiment 1a was conducted with 55 psychology undergraduates (M_age = 19.1, SD = 1.16, female = 34) at UNSW, Sydney. Experiment 1b involved 32 participants drawn from the same pool (M_age = 19.1, SD = 1.16, female = 16). Participants received course credit for participation and were awarded a proportional payment out of $5.00 AUD based upon their performance in the task (M_1a = $3.44, SD_1a = 0.21, M_1b = $3.73, SD_1b = 0.22). Sample size was determined on the basis of past similar experiments of training in categorization (Giguère, & Love, 2013; n = 50) and dot motion with similarly large numbers of within-subject trials (e.g. Pilly & Seitz, 2009; n = 12).

Materials

Stimuli

Experiment 1a and 1b used different stimuli. Experiment 1a involved categorizing mammogram images as either cancerous or normal. We obtained anonymized images from the Digital Database for Screening Mammography (DDSM) that is freely available online (Heath et al., 2001).

To understand our results better using stimuli over which we had more experimental control, Experiment 1b used random dot arrays. These arrays were adapted from the native random dot motion plugin for JSPsych (de Leeuw, 2015; example in Fig. 1). In the array, 300 Gy dots move across the screen in various straight lines with a proportion of the dots coherently moving along the 90°–270° axis. The task requires participants to determine the direction of movement along this axis as either left-motion or right-motion (shown in the orange arrows in Fig. 1). Distractor dots moved in straight lines but along different axes (shown in grey arrows). The difficulty of the task was manipulated through the proportion of coherently moving dots. For example, a higher coherence level indicates a larger proportion of dots moving along the 90°–270° axis.

Prior to each experiment, we conducted pilot testing to determine the difficulty of the stimuli. In general, difficulty was determined based on the performance of pilot participants in two additional separate experiments (N = 107 for mammogram pilot, N = 34 for dot motion pilot). In these pilot experiments, participants were presented with the perceptual task and asked to categorize the stimuli to their best ability. Average levels of performance were determined for each individual image in the case of mammograms (hence the larger sample size) and each coherence level for the dot motion stimuli. In brief, stimuli for which performance was relatively high (i.e. ~ 80% correct for mammograms, ~ 90% correct for dot motion) were labelled “easier”, whereas stimuli for which performance was near chance levels (i.e. ~ 55% correct for both stimuli types) were labelled “harder”. For Experiment 1a, we retained 267 mammogram images from an initial sample of 471 images using the above performance criterion. For Experiment 1b and all subsequent experiments, we selected coherence levels of 0.25, 0.2, 0.02, and 0.01 where the former two levels were labelled “easier” and the latter two “harder”. The full details of these pilot experiments are presented in Additional file 1.

Decision aid algorithm

The algorithm was instantiated as a probabilistic cue that was positioned above the stimulus. In Experiment 1a, the algorithm’s recommendation was a red circle that signalled cancer-category membership. In Experiment 1b, the recommendation was a left-pointing arrow that signalled leftwards motion. This means the algorithm signals only a single outcome (cancer/left). This design feature was originally inspired by mammogram images where a decision-maker may prioritize identification of cancer positive outcomes rather than non-cancerous outcomes. While this asymmetry in the outcomes does not translate to random dot stimuli, we retained the single-outcome cue in order to facilitate comparisons between the experiments.

For trials when the recommendation appeared, its onset was simultaneous with the onset of the stimulus. Participants were told that when the recommendation appeared, the algorithm would signal the correct category on 70% of occasions. This performance constraint means that stimulus categories were unbalanced such that 70% of the cued images were cancer/left stimuli and 30% were normal/right stimuli. We refer to the algorithm’s recommendation as the cue.

The test stage was separated into cued blocks, when the algorithm appeared, and control blocks. In the cued blocks, the cue appeared on half of the images and for an equal number of easier/harder images. Presenting the cue for half the stimuli meant that the absence of the cue did not always indicate the image was a normal/right stimuli although it was more probable due to the unbalanced proportions of stimuli. We return to the interpretation of non-cued images in the “Discussion” section. Each participant received a random subset of images for which the cue would appear. In the control blocks, participants were reminded the cue would never appear before the block began. We included the control block to isolate the influence of the cue on responses (see Fig. 1).

Decision aid algorithm description

In the instructions and as a reminder at the start of each cued block participants were told, “The algorithm is there to help you—whenever you see the cue, there is a 70% chance that the image (dots in the panel) was a cancer image (moving to the left). Conversely, there is a 30% chance that the cue is indicating the incorrect response and the image (dots in the panel) is a normal image (moving to the right).” (Italics show instructions for Exp. 1a, instructions for Exp. 1b in parentheses). Participants were reminded that it was up to them to decide if they wished to use the cue or rely upon their own judgement.

Design

The experiments used a within-subject design where block type (cued and control block) alternated throughout the experiment. The first block was randomized between-subjects and collapsed in the analyses.

Training and test blocks

Both experiments were divided into an initial training stage followed by a longer test stage without feedback. In Experiment 1a, the number of trials in each stage was constrained by the number of unique mammograms from the norming procedure. Experiment 1b did not have these constraints as the random dot motion stimuli were computer generated. Consequently, the training stage of Experiment 1a consisted of 44 easier mammogram images (i.e. 22 easier cancer and 22 easier normal images). The training stage in Experiment 1b consisted of 80 easier images (40 left-motion and 40 right-motion). As a brief aside, our decision to train participants on easier images and then test them on a combination of harder and easier images follows from work on the impact of idealized training in category learning (Giguere & Love, 2013; Hornsby & Love, 2014).^{Footnote 2}

Each test block of images contained 80 images made up of the 2 × 2 category by difficulty matrix. Specifically, in Experiment 1a each block consisted of 20 easier cancer, 20 easier normal, 20 harder cancer, and 20 harder normal mammograms. There were four test blocks in total (for progression, see Fig. 2). Experiment 1b also consisted of the same 80-image matrix with left-motion or right-motion categories and a total of six test blocks.

Procedure

Participants were introduced to the categorization task and given examples of each stimulus category prior to starting their training. They were told that their task was to categorize their respective stimuli as either cancer (left-motion) or normal (right-motion). Participants entered their responses on a keyboard with the cancer (left-motion) response mapped to the “c” key and normal (right-motion) responses mapped to the “n” key. The instructions explained that in the training stage, they would receive feedback following each image informing them of the correct category. Feedback appeared below the stimulus as either green text for correct responses or red text for incorrect responses. Individuals entered responses to proceed to the next trial. A fixation cross was displayed for 1.5 s that separated the start of the following trial. In training, responses slower than 5 s were given feedback to speed up.

Following training,^{Footnote 3} a new set of instructions then described the test stage and the algorithm (cue). In both experiments, participants were told the cue would appear above the stimulus and could help them by signalling the probable correct response. The algorithm description statement (see “Materials” section) was presented. Instructions then explained that the test stage would be separated into the two block types: cued blocks, where the cue would appear on a random half of the trials, and control blocks, where participants would complete the task on their own (see Fig. 2). A short quiz was administered prior to starting the test stage to ensure participant understanding of the instructions. Block type alternated throughout the task. Between each block a reminder screen stated either the cue’s chance of being correct (e.g. 70% chance of cancer) or a reminder that the upcoming control block would never display the cue. Once complete, participants were paid based upon their overall proportion of correct responses.

Results

For this and following experiments, we report Bonferroni corrected p-values for analyses involving multiple comparisons and remove responses with extreme response times (slower than 10 s, 0.04% of trials; or faster than 0.18 s, 0.2% of trials). Recall that the test stage alternated between the control blocks and the cued blocks where the algorithm recommended one response (cancer in Exp. 1a, left-motion in Exp. 1b). In Fig. 3, we separately present these trial types in each experiment.

Beginning with the easier trials (top row of Fig. 3), the proportion of correct responses was high across both experiments (M_1a = 0.89, SD_1a = 0.08, M_1b = 0.92, SD_1b = 0.08). Nearly all individuals, except for a single participant in each experiment, surpassed the accuracy of the cue. To gauge the influence of the cue, we calculated difference scores in proportion correct between the cued trials and the control trials. In both experiments, the cue produced a minor numerical improvement (mammogram images, M_diff = 0.02; dot motion M_diff = 0.03). This high level of performance suggests that on the 30% of trials when the cue was misleading, individuals were able to overrule its recommendations.

The lower panels of Fig. 3 present performance in the harder trials. Overall performance was worse for the harder stimuli than the easier stimuli as indicated by a main effect of difficulty (F (1, 80) = 1182.20, p < 0.001, η_p² = 0.93). Difference scores between cued and control blocks showed larger improvements for the mammogram images in Exp. 1a (M_diff = 0.13, SD_diff = 0.07) as compared to the dot motion stimuli in Exp. 1b (M_diff = 0.04, SD_diff = 0.07). This difference was supported by a two-way ANOVA with a significant difficulty (hard vs easy) by experiment (1a vs 1b) interaction (F(1, 80) = 21.68, p < 0.001; η_p² = 0.21). Despite this improvement, most participants performed worse relative to the accuracy of the algorithm in the cued trials (algorithm’s accuracy = 0.70; M_1a = 0.59, SD = 0.08, t(49) = − 9.13, p < 0.001; M_1b = 0.60, SD = 0.07, t(31) = − 8.59, p < 0.001). This suggests that on occasion, participants also disagreed with the cue when it appeared. Together, our results show that participants selectively relied on the cue for the harder trials but could have improved their performance had they agreed with the algorithm’s recommendation more often.

Discussion experiment 1a and 1b

Across both experiments, we found that individuals relied upon the algorithm’s recommendation for the harder stimuli and ignored the cue for easier stimuli when it was potentially misleading. For the harder images, participants improved their performance when the cue appeared by agreeing with the cancer (left-motion) recommendation. Curiously, participants also overruled the cue for the harder images on a minority of cued trials, presumably, to correct for the knowledge that there would be misleading recommendations. As an aside, we examined whether the overruling patterns resembled probability matching (e.g. responding “cancer” for 70% of the cued-images and “not-cancer” for the remaining 30%—see Additional file 1 for details). Although seemingly plausible in the aggregate, probability matching did not appear in the individual-level data.

While these initial results were encouraging, certain features of the cue in Experiment 1a and 1b limited our understanding of how participants recruited the recommendation. The first feature was that the cue always prompted a single response (cancer, left-motion). One problem this creates is determining whether participants inferred anything from the absence of the cue. It is possible participants interpreted this absence to signal the opposite of the cued response (normal or right-motion). Introducing a recommendation that can signal both responses would ameliorate this concern. Second, the fact that the cue appeared unpredictably obscured whether participants actually needed the recommendation for a particular stimulus. In the next experiment, we addressed both features by handing individuals control over when they sought out a recommendation.

An open question is whether people would overrule a recommendation that was sought out rather than automatically provided. Akin to the idea of sunk costs, overruling the algorithm may be unappealing given the effort to acquire the recommendation in the first place, and especially if the participants were already uncertain (Arkes & Blumer, 1985). To answer this question, we designed Experiment 2 with a recommendation requesting feature to examine when participants would seek out the algorithm’s response. We expected more requests for the algorithm’s recommendation during the harder trials than the easier trials. Indeed, participants in Experiment 1a and 1b showed an acute proficiency at the easier version of the task giving us little reason to believe they needed the recommendations at all. To narrow our focus onto decision aids themselves, rather than any stimuli-related effects (i.e. a response bias for mammogram judgements in favour of false alarms to missed diagnoses), our subsequent experiments used dot motion stimuli to understand how participants use a recommendation when they voluntarily seek it out.

Experiment 2: requesting the recommendation

Experiment 2 incorporated two changes to the algorithm. The first was that the algorithm provided recommendations about both outcomes (left- and right-motion). The second change implemented the recommendation request feature. In Experiment 2, participants had the option to request the recommendation on any given trial. One benefit to this request response is that it distinguishes instances when participants did not need the algorithm from instances when they requested but overruled its recommendation.

Alongside these changes, we manipulated block-feedback and training experience. Block-feedback provided a summary of participant performance separately for each difficulty level. We anticipated that performance feedback would prompt performance comparisons with the algorithm and highlight the improvement that comes with selectively requesting the algorithm for the harder images.

Our second manipulation involved training experience. In the previous experiments, participants did not have any training experience with the harder stimuli. In the absence of any error correction during testing, participants may have believed themselves to have discovered a sufficiently workable rule for the harder stimuli and may not have perceived a need to improve their strategy. Introducing training experience with the harder stimuli alongside feedback opportunities should ameliorate any such illusions about their performance.