Experiment 2 incorporated two changes to the algorithm. The first was that the algorithm provided recommendations about both outcomes (left- and right-motion). The second change implemented the recommendation request feature. In Experiment 2, participants had the option to request the recommendation on any given trial. One benefit to this request response is that it distinguishes instances when participants did not need the algorithm from instances when they requested but overruled its recommendation.
Alongside these changes, we manipulated block-feedback and training experience. Block-feedback provided a summary of participant performance separately for each difficulty level. We anticipated that performance feedback would prompt performance comparisons with the algorithm and highlight the improvement that comes with selectively requesting the algorithm for the harder images.
Our second manipulation involved training experience. In the previous experiments, participants did not have any training experience with the harder stimuli. In the absence of any error correction during testing, participants may have believed themselves to have discovered a sufficiently workable rule for the harder stimuli and may not have perceived a need to improve their strategy. Introducing training experience with the harder stimuli alongside feedback opportunities should ameliorate any such illusions about their performance.
Method
Participants
Experiment 2 involved 168 psychology undergraduates (Mage = 19.8, SD = 2.82, female = 109) at UNSW, Sydney. Six participants were excluded (one for failing the instruction check 14 times, five for completing less than half the experiment). The data for a further six participants who did not finish, but completed most of the experiment, were retained for a total of 162 participants. Participants received course credit for participation and were awarded a small payment up to a maximum of $5.00 AUD that was proportional to their performance in the task (M = $3.88, SD = $0.22).
Materials
Stimuli
Experiment 2 used the same random dot motion stimuli from Experiment 1b. We retained the same coherence levels with 0.01 and 0.02 constituting the harder stimuli and 0.20 & 0.25 constituting the easier stimuli.
Algorithm
In the training instructions, we introduced the computer algorithm as a 100px-by-100px green box positioned above the stimulus (see example in Fig. 1C). The green box remained on screen until the algorithm was requested or a left/right response was entered. Participants could request the algorithm by pressing the “g” key during any trial. The cost of requesting the algorithm was a one-second loading delay. Once requested, a white revolving loading bar rotated within the box (see Fig. 1B). After one second, the green box and loading bar disappeared to reveal a green-coloured arrow pointing either leftward or rightward. The recommendation was independently generated on each trial by randomly drawing a number between one and ten with a 70% probability of displaying the actual correct direction.
Algorithm description
Prior to the start of the test stage, we described the algorithm mechanics in more detail. Instructions stated, “The algorithm is there to help you—whenever you see an image, the algorithm will calculate a direction for that very same image”. Just as in the previous experiment we explicitly noted, “There is a 70% chance that [the algorithm] calculates the correct direction. Conversely, there is a 30% chance that it calculates the wrong direction”. Regarding stimulus difficulty, we explained that despite the perceptual difficulty the participant may experience, the algorithm was still able to calculate a direction. We specifically stated, “For both easier and harder images, the algorithm has a 70% chance of calculating the correct direction”.
Block of trials
Each block consisted of 80 random dot motion arrays distributed across the difficulty by direction matrix (easier/harder by left-/right-motion). The test stage comprised of six blocks in total and participants were informed of this length.
Design
The experiment used a 2 (training) × 2 (feedback) between-subjects design (nmin = 40). The two levels of the training factor were easy-only (as in Experiment 1a and 1b) and easy-hard training. The easy-only training condition underwent a block of easier training stimuli (80 images) followed by summary feedback. The easy-hard training condition underwent a block of easier training with summary feedback, followed by an additional block of harder images with summary feedback (160 images total). In the results, we examine whether this additional training block led to improvements in motion detection performance.
The training factor was crossed with the feedback factor. The two levels of feedback were block-feedback and no-block-feedback. At the end of a block of images, the block-feedback conditions received a summary screen that stated the (a) proportion correct for easier images, (b) proportion correct for harder images, and (c) overall number of requests for the algorithm in that block. After clicking next, a second screen presented a table with their past performance for easier and harder images in each previous block, including from training.
The no-block-feedback conditions skipped these summary pages and proceeded to a standard “take a break” screen between each block of trials.
Procedure
Participants were told their task was to categorize the motion of each stimulus as left- or right-motion. Instructions prior to the training stage explained that an algorithm would be present in training though participants could not interact with it at this point. In training, responses slower than 5 s were given feedback to speed up.
Participants proceeded through their respective training procedures, receiving trial-by-trial feedback following each stimulus. Following a block of training trials, participants also viewed summary feedback for that block. After completing the training stage, further instructions explained the functionality of the algorithm and the subsequent test stage. Participants were briefed about the incentive structure and told there would be six test blocks where the algorithm was available upon request by pressing the “g” key. In the test stage, there was no time limit for an individual trial. Once the test stage was completed, participants were paid proportional to their overall performance up to a maximum of $5.00 AUD.Footnote 4
Results
We structure the results in the following manner: we first examine participant performance in training and test followed by examining participant algorithm requests.
Performance
Performance in the training blocks is shown in Panel A of Fig. 4. Performance for the easier stimuli was near-ceiling in both training conditions (Measy-hard = 0.94 vs. Measy-only = 0.93, SE = 0.01). By comparison, performance for the harder training block was substantially lower and near-chance (Measy-hard = 0.53, SE = 0.01). Notably, only a single participant performed as well as the algorithm for these harder trials. For all other participants, we were interested if their relatively low performance in training would lead them to rely on the algorithm during the test stage.
Panel B of Fig. 4 plots performance in the subsequent test stage. Similar to the results in training, stimulus difficulty had a large effect on performance (F (1, 158) = 4219. 08, p < 0.001, ηp2 = 0.96). Across all conditions, participants performed better for the easier stimuli (top right, Panel B; M = 0.95, SE = 0.01) as compared to the harder stimuli (bottom right, Panel B; M = 0.61, SE = 0.01). Interestingly, despite the easy-hard training conditions undergoing an additional training block, levels of performance were similar to the easy-only training conditions that only underwent the easier training block. This suggests that the additional training trials did not improve the detection of motion. Rather, overall improvement in test performance relative to training seems to originate from the degree of algorithm requests.
Algorithm requests
Panel A of Fig. 5 shows the proportion of requests for the algorithm’s recommendation. Across conditions, participants overwhelmingly requested the algorithm when faced with harder stimuli (F(1, 158) = 242.56, p < 0.001, ηp2 = 0.61). Most participants made few requests, if any at all, for the easier stimuli (median = 7 requests/240 easier trials). This main effect of difficulty was qualified by a training-by-ease interaction (F(1, 158) = 8.61, p = 0.004, ηp2 = 0.05). The interaction speaks to the easy-hard training group requesting the algorithm more than the easy-only training conditions for the harder stimuli (Measy-hard = 0.40, se = 0.03 vs. Measy-only = 0.30, se = 0.03) but not the easier stimuli (Measy-hard = 0.08, se = 0.01 vs. Measy-only = 0.08, se = 0.02). In other words, training experience increased the subsequent reliance on the algorithm for the appropriate, harder, stimuli.
With regard to feedback, we failed to find any main effects (F (1, 158) = 0.25, p = 0.62) but, curiously, found a feedback by training interaction (F(1, 158) = 5.89, p = 0.016; ηp2 = 0.04). Specifically, we see that for the easy-hard training conditions, requests for the algorithm were numerically higher with block-feedback (first-from-the-left, or red, bar in Fig. 5) than without block-feedback (second, or green, bar in Fig. 5A; averaged over difficulty = 27% vs. 21%, t(320) = 1.43, p = 0.15). However, for the easy-only training groups, the reverse is true; requests were significantly higher without block-feedback (third, or orange, in Fig. 5A) compared to the block-feedback condition (fourth, or yellow, in Fig. 5; 15% vs. 24%, t(320) = 2.16, p = 0.03). This result is intriguing because the easy-only training groups did not have any exposure to the harder stimuli before test. While we return to this result in the “Discussion” section, a tentative interpretation may be that summary block-feedback may have encouraged easy-only participants to track their own skill improvement at the harder stimuli rather than emphasize the superiority of the algorithm.
To better understand the relationship between requests and performance, we separated algorithm-assisted test trials from participant’s own decisions in Panel B of Fig. 5. Across all conditions, the algorithm’s recommendation aided performance for the harder trials (lower panel; Mown = 0.59 vs. Massist = 0.67) but decreased performance for the easier trials (Mown = 0.95 to Massist = 0.83). This trial type by difficulty interaction (F(1, 130) = 103.78, p < 0.001, ηp2 = 0.44) verified that, indeed, the imperfect algorithm was helpful when participants recruited its recommendation for the harder trials. Despite a slight impairment in performance when requested for the easier trials, the overall low number of requests when the trial was easier shows that participants understood they did not need it.
Discussion Experiment 2
In Experiment 2, we examined how block-feedback and training affected how people relied on an algorithm’s recommendation. Overall, requests for the algorithm were mostly reserved for the harder stimuli, suggesting that participants distinguished when the recommendation was useful from when it could be misleading. The improvement in performance on algorithm-assisted trials also suggests that participants accepted the recommendation when they asked for it and agreed with its suggestion. We will return to discuss algorithm agreement in the “General discussion” section.
Optimistically, these results suggest that further improvements in performance were possible with requesting the algorithm on more, if not all, harder instances. That is, if participants exclusively relied upon the recommendation for the harder stimuli, they could match the superior performance level of the algorithm. However, even in the condition where we gave the most guidance, by providing block-feedback and training experience, participants fell short of fully capitalizing on this strategy. Why might this be the case?
One motivational explanation is that some participants may have wanted additional practice at the difficult stimuli. Participants likely noticed that some stimuli were considerably more difficult in the task, particularly for the conditions that received easy-hard training. Despite the additional effort involved, however, participants may still have believed they could improve their abilities with additional practice. If there were such motivated individuals in the experiment, then the algorithm may have been treated as a fallback response and only used in cases when participants were completely uncertain. Instead, these individuals may have persisted with the perceptual discrimination elements of the task for far longer under the belief that deferring to the recommendation would rob them of the chance to improve their skills.
This motivational account may also speak to the lack of strong feedback effects. Recall that our motivation for block-feedback was to encourage participants to compare their performance to that of the algorithm. However, an alternative way to use the summary feedback was to track one’s own improvement over time. Following each block, participants may have been more interested in comparing their performance levels to previous blocks rather than to the algorithm. Particularly for the easy-only training conditions that had yet to encounter a harder stimulus, summary feedback between blocks was their only metric to gauge their performance and error-correct their response strategy. Considered together, a speculative interpretation of the lack of feedback effects may be that it reflected different methods to gauge one’s performance; through block-feedback when it was available (forgoing the use of the algorithm), and through requesting the algorithm when feedback was absent.
In the next experiment, we sought to explicitly encourage performance comparisons with the algorithm and strengthen the feedback manipulation.