Evaluating the effectiveness of different perceptual training methods in a difficult visual discrimination task with ultrasound images

Marris, Jessica E.; Perfors, Andrew; Mitchell, David; Wang, Wayland; McCusker, Mark W.; Lovell, Timothy John Haynes; Gibson, Robert N.; Gaillard, Frank; Howe, Piers D. L.

doi:10.1186/s41235-023-00467-0

Original article
Open access
Published: 20 March 2023

Evaluating the effectiveness of different perceptual training methods in a difficult visual discrimination task with ultrasound images

Jessica E. Marris ORCID: orcid.org/0000-0003-2658-3016¹,
Andrew Perfors¹,
David Mitchell^2,3,
Wayland Wang³,
Mark W. McCusker^3,4,
Timothy John Haynes Lovell³,
Robert N. Gibson^3,4,
Frank Gaillard^3,4 &
…
Piers D. L. Howe¹

Cognitive Research: Principles and Implications volume 8, Article number: 19 (2023) Cite this article

1938 Accesses
Metrics details

Abstract

Recent work has shown that perceptual training can be used to improve the performance of novices in real-world visual classification tasks with medical images, but it is unclear which perceptual training methods are the most effective, especially for difficult medical image discrimination tasks. We investigated several different perceptual training methods with medically naïve participants in a difficult radiology task: identifying the degree of hepatic steatosis (fatty infiltration of the liver) in liver ultrasound images. In Experiment 1a (N = 90), participants completed four sessions of standard perceptual training, and participants in Experiment 1b (N = 71) completed four sessions of comparison training. There was a significant post-training improvement for both types of training, although performance was better when the trained task aligned with the task participants were tested on. In both experiments, performance initially improves rapidly, with learning becoming more gradual after the first training session. In Experiment 2 (N = 200), we explored the hypothesis that performance could be improved by combining perceptual training with explicit annotated feedback presented in a stepwise fashion. Although participants improved in all training conditions, performance was similar regardless of whether participants were given annotations, or underwent training in a stepwise fashion, both, or neither. Overall, we found that perceptual training can rapidly improve performance on a difficult radiology task, albeit not to a comparable level as expert performance, and that similar levels of performance were achieved across the perceptual training paradigms we compared.

Introduction

With practice and experience, humans learn to extract the relevant perceptual features that guide decisions about stimuli in their environment, even when these features are difficult to verbalise (Kellman & Garrigan, 2009). Training that aims to improve perceptual skills is referred to as perceptual training (Chen et al., 2017), whereas the term perceptual learning describes the improvement in task performance (e.g. the ability to identify, detect, and discriminate stimuli) that results from this training (Sagi, 2011). Perceptual learning occurs across a wide range of simple visual tasks with basic stimuli (e.g. dots, line segments, and Gabor patches), such as motion direction detection (Ball & Sekuler, 1987), orientation discrimination (Fiorentini & Berardi, 1980), and texture discrimination (Karni & Sagi, 1991).

Perceptual training techniques have been increasingly applied to real-world visual tasks with complex stimuli. A growing body of work in the medical domain—for example, in radiology (Chen et al., 2017; Frank et al., 2020; Johnston et al., 2020; Sha et al., 2020; Sowden et al., 2000), dermatology (Rimoin et al., 2015; Xu et al., 2016), histopathology (Krasne et al., 2013), and cytopathology (Evered et al., 2014)—has found that perceptual training can lead to rapid and substantial improvements in performance on visual tasks with medical images. These findings are particularly relevant because medical professionals, such as radiologists, undergo many years of training to develop the expertise to interpret complex medical images. Traditionally, radiologists are trained to interpret and diagnose medical images in a primarily rule-based fashion, although this may not be the most efficient approach for learning complex visual tasks that require perceptual decisions (Johnston et al., 2020). The findings from the perceptual training literature suggest that perceptual training techniques could usefully supplement the traditional training that radiologists receive. Perceptual training also offers the benefit of immediate feedback, which has been found to be essential for learning to interpret radiology images (Sha et al., 2020), and is often delayed or absent in real-world medical image training.

However, the extent to which perceptual training is beneficial remains somewhat unclear: is it possible for participants to reach (or at least approach) expert-level performance when the tasks are highly complex? Whilst many studies have shown that perceptual training can lead to improvements in performance in visual tasks in the medical domain, relatively few studies have explored whether it is possible for participants to achieve similar levels of performance to experts, and if so, when. One exception is a study by Chen et al. (2017), who compared the performance of medically naïve participants that underwent perceptual training to identify proximal neck of femur fractures in X-ray images to that of experts (board-certified radiologists and radiology residents) across a series of experiments. The mean accuracy of the participants was approximately 90% after only two perceptual training sessions, which was only slightly lower than the accuracy of experts (94%). Whilst pre-training accuracy was not assessed in this experiment, Chen et al. (2017) found that pre-training accuracy was only slightly above chance (55.9%) in two similar experiments. This finding suggests that perceptual training can be a practical and efficient way of obtaining medical image discrimination expertise.

Given its potential usefulness, it seems timely to ask if perceptual training techniques are effective for medical image discrimination tasks that require finer judgements (i.e. beyond a two-choice judgement). Additionally, with such tasks of increased difficulty, is there a particular perceptual training technique that is more effective? The most common and simple perceptual training technique is to present stimuli sequentially. On each training trial, participants make a judgement (e.g. “Is there a hip fracture present?”) about a single stimulus and are then informed if they were correct. Although similar techniques are used in the categorisation literature (category learning and perceptual learning likely result from overlapping mechanisms; Carvalho & Goldstone, 2016), we refer to this as standard perceptual training as our review focuses on the perceptual training literature. However, recent successes with alternative training techniques—for example, training participants with comparison images (e.g. Sha et al., 2020) or supplementing standard perceptual training with annotated feedback (e.g. Chen et al., 2017; Frank et al., 2020; Johnston et al., 2020)—question whether the standard perceptual training technique is the most effective, especially for more challenging perceptual tasks than what have typically been studied (i.e. beyond two-choice tasks).

Our overarching aim is to assess which perceptual training methods are the most effective for training medically naïve participants to improve their performance in a difficult real-world medical image discrimination task. To address this goal, we systematically tested different perceptual training procedures across a series of experiments. In our studies, we chose to assess perceptual training with a task which experts and trainee radiologists find difficult: identifying the degree of hepatic steatosis (fatty infiltration of the liver) on ultrasound images. Additionally, we sought to gain a better understanding of the limits of these perceptual training techniques in our task, by comparing the post-training performance of trained novices to an estimate of expert performance.

Experiment 1a

Traditionally, perceptual learning studies with simple stimuli and tasks have involved multiple sessions with thousands of trials (Dosher & Lu, 2017; Gauthier et al., 1998). However, studies with complex real-world images tend to involve substantially fewer sessions and trials. This is often due to practical constraints such as the limited availability of suitable images (Chen et al., 2017) and time constraints related to recruiting and maintaining participants. Despite a shorter amount of perceptual training, many of these studies have found significant performance improvements (e.g. Chen et al., 2017; Johnston et al., 2020; Sha et al., 2020), suggesting that perceptual learning with complex medical image discrimination tasks can occur rapidly. For instance, the top five performers in Chen et al.’s (2017) study could be trained up to a level approaching that of experts within an hour of training. These findings suggest that perceptual training can be efficient and effective, although the task employed by Chen et al. (2017) was very simple, requiring participants only to learn to make a binary judgement. It is therefore unclear to what extent perceptual training can be used to assist participants in learning a more difficult visual image classification task, especially one requiring more than binary judgements.

If we can train naïve participants to perform a difficult visual discrimination task at a level comparable to experts in a short period of time with standard perceptual training, then there would be no need to investigate if there are more effective perceptual training paradigms. Thus, the aim of the current experiment was to assess the effectiveness of standard perceptual training on a difficult real-world visual image discrimination task—grading the severity of hepatic steatosis present in ultrasound images—to determine if there is a need to develop more effective perceptual training paradigms. We spaced the training over four sessions to allow participants the time and opportunity to learn this difficult task whilst balancing fatigue and time constraints. We allowed images to repeat during training to ensure that we had sufficient stimuli, as there is evidence that repeating images is not detrimental to learning (Chen et al., 2017; Johnston et al., 2020; Sha et al., 2020).

Consistent with the literature, we hypothesised that standard perceptual training would lead to an improvement in performance, as measured by a reduction in the mean difference in error post-training (relative to the pretest). However, due to the difficult nature of the task, which requires finer discrimination than the two-choice tasks used in previous studies, we expected that participants would be unable to reduce their mean error to a benchmark level of expert performance (which we estimated from five experts that assisted with grading the stimuli). Finally, we hypothesised that learning would progress over the multiple sessions and that the average training performance (mean error) towards the end of each training session (the last 20 training trials) would improve over sessions.

Methods

Participants

Participants were recruited from Prolific. A pre-screening questionnaire was used to identify participants that had normal-or-corrected to normal vision, normal colour vision, no prior training or experience in radiology, and a willingness to participate in a multiple-session experiment. We invited 100 eligible people to participate. As we expected our task would be more difficult and have a smaller effect size than the task studied by Chen et al. (2017), we recruited a substantially larger sample size (i.e. 100 instead of 25).

Data for 10 participants were excluded for non-completion of all sessions or for repeating or partially completing a session. The final sample consisted of 90 participants (M_age = 38.8 years, SD_age = 13.7, 45 female). Participants were compensated a total of £11.05 for completing the four sessions. To motivate performance, a bonus of £1 was awarded to the top 25% of performers.

Additionally, five experts (three consultant radiologists, one radiology fellow, and one radiology registrar) rated the stimuli. These experts were a convenience sample. The experts did not participate in the experiments. From their ratings, we also obtained an estimate of expert performance, which we used to compare the performance of our trained participants.

Materials

Abdominal ultrasounds of 505 unique livers were sourced from a tertiary care centre and reviewed as suitable for inclusion. Instead of using a single image of each liver, a collage image was constructed, as radiologists typically view several images when making decisions about these types of cases. Each collage contained four ultrasound views (two transverse and two longitudinal) that represented a liver (see Fig. 1 for an example).

As no objective measure is available to establish the severity of hepatic steatosis, the five experts independently graded each collage on a 7-point scale, ranging from 1 (Normal) to 7 (Severe). The grading scale was expanded so that it was more fine-grained than what is commonly used in practice, to better determine improvements in performance. For all 505 collages, the intraclass correlation coefficient estimate was 0.94, 95% CI [0.93, 0.95], which was calculated based on a mean-rating (k = 5), absolute-agreement, two-way random-effects model, and suggested excellent reliability (Ku & Li, 2016). For each collage, a gold standard consensus grade was determined from the average rating of the five experts. As we sought to select stimuli that were rated the most consistently by experts, collages where one or more expert’s rating deviated more than one grade from this consensus grade were excluded. The final pool of stimuli contained 386 collages. The stimuli were not equally distributed across the grades, with the majority depicting livers that were on the lower end (grades 1–3) of the scale (16%, 40%, and 11%, respectively) rather than the higher end (grades 4–7; 6%, 11%, 10%, and 6%, respectively). However, this is consistent with more severe cases occurring less frequently in practice, resulting in less suitable images of higher severity being available.

The collages were randomly split into a training (286 collages) or test set (50 collages for pretest and 50 collages for post-test), with the condition that the distribution of grades was balanced across each set. The collages were 750 pixels (width) by 562.5 pixels (height).

Design and procedure

The experiment was developed using jsPsych (de Leeuw, 2015) to allow for the experiment to be completed online. All participants completed the experiment on a desktop or laptop computer with a minimum browser window size of 1024 × 700 pixels.

There were four self-paced training sessions, with a pretest at the beginning of the first session and a post-test at the end of the final session. Participants had a 48-h window to complete each session, with a 24-h break between when the window for a session closed and the window for the next session opened. Therefore, depending on when in the 48-h window the sessions were completed, there was a break of at least 24 h and up to 120 h between sessions.

At the start of the experiment, it was explained that the task was to grade the degree of fatty liver tissue in liver ultrasound images, using the 7-point scale. The description of the task was simplified into plain language to avoid technical terms that novices may have difficulty grasping. An annotated image was shown (Fig. 2) to provide basic instruction about the type of features that differ as the fattiness increases. In addition, four individual images of livers that represented grades 1, 3, 5, and 7 were shown. This rudimentary rule-based instruction was included because it is similar to the type of instruction that radiology trainees would initially receive.

Participants then completed the pretest where they graded 50 collages, with no limits on the time taken to view the stimuli or feedback. The collages were presented sequentially in a randomised order. Responses were made via the keyboard, and a prompt was displayed underneath each collage to remind participants of the response options. No feedback was provided during this phase.

Participants then underwent four sessions of perceptual training, which was also self-paced. There were 100 training trials per session (400 in total). The collages presented during the training phase were randomly sampled with replacement from the training set. To motivate participants, points were awarded during the training phase, and these points contributed towards earning the performance bonus. Points were awarded depending on the distance from the correct answer, with a higher number of points awarded for correct responses than near-correct responses. In the training phase, after grading a collage, the correct grade was immediately presented underneath the collage with a feedback message that differed depending on how near the response was to the correct answer (i.e. “Spot on! Correct'' in green text for correct responses, “Almost” in blue text when one grade off the correct answer, “Not quite” in orange text for two grades from the correct answer, or “Incorrect” in red for responses more than two grades from the correct answer).

To encourage careful responding, ten attention check trials were included over the four sessions. On these trials, the words “attention check” were overlaid in grey text on each image in the collage and the prompt below the stimulus instructed participants on how to respond (e.g. “Please respond 1: Normal”). If an incorrect response was made, participants were reminded that it was important to pay attention to the task. Participants who failed more than one attention check on average per session (i.e. more than four out of the ten attention checks over the four sessions) were excluded from the subsequent analyses.

Results

Due to a technical error, 1–10 trials of data were missing for five participants, so analyses were conducted on their remaining data. No participants failed the attention check criteria. The average total completion time for all sessions was 62 min.

As shown in Additional file 1: Fig. S1, performance on the post-test improved, with more responses closer to the consensus answer (e.g. distances 0 or 1) and fewer responses that were further (e.g. distances 5 or 6). To better quantify the overall improvement in performance, we computed the mean error for each participant on each test, which is shown in Fig. 3. The mean error represents the distance from the consensus answer, with a lower value indicating better performance. A paired-samples t test revealed that the mean error on the post-test was significantly lower than the pretest, t(89) = 13.68, p < .001, 95% CI [0.59, 0.79], d = 1.44.

To provide a reference point of expert performance, we first approximated the performance of our group of experts for the same collages that participants were tested on. However, we used a slightly different reference point to assess their performance, to avoid “double-dipping” the data. For each collage, each expert’s rating was assessed relative to the mean rating of the other four experts (i.e. for each expert we constructed a consensus rating using the ratings of the other four experts) and then calculated the overall mean error for the group of experts (shown in the blue dotted line in Fig. 3). As this used the same data that was used to select the reliably rated collages for use in the experiment, we also estimated the performance of the experts by repeating this procedure but for all 505 collages (shown in the black dotted line in Fig. 3). Using the results of this second more rigorous estimate of expert performance, a Welch independent samples t test found that the trained participants had significantly higher mean error than the experts, t(9.10) = 8.82, p < .001, 95% CI [0.33, 0.56], d = 2.14.

Figure 4 shows the average training performance over the course of each training session. A linear model with trial number as the predictor found that the average mean error decreased significantly over the first session, F(1, 98) = 22.77, p < .001. However, this trend did not continue over the second session, F(1, 98) = 3.28, p = .073, third session, F(1, 98) = 0.48, p = .491, or fourth session, F(1, 98) = 0.31, p = .566. As we did not conduct a post-test following each training session, we approximated the learning that occurred in each session by calculating the mean error for the final 20 training trials, which is given in Table 1. A one-way ANOVA found there was a significant difference in the mean error in the final 20 trials of the four sessions, F(3, 267) = 17. 45, p < .001, η²_G = .08. Post hoc t tests with a Bonferroni correction revealed that the second, third, and fourth training sessions all had significantly lower mean error than the first training session (p < .001). All other comparisons were non-significant.

Table 1 Mean error (Experiment 1a) and mean difficulty level of the comparison (Experiment 1b) for the last 20 training trials of each training session

Full size table

Discussion

Consistent with our expectations and prior work, we found that perceptual training improved performance on our difficult visual discrimination task. However, unlike Chen et al. (2017), and as expected, standard perceptual training was not sufficient to train people to the level of expert performance.

When evaluating training performance, we found that meaningful improvements in performance occurred within the first training session, after which learning appeared to gradually plateau. There were no significant improvements in training performance for the later sessions. These findings are not entirely consistent with Sha et al. (2020) where there were significant improvements in learning between sessions, or with the substantial improvement over the entire training found by Johnston et al. (2020). Whilst some minor differences in methodology could account for this discrepancy (e.g. the number of images and sessions), it is plausible that our increased task difficulty limited the amount of learning that could occur with this simple perceptual training method.

Experiment 1b

Is there a more effective training regime than the standard perceptual training approach? One alternative perceptual training method, which we refer to as comparison training, involves presenting several stimuli simultaneously, with the purpose of facilitating comparison. Whilst there are variations that involve passive learning (e.g. presenting stimuli with their category labels for study), we are interested in active learning where participants make judgements and receive feedback, as this kind of testing can enhance learning (Roediger & Karpicke, 2006). In active comparison training, the stimuli presented on each trial generally depict different categories (e.g. a normal and severe case) and participants need to discriminate between the stimuli (e.g. “Which image is Normal?”), and then receive immediate feedback. Whilst only a few perceptual training studies with real-world images have used comparison training (e.g. Evered et al., 2014; Searston & Tangen, 2017; Sha et al., 2020), similar techniques are successfully used in the categorisation literature (Kang & Pashler, 2012; Meagher et al., 2017). Additionally, there is some evidence that simultaneous exposure is more effective for perceptual learning than sequential exposure, in tasks with stimuli such as faces (Mundy et al., 2007) and simpler checkerboard stimuli (Mundy et al., 2009). Therefore, a perceptual training regime that involves an active comparison between simultaneously presented stimuli offers a promising way to enhance learning.

It is theorised that simultaneously presenting stimuli enhances discriminative contrast by highlighting commonalities and differences and can improve discrimination ability (Hammer et al., 2008; Kang & Pashler, 2012). This is particularly relevant when discriminating between highly similar categories (Carvalho & Goldstone, 2014). For our stimuli, those which are closer in grades (e.g. Normal vs Normal-mild) are likely to be more confusable than grades that are further apart. Therefore, the process of comparing these highly similar stimuli is expected to facilitate learning.

The aim of the current experiment is to assess if using a training approach that facilitates comparison between stimuli is effective for training medically naïve participants to grade the severity of hepatic steatosis present in ultrasound images. We hypothesised that comparison training will improve post-training performance, as measured by a reduction in mean error. Similar to Experiment 1a, we did not expect that participants would be able to reach the levels of expert performance. Although we did not find substantial benefits for multiple sessions in Experiment 1a, it is possible that comparison training could show a benefit, especially if it has the potential to teach the participant more. We therefore chose to test comparison training across four sessions, again expecting that average training performance towards the end of each training session (the last 20 training trials) would improve over sessions.