Object recognition ability predicts category learning with medical images

Smithson, Conor J. R.; Eichbaum, Quentin G.; Gauthier, Isabel

doi:10.1186/s41235-022-00456-9

Original Article
Open access
Published: 01 February 2023

Object recognition ability predicts category learning with medical images

Conor J. R. Smithson ORCID: orcid.org/0000-0002-2595-4367¹,
Quentin G. Eichbaum^2,3 &
Isabel Gauthier¹

Cognitive Research: Principles and Implications volume 8, Article number: 9 (2023) Cite this article

2172 Accesses
3 Citations
Metrics details

Abstract

We investigated the relationship between category learning and domain-general object recognition ability (o). We assessed this relationship in a radiological context, using a category learning test in which participants judged whether white blood cells were cancerous. In study 1, Bayesian evidence negated a relationship between o and category learning. This lack of correlation occurred despite high reliability in all measurements. However, participants only received feedback on the first 10 of 60 trials. In study 2, we assigned participants to one of two conditions: feedback on only the first 10 trials, or on all 60 trials of the category learning test. We found strong Bayesian evidence for a correlation between o and categorisation accuracy in the full-feedback condition, but not when feedback was limited to early trials. Moderate Bayesian evidence supported a difference between these correlations. Without feedback, participants may stick to simple rules they formulate at the start of category learning, when trials are easier. Feedback may encourage participants to abandon less effective rules and switch to exemplar learning. This work provides the first evidence relating o to a specific learning mechanism, suggesting this ability is more dependent upon exemplar learning mechanisms than rule abstraction. Object-recognition ability could complement other sources of individual differences when predicting accuracy of medical image interpretation.

Introduction

Accurate interpretation of medical images plays a crucial role in the diagnosis of many medical conditions. This process often requires the visual detection of abnormalities, such as lung nodules in radiographs or masses in mammograms. Although experts undergo substantial training, they cannot always make the correct decision (Brady, 2017; Graber et al., 2002). For many diagnostic tests, there are substantial discrepancies in accuracy between practitioners, in part due to differences in experience (Itani et al., 2019; Rudolph et al., 2021). Practitioners may even disagree with their own initial judgement when asked to review images a second time (Abujudeh et al., 2010). Although precise estimates of the prevalence of medical imaging errors are difficult to obtain, as errors vary widely based on test, practice setting, and population, estimates of real-world error rates range from < 1% to around 10% (Gergenti & Olympia, 2019; Lamoureux et al., 2021; Lockwood, 2017). Error rates can be higher still when the relevant disease is rare in the studied population (Kolb et al., 2002). These errors have multiple causes at the individual and the system level, including fatigue, communication failure, biased reasoning, failures of visual search, interpretive errors, technological errors, and poor technique (Lee et al., 2013; Waite et al., 2017). A majority of radiological errors are perceptual in nature, with practitioners failing to spot abnormalities, whereas a smaller but still substantial proportion of errors are due to failure to correctly categorise abnormalities (Donald & Barnard, 2012; Ferguson et al., 2021; Kim & Mansfield, 2014).

As the accurate interpretation of medical images relies on the detection and categorisation of objects, differences in diagnostic accuracy among practitioners may partially result from individual differences in visual abilities. The existence of such differences is supported by evidence for a domain-general object recognition ability (o). Confirmatory factor models demonstrate that diverse measures of object recognition, with differing task demands and differing object categories, load onto a single higher-order factor (Richler et al., 2019). The o factor explains variance in scores on object recognition tests beyond that explained by intelligence and visual working memory, and it can do so for both familiar and unfamiliar object categories (Richler et al., 2017; Sunday et al., 2022). In studies that are not concerned with investigating the structure of this visual ability, or that for practical reasons cannot achieve the sample size or the number of tasks required for structural equation modelling, an aggregate approach (Rushton et al., 1983) to measuring object recognition ability has been used (Chang & Gauthier, 2021; Chow et al., 2022). In this approach, z-scores on two object recognition tests that differ in format and stimuli are averaged to estimate the level of the underlying o ability. This approach provides a valid compromise in estimating o in smaller samples and when time is limited (Smithson et al., 2022). Using this approach, Sunday et al. (2018) found that o predicts the accurate detection of lung nodules in chest radiographs for both novices and experts, demonstrating a link between o and successful abnormality detection. The detection of lung nodules depends on successful visual search, but other radiological tasks rely less on visual search, and more on accurate categorisation.

As o captures the ability to learn individual identities, it is unclear whether it will also predict accurate categorisation. There are demonstrated individual differences in both speed and accuracy of category learning, in addition to differences in strategy use. Some people rely more on the abstraction of simple rules, leading to categorisation decisions based on one dimension. Others preferentially rely on judgements of perceptual similarity to category exemplars, which can be measured in a space defined by several relevant dimensions (Little & McDaniel, 2015; Wahlheim et al., 2016). For example, one may learn to categorise a skin mole as cancerous if it is asymmetrical, but one may also rely on comparisons of the mole to remembered examples of cancerous and non-cancerous moles. o predicts performance on many visual tasks that require judgements other than individuation, such as visual search, and judgements of summary statistics for ensembles (Chang & Gauthier, 2021; Sunday et al., 2018). Given that o can predict such a wide array of visual tasks, it is reasonable to question whether o could also predict accurate categorisation in a visual domain. There is some support for a relationship between individual differences in object recognition (measured by one of the tasks that tap into o) and accurate categorisation of medical images, at least under some conditions. In one study, participants categorised white blood cells as cancerous or not under conditions emphasising speed or accuracy, or when provided with a biased cue (Trueblood et al., 2018). Performance on an object recognition test predicted accurate categorisation, particularly for categorisation following biased cues. While this suggests that categorisation may rely on visual abilities under some conditions more than others, this work did not have sufficient power to compare correlations across conditions.

The study of domain-general high-level visual abilities is an emerging research area (Gauthier, 2018; Gauthier et al., 2022), and the extent to which these abilities can explain variability in performance on real-world tasks is still unclear. As diagnostic imaging has a heavy visual component, it is plausible that visual abilities may influence performance on these tasks. A good first step in showing this is to demonstrate that o can predict accurate categorisation of medical images. To investigate this, we created a three-alternative forced choice test in which participants learn to categorise white blood cell images as cancerous (blast) or non-cancerous (non-blast). We used a novice sample to test the relationship between o and categorisation in the absence of extensive pre-existing experience, which could contaminate the relationship.

Study 1

Participants

Thirty-nine Vanderbilt University students participated for course credit. A further sixty-seven adults were recruited on Amazon Mechanical Turk. Recruitment criteria required the use of a US IP address, greater than 50 approved hits, and a greater than 90% approval rating. We used a Bayesian stopping rule, collecting data in batches, until the Bayes Factor for the correlation between o and performance on the Blast Test reached a suggested threshold for moderate evidence, BF₁₀ > 3 or BF₁₀ < 1/3 (Lee & Wagenmakers, 2014). From our total of 106 participants, we excluded 26 for below-chance performance on either of the two tests used for estimating o.^{Footnote 1} This left 80 participants in the final analysis (Mage = 34.7, SD = 14.8; 32 men, 46 women, 2 other).

Materials and procedure

Participants completed three on-screen tests. First, they completed the Blast Test—a category learning test involving the identification of cancerous cells. After this, they completed the Novel Object Memory Test (NOMT) and the Object Matching Test, which were used to estimate o. Example stimuli for all three tests can be seen in Fig. 1. We used a fixed order of trials for all tests to minimise variance due to factors other than individual differences. Informed consent was obtained from all participants, and the study was approved by the Vanderbilt University Institutional Review Board.

Development of Blast Test

We obtained images of blast and non-blast blood cells from peripheral blood smears conducted at Vanderbilt University Medical Center. These images have been used in prior research on medical decision making (Hasan et al., 2021; Trueblood et al., 2018). They were categorised by expert consensus as blast or non-blast. They were additionally sorted into easy or hard categories on the basis of whether each cell image shared features common to the other category (Trueblood et al., 2018). We initially created 100 trials. Each trial was composed of two non-blast cells, and one blast cell. The task on each trial was to identify the blast cell from a side-by-side display of the three images. The use of three cells to choose from on each test trial reduces the importance of response bias and reduces the successful random guessing rate for each trial, compared to using only two. We initially created 25 trials composed of one easy blast image, and two easy non-blast images; 25 trials composed of one easy blast and two hard non-blast images; 25 trials composed of one hard blast and two easy non-blast images; and 25 trials composed of one hard blast, and two hard non-blast images. On the basis of pilot testing, we selected trials from a broad range of difficulty levels to maximise the informativeness of our test across a wide range of ability levels. Trials were also selected for high reliability; we checked item-rest correlations, and internal consistency if an item was dropped, and dropped those that reduced reliability. In the final Blast Test, the trials were ordered from easiest to hardest based on our pilot data. To familiarise participants with the two categories, participants were first shown 6 blast blood cell images, and 6 non-blast blood cell images, with category membership clearly labelled. Participants then completed 60 trials. Feedback indicating whether responses were correct appeared at the top of the screen for 1 s after each of the first 10 trials. No feedback was given for the remaining 50 trials. Percent correct over the 60 trials indexed performance.

Tests to estimate o

As in prior work, we used the aggregate of two object recognition tests to estimate o (e.g. Chang & Gauthier, 2021; Chow et al., 2022; Sunday et al., 2018). These two tests were chosen from a battery of tests that were good indicators of o in confirmatory factor models (Richler et al., 2019; Sunday et al., 2022), on the basis that they have different test constraints and use different categories of novel objects. The aggregation of scores from tests using different object categories and different task demands purifies the measurement of domain-general ability, by reducing the proportion of variance in scores that is due to irrelevant variation specific to particular task demands or stimuli (Rushton et al., 1983). The expected correlation for a pair of such tests is relatively low (0.3–0.4) because superficial features of the tests and stimuli are different. The aggregate of standardised performance on two tests provides a good estimate (r ≈ 0.8) of o measured as a factor score in a confirmatory factor analysis based on six tests (Smithson et al., 2022).

Novel object memory test

The NOMT was developed to assess object recognition ability (Richler et al., 2017). Participants were asked to memorise six exemplars from a category of novel objects (symmetrical Greebles; Gauthier & Tarr, 1997). They then viewed these six targets for as long as they needed, before completing six test trials. On each test trial, one target Greeble appeared alongside two distractor Greebles. Participants had unlimited time to select the target Greeble with their mouse on each trial. Participants then reviewed the targets and completed a further 18 test trials. Participants were then informed that the Greebles could appear in different viewpoints on remaining trials. The targets were presented again for review, prior to the final 24 test trials. Percent correct over the 48 test trials indexed performance.

Object matching test

On each trial participants had to determine whether two serially presented images displayed the same object. The objects were selected from four categories of novel objects: asymmetrical Greebles, Sheinbugs, and two distinct categories of Ziggerins (Richler et al., 2019). Each trial used either one or two objects from the same category. Each trial began with the presentation of a central fixation cross for 500 ms. The target object was then presented for 300 ms before a visual mask composed of scrambled object parts appeared for 500 ms. Finally, another object was presented which was either the same as the target or different. Participants had four seconds to respond by clicking either the same or different buttons on-screen. The target object could change in orientation or size from study to test, but participants were asked to judge only whether the identity of the object was the same. After an initial four practice trials, participants completed 70 test trials. Performance was indexed by a signal detection theory measure of sensitivity (d′). Timed-out responses were not included in the calculation. Less than 1% of all trials had timed-out responses.

Results

Descriptive statistics and reliability for each test can be seen in Table 1. To estimate o, z-scores for percent accuracy on the NOMT and d′ on the Object Matching Test were averaged. Correlational analyses used a Jeffreys-beta prior (Jeffreys, 1961) with a scale of 1 and were conducted with the BayesFactor Package (Morey & Rouder, 2021) in R. BF₊₀ indicates a one-sided test in the positive direction, and BF₁₀ is used for two-sided tests. We report highest posterior densities as 95% credibility intervals, and the median of the posterior distribution is used for parameter estimation. Our reported CIs and parameter estimates are always calculated from two-sided analyses. As expected, there was very strong Bayesian evidence for a positive correlation between performance on the NOMT and the Object Matching Test (r = 0.33, 95% CI [0.13, 0.52], BF₊₀ = 31.07). We obtained moderate evidence against a correlation between o and percent accuracy on the Blast Test (r = 0.03, 95% CI [− 0.18, 0.25] BF₊₀ = 0.18, Fig. 2), although this was somewhat sensitive to the choice of prior, with the Bayes factor rising above 1/3rd for prior scales equalling or below 0.32.

Table 1 Descriptive statistics

Full size table

Discussion

Contrary to our hypothesis, o did not predict performance on the Blast Test. One possible reason for the lack of a relationship with category learning may be the limited amount of feedback that participants received. Although tests that are used to estimate o do not use feedback, and o can predict other skills measured in tests without feedback, such as working memory judgements with musical notation (Chang & Gauthier, 2021) or food oddball judgements (Gauthier & Fiestan, 2023), the strategies and mechanisms recruited during category learning may be particularly sensitive to feedback. Early on in category learning, people tend to rely on simple rule-based judgements and then update these rules as they receive further feedback, before shifting to similarity-based exemplar retrieval as expertise develops (Johansen & Palmeri, 2002). In Study 1, we only provided participants with feedback on the first ten trials of the Blast Test. As earlier trials in the Blast Test are easier, participants did not receive any feedback on more difficult trials. Individual differences in performance may thus result from divergent initial rule choices, or differing success in the application of these rules. Due to the limited feedback, participants may have seen no need to update their initial rules or may have had no basis on which to do so. Additionally, the lack of feedback may have discouraged a switch in strategy to reliance on judgements of perceptual similarity to prior exemplars. Harder trials are presumably more likely to require methods of judgement other than simple rule use. The tests used to estimate o require within-category individuation, which also cannot usually rely on the use of simple rules, as objects in a common category will share a basic configuration of parts.

To test whether the lack of association between o and Blast Test accuracy was due to the limited feedback, we repeated the study with the addition of a full-feedback condition, wherein participants received feedback for all 60 trials of the Blast Test.

Study 2

Materials and procedure

Participants completed the same three tests as in study one. However, the tests were in a different fixed order: NOMT, Object Matching Test, and Blast Test. In Study 2, we compared the limited feedback and the full-feedback versions of the Blast Test, so placing this test last ensured that performance on the two object recognition tests could not be affected by assignment to either condition of the Blast Test. The NOMT was modified such that participants had a fixed 20 s to familiarise themselves with the six targets on each study trial, reducing differences in study time as a source of individual differences. The Object Matching Test was altered so that for the first 35 trials the study object was presented for 600 ms. This was done to lower difficulty on some trials, as mean d′ was low in Study 1 (0.97). For the remaining 35 trials, the study object was presented for 300 ms, as in Study 1. Another alteration was to allow unlimited time to respond, eliminating timed-out responses so that d′ was calculated for the exact same trials for all participants.